[RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation
@ 2024-08-28  2:48 Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 01/28] dma-buf: Split out dma fence array create into alloc and arm functions Matthew Brost
                   ` (31 more replies)
  0 siblings, 32 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Continuation of SVM work by Oak Zeng [1][2] based on community feedback.
Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for
system allocations (e.g., malloc), runtime allocations (e.g., binding a
BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM
allocations can evict each other). Fully tested; more on this below.

The patch breakdown is as follows:

1. Preparation patches already on the list [3].
	- Patches 1-3.
	- Please refrain from reviewing these here.	
2. New migrate layer functionality
	- Patch 4.
	- Required for eviction to avoid locking inversion between
	  dma-resv and mmap lock.
3. GPU SVM.
	- Patch 5.
	- This is what needs community review.
	- Inspired by GPUVM.
	- Kernel doc should explain design principles.
	- There is certainly room for optimization of the implementation
	  and improvements with existing core MM interaction. Pulling in
	  pending DMA mapping work [4] and additional core MM support
	  for SVM is also likely desired. However, this serves as a good
	  starting point for any SVM discussions and could be used as a
	  stepping stone to future core MM work.
3. Basic SVM support in Xe (i.e., SRAM backing only).
	- Patches 6-15.
	- The uAPI in the patch could benefit from community input.
4. SVM VRAM migration support in Xe.
	- Patches 16-23.
	- Using TMM BOs for SVM VRAM allocations could use community
	  input. Patch 23 has a detailed explaination of this design
	  choice in the commit message.
5. SVM eviction support in Xe.
	- Patch 24.
	- Should work with exhaustive eviction [5] when it merges.
6. Xe SVM debug / tuning.
	- Patch 25-28.

Kernel documentation and commit messages are relatively light, aside
from GPU SVM and uAPI patches as this is an RFC.

Testing has been conducted quite thoroughly with new IGT [6]. Various
system allocation types (malloc, mmap, mmap flags, huge pages, different
sizes, different alignments), mixing runtime allocations, unmapping
corners, invalid faults, and eviction have been tested. Testing scales
from single thread to multiple threads and multiple processes. Tests
pass on LNL, BMG, PVC 1 tile, and PVC 2 tile.

1. Multiple GPU support.
	- This is likely to follow or occur in parallel to this work.
2. Userptr unification with GPU SVM.
	- This is essentially designed in my head (likely involving a
	  few new GPU SVM layer functions) but would require some fairly
	  invasive changes to Xe KMD to test out. Therefore, I would
	  like GPU SVM to be reviewed first before proceeding with these
	  changes.
3. Madvise and prefetch IOCTLs
	- This is likely to follow or occur in parallel to this work.

Given the size of the series, I have pushed a GitLab branch for
reference [7].

Matt

[1] https://patchwork.freedesktop.org/series/128910/
[2] https://patchwork.freedesktop.org/series/132229/
[3] https://patchwork.freedesktop.org/series/137805/
[4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
[5] https://patchwork.freedesktop.org/series/133643/
[6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2
[7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads

Matthew Brost (28):
  dma-buf: Split out dma fence array create into alloc and arm functions
  drm/xe: Invalidate media_gt TLBs in PT code
  drm/xe: Retry BO allocation
  mm/migrate: Add migrate_device_vma_range
  drm/gpusvm: Add support for GPU Shared Virtual Memory
  drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
  drm/xe: Add SVM init / fini to faulting VMs
  drm/xe: Add dma_addr res cursor
  drm/xe: Add SVM range invalidation
  drm/gpuvm: Add DRM_GPUVA_OP_USER
  drm/xe: Add (re)bind to SVM page fault handler
  drm/xe: Add SVM garbage collector
  drm/xe: Add unbind to SVM garbage collector
  drm/xe: Do not allow system allocator VMA unbind if the GPU has
    bindings
  drm/xe: Enable system allocator uAPI
  drm/xe: Add migrate layer functions for SVM support
  drm/xe: Add SVM device memory mirroring
  drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
  drm/xe: Update PT layer to understand ranges in VRAM
  drm/xe: Add Xe SVM populate_vram_pfn vfunc
  drm/xe: Add Xe SVM vram_release vfunc
  drm/xe: Add BO flags required for SVM
  drm/xe: Add SVM VRAM migration
  drm/xe: Basic SVM BO eviction
  drm/xe: Add SVM debug
  drm/xe: Add modparam for SVM notifier size
  drm/xe: Add modparam for SVM prefault
  drm/gpusvm: Ensure all pages migrated upon eviction

 drivers/dma-buf/dma-fence-array.c    |   78 +-
 drivers/gpu/drm/xe/Makefile          |    4 +-
 drivers/gpu/drm/xe/drm_gpusvm.c      | 2213 ++++++++++++++++++++++++++
 drivers/gpu/drm/xe/drm_gpusvm.h      |  415 +++++
 drivers/gpu/drm/xe/xe_bo.c           |   54 +-
 drivers/gpu/drm/xe/xe_bo.h           |    2 +
 drivers/gpu/drm/xe/xe_bo_types.h     |    3 +
 drivers/gpu/drm/xe/xe_device_types.h |    8 +
 drivers/gpu/drm/xe/xe_gt_pagefault.c |   17 +-
 drivers/gpu/drm/xe/xe_migrate.c      |  150 ++
 drivers/gpu/drm/xe/xe_migrate.h      |   10 +
 drivers/gpu/drm/xe/xe_module.c       |    7 +
 drivers/gpu/drm/xe/xe_module.h       |    2 +
 drivers/gpu/drm/xe/xe_pt.c           |  456 +++++-
 drivers/gpu/drm/xe/xe_pt.h           |    3 +
 drivers/gpu/drm/xe/xe_pt_types.h     |    2 +
 drivers/gpu/drm/xe/xe_res_cursor.h   |   50 +-
 drivers/gpu/drm/xe/xe_svm.c          |  775 +++++++++
 drivers/gpu/drm/xe/xe_svm.h          |   70 +
 drivers/gpu/drm/xe/xe_tile.c         |    5 +
 drivers/gpu/drm/xe/xe_vm.c           |  286 +++-
 drivers/gpu/drm/xe/xe_vm.h           |   15 +-
 drivers/gpu/drm/xe/xe_vm_types.h     |   44 +
 include/drm/drm_gpuvm.h              |    5 +
 include/linux/dma-fence-array.h      |    6 +
 include/linux/migrate.h              |    3 +
 include/uapi/drm/xe_drm.h            |   19 +-
 mm/migrate_device.c                  |   53 +
 28 files changed, 4615 insertions(+), 140 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [RFC PATCH 01/28] dma-buf: Split out dma fence array create into alloc and arm functions
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 02/28] drm/xe: Invalidate media_gt TLBs in PT code Matthew Brost
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Useful to preallocate dma fence array and then arm in path of reclaim or
a dma fence.

v2:
 - s/arm/init (Christian)
 - Drop !array warn (Christian)

Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/dma-buf/dma-fence-array.c | 78 ++++++++++++++++++++++---------
 include/linux/dma-fence-array.h   |  6 +++
 2 files changed, 63 insertions(+), 21 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
index c74ac197d5fe..0659e6b29b3c 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -144,37 +144,38 @@ const struct dma_fence_ops dma_fence_array_ops = {
 EXPORT_SYMBOL(dma_fence_array_ops);
 
 /**
- * dma_fence_array_create - Create a custom fence array
+ * dma_fence_array_alloc - Allocate a custom fence array
+ * @num_fences:		[in]	number of fences to add in the array
+ *
+ * Return dma fence array on success, NULL on failure
+ */
+struct dma_fence_array *dma_fence_array_alloc(int num_fences)
+{
+	struct dma_fence_array *array;
+
+	return kzalloc(struct_size(array, callbacks, num_fences), GFP_KERNEL);
+}
+EXPORT_SYMBOL(dma_fence_array_alloc);
+
+/**
+ * dma_fence_array_init - Arm a custom fence array
+ * @array:		[in]	dma fence array to arm
  * @num_fences:		[in]	number of fences to add in the array
  * @fences:		[in]	array containing the fences
  * @context:		[in]	fence context to use
  * @seqno:		[in]	sequence number to use
  * @signal_on_any:	[in]	signal on any fence in the array
  *
- * Allocate a dma_fence_array object and initialize the base fence with
- * dma_fence_init().
- * In case of error it returns NULL.
- *
- * The caller should allocate the fences array with num_fences size
- * and fill it with the fences it wants to add to the object. Ownership of this
- * array is taken and dma_fence_put() is used on each fence on release.
- *
- * If @signal_on_any is true the fence array signals if any fence in the array
- * signals, otherwise it signals when all fences in the array signal.
+ * Implementation of @dma_fence_array_create without allocation. Useful to arm a
+ * preallocated dma fence fence in the path of reclaim or dma fence signaling.
  */
-struct dma_fence_array *dma_fence_array_create(int num_fences,
-					       struct dma_fence **fences,
-					       u64 context, unsigned seqno,
-					       bool signal_on_any)
+void dma_fence_array_init(struct dma_fence_array *array,
+			  int num_fences, struct dma_fence **fences,
+			  u64 context, unsigned seqno,
+			  bool signal_on_any)
 {
-	struct dma_fence_array *array;
-
 	WARN_ON(!num_fences || !fences);
 
-	array = kzalloc(struct_size(array, callbacks, num_fences), GFP_KERNEL);
-	if (!array)
-		return NULL;
-
 	array->num_fences = num_fences;
 
 	spin_lock_init(&array->lock);
@@ -200,6 +201,41 @@ struct dma_fence_array *dma_fence_array_create(int num_fences,
 	 */
 	while (num_fences--)
 		WARN_ON(dma_fence_is_container(fences[num_fences]));
+}
+EXPORT_SYMBOL(dma_fence_array_init);
+
+/**
+ * dma_fence_array_create - Create a custom fence array
+ * @num_fences:		[in]	number of fences to add in the array
+ * @fences:		[in]	array containing the fences
+ * @context:		[in]	fence context to use
+ * @seqno:		[in]	sequence number to use
+ * @signal_on_any:	[in]	signal on any fence in the array
+ *
+ * Allocate a dma_fence_array object and initialize the base fence with
+ * dma_fence_init().
+ * In case of error it returns NULL.
+ *
+ * The caller should allocate the fences array with num_fences size
+ * and fill it with the fences it wants to add to the object. Ownership of this
+ * array is taken and dma_fence_put() is used on each fence on release.
+ *
+ * If @signal_on_any is true the fence array signals if any fence in the array
+ * signals, otherwise it signals when all fences in the array signal.
+ */
+struct dma_fence_array *dma_fence_array_create(int num_fences,
+					       struct dma_fence **fences,
+					       u64 context, unsigned seqno,
+					       bool signal_on_any)
+{
+	struct dma_fence_array *array;
+
+	array = dma_fence_array_alloc(num_fences);
+	if (!array)
+		return NULL;
+
+	dma_fence_array_init(array, num_fences, fences,
+			     context, seqno, signal_on_any);
 
 	return array;
 }
diff --git a/include/linux/dma-fence-array.h b/include/linux/dma-fence-array.h
index 29c5650c1038..079b3dec0a16 100644
--- a/include/linux/dma-fence-array.h
+++ b/include/linux/dma-fence-array.h
@@ -79,6 +79,12 @@ to_dma_fence_array(struct dma_fence *fence)
 	for (index = 0, fence = dma_fence_array_first(head); fence;	\
 	     ++(index), fence = dma_fence_array_next(head, index))
 
+struct dma_fence_array *dma_fence_array_alloc(int num_fences);
+void dma_fence_array_init(struct dma_fence_array *array,
+			  int num_fences, struct dma_fence **fences,
+			  u64 context, unsigned seqno,
+			  bool signal_on_any);
+
 struct dma_fence_array *dma_fence_array_create(int num_fences,
 					       struct dma_fence **fences,
 					       u64 context, unsigned seqno,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 02/28] drm/xe: Invalidate media_gt TLBs in PT code
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 01/28] dma-buf: Split out dma fence array create into alloc and arm functions Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 03/28] drm/xe: Retry BO allocation Matthew Brost
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Testing on LNL has shown media GT's TLBs need to be invalidated via the
GuC, update PT code appropriately.

v2:
 - Do dma_fence_get before first call of invalidation_fence_init (Himal)
 - No need to check for valid chain fence (Himal)
v3:
 - Use dma-fence-array

Fixes: 3330361543fc ("drm/xe/lnl: Add LNL platform definition")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c | 117 ++++++++++++++++++++++++++++++-------
 1 file changed, 96 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 579ed31b46db..d6353e8969f0 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -3,6 +3,8 @@
  * Copyright © 2022 Intel Corporation
  */
 
+#include <linux/dma-fence-array.h>
+
 #include "xe_pt.h"
 
 #include "regs/xe_gtt_defs.h"
@@ -1627,9 +1629,11 @@ xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
 
 static int vma_reserve_fences(struct xe_device *xe, struct xe_vma *vma)
 {
+	int shift = xe_device_get_root_tile(xe)->media_gt ? 1 : 0;
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
 		return dma_resv_reserve_fences(xe_vma_bo(vma)->ttm.base.resv,
-					       xe->info.tile_count);
+					       xe->info.tile_count << shift);
 
 	return 0;
 }
@@ -1816,6 +1820,7 @@ int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops)
 	struct xe_vm_pgtable_update_ops *pt_update_ops =
 		&vops->pt_update_ops[tile->id];
 	struct xe_vma_op *op;
+	int shift = tile->media_gt ? 1 : 0;
 	int err;
 
 	lockdep_assert_held(&vops->vm->lock);
@@ -1824,7 +1829,7 @@ int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops)
 	xe_pt_update_ops_init(pt_update_ops);
 
 	err = dma_resv_reserve_fences(xe_vm_resv(vops->vm),
-				      tile_to_xe(tile)->info.tile_count);
+				      tile_to_xe(tile)->info.tile_count << shift);
 	if (err)
 		return err;
 
@@ -1849,13 +1854,20 @@ int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops)
 
 static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vm_pgtable_update_ops *pt_update_ops,
-			   struct xe_vma *vma, struct dma_fence *fence)
+			   struct xe_vma *vma, struct dma_fence *fence,
+			   struct dma_fence *fence2)
 {
-	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
 				   DMA_RESV_USAGE_KERNEL :
 				   DMA_RESV_USAGE_BOOKKEEP);
+		if (fence2)
+			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence2,
+					   pt_update_ops->wait_vm_bookkeep ?
+					   DMA_RESV_USAGE_KERNEL :
+					   DMA_RESV_USAGE_BOOKKEEP);
+	}
 	vma->tile_present |= BIT(tile->id);
 	vma->tile_staged &= ~BIT(tile->id);
 	if (xe_vma_is_userptr(vma)) {
@@ -1875,13 +1887,20 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 
 static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			     struct xe_vm_pgtable_update_ops *pt_update_ops,
-			     struct xe_vma *vma, struct dma_fence *fence)
+			     struct xe_vma *vma, struct dma_fence *fence,
+			     struct dma_fence *fence2)
 {
-	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
 				   DMA_RESV_USAGE_KERNEL :
 				   DMA_RESV_USAGE_BOOKKEEP);
+		if (fence2)
+			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence2,
+					   pt_update_ops->wait_vm_bookkeep ?
+					   DMA_RESV_USAGE_KERNEL :
+					   DMA_RESV_USAGE_BOOKKEEP);
+	}
 	vma->tile_present &= ~BIT(tile->id);
 	if (!vma->tile_present) {
 		list_del_init(&vma->combined_links.rebind);
@@ -1898,7 +1917,8 @@ static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 static void op_commit(struct xe_vm *vm,
 		      struct xe_tile *tile,
 		      struct xe_vm_pgtable_update_ops *pt_update_ops,
-		      struct xe_vma_op *op, struct dma_fence *fence)
+		      struct xe_vma_op *op, struct dma_fence *fence,
+		      struct dma_fence *fence2)
 {
 	xe_vm_assert_held(vm);
 
@@ -1907,26 +1927,28 @@ static void op_commit(struct xe_vm *vm,
 		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
 			break;
 
-		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence);
+		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence,
+			       fence2);
 		break;
 	case DRM_GPUVA_OP_REMAP:
 		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.remap.unmap->va), fence);
+				 gpuva_to_vma(op->base.remap.unmap->va), fence,
+				 fence2);
 
 		if (op->remap.prev)
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
-				       fence);
+				       fence, fence2);
 		if (op->remap.next)
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
-				       fence);
+				       fence, fence2);
 		break;
 	case DRM_GPUVA_OP_UNMAP:
 		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.unmap.va), fence);
+				 gpuva_to_vma(op->base.unmap.va), fence, fence2);
 		break;
 	case DRM_GPUVA_OP_PREFETCH:
 		bind_op_commit(vm, tile, pt_update_ops,
-			       gpuva_to_vma(op->base.prefetch.va), fence);
+			       gpuva_to_vma(op->base.prefetch.va), fence, fence2);
 		break;
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
@@ -1963,7 +1985,9 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 	struct xe_vm_pgtable_update_ops *pt_update_ops =
 		&vops->pt_update_ops[tile->id];
 	struct dma_fence *fence;
-	struct invalidation_fence *ifence = NULL;
+	struct invalidation_fence *ifence = NULL, *mfence = NULL;
+	struct dma_fence **fences = NULL;
+	struct dma_fence_array *cf = NULL;
 	struct xe_range_fence *rfence;
 	struct xe_vma_op *op;
 	int err = 0, i;
@@ -1996,6 +2020,23 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 			err = -ENOMEM;
 			goto kill_vm_tile1;
 		}
+		if (tile->media_gt) {
+			mfence = kzalloc(sizeof(*ifence), GFP_KERNEL);
+			if (!mfence) {
+				err = -ENOMEM;
+				goto free_ifence;
+			}
+			fences = kmalloc_array(2, sizeof(*fences), GFP_KERNEL);
+			if (!fences) {
+				err = -ENOMEM;
+				goto free_ifence;
+			}
+			cf = dma_fence_array_alloc(2);
+			if (!cf) {
+				err = -ENOMEM;
+				goto free_ifence;
+			}
+		}
 	}
 
 	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
@@ -2027,19 +2068,50 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 
 	/* tlb invalidation must be done before signaling rebind */
 	if (ifence) {
+		if (mfence)
+			dma_fence_get(fence);
 		invalidation_fence_init(tile->primary_gt, ifence, fence,
 					pt_update_ops->start,
 					pt_update_ops->last, vm->usm.asid);
-		fence = &ifence->base.base;
+		if (mfence) {
+			invalidation_fence_init(tile->media_gt, mfence, fence,
+						pt_update_ops->start,
+						pt_update_ops->last, vm->usm.asid);
+			fences[0] = &ifence->base.base;
+			fences[1] = &mfence->base.base;
+			dma_fence_array_init(cf, 2, fences,
+					     vm->composite_fence_ctx,
+					     vm->composite_fence_seqno++,
+					     false);
+			fence = &cf->base;
+		} else {
+			fence = &ifence->base.base;
+		}
 	}
 
-	dma_resv_add_fence(xe_vm_resv(vm), fence,
-			   pt_update_ops->wait_vm_bookkeep ?
-			   DMA_RESV_USAGE_KERNEL :
-			   DMA_RESV_USAGE_BOOKKEEP);
+	if (!mfence) {
+		dma_resv_add_fence(xe_vm_resv(vm), fence,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
 
-	list_for_each_entry(op, &vops->list, link)
-		op_commit(vops->vm, tile, pt_update_ops, op, fence);
+		list_for_each_entry(op, &vops->list, link)
+			op_commit(vops->vm, tile, pt_update_ops, op, fence, NULL);
+	} else {
+		dma_resv_add_fence(xe_vm_resv(vm), &ifence->base.base,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+
+		dma_resv_add_fence(xe_vm_resv(vm), &mfence->base.base,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+
+		list_for_each_entry(op, &vops->list, link)
+			op_commit(vops->vm, tile, pt_update_ops, op,
+				  &ifence->base.base, &mfence->base.base);
+	}
 
 	if (pt_update_ops->needs_userptr_lock)
 		up_read(&vm->userptr.notifier_lock);
@@ -2049,6 +2121,9 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 free_rfence:
 	kfree(rfence);
 free_ifence:
+	kfree(cf);
+	kfree(fences);
+	kfree(mfence);
 	kfree(ifence);
 kill_vm_tile1:
 	if (err != -EAGAIN && tile->id)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 03/28] drm/xe: Retry BO allocation
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 01/28] dma-buf: Split out dma fence array create into alloc and arm functions Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 02/28] drm/xe: Invalidate media_gt TLBs in PT code Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range Matthew Brost
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

TTM doesn't support fair eviction via WW locking, this mitigated in by
using retry loops in exec and preempt rebind worker. Extend this retry
loop to BO allocation. Once TTM supports fair eviction this patch can be
reverted.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index cbe7bf098970..b6c6a4a3b4d4 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -1977,6 +1977,7 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 	struct xe_file *xef = to_xe_file(file);
 	struct drm_xe_gem_create *args = data;
 	struct xe_vm *vm = NULL;
+	ktime_t end = 0;
 	struct xe_bo *bo;
 	unsigned int bo_flags;
 	u32 handle;
@@ -2042,11 +2043,14 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 		vm = xe_vm_lookup(xef, args->vm_id);
 		if (XE_IOCTL_DBG(xe, !vm))
 			return -ENOENT;
+	}
+
+retry:
+	if (vm) {
 		err = xe_vm_lock(vm, true);
 		if (err)
 			goto out_vm;
 	}
-
 	bo = xe_bo_create_user(xe, NULL, vm, args->size, args->cpu_caching,
 			       bo_flags);
 
@@ -2055,6 +2059,8 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 
 	if (IS_ERR(bo)) {
 		err = PTR_ERR(bo);
+		if (xe_vm_validate_should_retry(NULL, err, &end))
+			goto retry;
 		goto out_vm;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (2 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 03/28] drm/xe: Retry BO allocation Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-29  9:03   ` Daniel Vetter
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add migrate_device_vma_range which prepares an array of pre-populated
device pages for migration and issues a MMU invalidation.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/linux/migrate.h |  3 +++
 mm/migrate_device.c     | 53 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 644be30b69c8..e8cce05bf9c2 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -226,6 +226,9 @@ void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
 int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 			unsigned long npages);
+int migrate_device_vma_range(struct mm_struct *mm, void *pgmap_owner,
+			     unsigned long *src_pfns, unsigned long npages,
+			     unsigned long start);
 void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
 			unsigned long npages);
 void migrate_device_finalize(unsigned long *src_pfns,
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa..e25f12a132e8 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -920,6 +920,59 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 }
 EXPORT_SYMBOL(migrate_device_range);
 
+/**
+ * migrate_device_vma_range() - migrate device private pfns to normal memory and
+ * trigger MMU invalidation.
+ * @mm: struct mm of device pages.
+ * @src_pfns: pre-popluated array of source device private pfns to migrate.
+ * @pgmap_owner: page group map owner of device pages.
+ * @npages: number of pages to migrate.
+ * @start: VMA start of device pages.
+ *
+ * Similar to migrate_device_range() but supports non-contiguous pre-popluated
+ * array of device pages to migrate. Also triggers MMU invalidation. Useful in
+ * device memory eviction paths where lock is held protecting the device pages
+ * but where the mmap lock cannot be taken to due to a locking inversion (e.g.
+ * DRM drivers). Since the mmap lock is not required to be held, the MMU
+ * invalidation can race with with VMA start being repurposed, worst case this
+ * would result in an unecessary invalidation.
+ */
+int migrate_device_vma_range(struct mm_struct *mm, void *pgmap_owner,
+			     unsigned long *src_pfns, unsigned long npages,
+			     unsigned long start)
+{
+	struct mmu_notifier_range range;
+	unsigned long i;
+
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
+				      mm, start, start + npages * PAGE_SIZE,
+				      pgmap_owner);
+	mmu_notifier_invalidate_range_start(&range);
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = pfn_to_page(src_pfns[i]);
+
+		if (!get_page_unless_zero(page)) {
+			src_pfns[i] = 0;
+			continue;
+		}
+
+		if (!trylock_page(page)) {
+			src_pfns[i] = 0;
+			put_page(page);
+			continue;
+		}
+
+		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
+	}
+
+	migrate_device_unmap(src_pfns, npages, NULL);
+	mmu_notifier_invalidate_range_end(&range);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_device_vma_range);
+
 /*
  * Migrate a device coherent page back to normal memory. The caller should have
  * a reference on page which will be copied to the new page if migration is
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (3 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28 14:31   ` Daniel Vetter
                     ` (7 more replies)
  2024-08-28  2:48 ` [RFC PATCH 06/28] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
                   ` (26 subsequent siblings)
  31 siblings, 8 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

This patch introduces support for GPU Shared Virtual Memory (SVM) in the
Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
sharing of memory between the CPU and GPU, enhancing performance and
flexibility in GPU computing tasks.

The patch adds the necessary infrastructure for SVM, including data
structures and functions for managing SVM ranges and notifiers. It also
provides mechanisms for allocating, deallocating, and migrating memory
regions between system RAM and GPU VRAM.

This mid-layer is largely inspired by GPUVM.

Cc: Dave Airlie <airlied@redhat.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/Makefile     |    3 +-
 drivers/gpu/drm/xe/drm_gpusvm.c | 2174 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
 3 files changed, 2591 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index b9670ae09a9e..b8fc2ee58f1a 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
 
 # core driver code
 
-xe-y += xe_bb.o \
+xe-y += drm_gpusvm.o \
+	xe_bb.o \
 	xe_bo.o \
 	xe_bo_evict.o \
 	xe_devcoredump.o \
diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c b/drivers/gpu/drm/xe/drm_gpusvm.c
new file mode 100644
index 000000000000..fc1e44e6ae72
--- /dev/null
+++ b/drivers/gpu/drm/xe/drm_gpusvm.c
@@ -0,0 +1,2174 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ *
+ * Authors:
+ *     Matthew Brost <matthew.brost@intel.com>
+ */
+
+#include <linux/dma-mapping.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/hmm.h>
+#include <linux/memremap.h>
+#include <linux/migrate.h>
+#include <linux/mm_types.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+
+#include <drm/drm_device.h>
+#include "drm_gpusvm.h"
+
+/**
+ * DOC: Overview
+ *
+ * GPU Shared Virtual Memory (GPU SVM) layer for the Direct Rendering Manager (DRM)
+ *
+ * The GPU SVM layer is a component of the DRM framework designed to manage shared
+ * virtual memory between the CPU and GPU. It enables efficient data exchange and
+ * processing for GPU-accelerated applications by allowing memory sharing and
+ * synchronization between the CPU's and GPU's virtual address spaces.
+ *
+ * Key GPU SVM Components:
+ * - Notifiers: Notifiers: Used for tracking memory intervals and notifying the
+ *		GPU of changes, notifiers are sized based on a GPU SVM
+ *		initialization parameter, with a recommendation of 512M or
+ *		larger. They maintain a Red-BlacK tree and a list of ranges that
+ *		fall within the notifier interval. Notifiers are tracked within
+ *		a GPU SVM Red-BlacK tree and list and are dynamically inserted
+ *		or removed as ranges within the interval are created or
+ *		destroyed.
+ * - Ranges: Represent memory ranges mapped in a DRM device and managed
+ *	     by GPU SVM. They are sized based on an array of chunk sizes, which
+ *	     is a GPU SVM initialization parameter, and the CPU address space.
+ *	     Upon GPU fault, the largest aligned chunk that fits within the
+ *	     faulting CPU address space is chosen for the range size. Ranges are
+ *	     expected to be dynamically allocated on GPU fault and removed on an
+ *	     MMU notifier UNMAP event. As mentioned above, ranges are tracked in
+ *	     a notifier's Red-Black tree.
+ * - Operations: Define the interface for driver-specific SVM operations such as
+ *		 allocation, page collection, migration, invalidations, and VRAM
+ *		 release.
+ *
+ * This layer provides interfaces for allocating, mapping, migrating, and
+ * releasing memory ranges between the CPU and GPU. It handles all core memory
+ * management interactions (DMA mapping, HMM, and migration) and provides
+ * driver-specific virtual functions (vfuncs). This infrastructure is sufficient
+ * to build the expected driver components for an SVM implementation as detailed
+ * below.
+ *
+ * Expected Driver Components:
+ * - GPU page fault handler: Used to create ranges and notifiers based on the
+ *			     fault address, optionally migrate the range to
+ *			     VRAM, and create GPU bindings.
+ * - Garbage collector: Used to destroy GPU bindings for ranges. Ranges are
+ *			expected to be added to the garbage collector upon
+ *			MMU_NOTIFY_UNMAP event.
+ */
+
+/**
+ * DOC: Locking
+ *
+ * GPU SVM handles locking for core MM interactions, i.e., it locks/unlocks the
+ * mmap lock as needed. Alternatively, if the driver prefers to handle the mmap
+ * lock itself, a 'locked' argument is provided to the functions that require
+ * the mmap lock. This option may be useful for drivers that need to call into
+ * GPU SVM while also holding a dma-resv lock, thus preventing locking
+ * inversions between the mmap and dma-resv locks.
+ *
+ * GPU SVM introduces a global notifier lock, which safeguards the notifier's
+ * range RB tree and list, as well as the range's DMA mappings and sequence
+ * number. GPU SVM manages all necessary locking and unlocking operations,
+ * except for the recheck of the range's sequence number
+ * (mmu_interval_read_retry) when the driver is committing GPU bindings. This
+ * lock corresponds to the 'driver->update' lock mentioned in the HMM
+ * documentation (TODO: Link). Future revisions may transition from a GPU SVM
+ * global lock to a per-notifier lock if finer-grained locking is deemed
+ * necessary.
+ *
+ * In addition to the locking mentioned above, the driver should implement a
+ * lock to safeguard core GPU SVM function calls that modify state, such as
+ * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove. Alternatively,
+ * these core functions can be called within a single kernel thread, for
+ * instance, using an ordered work queue. This lock is denoted as
+ * 'driver_svm_lock' in code examples.
+ */
+
+/**
+ * DOC: Migrataion
+ *
+ * The migration support is quite simple, allowing migration between SRAM and
+ * VRAM at the range granularity. For example, GPU SVM currently does not
+ * support mixing SRAM and VRAM pages within a range. This means that upon GPU
+ * fault, the entire range can be migrated to VRAM, and upon CPU fault, the
+ * entire range is migrated to SRAM.
+ *
+ * The reasoning for only supporting range granularity is as follows: it
+ * simplifies the implementation, and range sizes are driver-defined and should
+ * be relatively small.
+ */
+
+/**
+ * DOC: Partial Unmapping of Ranges
+ *
+ * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by CPU resulting
+ * in MMU_NOTIFY_UNMAP event) presents several challenges, with the main one
+ * being that a subset of the range still has CPU and GPU mappings. If the
+ * backing store for the range is in VRAM, a subset of the backing store has
+ * references. One option would be to split the range and VRAM backing store,
+ * but the implementation for this would be quite complicated. Given that
+ * partial unmappings are rare and driver-defined range sizes are relatively
+ * small, GPU SVM does not support splitting of ranges.
+ *
+ * With no support for range splitting, upon partial unmapping of a range, the
+ * driver is expected to invalidate and destroy the entire range. If the range
+ * has VRAM as its backing, the driver is also expected to migrate any remaining
+ * pages back to SRAM.
+ */
+
+/**
+ * DOC: Examples
+ *
+ * This section provides two examples of how to build the expected driver
+ * components: the GPU page fault handler and the garbage collector. A third
+ * example demonstrates a sample invalidation driver vfunc.
+ *
+ * The generic code provided does not include logic for complex migration
+ * policies, optimized invalidations, or other potentially required driver
+ * locking (e.g., DMA-resv locks).
+ *
+ * 1) GPU page fault handler
+ *
+ *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct drm_gpusvm_range *range)
+ *	{
+ *		int err = 0;
+ *
+ *		driver_alloc_and_setup_memory_for_bind(gpusvm, range);
+ *
+ *		drm_gpusvm_notifier_lock(gpusvm);
+ *		if (drm_gpusvm_range_pages_valid(range))
+ *			driver_commit_bind(gpusvm, range);
+ *		else
+ *			err = -EAGAIN;
+ *		drm_gpusvm_notifier_unlock(gpusvm);
+ *
+ *		return err;
+ *	}
+ *
+ *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64 fault_addr,
+ *			     u64 gpuva_start, u64 gpuva_end)
+ *	{
+ *		struct drm_gpusvm_ctx ctx = {};
+ *		int err;
+ *
+ *		driver_svm_lock();
+ *	retry:
+ *		// Always process UNMAPs first so view of GPU SVM ranges is current
+ *		driver_garbage_collector(gpusvm);
+ *
+ *		range = drm_gpusvm_range_find_or_insert(gpusvm, fault_addr,
+ *							gpuva_start, gpuva_end,
+ *						        &ctx);
+ *		if (IS_ERR(range)) {
+ *			err = PTR_ERR(range);
+ *			goto unlock;
+ *		}
+ *
+ *		if (driver_migration_policy(range)) {
+ *			bo = driver_alloc_bo();
+ *			err = drm_gpusvm_migrate_to_vram(gpusvm, range, bo, &ctx);
+ *			if (err)	// CPU mappings may have changed
+ *				goto retry;
+ *		}
+ *
+ *		err = drm_gpusvm_range_get_pages(gpusvm, range, &ctx);
+ *		if (err == -EFAULT || err == -EPERM)	// CPU mappings changed
+ *			goto retry;
+ *		else if (err)
+ *			goto unlock;
+ *
+ *		err = driver_bind_range(gpusvm, range);
+ *		if (err == -EAGAIN)	// CPU mappings changed
+ *			goto retry
+ *
+ *	unlock:
+ *		driver_svm_unlock();
+ *		return err;
+ *	}
+ *
+ * 2) Garbage Collector.
+ *
+ *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
+ *					struct drm_gpusvm_range *range)
+ *	{
+ *		struct drm_gpusvm_ctx ctx = {};
+ *
+ *		assert_driver_svm_locked(gpusvm);
+ *
+ *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
+ *		if (range->flags.partial_unmap)
+ *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
+ *
+ *		driver_unbind_range(range);
+ *		drm_gpusvm_range_remove(gpusvm, range);
+ *	}
+ *
+ *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
+ *	{
+ *		assert_driver_svm_locked(gpusvm);
+ *
+ *		for_each_range_in_garbage_collector(gpusvm, range)
+ *			__driver_garbage_collector(gpusvm, range);
+ *	}
+ *
+ * 3) Invalidation driver vfunc.
+ *
+ *	void driver_invalidation(struct drm_gpusvm *gpusvm,
+ *				 struct drm_gpusvm_notifier *notifier,
+ *				 const struct mmu_notifier_range *mmu_range)
+ *	{
+ *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
+ *		struct drm_gpusvm_range *range = NULL;
+ *
+ *		driver_invalidate_device_tlb(gpusvm, mmu_range->start, mmu_range->end);
+ *
+ *		drm_gpusvm_for_each_range(range, notifier, mmu_range->start,
+ *					  mmu_range->end) {
+ *			drm_gpusvm_range_unmap_pages(gpusvm, range, &ctx);
+ *
+ *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
+ *				continue;
+ *
+ *			drm_gpusvm_range_set_unmapped(range, mmu_range);
+ *			driver_garbage_collector_add(gpusvm, range);
+ *		}
+ *	}
+ */
+
+#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
+#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
+INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64, rb.__subtree_last,
+		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
+		     static __maybe_unused, range);
+
+#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)->interval.start)
+#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)->interval.end - 1)
+INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
+		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
+		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused, notifier);
+
+/**
+ * npages_in_range() - Calculate the number of pages in a given range
+ * @start__: The start address of the range
+ * @end__: The end address of the range
+ *
+ * This macro calculates the number of pages in a given memory range,
+ * specified by the start and end addresses. It divides the difference
+ * between the end and start addresses by the page size (PAGE_SIZE) to
+ * determine the number of pages in the range.
+ *
+ * Return: The number of pages in the specified range.
+ */
+#define npages_in_range(start__, end__)	\
+	(((end__) - (start__)) >> PAGE_SHIFT)
+
+/**
+ * struct drm_gpusvm_zdd - GPU SVM zone device data
+ *
+ * @refcount: Reference count for the zdd
+ * @destroy_work: Work structure for asynchronous zdd destruction
+ * @range: Pointer to the GPU SVM range
+ * @vram_allocation: Driver-private pointer to the VRAM allocation
+ *
+ * This structure serves as a generic wrapper installed in
+ * page->zone_device_data. It provides infrastructure for looking up a range
+ * upon CPU page fault and asynchronously releasing VRAM once the CPU has no
+ * page references. Asynchronous release is useful because CPU page references
+ * can be dropped in IRQ contexts, while releasing VRAM likely requires sleeping
+ * locks.
+ */
+struct drm_gpusvm_zdd {
+	struct kref refcount;
+	struct work_struct destroy_work;
+	struct drm_gpusvm_range *range;
+	void *vram_allocation;
+};
+
+/**
+ * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a zdd
+ * @w: Pointer to the work_struct
+ *
+ * This function releases VRAM, puts GPU SVM range, and frees zdd.
+ */
+static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
+{
+	struct drm_gpusvm_zdd *zdd =
+		container_of(w, struct drm_gpusvm_zdd, destroy_work);
+	struct drm_gpusvm_range *range = zdd->range;
+	struct drm_gpusvm *gpusvm = range->gpusvm;
+
+	if (gpusvm->ops->vram_release && zdd->vram_allocation)
+		gpusvm->ops->vram_release(zdd->vram_allocation);
+	drm_gpusvm_range_put(range);
+	kfree(zdd);
+}
+
+/**
+ * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
+ * @range: Pointer to the GPU SVM range.
+ *
+ * This function allocates and initializes a new zdd structure. It sets up the
+ * reference count, initializes the destroy work, and links the provided GPU SVM
+ * range.
+ *
+ * Returns:
+ * Pointer to the allocated zdd on success, ERR_PTR() on failure.
+ */
+static struct drm_gpusvm_zdd *
+drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
+{
+	struct drm_gpusvm_zdd *zdd;
+
+	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
+	if (!zdd)
+		return NULL;
+
+	kref_init(&zdd->refcount);
+	INIT_WORK(&zdd->destroy_work, drm_gpusvm_zdd_destroy_work_func);
+	zdd->range = drm_gpusvm_range_get(range);
+	zdd->vram_allocation = NULL;
+
+	return zdd;
+}
+
+/**
+ * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
+ * @zdd: Pointer to the zdd structure.
+ *
+ * This function increments the reference count of the provided zdd structure.
+ *
+ * Returns: Pointer to the zdd structure.
+ */
+static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct drm_gpusvm_zdd *zdd)
+{
+	kref_get(&zdd->refcount);
+	return zdd;
+}
+
+/**
+ * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
+ * @ref: Pointer to the reference count structure.
+ *
+ * This function queues the destroy_work of the zdd for asynchronous destruction.
+ */
+static void drm_gpusvm_zdd_destroy(struct kref *ref)
+{
+	struct drm_gpusvm_zdd *zdd =
+		container_of(ref, struct drm_gpusvm_zdd, refcount);
+	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
+
+	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
+}
+
+/**
+ * drm_gpusvm_zdd_put - Put a zdd reference.
+ * @zdd: Pointer to the zdd structure.
+ *
+ * This function decrements the reference count of the provided zdd structure
+ * and schedules its destruction if the count drops to zero.
+ */
+static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
+{
+	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
+}
+
+/**
+ * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
+ * @notifier: Pointer to the GPU SVM notifier structure.
+ * @start: Start address of the range
+ * @end: End address of the range
+ *
+ * Return: A pointer to the drm_gpusvm_range if found or NULL
+ */
+struct drm_gpusvm_range *
+drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end)
+{
+	return range_iter_first(&notifier->root, start, end - 1);
+}
+
+/**
+ * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM ranges in a notifier
+ * @range__: Iterator variable for the ranges
+ * @next__: Iterator variable for the ranges temporay storage
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the range
+ * @end__: End address of the range
+ *
+ * This macro is used to iterate over GPU SVM ranges in a notifier while
+ * removing ranges from it.
+ */
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))
+
+/**
+ * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in the list
+ * @notifier: a pointer to the current drm_gpusvm_notifier
+ *
+ * Return: A pointer to the next drm_gpusvm_notifier if available, or NULL if
+ *         the current notifier is the last one or if the input notifier is
+ *         NULL.
+ */
+static struct drm_gpusvm_notifier *
+__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
+{
+	if (notifier && !list_is_last(&notifier->rb.entry,
+				      &notifier->gpusvm->notifier_list))
+		return list_next_entry(notifier, rb.entry);
+
+	return NULL;
+}
+
+/**
+ * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in a gpusvm
+ * @notifier__: Iterator variable for the notifiers
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the notifier
+ * @end__: End address of the notifier
+ *
+ * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
+ */
+#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
+
+/**
+ * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM notifiers in a gpusvm
+ * @notifier__: Iterator variable for the notifiers
+ * @next__: Iterator variable for the notifiers temporay storage
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the notifier
+ * @end__: End address of the notifier
+ *
+ * This macro is used to iterate over GPU SVM notifiers in a gpusvm while
+ * removing notifiers from it.
+ */
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))
+
+/**
+ * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
+ * @mni: Pointer to the mmu_interval_notifier structure.
+ * @mmu_range: Pointer to the mmu_notifier_range structure.
+ * @cur_seq: Current sequence number.
+ *
+ * This function serves as a generic MMU notifier for GPU SVM. It sets the MMU
+ * notifier sequence number and calls the driver invalidate vfunc under
+ * gpusvm->notifier_lock.
+ *
+ * Returns:
+ * true if the operation succeeds, false otherwise.
+ */
+static bool
+drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
+			       const struct mmu_notifier_range *mmu_range,
+			       unsigned long cur_seq)
+{
+	struct drm_gpusvm_notifier *notifier =
+		container_of(mni, typeof(*notifier), notifier);
+	struct drm_gpusvm *gpusvm = notifier->gpusvm;
+
+	if (!mmu_notifier_range_blockable(mmu_range))
+		return false;
+
+	down_write(&gpusvm->notifier_lock);
+	mmu_interval_set_seq(mni, cur_seq);
+	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
+	up_write(&gpusvm->notifier_lock);
+
+	return true;
+}
+
+/**
+ * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
+ */
+static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
+	.invalidate = drm_gpusvm_notifier_invalidate,
+};
+
+/**
+ * drm_gpusvm_init - Initialize the GPU SVM.
+ * @gpusvm: Pointer to the GPU SVM structure.
+ * @name: Name of the GPU SVM.
+ * @drm: Pointer to the DRM device structure.
+ * @mm: Pointer to the mm_struct for the address space.
+ * @device_private_page_owner: Device private pages owner.
+ * @mm_start: Start address of GPU SVM.
+ * @mm_range: Range of the GPU SVM.
+ * @notifier_size: Size of individual notifiers.
+ * @ops: Pointer to the operations structure for GPU SVM.
+ * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
+ *               Entries should be powers of 2 in descending order with last
+ *               entry being SZ_4K.
+ * @num_chunks: Number of chunks.
+ *
+ * This function initializes the GPU SVM.
+ *
+ * Returns:
+ * 0 on success, a negative error code on failure.
+ */
+int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
+		    const char *name, struct drm_device *drm,
+		    struct mm_struct *mm, void *device_private_page_owner,
+		    u64 mm_start, u64 mm_range, u64 notifier_size,
+		    const struct drm_gpusvm_ops *ops,
+		    const u64 *chunk_sizes, int num_chunks)
+{
+	if (!ops->invalidate || !num_chunks)
+		return -EINVAL;
+
+	gpusvm->name = name;
+	gpusvm->drm = drm;
+	gpusvm->mm = mm;
+	gpusvm->device_private_page_owner = device_private_page_owner;
+	gpusvm->mm_start = mm_start;
+	gpusvm->mm_range = mm_range;
+	gpusvm->notifier_size = notifier_size;
+	gpusvm->ops = ops;
+	gpusvm->chunk_sizes = chunk_sizes;
+	gpusvm->num_chunks = num_chunks;
+	gpusvm->zdd_wq = system_wq;
+
+	mmgrab(mm);
+	gpusvm->root = RB_ROOT_CACHED;
+	INIT_LIST_HEAD(&gpusvm->notifier_list);
+
+	init_rwsem(&gpusvm->notifier_lock);
+
+	fs_reclaim_acquire(GFP_KERNEL);
+	might_lock(&gpusvm->notifier_lock);
+	fs_reclaim_release(GFP_KERNEL);
+
+	return 0;
+}
+
+/**
+ * drm_gpusvm_notifier_find - Find GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM structure
+ * @fault_addr__: Fault address
+ *
+ * This macro finds the GPU SVM notifier associated with the fault address.
+ *
+ * Returns:
+ * Pointer to the GPU SVM notifier on success, NULL otherwise.
+ */
+#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
+	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
+			    (fault_addr__ + 1))
+
+/**
+ * to_drm_gpusvm_notifier - retrieve the container struct for a given rbtree node
+ * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_notifier struct
+ *
+ * Return: A pointer to the containing drm_gpusvm_notifier structure.
+ */
+#define to_drm_gpusvm_notifier(__node)				\
+	container_of((__node), struct drm_gpusvm_notifier, rb.node)
+
+/**
+ * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ *
+ * This function inserts the GPU SVM notifier into the GPU SVM RB tree and list.
+ */
+static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
+				       struct drm_gpusvm_notifier *notifier)
+{
+	struct rb_node *node;
+	struct list_head *head;
+
+	notifier_insert(notifier, &gpusvm->root);
+
+	node = rb_prev(&notifier->rb.node);
+	if (node)
+		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
+	else
+		head = &gpusvm->notifier_list;
+
+	list_add(&notifier->rb.entry, head);
+}
+
+/**
+ * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM tructure
+ * @notifier__: Pointer to the GPU SVM notifier structure
+ *
+ * This macro removes the GPU SVM notifier from the GPU SVM RB tree and list.
+ */
+#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
+	notifier_remove((notifier__), &(gpusvm__)->root);	\
+	list_del(&(notifier__)->rb.entry)
+
+/**
+ * drm_gpusvm_fini - Finalize the GPU SVM.
+ * @gpusvm: Pointer to the GPU SVM structure.
+ *
+ * This function finalizes the GPU SVM by cleaning up any remaining ranges and
+ * notifiers, and dropping a reference to struct MM.
+ */
+void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
+{
+	struct drm_gpusvm_notifier *notifier, *next;
+
+	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0, LONG_MAX) {
+		struct drm_gpusvm_range *range, *__next;
+
+		/*
+		 * Remove notifier first to avoid racing with any invalidation
+		 */
+		mmu_interval_notifier_remove(&notifier->notifier);
+		notifier->flags.removed = true;
+
+		drm_gpusvm_for_each_range_safe(range, __next, notifier, 0,
+					       LONG_MAX)
+			drm_gpusvm_range_remove(gpusvm, range);
+	}
+
+	mmdrop(gpusvm->mm);
+	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
+}
+
+/**
+ * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @fault_addr: Fault address
+ *
+ * This function allocates and initializes the GPU SVM notifier structure.
+ *
+ * Returns:
+ * Pointer to the allocated GPU SVM notifier on success, ERR_PTR() on failure.
+ */
+static struct drm_gpusvm_notifier *
+drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
+{
+	struct drm_gpusvm_notifier *notifier;
+
+	if (gpusvm->ops->notifier_alloc)
+		notifier = gpusvm->ops->notifier_alloc();
+	else
+		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
+
+	if (!notifier)
+		return ERR_PTR(-ENOMEM);
+
+	notifier->gpusvm = gpusvm;
+	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
+	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm->notifier_size);
+	INIT_LIST_HEAD(&notifier->rb.entry);
+	notifier->root = RB_ROOT_CACHED;
+	INIT_LIST_HEAD(&notifier->range_list);
+
+	return notifier;
+}
+
+/**
+ * drm_gpusvm_notifier_free - Free GPU SVM notifier
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ *
+ * This function frees the GPU SVM notifier structure.
+ */
+static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
+				     struct drm_gpusvm_notifier *notifier)
+{
+	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
+
+	if (gpusvm->ops->notifier_free)
+		gpusvm->ops->notifier_free(notifier);
+	else
+		kfree(notifier);
+}
+
+/**
+ * to_drm_gpusvm_range - retrieve the container struct for a given rbtree node
+ * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_range struct
+ *
+ * Return: A pointer to the containing drm_gpusvm_range structure.
+ */
+#define to_drm_gpusvm_range(node__)	\
+	container_of((node__), struct drm_gpusvm_range, rb.node)
+
+/**
+ * drm_gpusvm_range_insert - Insert GPU SVM range
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function inserts the GPU SVM range into the notifier RB tree and list.
+ */
+static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier *notifier,
+				    struct drm_gpusvm_range *range)
+{
+	struct rb_node *node;
+	struct list_head *head;
+
+	drm_gpusvm_notifier_lock(notifier->gpusvm);
+	range_insert(range, &notifier->root);
+
+	node = rb_prev(&range->rb.node);
+	if (node)
+		head = &(to_drm_gpusvm_range(node))->rb.entry;
+	else
+		head = &notifier->range_list;
+
+	list_add(&range->rb.entry, head);
+	drm_gpusvm_notifier_unlock(notifier->gpusvm);
+}
+
+/**
+ * __drm_gpusvm_range_remove - Remove GPU SVM range
+ * @notifier__: Pointer to the GPU SVM notifier structure
+ * @range__: Pointer to the GPU SVM range structure
+ *
+ * This macro removes the GPU SVM range from the notifier RB tree and list.
+ */
+#define __drm_gpusvm_range_remove(notifier__, range__)		\
+	range_remove((range__), &(notifier__)->root);		\
+	list_del(&(range__)->rb.entry)
+
+/**
+ * drm_gpusvm_range_alloc - Allocate GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @fault_addr: Fault address
+ * @chunk_size: Chunk size
+ * @migrate_vram: Flag indicating whether to migrate VRAM
+ *
+ * This function allocates and initializes the GPU SVM range structure.
+ *
+ * Returns:
+ * Pointer to the allocated GPU SVM range on success, ERR_PTR() on failure.
+ */
+static struct drm_gpusvm_range *
+drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
+		       struct drm_gpusvm_notifier *notifier,
+		       u64 fault_addr, u64 chunk_size, bool migrate_vram)
+{
+	struct drm_gpusvm_range *range;
+
+	if (gpusvm->ops->range_alloc)
+		range = gpusvm->ops->range_alloc(gpusvm);
+	else
+		range = kzalloc(sizeof(*range), GFP_KERNEL);
+
+	if (!range)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&range->refcount);
+	range->gpusvm = gpusvm;
+	range->notifier = notifier;
+	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
+	range->va.end = ALIGN(fault_addr + 1, chunk_size);
+	INIT_LIST_HEAD(&range->rb.entry);
+	range->notifier_seq = LONG_MAX;
+	range->flags.migrate_vram = migrate_vram ? 1 : 0;
+
+	return range;
+}
+
+/**
+ * drm_gpusvm_check_pages - Check pages
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @start: Start address
+ * @end: End address
+ *
+ * Check if pages between start and end have been faulted in on the CPU. Use to
+ * prevent migration of pages without CPU backing store.
+ *
+ * Returns:
+ * True if pages have been faulted into CPU, False otherwise
+ */
+static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
+				   struct drm_gpusvm_notifier *notifier,
+				   u64 start, u64 end)
+{
+	struct hmm_range hmm_range = {
+		.default_flags = 0,
+		.notifier = &notifier->notifier,
+		.start = start,
+		.end = end,
+		.dev_private_owner = gpusvm->device_private_page_owner,
+	};
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long *pfns;
+	unsigned long npages = npages_in_range(start, end);
+	int err, i;
+
+	mmap_assert_locked(gpusvm->mm);
+
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (!pfns)
+		return false;
+
+	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+	hmm_range.hmm_pfns = pfns;
+
+	while (true) {
+		err = hmm_range_fault(&hmm_range);
+		if (err == -EBUSY) {
+			if (time_after(jiffies, timeout))
+				break;
+
+			hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+			continue;
+		}
+		break;
+	}
+	if (err)
+		goto err_free;
+
+	for (i = 0; i < npages; ++i) {
+		if (!(pfns[i] & HMM_PFN_VALID)) {
+			err = -EFAULT;
+			goto err_free;
+		}
+	}
+
+err_free:
+	kvfree(pfns);
+	return err ? false : true;
+}
+
+/**
+ * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @vas: Pointer to the virtual memory area structure
+ * @fault_addr: Fault address
+ * @gpuva_start: Start address of GPUVA which mirrors CPU
+ * @gpuva_end: End address of GPUVA which mirrors CPU
+ * @check_pages: Flag indicating whether to check pages
+ *
+ * This function determines the chunk size for the GPU SVM range based on the
+ * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and the virtual
+ * memory area boundaries.
+ *
+ * Returns:
+ * Chunk size on success, LONG_MAX on failure.
+ */
+static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
+				       struct drm_gpusvm_notifier *notifier,
+				       struct vm_area_struct *vas,
+				       u64 fault_addr, u64 gpuva_start,
+				       u64 gpuva_end, bool check_pages)
+{
+	u64 start, end;
+	int i = 0;
+
+retry:
+	for (; i < gpusvm->num_chunks; ++i) {
+		start = ALIGN_DOWN(fault_addr, gpusvm->chunk_sizes[i]);
+		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
+
+		if (start >= vas->vm_start && end <= vas->vm_end &&
+		    start >= notifier->interval.start &&
+		    end <= notifier->interval.end &&
+		    start >= gpuva_start && end <= gpuva_end)
+			break;
+	}
+
+	if (i == gpusvm->num_chunks)
+		return LONG_MAX;
+
+	/*
+	 * If allocation more than page, ensure not to overlap with existing
+	 * ranges.
+	 */
+	if (end - start != SZ_4K) {
+		struct drm_gpusvm_range *range;
+
+		range = drm_gpusvm_range_find(notifier, start, end);
+		if (range) {
+			++i;
+			goto retry;
+		}
+
+		/*
+		 * XXX: Only create range on pages CPU has faulted in. Without
+		 * this check, or prefault, on BMG 'xe_exec_system_allocator --r
+		 * process-many-malloc' fails. In the failure case, each process
+		 * mallocs 16k but the CPU VMA is ~128k which results in 64k SVM
+		 * ranges. When migrating the SVM ranges, some processes fail in
+		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages != npages'
+		 * and then upon drm_gpusvm_range_get_pages device pages from
+		 * other processes are collected + faulted in which creates all
+		 * sorts of problems. Unsure exactly how this happening, also
+		 * problem goes away if 'xe_exec_system_allocator --r
+		 * process-many-malloc' mallocs at least 64k at a time.
+		 */
+		if (check_pages &&
+		    !drm_gpusvm_check_pages(gpusvm, notifier, start, end)) {
+			++i;
+			goto retry;
+		}
+	}
+
+	return end - start;
+}
+
+/**
+ * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @fault_addr: Fault address
+ * @gpuva_start: Start address of GPUVA which mirrors CPU
+ * @gpuva_end: End address of GPUVA which mirrors CPU
+ * @ctx: GPU SVM context
+ *
+ * This function finds or inserts a newly allocated a GPU SVM range based on the
+ * fault address. Caller must hold a lock to protect range lookup and insertion.
+ *
+ * Returns:
+ * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
+ */
+struct drm_gpusvm_range *
+drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
+				u64 gpuva_start, u64 gpuva_end,
+				const struct drm_gpusvm_ctx *ctx)
+{
+	struct drm_gpusvm_notifier *notifier;
+	struct drm_gpusvm_range *range;
+	struct mm_struct *mm = gpusvm->mm;
+	struct vm_area_struct *vas;
+	bool notifier_alloc = false;
+	u64 chunk_size;
+	int err;
+	bool migrate_vram;
+
+	if (fault_addr < gpusvm->mm_start ||
+	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	if (!ctx->mmap_locked) {
+		if (!mmget_not_zero(mm)) {
+			err = -EFAULT;
+			goto err_out;
+		}
+		mmap_write_lock(mm);
+	}
+
+	mmap_assert_write_locked(mm);
+
+	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
+	if (!notifier) {
+		notifier = drm_gpusvm_notifier_alloc(gpusvm, fault_addr);
+		if (IS_ERR(notifier)) {
+			err = PTR_ERR(notifier);
+			goto err_mmunlock;
+		}
+		notifier_alloc = true;
+		err = mmu_interval_notifier_insert_locked(&notifier->notifier,
+							  mm, notifier->interval.start,
+							  notifier->interval.end -
+							  notifier->interval.start,
+							  &drm_gpusvm_notifier_ops);
+		if (err)
+			goto err_notifier;
+	}
+
+	vas = vma_lookup(mm, fault_addr);
+	if (!vas) {
+		err = -ENOENT;
+		goto err_notifier_remove;
+	}
+
+	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
+		err = -EPERM;
+		goto err_notifier_remove;
+	}
+
+	range = drm_gpusvm_range_find(notifier, fault_addr, fault_addr + 1);
+	if (range)
+		goto out_mmunlock;
+	/*
+	 * XXX: Short-circuiting migration based on migrate_vma_* current
+	 * limitations. If/when migrate_vma_* add more support, this logic will
+	 * have to change.
+	 */
+	migrate_vram = ctx->vram_possible &&
+		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
+
+	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier, vas,
+						 fault_addr, gpuva_start,
+						 gpuva_end, migrate_vram &&
+						 !ctx->prefault);
+	if (chunk_size == LONG_MAX) {
+		err = -EINVAL;
+		goto err_notifier_remove;
+	}
+
+	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr, chunk_size,
+				       migrate_vram);
+	if (IS_ERR(range)) {
+		err = PTR_ERR(range);
+		goto err_notifier_remove;
+	}
+
+	drm_gpusvm_range_insert(notifier, range);
+	if (notifier_alloc)
+		drm_gpusvm_notifier_insert(gpusvm, notifier);
+
+	if (ctx->prefault) {
+		struct drm_gpusvm_ctx __ctx = *ctx;
+
+		__ctx.mmap_locked = true;
+		err = drm_gpusvm_range_get_pages(gpusvm, range, &__ctx);
+		if (err)
+			goto err_range_remove;
+	}
+
+out_mmunlock:
+	if (!ctx->mmap_locked) {
+		mmap_write_unlock(mm);
+		mmput(mm);
+	}
+
+	return range;
+
+err_range_remove:
+	__drm_gpusvm_range_remove(notifier, range);
+err_notifier_remove:
+	if (notifier_alloc)
+		mmu_interval_notifier_remove(&notifier->notifier);
+err_notifier:
+	if (notifier_alloc)
+		drm_gpusvm_notifier_free(gpusvm, notifier);
+err_mmunlock:
+	if (!ctx->mmap_locked) {
+		mmap_write_unlock(mm);
+		mmput(mm);
+	}
+err_out:
+	return ERR_PTR(err);
+}
+
+/**
+ * for_each_dma_page - iterate over pages in a DMA regio`n
+ * @i__: the current page index in the iteration
+ * @j__: the current page index, log order, in the iteration
+ * @npages__: the total number of pages in the DMA region
+ * @order__: the order of the pages in the DMA region
+ *
+ * This macro iterates over each page in a DMA region. The DMA region
+ * is assumed to be composed of 2^@order__ pages, and the macro will
+ * step through the region one block of 2^@order__ pages at a time.
+ */
+#define for_each_dma_page(i__, j__, npages__, order__)	\
+	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
+	     (j__)++, (i__) += 0x1 << (order__))
+
+/**
+ * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range (internal)
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function unmap pages associated with a GPU SVM range. Assumes and
+ * asserts correct locking is in place when called.
+ */
+static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
+					   struct drm_gpusvm_range *range)
+{
+	lockdep_assert_held(&gpusvm->notifier_lock);
+
+	if (range->pages) {
+		unsigned long i, j, npages = npages_in_range(range->va.start,
+							     range->va.end);
+
+		if (range->flags.has_dma_mapping) {
+			for_each_dma_page(i, j, npages, range->order)
+				dma_unmap_page(gpusvm->drm->dev,
+					       range->dma_addr[j],
+					       PAGE_SIZE << range->order,
+					       DMA_BIDIRECTIONAL);
+		}
+
+		range->flags.has_vram_pages = false;
+		range->flags.has_dma_mapping = false;
+	}
+}
+
+/**
+ * drm_gpusvm_range_free_pages - Free pages associated with a GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function free pages associated with a GPU SVM range.
+ */
+static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
+					struct drm_gpusvm_range *range)
+{
+	lockdep_assert_held(&gpusvm->notifier_lock);
+
+	if (range->pages) {
+		if (range->flags.kfree_mapping) {
+			kfree(range->dma_addr);
+			range->flags.kfree_mapping = false;
+			range->pages = NULL;
+		} else {
+			kvfree(range->pages);
+			range->pages = NULL;
+		}
+	}
+}
+
+/**
+ * drm_gpusvm_range_remove - Remove GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range to be removed
+ *
+ * This function removes the specified GPU SVM range and also removes the parent
+ * GPU SVM notifier if no more ranges remain in the notifier. The caller must
+ * hold a lock to protect range and notifier removal.
+ */
+void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
+			     struct drm_gpusvm_range *range)
+{
+	struct drm_gpusvm_notifier *notifier;
+
+	notifier = drm_gpusvm_notifier_find(gpusvm, range->va.start);
+	if (WARN_ON_ONCE(!notifier))
+		return;
+
+	drm_gpusvm_notifier_lock(gpusvm);
+	__drm_gpusvm_range_unmap_pages(gpusvm, range);
+	drm_gpusvm_range_free_pages(gpusvm, range);
+	__drm_gpusvm_range_remove(notifier, range);
+	drm_gpusvm_notifier_unlock(gpusvm);
+
+	drm_gpusvm_range_put(range);
+
+	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
+		if (!notifier->flags.removed)
+			mmu_interval_notifier_remove(&notifier->notifier);
+		drm_gpusvm_notifier_remove(gpusvm, notifier);
+		drm_gpusvm_notifier_free(gpusvm, notifier);
+	}
+}
+
+/**
+ * drm_gpusvm_range_get - Get a reference to GPU SVM range
+ * @range: Pointer to the GPU SVM range
+ *
+ * This function increments the reference count of the specified GPU SVM range.
+ *
+ * Returns:
+ * Pointer to the GPU SVM range.
+ */
+struct drm_gpusvm_range *
+drm_gpusvm_range_get(struct drm_gpusvm_range *range)
+{
+	kref_get(&range->refcount);
+
+	return range;
+}
+
+/**
+ * drm_gpusvm_range_destroy - Destroy GPU SVM range
+ * @refcount: Pointer to the reference counter embedded in the GPU SVM range
+ *
+ * This function destroys the specified GPU SVM range when its reference count
+ * reaches zero. If a custom range-free function is provided, it is invoked to
+ * free the range; otherwise, the range is deallocated using kfree().
+ */
+static void drm_gpusvm_range_destroy(struct kref *refcount)
+{
+	struct drm_gpusvm_range *range =
+		container_of(refcount, struct drm_gpusvm_range, refcount);
+	struct drm_gpusvm *gpusvm = range->gpusvm;
+
+	if (gpusvm->ops->range_free)
+		gpusvm->ops->range_free(range);
+	else
+		kfree(range);
+}
+
+/**
+ * drm_gpusvm_range_put - Put a reference to GPU SVM range
+ * @range: Pointer to the GPU SVM range
+ *
+ * This function decrements the reference count of the specified GPU SVM range
+ * and frees it when the count reaches zero.
+ */
+void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
+{
+	kref_put(&range->refcount, drm_gpusvm_range_destroy);
+}
+
+/**
+ * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function determines if a GPU SVM range pages are valid. Expected be
+ * called holding gpusvm->notifier_lock and as the last step before commiting a
+ * GPU binding.
+ *
+ * Returns:
+ * True if GPU SVM range has valid pages, False otherwise
+ */
+bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range)
+{
+	lockdep_assert_held(&gpusvm->notifier_lock);
+
+	return range->flags.has_vram_pages || range->flags.has_dma_mapping;
+}
+
+/**
+ * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid unlocked
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function determines if a GPU SVM range pages are valid. Expected be
+ * called without holding gpusvm->notifier_lock.
+ *
+ * Returns:
+ * True if GPU SVM range has valid pages, False otherwise
+ */
+static bool
+drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
+				      struct drm_gpusvm_range *range)
+{
+	bool pages_valid;
+
+	if (!range->pages)
+		return false;
+
+	drm_gpusvm_notifier_lock(gpusvm);
+	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
+	if (!pages_valid && range->flags.kfree_mapping) {
+		kfree(range->dma_addr);
+		range->flags.kfree_mapping = false;
+		range->pages = NULL;
+	}
+	drm_gpusvm_notifier_unlock(gpusvm);
+
+	return pages_valid;
+}
+
+/**
+ * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ * @ctx: GPU SVM context
+ *
+ * This function gets pages for a GPU SVM range and ensures they are mapped for
+ * DMA access.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       const struct drm_gpusvm_ctx *ctx)
+{
+	struct mmu_interval_notifier *notifier = &range->notifier->notifier;
+	struct hmm_range hmm_range = {
+		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only ? 0 :
+			HMM_PFN_REQ_WRITE),
+		.notifier = notifier,
+		.start = range->va.start,
+		.end = range->va.end,
+		.dev_private_owner = gpusvm->device_private_page_owner,
+	};
+	struct mm_struct *mm = gpusvm->mm;
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long i, j;
+	unsigned long npages = npages_in_range(range->va.start, range->va.end);
+	unsigned int order = 0;
+	unsigned long *pfns;
+	struct page **pages;
+	int err = 0;
+	bool vram_pages = !!range->flags.migrate_vram;
+	bool alloc_pfns = false, kfree_mapping;
+
+retry:
+	kfree_mapping = false;
+	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
+	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
+		return 0;
+
+	if (range->notifier_seq == hmm_range.notifier_seq && range->pages) {
+		if (ctx->prefault)
+			return 0;
+
+		pfns = (unsigned long *)range->pages;
+		pages = range->pages;
+		goto map_pages;
+	}
+
+	if (!range->pages) {
+		pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+		if (!pfns)
+			return -ENOMEM;
+		alloc_pfns = true;
+	} else {
+		pfns = (unsigned long *)range->pages;
+	}
+
+	if (!ctx->mmap_locked) {
+		if (!mmget_not_zero(mm)) {
+			err = -EFAULT;
+			goto err_out;
+		}
+	}
+
+	hmm_range.hmm_pfns = pfns;
+	while (true) {
+		/* Must be checked after mmu_interval_read_begin */
+		if (range->flags.unmapped) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (!ctx->mmap_locked) {
+			/*
+			 * XXX: HMM locking document indicates only a read-lock
+			 * is required but there apears to be a window between
+			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
+			 * via migrate_vma_setup and the pages actually moving
+			 * in migrate_vma_finalize in which this code can grab
+			 * garbage pages. Grabbing the write-lock if the range
+			 * is attached to vram appears to protect against this
+			 * race.
+			 */
+			if (vram_pages)
+				mmap_write_lock(mm);
+			else
+				mmap_read_lock(mm);
+		}
+		err = hmm_range_fault(&hmm_range);
+		if (!ctx->mmap_locked) {
+			if (vram_pages)
+				mmap_write_unlock(mm);
+			else
+				mmap_read_unlock(mm);
+		}
+
+		if (err == -EBUSY) {
+			if (time_after(jiffies, timeout))
+				break;
+
+			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
+			continue;
+		}
+		break;
+	}
+	if (!ctx->mmap_locked)
+		mmput(mm);
+	if (err)
+		goto err_free;
+
+	pages = (struct page **)pfns;
+
+	if (ctx->prefault) {
+		range->pages = pages;
+		goto set_seqno;
+	}
+
+map_pages:
+	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
+		WARN_ON_ONCE(!range->vram_allocation);
+
+		for (i = 0; i < npages; ++i) {
+			pages[i] = hmm_pfn_to_page(pfns[i]);
+
+			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
+				err = -EOPNOTSUPP;
+				goto err_free;
+			}
+		}
+
+		/* Do not race with notifier unmapping pages */
+		drm_gpusvm_notifier_lock(gpusvm);
+		range->flags.has_vram_pages = true;
+		range->pages = pages;
+		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
+			err = -EAGAIN;
+			__drm_gpusvm_range_unmap_pages(gpusvm, range);
+		}
+		drm_gpusvm_notifier_unlock(gpusvm);
+	} else {
+		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
+
+		for_each_dma_page(i, j, npages, order) {
+			if (WARN_ON_ONCE(i && order !=
+					 hmm_pfn_to_map_order(pfns[i]))) {
+				err = -EOPNOTSUPP;
+				npages = i;
+				goto err_unmap;
+			}
+			order = hmm_pfn_to_map_order(pfns[i]);
+
+			pages[j] = hmm_pfn_to_page(pfns[i]);
+			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
+				err = -EOPNOTSUPP;
+				npages = i;
+				goto err_unmap;
+			}
+
+			set_page_dirty_lock(pages[j]);
+			mark_page_accessed(pages[j]);
+
+			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
+						   pages[j], 0,
+						   PAGE_SIZE << order,
+						   DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
+				err = -EFAULT;
+				npages = i;
+				goto err_unmap;
+			}
+		}
+
+		/* Huge pages, reduce memory footprint */
+		if (order) {
+			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
+						 GFP_KERNEL);
+			if (dma_addr) {
+				for (i = 0; i < j; ++i)
+					dma_addr[i] = (dma_addr_t)pfns[i];
+				kvfree(pfns);
+				kfree_mapping = true;
+			} else {
+				dma_addr = (dma_addr_t *)pfns;
+			}
+		}
+
+		/* Do not race with notifier unmapping pages */
+		drm_gpusvm_notifier_lock(gpusvm);
+		range->order = order;
+		range->flags.kfree_mapping = kfree_mapping;
+		range->flags.has_dma_mapping = true;
+		range->dma_addr = dma_addr;
+		range->vram_allocation = NULL;
+		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
+			err = -EAGAIN;
+			__drm_gpusvm_range_unmap_pages(gpusvm, range);
+		}
+		drm_gpusvm_notifier_unlock(gpusvm);
+	}
+
+	if (err == -EAGAIN)
+		goto retry;
+set_seqno:
+	range->notifier_seq = hmm_range.notifier_seq;
+
+	return 0;
+
+err_unmap:
+	for_each_dma_page(i, j, npages, order)
+		dma_unmap_page(gpusvm->drm->dev,
+			       (dma_addr_t)pfns[j],
+			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
+err_free:
+	if (alloc_pfns)
+		kvfree(pfns);
+err_out:
+	return err;
+}
+
+/**
+ * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ * @ctx: GPU SVM context
+ *
+ * This function unmaps pages associated with a GPU SVM range. If @in_notifier
+ * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
+ * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
+ * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
+ * security model.
+ */
+void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range,
+				  const struct drm_gpusvm_ctx *ctx)
+{
+	if (ctx->in_notifier)
+		lockdep_assert_held_write(&gpusvm->notifier_lock);
+	else
+		drm_gpusvm_notifier_lock(gpusvm);
+
+	__drm_gpusvm_range_unmap_pages(gpusvm, range);
+
+	if (!ctx->in_notifier)
+		drm_gpusvm_notifier_unlock(gpusvm);
+}
+
+/**
+ * drm_gpusvm_migration_put_page - Put a migration page
+ * @page: Pointer to the page to put
+ *
+ * This function unlocks and puts a page.
+ */
+static void drm_gpusvm_migration_put_page(struct page *page)
+{
+	unlock_page(page);
+	put_page(page);
+}
+
+/**
+ * drm_gpusvm_migration_put_pages - Put migration pages
+ * @npages: Number of pages
+ * @migrate_pfn: Array of migrate page frame numbers
+ *
+ * This function puts an array of pages.
+ */
+static void drm_gpusvm_migration_put_pages(unsigned long npages,
+					   unsigned long *migrate_pfn)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i) {
+		if (!migrate_pfn[i])
+			continue;
+
+		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
+		migrate_pfn[i] = 0;
+	}
+}
+
+/**
+ * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
+ * @page: Pointer to the page
+ * @zdd: Pointer to the GPU SVM zone device data
+ *
+ * This function associates the given page with the specified GPU SVM zone
+ * device data and initializes it for zone device usage.
+ */
+static void drm_gpusvm_get_vram_page(struct page *page,
+				     struct drm_gpusvm_zdd *zdd)
+{
+	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
+	zone_device_page_init(page);
+}
+
+/**
+ * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
+ * @dev: The device for which the pages are being mapped
+ * @dma_addr: Array to store DMA addresses corresponding to mapped pages
+ * @migrate_pfn: Array of migrate page frame numbers to map
+ * @npages: Number of pages to map
+ * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
+ *
+ * This function maps pages of memory for migration usage in GPU SVM. It
+ * iterates over each page frame number provided in @migrate_pfn, maps the
+ * corresponding page, and stores the DMA address in the provided @dma_addr
+ * array.
+ *
+ * Return: 0 on success, -EFAULT if an error occurs during mapping.
+ */
+static int drm_gpusvm_migrate_map_pages(struct device *dev,
+					dma_addr_t *dma_addr,
+					long unsigned int *migrate_pfn,
+					unsigned long npages,
+					enum dma_data_direction dir)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
+
+		if (!page)
+			continue;
+
+		if (WARN_ON_ONCE(is_zone_device_page(page)))
+			return -EFAULT;
+
+		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
+		if (dma_mapping_error(dev, dma_addr[i]))
+			return -EFAULT;
+	}
+
+	return 0;
+}
+
+/**
+ * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
+ * @dev: The device for which the pages were mapped
+ * @dma_addr: Array of DMA addresses corresponding to mapped pages
+ * @npages: Number of pages to unmap
+ * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
+ *
+ * This function unmaps previously mapped pages of memory for GPU Shared Virtual
+ * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
+ * if it's valid and not already unmapped, and unmaps the corresponding page.
+ */
+static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
+					   dma_addr_t *dma_addr,
+					   unsigned long npages,
+					   enum dma_data_direction dir)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i) {
+		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
+			continue;
+
+		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
+	}
+}
+
+/**
+ * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *                   failure of this function.
+ * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
+ *                   should hold a reference to the VRAM allocation, which
+ *                   should be dropped via ops->vram_allocation or upon the
+ *                   failure of this function.
+ * @ctx: GPU SVM context
+ *
+ * This function migrates the specified GPU SVM range to VRAM. It performs the
+ * necessary setup and invokes the driver-specific operations for migration to
+ * VRAM. Upon successful return, @vram_allocation can safely reference @range
+ * until ops->vram_release is called which only upon successful return.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       void *vram_allocation,
+			       const struct drm_gpusvm_ctx *ctx)
+{
+	u64 start = range->va.start, end = range->va.end;
+	struct migrate_vma migrate = {
+		.start		= start,
+		.end		= end,
+		.pgmap_owner	= gpusvm->device_private_page_owner,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+	};
+	struct mm_struct *mm = gpusvm->mm;
+	unsigned long i, npages = npages_in_range(start, end);
+	struct vm_area_struct *vas;
+	struct drm_gpusvm_zdd *zdd = NULL;
+	struct page **pages;
+	dma_addr_t *dma_addr;
+	void *buf;
+	int err;
+
+	if (!range->flags.migrate_vram)
+		return -EINVAL;
+
+	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
+	    !gpusvm->ops->copy_to_sram)
+		return -EOPNOTSUPP;
+
+	if (!ctx->mmap_locked) {
+		if (!mmget_not_zero(mm)) {
+			err = -EFAULT;
+			goto err_out;
+		}
+		mmap_write_lock(mm);
+	}
+
+	mmap_assert_locked(mm);
+
+	vas = vma_lookup(mm, start);
+	if (!vas) {
+		err = -ENOENT;
+		goto err_mmunlock;
+	}
+
+	if (end > vas->vm_end || start < vas->vm_start) {
+		err = -EINVAL;
+		goto err_mmunlock;
+	}
+
+	if (!vma_is_anonymous(vas)) {
+		err = -EBUSY;
+		goto err_mmunlock;
+	}
+
+	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
+		       sizeof(*pages), GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto err_mmunlock;
+	}
+	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
+	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
+
+	zdd = drm_gpusvm_zdd_alloc(range);
+	if (!zdd) {
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	migrate.vma = vas;
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+
+	err = migrate_vma_setup(&migrate);
+	if (err)
+		goto err_free;
+
+	/*
+	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
+	 * always an error. Need to revisit possible cases and how to handle. We
+	 * could prefault on migrate.cpages != npages via hmm_range_fault.
+	 */
+
+	if (!migrate.cpages) {
+		err = -EFAULT;
+		goto err_free;
+	}
+
+	if (migrate.cpages != npages) {
+		err = -EBUSY;
+		goto err_finalize;
+	}
+
+	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
+					     migrate.dst);
+	if (err)
+		goto err_finalize;
+
+	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
+					   migrate.src, npages, DMA_TO_DEVICE);
+	if (err)
+		goto err_finalize;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = pfn_to_page(migrate.dst[i]);
+
+		pages[i] = page;
+		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
+		drm_gpusvm_get_vram_page(page, zdd);
+	}
+
+	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
+	if (err)
+		goto err_finalize;
+
+	/* Upon success bind vram allocation to range and zdd */
+	range->vram_allocation = vram_allocation;
+	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
+
+err_finalize:
+	if (err)
+		drm_gpusvm_migration_put_pages(npages, migrate.dst);
+	migrate_vma_pages(&migrate);
+	migrate_vma_finalize(&migrate);
+	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
+				       DMA_TO_DEVICE);
+err_free:
+	if (zdd)
+		drm_gpusvm_zdd_put(zdd);
+	kvfree(buf);
+err_mmunlock:
+	if (!ctx->mmap_locked) {
+		mmap_write_unlock(mm);
+		mmput(mm);
+	}
+err_out:
+	return err;
+}
+
+/**
+ * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
+ * @vas: Pointer to the VM area structure, can be NULL
+ * @npages: Number of pages to populate
+ * @src_mpfn: Source array of migrate PFNs
+ * @mpfn: Array of migrate PFNs to populate
+ * @addr: Start address for PFN allocation
+ *
+ * This function populates the SRAM migrate page frame numbers (PFNs) for the
+ * specified VM area structure. It allocates and locks pages in the VM area for
+ * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
+ * alloc_page for allocation.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
+						unsigned long npages,
+						unsigned long *src_mpfn,
+						unsigned long *mpfn, u64 addr)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
+		struct page *page;
+
+		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		if (vas)
+			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
+		else
+			page = alloc_page(GFP_HIGHUSER);
+
+		if (!page)
+			return -ENOMEM;
+
+		lock_page(page);
+		mpfn[i] = migrate_pfn(page_to_pfn(page));
+	}
+
+	return 0;
+}
+
+/**
+ * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
+ * migration done via migrate_device_* functions. Fallback path as it is
+ * preferred to issue migrations with mmap lock.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
+				    struct drm_gpusvm_range *range)
+{
+	unsigned long npages;
+	struct page **pages;
+	unsigned long *src, *dst;
+	dma_addr_t *dma_addr;
+	void *buf;
+	int i, err = 0;
+
+	npages = npages_in_range(range->va.start, range->va.end);
+
+	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
+		       sizeof(*pages), GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+	src = buf;
+	dst = buf + (sizeof(*src) * npages);
+	dma_addr = buf + (2 * sizeof(*src) * npages);
+	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
+
+	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
+					     npages, src);
+	if (err)
+		goto err_free;
+
+	err = migrate_device_vma_range(gpusvm->mm,
+				       gpusvm->device_private_page_owner, src,
+				       npages, range->va.start);
+	if (err)
+		goto err_free;
+
+	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
+	if (err)
+		goto err_finalize;
+
+	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
+					   dst, npages, DMA_BIDIRECTIONAL);
+	if (err)
+		goto err_finalize;
+
+	for (i = 0; i < npages; ++i)
+		pages[i] = migrate_pfn_to_page(src[i]);
+
+	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
+	if (err)
+		goto err_finalize;
+
+err_finalize:
+	if (err)
+		drm_gpusvm_migration_put_pages(npages, dst);
+	migrate_device_pages(src, dst, npages);
+	migrate_device_finalize(src, dst, npages);
+	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
+				       DMA_BIDIRECTIONAL);
+err_free:
+	kvfree(buf);
+err_out:
+
+	return err;
+}
+
+/**
+ * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @vas: Pointer to the VM area structure
+ * @page: Pointer to the page for fault handling (can be NULL)
+ * @start: Start address of the migration range
+ * @end: End address of the migration range
+ *
+ * This internal function performs the migration of the specified GPU SVM range
+ * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
+ * invokes the driver-specific operations for migration to SRAM.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
+					struct vm_area_struct *vas,
+					struct page *page,
+					u64 start, u64 end)
+{
+	struct migrate_vma migrate = {
+		.vma		= vas,
+		.pgmap_owner	= gpusvm->device_private_page_owner,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.fault_page	= page,
+	};
+	unsigned long npages;
+	struct page **pages;
+	dma_addr_t *dma_addr;
+	void *buf;
+	int i, err = 0;
+
+	mmap_assert_locked(gpusvm->mm);
+
+	/* Corner where VMA area struct has been partially unmapped */
+	if (start < vas->vm_start)
+		start = vas->vm_start;
+	if (end > vas->vm_end)
+		end = vas->vm_end;
+
+	migrate.start = start;
+	migrate.end = end;
+	npages = npages_in_range(start, end);
+
+	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
+		       sizeof(*pages), GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
+	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
+
+	migrate.vma = vas;
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+
+	err = migrate_vma_setup(&migrate);
+	if (err)
+		goto err_free;
+
+	/* Raced with another CPU fault, nothing to do */
+	if (!migrate.cpages)
+		goto err_free;
+
+	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
+						   migrate.src, migrate.dst,
+						   start);
+	if (err)
+		goto err_finalize;
+
+	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
+					   migrate.dst, npages,
+					   DMA_BIDIRECTIONAL);
+	if (err)
+		goto err_finalize;
+
+	for (i = 0; i < npages; ++i)
+		pages[i] = migrate_pfn_to_page(migrate.src[i]);
+
+	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
+	if (err)
+		goto err_finalize;
+
+err_finalize:
+	if (err)
+		drm_gpusvm_migration_put_pages(npages, migrate.dst);
+	migrate_vma_pages(&migrate);
+	migrate_vma_finalize(&migrate);
+	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
+				       DMA_BIDIRECTIONAL);
+err_free:
+	kvfree(buf);
+err_out:
+	mmap_assert_locked(gpusvm->mm);
+
+	return err;
+}
+
+/**
+ * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ * @ctx: GPU SVM context
+ *
+ * This function initiates the migration of the specified GPU SVM range to
+ * SRAM. It performs necessary checks and invokes the internal migration
+ * function for actual migration.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       const struct drm_gpusvm_ctx *ctx)
+{
+	u64 start = range->va.start, end = range->va.end;
+	struct mm_struct *mm = gpusvm->mm;
+	struct vm_area_struct *vas;
+	int err;
+	bool retry = false;
+
+	if (!ctx->mmap_locked) {
+		if (!mmget_not_zero(mm)) {
+			err = -EFAULT;
+			goto err_out;
+		}
+		if (ctx->trylock_mmap) {
+			if (!mmap_read_trylock(mm))  {
+				err = drm_gpusvm_evict_to_sram(gpusvm, range);
+				goto err_mmput;
+			}
+		} else {
+			mmap_read_lock(mm);
+		}
+	}
+
+	mmap_assert_locked(mm);
+
+	/*
+	 * Loop required to find all VMA area structs for the corner case when
+	 * VRAM backing has been partially unmapped from MM's address space.
+	 */
+again:
+	vas = find_vma(mm, start);
+	if (!vas) {
+		if (!retry)
+			err = -ENOENT;
+		goto err_mmunlock;
+	}
+
+	if (end <= vas->vm_start || start >= vas->vm_end) {
+		if (!retry)
+			err = -EINVAL;
+		goto err_mmunlock;
+	}
+
+	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
+	if (err)
+		goto err_mmunlock;
+
+	if (vas->vm_end < end) {
+		retry = true;
+		start = vas->vm_end;
+		goto again;
+	}
+
+	if (!ctx->mmap_locked) {
+		mmap_read_unlock(mm);
+		/*
+		 * Using mmput_async as this function can be called while
+		 * holding a dma-resv lock, and a final put can grab the mmap
+		 * lock, causing a lock inversion.
+		 */
+		mmput_async(mm);
+	}
+
+	return 0;
+
+err_mmunlock:
+	if (!ctx->mmap_locked)
+		mmap_read_unlock(mm);
+err_mmput:
+	if (!ctx->mmap_locked)
+		mmput_async(mm);
+err_out:
+	return err;
+}
+
+/**
+ * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
+ * @page: Pointer to the page
+ *
+ * This function is a callback used to put the GPU SVM zone device data
+ * associated with a page when it is being released.
+ */
+static void drm_gpusvm_page_free(struct page *page)
+{
+	drm_gpusvm_zdd_put(page->zone_device_data);
+}
+
+/**
+ * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
+ * @vmf: Pointer to the fault information structure
+ *
+ * This function is a page fault handler used to migrate a GPU SVM range to RAM.
+ * It retrieves the GPU SVM range information from the faulting page and invokes
+ * the internal migration function to migrate the range back to RAM.
+ *
+ * Returns:
+ * VM_FAULT_SIGBUS on failure, 0 on success.
+ */
+static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
+{
+	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
+	int err;
+
+	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
+					   vmf->vma, vmf->page,
+					   zdd->range->va.start,
+					   zdd->range->va.end);
+
+	return err ? VM_FAULT_SIGBUS : 0;
+}
+
+/**
+ * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
+ */
+static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
+	.page_free = drm_gpusvm_page_free,
+	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
+};
+
+/**
+ * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
+ *
+ * Returns:
+ * Pointer to the GPU SVM device page map operations structure.
+ */
+const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
+{
+	return &drm_gpusvm_pagemap_ops;
+}
+
+/**
+ * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
+ * @gpusvm: Pointer to the GPU SVM structure.
+ * @start: Start address
+ * @end: End address
+ *
+ * Returns:
+ * True if GPU SVM has mapping, False otherwise
+ */
+bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
+{
+	struct drm_gpusvm_notifier *notifier;
+
+	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
+		struct drm_gpusvm_range *range = NULL;
+
+		drm_gpusvm_for_each_range(range, notifier, start, end)
+			return true;
+	}
+
+	return false;
+}
diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
new file mode 100644
index 000000000000..0ea70f8534a8
--- /dev/null
+++ b/drivers/gpu/drm/xe/drm_gpusvm.h
@@ -0,0 +1,415 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef __DRM_GPUSVM_H__
+#define __DRM_GPUSVM_H__
+
+#include <linux/kref.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+
+struct dev_pagemap_ops;
+struct drm_device;
+struct drm_gpusvm;
+struct drm_gpusvm_notifier;
+struct drm_gpusvm_ops;
+struct drm_gpusvm_range;
+
+/**
+ * struct drm_gpusvm_ops - Operations structure for GPU SVM
+ *
+ * This structure defines the operations for GPU Shared Virtual Memory (SVM).
+ * These operations are provided by the GPU driver to manage SVM ranges and
+ * perform operations such as migration between VRAM and system RAM.
+ */
+struct drm_gpusvm_ops {
+	/**
+	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
+	 *
+	 * This function shall allocate a GPU SVM notifier.
+	 *
+	 * Returns:
+	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
+	 */
+	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
+
+	/**
+	 * @notifier_free: Free a GPU SVM notifier (optional)
+	 * @notifier: Pointer to the GPU SVM notifier to be freed
+	 *
+	 * This function shall free a GPU SVM notifier.
+	 */
+	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
+
+	/**
+	 * @range_alloc: Allocate a GPU SVM range (optional)
+	 * @gpusvm: Pointer to the GPU SVM
+	 *
+	 * This function shall allocate a GPU SVM range.
+	 *
+	 * Returns:
+	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
+	 */
+	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
+
+	/**
+	 * @range_free: Free a GPU SVM range (optional)
+	 * @range: Pointer to the GPU SVM range to be freed
+	 *
+	 * This function shall free a GPU SVM range.
+	 */
+	void (*range_free)(struct drm_gpusvm_range *range);
+
+	/**
+	 * @vram_release: Release VRAM allocation (optional)
+	 * @vram_allocation: Driver-private pointer to the VRAM allocation
+	 *
+	 * This function shall release VRAM allocation and expects to drop a
+	 * reference to VRAM allocation.
+	 */
+	void (*vram_release)(void *vram_allocation);
+
+	/**
+	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
+	 * @gpusvm: Pointer to the GPU SVM
+	 * @vram_allocation: Driver-private pointer to the VRAM allocation
+	 * @npages: Number of pages to populate
+	 * @pfn: Array of page frame numbers to populate
+	 *
+	 * This function shall populate VRAM page frame numbers (PFN).
+	 *
+	 * Returns:
+	 * 0 on success, a negative error code on failure.
+	 */
+	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
+				 void *vram_allocation,
+				 unsigned long npages,
+				 unsigned long *pfn);
+
+	/**
+	 * @copy_to_vram: Copy to VRAM (required for migration)
+	 * @gpusvm: Pointer to the GPU SVM
+	 * @pages: Pointer to array of VRAM pages (destination)
+	 * @dma_addr: Pointer to array of DMA addresses (source)
+	 * @npages: Number of pages to copy
+	 *
+	 * This function shall copy pages to VRAM.
+	 *
+	 * Returns:
+	 * 0 on success, a negative error code on failure.
+	 */
+	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
+			    struct page **pages,
+			    dma_addr_t *dma_addr,
+			    unsigned long npages);
+
+	/**
+	 * @copy_to_sram: Copy to system RAM (required for migration)
+	 * @gpusvm: Pointer to the GPU SVM
+	 * @pages: Pointer to array of VRAM pages (source)
+	 * @dma_addr: Pointer to array of DMA addresses (destination)
+	 * @npages: Number of pages to copy
+	 *
+	 * This function shall copy pages to system RAM.
+	 *
+	 * Returns:
+	 * 0 on success, a negative error code on failure.
+	 */
+	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
+			    struct page **pages,
+			    dma_addr_t *dma_addr,
+			    unsigned long npages);
+
+	/**
+	 * @invalidate: Invalidate GPU SVM notifier (required)
+	 * @gpusvm: Pointer to the GPU SVM
+	 * @notifier: Pointer to the GPU SVM notifier
+	 * @mmu_range: Pointer to the mmu_notifier_range structure
+	 *
+	 * This function shall invalidate the GPU page tables. It can safely
+	 * walk the notifier range RB tree/list in this function. Called while
+	 * holding the notifier lock.
+	 */
+	void (*invalidate)(struct drm_gpusvm *gpusvm,
+			   struct drm_gpusvm_notifier *notifier,
+			   const struct mmu_notifier_range *mmu_range);
+};
+
+/**
+ * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
+ *
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: MMU interval notifier
+ * @interval: Interval for the notifier
+ * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
+ * @root: Cached root node of the RB tree containing ranges
+ * @range_list: List head containing of ranges in the same order they appear in
+ *              interval tree. This is useful to keep iterating ranges while
+ *              doing modifications to RB tree.
+ * @flags.removed: Flag indicating whether the MMU interval notifier has been
+ *                 removed
+ *
+ * This structure represents a GPU SVM notifier.
+ */
+struct drm_gpusvm_notifier {
+	struct drm_gpusvm *gpusvm;
+	struct mmu_interval_notifier notifier;
+	struct {
+		u64 start;
+		u64 end;
+	} interval;
+	struct {
+		struct rb_node node;
+		struct list_head entry;
+		u64 __subtree_last;
+	} rb;
+	struct rb_root_cached root;
+	struct list_head range_list;
+	struct {
+		u32 removed : 1;
+	} flags;
+};
+
+/**
+ * struct drm_gpusvm_range - Structure representing a GPU SVM range
+ *
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier
+ * @refcount: Reference count for the range
+ * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
+ * @va: Virtual address range
+ * @notifier_seq: Notifier sequence number of the range's pages
+ * @pages: Pointer to the array of pages (if backing store is in VRAM)
+ * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
+ * @vram_allocation: Driver-private pointer to the VRAM allocation
+ * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
+ * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
+ * @flags.unmapped: Flag indicating if the range has been unmapped
+ * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
+ * @flags.has_vram_pages: Flag indicating if the range has vram pages
+ * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
+ * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
+ *                       on @order which releases via kfree
+ *
+ * This structure represents a GPU SVM range used for tracking memory ranges
+ * mapped in a DRM device.
+ */
+struct drm_gpusvm_range {
+	struct drm_gpusvm *gpusvm;
+	struct drm_gpusvm_notifier *notifier;
+	struct kref refcount;
+	struct {
+		struct rb_node node;
+		struct list_head entry;
+		u64 __subtree_last;
+	} rb;
+	struct {
+		u64 start;
+		u64 end;
+	} va;
+	unsigned long notifier_seq;
+	union {
+		struct page **pages;
+		dma_addr_t *dma_addr;
+	};
+	void *vram_allocation;
+	u16 order;
+	struct {
+		/* All flags below must be set upon creation */
+		u16 migrate_vram : 1;
+		/* All flags below must be set / cleared under notifier lock */
+		u16 unmapped : 1;
+		u16 partial_unmap : 1;
+		u16 has_vram_pages : 1;
+		u16 has_dma_mapping : 1;
+		u16 kfree_mapping : 1;
+	} flags;
+};
+
+/**
+ * struct drm_gpusvm - GPU SVM structure
+ *
+ * @name: Name of the GPU SVM
+ * @drm: Pointer to the DRM device structure
+ * @mm: Pointer to the mm_struct for the address space
+ * @device_private_page_owner: Device private pages owner
+ * @mm_start: Start address of GPU SVM
+ * @mm_range: Range of the GPU SVM
+ * @notifier_size: Size of individual notifiers
+ * @ops: Pointer to the operations structure for GPU SVM
+ * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
+ *               Entries should be powers of 2 in descending order.
+ * @num_chunks: Number of chunks
+ * @notifier_lock: Read-write semaphore for protecting notifier operations
+ * @zdd_wq: Workqueue for deferred work on zdd destruction
+ * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
+ * @notifier_list: list head containing of notifiers in the same order they
+ *                 appear in interval tree. This is useful to keep iterating
+ *                 notifiers while doing modifications to RB tree.
+ *
+ * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
+ * memory ranges mapped in a DRM (Direct Rendering Manager) device.
+ *
+ * No reference counting is provided, as this is expected to be embedded in the
+ * driver VM structure along with the struct drm_gpuvm, which handles reference
+ * counting.
+ */
+struct drm_gpusvm {
+	const char *name;
+	struct drm_device *drm;
+	struct mm_struct *mm;
+	void *device_private_page_owner;
+	u64 mm_start;
+	u64 mm_range;
+	u64 notifier_size;
+	const struct drm_gpusvm_ops *ops;
+	const u64 *chunk_sizes;
+	int num_chunks;
+	struct rw_semaphore notifier_lock;
+	struct workqueue_struct *zdd_wq;
+	struct rb_root_cached root;
+	struct list_head notifier_list;
+};
+
+/**
+ * struct drm_gpusvm_ctx - DRM GPU SVM context
+ *
+ * @mmap_locked: mmap lock is locked
+ * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
+ *                (e.g.dma-revs -> mmap lock)
+ * @in_notifier: entering from a MMU notifier
+ * @read_only: operating on read-only memory
+ * @vram_possible: possible to use VRAM
+ * @prefault: prefault pages
+ *
+ * Context that is DRM GPUSVM is operating in (i.e. user arguments).
+ */
+struct drm_gpusvm_ctx {
+	u32 mmap_locked :1;
+	u32 trylock_mmap :1;
+	u32 in_notifier :1;
+	u32 read_only :1;
+	u32 vram_possible :1;
+	u32 prefault :1;
+};
+
+int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
+		    const char *name, struct drm_device *drm,
+		    struct mm_struct *mm, void *device_private_page_owner,
+		    u64 mm_start, u64 mm_range, u64 notifier_size,
+		    const struct drm_gpusvm_ops *ops,
+		    const u64 *chunk_sizes, int num_chunks);
+void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
+void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
+
+struct drm_gpusvm_range *
+drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
+				u64 gpuva_start, u64 gpuva_end,
+				const struct drm_gpusvm_ctx *ctx);
+void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
+			     struct drm_gpusvm_range *range);
+
+struct drm_gpusvm_range *
+drm_gpusvm_range_get(struct drm_gpusvm_range *range);
+void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
+
+bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range);
+
+int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       const struct drm_gpusvm_ctx *ctx);
+void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range,
+				  const struct drm_gpusvm_ctx *ctx);
+
+int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       void *vram_allocation,
+			       const struct drm_gpusvm_ctx *ctx);
+int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       const struct drm_gpusvm_ctx *ctx);
+
+const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
+
+bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
+
+struct drm_gpusvm_range *
+drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
+
+/**
+ * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM structure.
+ *
+ * Abstract client usage GPU SVM notifier lock, take lock
+ */
+#define drm_gpusvm_notifier_lock(gpusvm__)	\
+	down_read(&(gpusvm__)->notifier_lock)
+
+/**
+ * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM structure.
+ *
+ * Abstract client usage GPU SVM notifier lock, drop lock
+ */
+#define drm_gpusvm_notifier_unlock(gpusvm__)	\
+	up_read(&(gpusvm__)->notifier_lock)
+
+/**
+ * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
+ * @range: a pointer to the current GPU SVM range
+ *
+ * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
+ *         current range is the last one or if the input range is NULL.
+ */
+static inline struct drm_gpusvm_range *
+__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
+{
+	if (range && !list_is_last(&range->rb.entry,
+				   &range->notifier->range_list))
+		return list_next_entry(range, rb.entry);
+
+	return NULL;
+}
+
+/**
+ * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
+ * @range__: Iterator variable for the ranges. If set, it indicates the start of
+ *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the range
+ * @end__: End address of the range
+ *
+ * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
+ * to use while holding the driver SVM lock or the notifier lock.
+ */
+#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
+	for ((range__) = (range__) ?:					\
+	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
+	     (range__) && (range__->va.start < (end__));		\
+	     (range__) = __drm_gpusvm_range_next(range__))
+
+/**
+ * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
+ * @range: Pointer to the GPU SVM range structure.
+ * @mmu_range: Pointer to the MMU notifier range structure.
+ *
+ * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
+ * if the range partially falls within the provided MMU notifier range.
+ */
+static inline void
+drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
+			      const struct mmu_notifier_range *mmu_range)
+{
+	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
+
+	range->flags.unmapped = true;
+	if (range->va.start < mmu_range->start ||
+	    range->va.end > mmu_range->end)
+		range->flags.partial_unmap = true;
+}
+
+#endif /* __DRM_GPUSVM_H__ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 06/28] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (4 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 07/28] drm/xe: Add SVM init / fini to faulting VMs Matthew Brost
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag, which is used to
create unpopulated virtual memory areas (VMAs) without memory backing or
GPU page tables. These VMAs are referred to as system allocator VMAs.
The idea is that upon a page fault or prefetch, the memory backing and
GPU page tables will be populated.

System allocator VMAs only update GPUVM state; they do not have an
internal page table (PT) state, nor do they have GPU mappings.

It is expected that system allocator VMAs will be mixed with buffer
object (BO) VMAs within a single VM. In other words, system allocations
and runtime allocations can be mixed within a single user-mode driver
(UMD) program.

Expected usage:

- Bind the entire virtual address (VA) space upon program load using the
  DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
- If a buffer object (BO) requires GPU mapping, allocate an address
  using malloc, and bind the BO to the malloc'd address using existing
  bind IOCTLs (runtime allocation).
- If a BO no longer requires GPU mapping, bind the mapping address with
  the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
- Any malloc'd address accessed by the GPU will be faulted in via the
  SVM implementation (system allocation).
- Upon freeing any malloc'd data, the SVM implementation will remove GPU
  mappings.

Only supporting 1 to 1 mapping between user address space and GPU
address space at the moment as that is the expected use case. uAPI
defines interface for non 1 to 1 but enforces 1 to 1, this restriction
can be lifted if use cases arrise for non 1 to 1 mappings.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       |  76 +++++++++++++++++-----
 drivers/gpu/drm/xe/xe_vm.c       | 107 ++++++++++++++++++++-----------
 drivers/gpu/drm/xe/xe_vm.h       |   8 ++-
 drivers/gpu/drm/xe/xe_vm_types.h |   3 +
 include/uapi/drm/xe_drm.h        |  19 +++++-
 5 files changed, 157 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index d6353e8969f0..d21e45efeaab 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -1068,6 +1068,11 @@ static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
 {
 	int err = 0;
 
+	/*
+	 * No need to check for is_system_allocator here as vma_add_deps is a
+	 * NOP if VMA is_system_allocator
+	 */
+
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
 		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
@@ -1646,6 +1651,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 	int err;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1713,6 +1719,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	if (!((vma->tile_present | vma->tile_staged) & BIT(tile->id)))
 		return 0;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1759,15 +1766,21 @@ static int op_prepare(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.remap.unmap->va));
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, old);
 
 		if (!err && op->remap.prev) {
 			err = bind_op_prepare(vm, tile, pt_update_ops,
@@ -1780,15 +1793,28 @@ static int op_prepare(struct xe_vm *vm,
 			pt_update_ops->wait_vm_bookkeep = true;
 		}
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.unmap.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, vma);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		err = bind_op_prepare(vm, tile, pt_update_ops,
-				      gpuva_to_vma(op->base.prefetch.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -1857,6 +1883,8 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vma *vma, struct dma_fence *fence,
 			   struct dma_fence *fence2)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1890,6 +1918,8 @@ static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			     struct xe_vma *vma, struct dma_fence *fence,
 			     struct dma_fence *fence2)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1924,16 +1954,21 @@ static void op_commit(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence,
 			       fence2);
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.remap.unmap->va), fence,
-				 fence2);
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		unbind_op_commit(vm, tile, pt_update_ops, old, fence, fence2);
 
 		if (op->remap.prev)
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
@@ -1942,14 +1977,25 @@ static void op_commit(struct xe_vm *vm,
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
 				       fence, fence2);
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.unmap.va), fence, fence2);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			unbind_op_commit(vm, tile, pt_update_ops, vma, fence,
+					 fence2);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		bind_op_commit(vm, tile, pt_update_ops,
-			       gpuva_to_vma(op->base.prefetch.va), fence, fence2);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			bind_op_commit(vm, tile, pt_update_ops, vma, fence,
+				       fence2);
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 4cc13eddb6b3..5ec160561662 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -901,9 +901,10 @@ static void xe_vma_free(struct xe_vma *vma)
 		kfree(vma);
 }
 
-#define VMA_CREATE_FLAG_READ_ONLY	BIT(0)
-#define VMA_CREATE_FLAG_IS_NULL		BIT(1)
-#define VMA_CREATE_FLAG_DUMPABLE	BIT(2)
+#define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
+#define VMA_CREATE_FLAG_IS_NULL			BIT(1)
+#define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
+#define VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR	BIT(3)
 
 static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 				    struct xe_bo *bo,
@@ -917,6 +918,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	bool read_only = (flags & VMA_CREATE_FLAG_READ_ONLY);
 	bool is_null = (flags & VMA_CREATE_FLAG_IS_NULL);
 	bool dumpable = (flags & VMA_CREATE_FLAG_DUMPABLE);
+	bool is_system_allocator =
+		(flags & VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR);
 
 	xe_assert(vm->xe, start < end);
 	xe_assert(vm->xe, end < vm->size);
@@ -925,7 +928,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	 * Allocate and ensure that the xe_vma_is_userptr() return
 	 * matches what was allocated.
 	 */
-	if (!bo && !is_null) {
+	if (!bo && !is_null && !is_system_allocator) {
 		struct xe_userptr_vma *uvma = kzalloc(sizeof(*uvma), GFP_KERNEL);
 
 		if (!uvma)
@@ -937,6 +940,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		if (!vma)
 			return ERR_PTR(-ENOMEM);
 
+		if (is_system_allocator)
+			vma->gpuva.flags |= XE_VMA_SYSTEM_ALLOCATOR;
 		if (is_null)
 			vma->gpuva.flags |= DRM_GPUVA_SPARSE;
 		if (bo)
@@ -979,7 +984,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		drm_gpuva_link(&vma->gpuva, vm_bo);
 		drm_gpuvm_bo_put(vm_bo);
 	} else /* userptr or null */ {
-		if (!is_null) {
+		if (!is_null && !is_system_allocator) {
 			struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
 			u64 size = end - start + 1;
 			int err;
@@ -1029,7 +1034,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		 */
 		mmu_interval_notifier_remove(&userptr->notifier);
 		xe_vm_put(vm);
-	} else if (xe_vma_is_null(vma)) {
+	} else if (xe_vma_is_null(vma) || xe_vma_is_system_allocator(vma)) {
 		xe_vm_put(vm);
 	} else {
 		xe_bo_put(xe_vma_bo(vma));
@@ -1068,7 +1073,7 @@ static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
 		spin_lock(&vm->userptr.invalidated_lock);
 		list_del(&to_userptr_vma(vma)->userptr.invalidate_link);
 		spin_unlock(&vm->userptr.invalidated_lock);
-	} else if (!xe_vma_is_null(vma)) {
+	} else if (!xe_vma_is_null(vma) && !xe_vma_is_system_allocator(vma)) {
 		xe_bo_assert_held(xe_vma_bo(vma));
 
 		drm_gpuva_unlink(&vma->gpuva);
@@ -1971,6 +1976,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 			op->map.read_only =
 				flags & DRM_XE_VM_BIND_FLAG_READONLY;
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+			op->map.is_system_allocator = flags &
+				DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
 		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
@@ -2162,6 +2169,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				VMA_CREATE_FLAG_IS_NULL : 0;
 			flags |= op->map.dumpable ?
 				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->map.is_system_allocator ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
 			vma = new_vma(vm, &op->base.map, op->map.pat_index,
 				      flags);
@@ -2169,7 +2178,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
-			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+			if ((op->map.immediate || !xe_vm_in_fault_mode(vm)) &&
+			    !op->map.is_system_allocator)
 				xe_vma_ops_incr_pt_update_ops(vops,
 							      op->tile_mask);
 			break;
@@ -2178,21 +2188,24 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 		{
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
+			bool skip = xe_vma_is_system_allocator(old);
 
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
 
-			if (op->base.remap.prev) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_READ_ONLY ?
+				VMA_CREATE_FLAG_READ_ONLY : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				DRM_GPUVA_SPARSE ?
+				VMA_CREATE_FLAG_IS_NULL : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_DUMPABLE ?
+				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= xe_vma_is_system_allocator(old) ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
+			if (op->base.remap.prev) {
 				vma = new_vma(vm, op->base.remap.prev,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2204,9 +2217,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_prev = !xe_vma_is_userptr(old) &&
+				op->remap.skip_prev = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_end(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_prev) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2222,16 +2236,6 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			}
 
 			if (op->base.remap.next) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
-
 				vma = new_vma(vm, op->base.remap.next,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2243,9 +2247,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_next = !xe_vma_is_userptr(old) &&
+				op->remap.skip_next = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_start(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_next) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2258,14 +2263,27 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!skip)
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			break;
+		}
 		case DRM_GPUVA_OP_PREFETCH:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
 			/* FIXME: Need to skip some prefetch ops */
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
+		}
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
@@ -2706,7 +2724,8 @@ static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
 	(DRM_XE_VM_BIND_FLAG_READONLY | \
 	 DRM_XE_VM_BIND_FLAG_IMMEDIATE | \
 	 DRM_XE_VM_BIND_FLAG_NULL | \
-	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+	 DRM_XE_VM_BIND_FLAG_DUMPABLE | \
+	 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR)
 
 #ifdef TEST_VM_OPS_ERROR
 #define SUPPORTED_FLAGS	(SUPPORTED_FLAGS_STUB | FORCE_OP_ERROR)
@@ -2761,9 +2780,17 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u64 obj_offset = (*bind_ops)[i].obj_offset;
 		u32 prefetch_region = (*bind_ops)[i].prefetch_mem_region_instance;
 		bool is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+		bool is_system_allocator = flags &
+			DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
+		/* FIXME: Disabling system allocator for now */
+		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
+			err = -EOPNOTSUPP;
+			goto free_bind_ops;
+		}
+
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
@@ -2784,13 +2811,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 
 		if (XE_IOCTL_DBG(xe, op > DRM_XE_VM_BIND_OP_PREFETCH) ||
 		    XE_IOCTL_DBG(xe, flags & ~SUPPORTED_FLAGS) ||
-		    XE_IOCTL_DBG(xe, obj && is_null) ||
-		    XE_IOCTL_DBG(xe, obj_offset && is_null) ||
+		    XE_IOCTL_DBG(xe, obj && (is_null || is_system_allocator)) ||
+		    XE_IOCTL_DBG(xe, obj_offset && (is_null ||
+				 is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
-				 is_null) ||
+				 (is_null || is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_MAP &&
-				 !is_null) ||
+				 !is_null && !is_system_allocator) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_UNMAP_ALL) ||
 		    XE_IOCTL_DBG(xe, addr &&
@@ -3165,6 +3193,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	int ret = 0;
 
 	xe_assert(xe, !xe_vma_is_null(vma));
+	xe_assert(xe, !xe_vma_is_system_allocator(vma));
 	trace_xe_vma_invalidate(vma);
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index c864dba35e1d..1a5aed678214 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -151,6 +151,11 @@ static inline bool xe_vma_is_null(struct xe_vma *vma)
 	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
 }
 
+static inline bool xe_vma_is_system_allocator(struct xe_vma *vma)
+{
+	return vma->gpuva.flags & XE_VMA_SYSTEM_ALLOCATOR;
+}
+
 static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 {
 	return !xe_vma_bo(vma);
@@ -158,7 +163,8 @@ static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 
 static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 {
-	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma);
+	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma) &&
+		!xe_vma_is_system_allocator(vma);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 7f9a303e51d8..1764781c376b 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -42,6 +42,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 6)
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 7)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 8)
+#define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 9)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -294,6 +295,8 @@ struct xe_vma_op_map {
 	bool read_only;
 	/** @is_null: is NULL binding */
 	bool is_null;
+	/** @is_system_allocator: is system allocator binding */
+	bool is_system_allocator;
 	/** @dumpable: whether BO is dumped on GPU hang */
 	bool dumpable;
 	/** @pat_index: The pat index to use for this operation. */
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index b6fbe4988f2e..27003777cb62 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -904,6 +904,12 @@ struct drm_xe_vm_destroy {
  *    will only be valid for DRM_XE_VM_BIND_OP_MAP operations, the BO
  *    handle MBZ, and the BO offset MBZ. This flag is intended to
  *    implement VK sparse bindings.
+ *  - %DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR - When the system allocator flag is
+ *    set, no mappings are created rather the range is reserved for system
+ *    allocations which will be populated on GPU page faults. Only valid on VMs
+ *    with DRM_XE_VM_CREATE_FLAG_FAULT_MODE set. The system allocator flag are
+ *    only valid for DRM_XE_VM_BIND_OP_MAP operations, the BO handle MBZ, and
+ *    the BO offset MBZ.
  */
 struct drm_xe_vm_bind_op {
 	/** @extensions: Pointer to the first extension struct, if any */
@@ -956,7 +962,9 @@ struct drm_xe_vm_bind_op {
 	 * on the @pat_index. For such mappings there is no actual memory being
 	 * mapped (the address in the PTE is invalid), so the various PAT memory
 	 * attributes likely do not apply.  Simply leaving as zero is one
-	 * option (still a valid pat_index).
+	 * option (still a valid pat_index). Same applies to
+	 * DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR bindings as for such mapping
+	 * there is no actual memory being mapped.
 	 */
 	__u16 pat_index;
 
@@ -972,6 +980,14 @@ struct drm_xe_vm_bind_op {
 
 		/** @userptr: user pointer to bind on */
 		__u64 userptr;
+
+		/**
+		 * @system_allocator_offset: Offset from GPU @addr to create
+		 * system allocator mappings. MBZ with current level of support
+		 * (e.g. 1 to 1 mapping between GPU and CPU mappings only
+		 * supported).
+		 */
+		__s64 system_allocator_offset;
 	};
 
 	/**
@@ -994,6 +1010,7 @@ struct drm_xe_vm_bind_op {
 #define DRM_XE_VM_BIND_FLAG_IMMEDIATE	(1 << 1)
 #define DRM_XE_VM_BIND_FLAG_NULL	(1 << 2)
 #define DRM_XE_VM_BIND_FLAG_DUMPABLE	(1 << 3)
+#define DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR	(1 << 4)
 	/** @flags: Bind flags */
 	__u32 flags;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 07/28] drm/xe: Add SVM init / fini to faulting VMs
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (5 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 06/28] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 08/28] drm/xe: Add dma_addr res cursor Matthew Brost
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add SVM init / fini to faulting VMs. Minimual implementation.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/Makefile      |  1 +
 drivers/gpu/drm/xe/xe_svm.c      | 40 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h      | 14 +++++++++++
 drivers/gpu/drm/xe/xe_vm.c       | 10 ++++++++
 drivers/gpu/drm/xe/xe_vm_types.h |  7 ++++++
 5 files changed, 72 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index b8fc2ee58f1a..17bd7cfc9a62 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -94,6 +94,7 @@ xe-y += drm_gpusvm.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
new file mode 100644
index 000000000000..7166100e3298
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include "drm_gpusvm.h"
+
+#include "xe_svm.h"
+#include "xe_vm.h"
+#include "xe_vm_types.h"
+
+static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
+			      struct drm_gpusvm_notifier *notifier,
+			      const struct mmu_notifier_range *mmu_range)
+{
+	/* TODO: Implement */
+}
+
+static const struct drm_gpusvm_ops gpusvm_ops = {
+	.invalidate = xe_svm_invalidate,
+};
+
+static const u64 fault_chunk_sizes[] = {
+	SZ_2M,
+	SZ_64K,
+	SZ_4K,
+};
+
+int xe_svm_init(struct xe_vm *vm)
+{
+	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
+			       current->mm, NULL, 0, vm->size,
+			       SZ_512M, &gpusvm_ops, fault_chunk_sizes,
+			       ARRAY_SIZE(fault_chunk_sizes));
+}
+
+void xe_svm_fini(struct xe_vm *vm)
+{
+	drm_gpusvm_fini(&vm->svm.gpusvm);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..4982d9168095
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _XE_SVM_H_
+#define _XE_SVM_H_
+
+struct xe_vm;
+
+int xe_svm_init(struct xe_vm *vm);
+void xe_svm_fini(struct xe_vm *vm);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 5ec160561662..17ad6a533b2f 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -35,6 +35,7 @@
 #include "xe_preempt_fence.h"
 #include "xe_pt.h"
 #include "xe_res_cursor.h"
+#include "xe_svm.h"
 #include "xe_sync.h"
 #include "xe_trace_bo.h"
 #include "xe_wa.h"
@@ -1503,6 +1504,12 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		}
 	}
 
+	if (flags & XE_VM_FLAG_FAULT_MODE) {
+		err = xe_svm_init(vm);
+		if (err)
+			goto err_close;
+	}
+
 	if (number_tiles > 1)
 		vm->composite_fence_ctx = dma_fence_context_alloc(1);
 
@@ -1616,6 +1623,9 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 		xe_vma_destroy_unlocked(vma);
 	}
 
+	if (xe_vm_in_fault_mode(vm))
+		xe_svm_fini(vm);
+
 	up_write(&vm->lock);
 
 	mutex_lock(&xe->usm.lock);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 1764781c376b..bd1c0e368238 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -6,6 +6,7 @@
 #ifndef _XE_VM_TYPES_H_
 #define _XE_VM_TYPES_H_
 
+#include "drm_gpusvm.h"
 #include <drm/drm_gpuvm.h>
 
 #include <linux/dma-resv.h>
@@ -140,6 +141,12 @@ struct xe_vm {
 	/** @gpuvm: base GPUVM used to track VMAs */
 	struct drm_gpuvm gpuvm;
 
+	/** @svm: Shared virtual memory state */
+	struct {
+		/** @svm.gpusvm: base GPUSVM used to track fault allocations */
+		struct drm_gpusvm gpusvm;
+	} svm;
+
 	struct xe_device *xe;
 
 	/* exec queue used for (un)binding vma's */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 08/28] drm/xe: Add dma_addr res cursor
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (6 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 07/28] drm/xe: Add SVM init / fini to faulting VMs Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 09/28] drm/xe: Add SVM range invalidation Matthew Brost
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Useful for SVM ranges in SRAM and programing page tables.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_res_cursor.h | 50 +++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h b/drivers/gpu/drm/xe/xe_res_cursor.h
index dca374b6521c..3df630af9252 100644
--- a/drivers/gpu/drm/xe/xe_res_cursor.h
+++ b/drivers/gpu/drm/xe/xe_res_cursor.h
@@ -43,7 +43,9 @@ struct xe_res_cursor {
 	u64 remaining;
 	void *node;
 	u32 mem_type;
+	unsigned int order;
 	struct scatterlist *sgl;
+	const dma_addr_t *dma_addr;
 	struct drm_buddy *mm;
 };
 
@@ -70,6 +72,7 @@ static inline void xe_res_first(struct ttm_resource *res,
 				struct xe_res_cursor *cur)
 {
 	cur->sgl = NULL;
+	cur->dma_addr = NULL;
 	if (!res)
 		goto fallback;
 
@@ -160,11 +163,43 @@ static inline void xe_res_first_sg(const struct sg_table *sg,
 	cur->start = start;
 	cur->remaining = size;
 	cur->size = 0;
+	cur->dma_addr = NULL;
 	cur->sgl = sg->sgl;
 	cur->mem_type = XE_PL_TT;
 	__xe_res_sg_next(cur);
 }
 
+/**
+ * xe_res_first_dma - initialize a xe_res_cursor with dma_addr array
+ *
+ * @dma_addr: dma_addr array to walk
+ * @start: Start of the range
+ * @size: Size of the range
+ * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
+ * @cur: cursor object to initialize
+ *
+ * Start walking over the range of allocations between @start and @size.
+ */
+static inline void xe_res_first_dma(const dma_addr_t *dma_addr,
+				    u64 start, u64 size,
+				    unsigned int order,
+				    struct xe_res_cursor *cur)
+{
+	XE_WARN_ON(start);
+	XE_WARN_ON(!dma_addr);
+	XE_WARN_ON(!IS_ALIGNED(start, PAGE_SIZE) ||
+		   !IS_ALIGNED(size, PAGE_SIZE));
+
+	cur->node = NULL;
+	cur->start = start;
+	cur->remaining = size;
+	cur->size = PAGE_SIZE << order;
+	cur->dma_addr = dma_addr;
+	cur->order = order;
+	cur->sgl = NULL;
+	cur->mem_type = XE_PL_TT;
+}
+
 /**
  * xe_res_next - advance the cursor
  *
@@ -191,6 +226,13 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
 		return;
 	}
 
+	if (cur->dma_addr) {
+		cur->size = (PAGE_SIZE << cur->order) -
+			(size - cur->size);
+		cur->start += size;
+		return;
+	}
+
 	if (cur->sgl) {
 		cur->start += size;
 		__xe_res_sg_next(cur);
@@ -232,6 +274,12 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
  */
 static inline u64 xe_res_dma(const struct xe_res_cursor *cur)
 {
-	return cur->sgl ? sg_dma_address(cur->sgl) + cur->start : cur->start;
+	if (cur->dma_addr)
+		return cur->dma_addr[cur->start >> (PAGE_SHIFT + cur->order)] +
+			(cur->start & ((PAGE_SIZE << cur->order) - 1));
+	else if (cur->sgl)
+		return sg_dma_address(cur->sgl) + cur->start;
+	else
+		return cur->start;
 }
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 09/28] drm/xe: Add SVM range invalidation
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (7 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 08/28] drm/xe: Add dma_addr res cursor Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 10/28] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add SVM range invalidation vfunc.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
 drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
 drivers/gpu/drm/xe/xe_pt.h           |   3 +
 drivers/gpu/drm/xe/xe_svm.c          | 201 ++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h          |  14 ++
 5 files changed, 253 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 0be4687bfc20..e1f32d782f65 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -19,6 +19,7 @@
 #include "xe_guc.h"
 #include "xe_guc_ct.h"
 #include "xe_migrate.h"
+#include "xe_svm.h"
 #include "xe_trace_bo.h"
 #include "xe_vm.h"
 
@@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec *exec, struct xe_vma *vma,
 	return 0;
 }
 
-static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
-				struct xe_vma *vma)
+static int handle_vma_pagefault(struct xe_tile *tile, struct xe_vma *vma,
+				bool atomic)
 {
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct drm_exec exec;
 	struct dma_fence *fence;
 	ktime_t end = 0;
 	int err;
-	bool atomic;
 
+	lockdep_assert_held_write(&vm->lock);
 	trace_xe_vma_pagefault(vma);
-	atomic = access_is_atomic(pf->access_type);
 
 	/* Check if VMA is valid */
 	if (vma_is_valid(tile, vma) && !atomic)
@@ -192,6 +192,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 	struct xe_vm *vm;
 	struct xe_vma *vma = NULL;
 	int err;
+	bool atomic;
 
 	/* SW isn't expected to handle TRTT faults */
 	if (pf->trva_fault)
@@ -218,7 +219,13 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		goto unlock_vm;
 	}
 
-	err = handle_vma_pagefault(tile, pf, vma);
+	atomic = access_is_atomic(pf->access_type);
+
+	if (xe_vma_is_system_allocator(vma))
+		err = xe_svm_handle_pagefault(vm, vma, tile,
+					      pf->page_addr, atomic);
+	else
+		err = handle_vma_pagefault(tile, vma, atomic);
 
 unlock_vm:
 	if (!err)
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index d21e45efeaab..b2db79251825 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -20,6 +20,7 @@
 #include "xe_res_cursor.h"
 #include "xe_sched_job.h"
 #include "xe_sync.h"
+#include "xe_svm.h"
 #include "xe_trace.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma)
 	return xe_walk.needs_invalidate;
 }
 
+bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
+			  struct xe_svm_range *range)
+{
+	struct xe_pt_zap_ptes_walk xe_walk = {
+		.base = {
+			.ops = &xe_pt_zap_ptes_ops,
+			.shifts = xe_normal_pt_shifts,
+			.max_level = XE_PT_HIGHEST_LEVEL,
+		},
+		.tile = tile,
+	};
+	struct xe_pt *pt = vm->pt_root[tile->id];
+	u8 pt_mask = (range->tile_present & ~range->tile_invalidated);
+
+	if (!(pt_mask & BIT(tile->id)))
+		return false;
+
+	(void)xe_pt_walk_shared(&pt->base, pt->level, range->base.va.start,
+				range->base.va.end, &xe_walk.base);
+
+	return xe_walk.needs_invalidate;
+}
+
 static void
 xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *tile,
 		       struct iosys_map *map, void *data,
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 9ab386431cad..5f333eeedf5c 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -13,6 +13,7 @@ struct dma_fence;
 struct xe_bo;
 struct xe_device;
 struct xe_exec_queue;
+struct xe_svm_range;
 struct xe_sync_entry;
 struct xe_tile;
 struct xe_vm;
@@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops);
 void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
+bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
+			  struct xe_svm_range *range);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 7166100e3298..3ac84f9615e2 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -5,18 +5,189 @@
 
 #include "drm_gpusvm.h"
 
+#include "xe_gt_tlb_invalidation.h"
+#include "xe_pt.h"
 #include "xe_svm.h"
 #include "xe_vm.h"
 #include "xe_vm_types.h"
 
+static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
+ {
+	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
+}
+
+static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
+{
+	return gpusvm_to_vm(r->gpusvm);
+}
+
+static struct drm_gpusvm_range *
+xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
+{
+	struct xe_svm_range *range;
+
+	range = kzalloc(sizeof(*range), GFP_KERNEL);
+	if (!range)
+		return ERR_PTR(-ENOMEM);
+
+	xe_vm_get(gpusvm_to_vm(gpusvm));
+
+	return &range->base;
+}
+
+static void xe_svm_range_free(struct drm_gpusvm_range *range)
+{
+	xe_vm_put(range_to_vm(range));
+	kfree(range);
+}
+
+static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range *r)
+{
+	return container_of(r, struct xe_svm_range, base);
+}
+
+static u8
+xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
+				  const struct mmu_notifier_range *mmu_range,
+				  u64 *adj_start, u64 *adj_end)
+{
+	struct xe_svm_range *range = to_xe_range(r);
+	struct xe_device *xe = vm->xe;
+	struct xe_tile *tile;
+	u8 tile_mask = 0;
+	u8 id;
+
+	/* Skip if already unmapped or if no binding exist */
+	if (range->base.flags.unmapped || !range->tile_present)
+		return 0;
+
+	/* Adjust invalidation to range boundaries */
+	if (range->base.va.start < mmu_range->start)
+		*adj_start = range->base.va.start;
+	if (range->base.va.end > mmu_range->end)
+		*adj_end = range->base.va.end;
+
+	/*
+	 * XXX: Ideally would zap PTEs in one shot in xe_svm_invalidate but the
+	 * invalidation code can't correctly cope with sparse ranges or
+	 * invalidations spanning multiple ranges.
+	 */
+	for_each_tile(tile, xe, id)
+		if (xe_pt_zap_ptes_range(tile, vm, range)) {
+			tile_mask |= BIT(id);
+			range->tile_invalidated |= BIT(id);
+		}
+
+	return tile_mask;
+}
+
+static void
+xe_svm_range_notifier_event_end(struct xe_vm *vm, struct drm_gpusvm_range *r,
+				const struct mmu_notifier_range *mmu_range)
+{
+	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
+
+	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
+	/* TODO: Add range to garbage collector */
+}
+
 static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 			      struct drm_gpusvm_notifier *notifier,
 			      const struct mmu_notifier_range *mmu_range)
 {
-	/* TODO: Implement */
+	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
+	struct xe_device *xe = vm->xe;
+	struct xe_tile *tile;
+	struct drm_gpusvm_range *r, *first;
+	struct xe_gt_tlb_invalidation_fence
+		fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
+	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
+	u8 tile_mask = 0;
+	u8 id;
+	u32 fence_id = 0;
+	long err;
+
+	/* Adjust invalidation to notifier boundaries */
+	if (adj_start < notifier->interval.start)
+		adj_start = notifier->interval.start;
+	if (adj_end > notifier->interval.end)
+		adj_end = notifier->interval.end;
+
+	first = drm_gpusvm_range_find(notifier, adj_start, adj_end);
+	if (!first)
+		return;
+
+	/*
+	 * XXX: Less than ideal to always wait on VM's resv slots if an
+	 * invalidation is not required. Could walk range list twice to figure
+	 * out if an invalidations is need, but also not ideal. Maybe a counter
+	 * within the notifier, seems like that could work.
+	 */
+	err = dma_resv_wait_timeout(xe_vm_resv(vm),
+				    DMA_RESV_USAGE_BOOKKEEP,
+				    false, MAX_SCHEDULE_TIMEOUT);
+	XE_WARN_ON(err <= 0);
+
+	r = first;
+	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
+		tile_mask |= xe_svm_range_notifier_event_begin(vm, r, mmu_range,
+							       &adj_start,
+							       &adj_end);
+	if (!tile_mask)
+		goto range_notifier_event_end;
+
+	xe_device_wmb(xe);
+
+	for_each_tile(tile, xe, id) {
+		if (tile_mask & BIT(id)) {
+			int err;
+
+			xe_gt_tlb_invalidation_fence_init(tile->primary_gt,
+							  &fence[fence_id], true);
+
+			err = xe_gt_tlb_invalidation_range(tile->primary_gt,
+							   &fence[fence_id],
+							   adj_start,
+							   adj_end,
+							   vm->usm.asid);
+			if (WARN_ON_ONCE(err < 0)) {
+				xe_gt_tlb_invalidation_fence_fini(&fence[fence_id]);
+				goto wait;
+			}
+			++fence_id;
+
+			if (!tile->media_gt)
+				continue;
+
+			xe_gt_tlb_invalidation_fence_init(tile->media_gt,
+							  &fence[fence_id], true);
+
+			err = xe_gt_tlb_invalidation_range(tile->media_gt,
+							   &fence[fence_id],
+							   adj_start,
+							   adj_end,
+							   vm->usm.asid);
+			if (WARN_ON_ONCE(err < 0)) {
+				xe_gt_tlb_invalidation_fence_fini(&fence[fence_id]);
+				goto wait;
+			}
+			++fence_id;
+		}
+	}
+
+wait:
+	for (id = 0; id < fence_id; ++id)
+		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
+
+range_notifier_event_end:
+	r = first;
+	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
+		xe_svm_range_notifier_event_end(vm, r, mmu_range);
 }
 
 static const struct drm_gpusvm_ops gpusvm_ops = {
+	.range_alloc = xe_svm_range_alloc,
+	.range_free = xe_svm_range_free,
 	.invalidate = xe_svm_invalidate,
 };
 
@@ -38,3 +209,31 @@ void xe_svm_fini(struct xe_vm *vm)
 {
 	drm_gpusvm_fini(&vm->svm.gpusvm);
 }
+
+int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
+			    struct xe_tile *tile, u64 fault_addr,
+			    bool atomic)
+{
+	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
+	struct drm_gpusvm_range *r;
+	int err;
+
+	lockdep_assert_held_write(&vm->lock);
+
+retry:
+	/* TODO: Run garbage collector */
+
+	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm, fault_addr,
+					    xe_vma_start(vma), xe_vma_end(vma),
+					    &ctx);
+	if (IS_ERR(r))
+		return PTR_ERR(r);
+
+	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
+	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
+	       goto retry;
+
+	/* TODO: Issue bind */
+
+	return err;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 4982d9168095..b053b11692f0 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -6,9 +6,23 @@
 #ifndef _XE_SVM_H_
 #define _XE_SVM_H_
 
+#include "drm_gpusvm.h"
+
+struct xe_tile;
 struct xe_vm;
+struct xe_vma;
+
+struct xe_svm_range {
+	struct drm_gpusvm_range base;
+	u8 tile_present;
+	u8 tile_invalidated;
+};
 
 int xe_svm_init(struct xe_vm *vm);
 void xe_svm_fini(struct xe_vm *vm);
 
+int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
+			    struct xe_tile *tile, u64 fault_addr,
+			    bool atomic);
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 10/28] drm/gpuvm: Add DRM_GPUVA_OP_USER
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (8 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 09/28] drm/xe: Add SVM range invalidation Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 11/28] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add DRM_GPUVA_OP_USER which allows driver to define their own gpuvm ops.

Cc: Danilo Krummrich <dakr@redhat.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/drm/drm_gpuvm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
index 00d4e43b76b6..cc3f8ed5113b 100644
--- a/include/drm/drm_gpuvm.h
+++ b/include/drm/drm_gpuvm.h
@@ -812,6 +812,11 @@ enum drm_gpuva_op_type {
 	 * @DRM_GPUVA_OP_PREFETCH: the prefetch op type
 	 */
 	DRM_GPUVA_OP_PREFETCH,
+
+	/**
+	 * @DRM_GPUVA_OP_USER: the user defined op type
+	 */
+	DRM_GPUVA_OP_USER,
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 11/28] drm/xe: Add (re)bind to SVM page fault handler
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (9 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 10/28] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 12/28] drm/xe: Add SVM garbage collector Matthew Brost
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add (re)bind to SVM page fault handler. To facilitate add support
function to VM layer which (re)binds a SVM range. Also teach PT layer to
understand (re)binds of SVM ranges.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       | 140 +++++++++++++++++++++++++++----
 drivers/gpu/drm/xe/xe_pt_types.h |   2 +
 drivers/gpu/drm/xe/xe_svm.c      |  45 +++++++++-
 drivers/gpu/drm/xe/xe_svm.h      |  11 +++
 drivers/gpu/drm/xe/xe_vm.c       |  80 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h       |   5 ++
 drivers/gpu/drm/xe/xe_vm_types.h |  19 +++++
 7 files changed, 286 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index b2db79251825..d5e444af7e02 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -587,6 +587,7 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  * range.
  * @tile: The tile we're building for.
  * @vma: The vma indicating the address range.
+ * @range: The range indicating the address range.
  * @entries: Storage for the update entries used for connecting the tree to
  * the main tree at commit time.
  * @num_entries: On output contains the number of @entries used.
@@ -602,6 +603,7 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  */
 static int
 xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
+		 struct xe_svm_range *range,
 		 struct xe_vm_pgtable_update *entries, u32 *num_entries)
 {
 	struct xe_device *xe = tile_to_xe(tile);
@@ -618,7 +620,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		.vm = xe_vma_vm(vma),
 		.tile = tile,
 		.curs = &curs,
-		.va_curs_start = xe_vma_start(vma),
+		.va_curs_start = range ? range->base.va.start :
+			xe_vma_start(vma),
 		.vma = vma,
 		.wupd.entries = entries,
 		.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K) && is_devmem,
@@ -671,7 +674,11 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 
 	xe_bo_assert_held(bo);
 
-	if (!xe_vma_is_null(vma)) {
+	if (range) {
+		xe_res_first_dma(range->base.dma_addr, 0,
+				 range->base.va.end - range->base.va.start,
+				 range->base.order, &curs);
+	} else if (!xe_vma_is_null(vma)) {
 		if (xe_vma_is_userptr(vma))
 			xe_res_first_sg(to_userptr_vma(vma)->userptr.sg, 0,
 					xe_vma_size(vma), &curs);
@@ -685,8 +692,10 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		curs.size = xe_vma_size(vma);
 	}
 
-	ret = xe_pt_walk_range(&pt->base, pt->level, xe_vma_start(vma),
-			       xe_vma_end(vma), &xe_walk.base);
+	ret = xe_pt_walk_range(&pt->base, pt->level,
+			       range ? range->base.va.start : xe_vma_start(vma),
+			       range ? range->base.va.end : xe_vma_end(vma),
+			       &xe_walk.base);
 
 	*num_entries = xe_walk.wupd.num_used_entries;
 	return ret;
@@ -902,7 +911,7 @@ static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 
 	lockdep_assert_held(&vm->lock);
 
-	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
+	if (!xe_vma_has_no_bo(vma))
 		dma_resv_assert_held(xe_vma_bo(vma)->ttm.base.resv);
 
 	xe_vm_assert_held(vm);
@@ -1004,12 +1013,13 @@ static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
 
 static int
 xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
+		   struct xe_svm_range *range,
 		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
 {
 	int err;
 
 	*num_entries = 0;
-	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
+	err = xe_pt_stage_bind(tile, vma, range, entries, num_entries);
 	if (!err)
 		xe_tile_assert(tile, *num_entries);
 
@@ -1115,6 +1125,8 @@ static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
 	case DRM_GPUVA_OP_PREFETCH:
 		err = vma_add_deps(gpuva_to_vma(op->base.prefetch.va), job);
 		break;
+	case DRM_GPUVA_OP_USER:
+		break;
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -1339,6 +1351,34 @@ static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
 	return err;
 }
 
+static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update *pt_update)
+{
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vma_op *op;
+	int err;
+
+	err = xe_pt_pre_commit(pt_update);
+	if (err)
+		return err;
+
+	xe_svm_notifier_lock(vm);
+
+	list_for_each_entry(op, &vops->list, link) {
+		struct xe_svm_range *range = op->map_range.range;
+
+		xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
+		xe_assert(vm->xe, op->subop == XE_VMA_SUBOP_MAP_RANGE);
+
+		if (!xe_svm_range_pages_valid(range)) {
+			xe_svm_notifier_unlock(vm);
+			return -EAGAIN;
+		}
+	}
+
+	return 0;
+}
+
 struct invalidation_fence {
 	struct xe_gt_tlb_invalidation_fence base;
 	struct xe_gt *gt;
@@ -1632,12 +1672,12 @@ xe_pt_commit_prepare_unbind(struct xe_vma *vma,
 
 static void
 xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
-				 struct xe_vma *vma)
+				 u64 start, u64 end)
 {
+	u64 last;
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 	int i, level = 0;
-	u64 start, last;
 
 	for (i = 0; i < pt_op->num_entries; i++) {
 		const struct xe_vm_pgtable_update *entry = &pt_op->entries[i];
@@ -1647,8 +1687,8 @@ xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
 	}
 
 	/* Greedy (non-optimal) calculation but simple */
-	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull << xe_pt_shift(level));
-	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level)) - 1;
+	start = ALIGN_DOWN(start, 0x1ull << xe_pt_shift(level));
+	last = ALIGN(end, 0x1ull << xe_pt_shift(level)) - 1;
 
 	if (start < pt_update_ops->start)
 		pt_update_ops->start = start;
@@ -1690,7 +1730,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	if (err)
 		return err;
 
-	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
+	err = xe_pt_prepare_bind(tile, vma, NULL, pt_op->entries,
 				 &pt_op->num_entries);
 	if (!err) {
 		xe_tile_assert(tile, pt_op->num_entries <=
@@ -1698,7 +1738,9 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 					pt_op->num_entries, true);
 
-		xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+		xe_pt_update_ops_rfence_interval(pt_update_ops,
+						 xe_vma_start(vma),
+						 xe_vma_end(vma));
 		++pt_update_ops->current_op;
 		pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 
@@ -1732,6 +1774,48 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	return err;
 }
 
+static int bind_range_prepare(struct xe_vm *vm, struct xe_tile *tile,
+			      struct xe_vm_pgtable_update_ops *pt_update_ops,
+			      struct xe_vma *vma, struct xe_svm_range *range)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	int err;
+
+	xe_tile_assert(tile, xe_vma_is_system_allocator(vma));
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "Preparing bind, with range [%llx...%llx)\n",
+	       range->base.va.start, range->base.va.end - 1);
+
+	pt_op->vma = NULL;
+	pt_op->bind = true;
+	pt_op->rebind = BIT(tile->id) & range->tile_present;
+
+	err = xe_pt_prepare_bind(tile, vma, range, pt_op->entries,
+				 &pt_op->num_entries);
+	if (!err) {
+		xe_tile_assert(tile, pt_op->num_entries <=
+			       ARRAY_SIZE(pt_op->entries));
+		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+					pt_op->num_entries, true);
+
+		xe_pt_update_ops_rfence_interval(pt_update_ops,
+						 range->base.va.start,
+						 range->base.va.end);
+		++pt_update_ops->current_op;
+		pt_update_ops->needs_svm_lock = true;
+
+		pt_op->vma = vma;
+		xe_pt_commit_prepare_bind(vma, pt_op->entries,
+					  pt_op->num_entries, pt_op->rebind);
+	} else {
+		xe_pt_cancel_bind(vma, pt_op->entries, pt_op->num_entries);
+	}
+
+	return err;
+}
+
 static int unbind_op_prepare(struct xe_tile *tile,
 			     struct xe_vm_pgtable_update_ops *pt_update_ops,
 			     struct xe_vma *vma)
@@ -1769,7 +1853,8 @@ static int unbind_op_prepare(struct xe_tile *tile,
 
 	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 				pt_op->num_entries, false);
-	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, xe_vma_start(vma),
+					 xe_vma_end(vma));
 	++pt_update_ops->current_op;
 	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 	pt_update_ops->needs_invalidation = true;
@@ -1839,6 +1924,15 @@ static int op_prepare(struct xe_vm *vm,
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	}
+	case DRM_GPUVA_OP_USER:
+		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
+			xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
+
+			err = bind_range_prepare(vm, tile, pt_update_ops,
+						 op->map_range.vma,
+						 op->map_range.range);
+		}
+		break;
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -2020,6 +2114,14 @@ static void op_commit(struct xe_vm *vm,
 				       fence2);
 		break;
 	}
+	case DRM_GPUVA_OP_USER:
+	{
+		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
+			op->map_range.range->tile_present |= BIT(tile->id);
+			op->map_range.range->tile_invalidated &= ~BIT(tile->id);
+		}
+		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -2037,6 +2139,12 @@ static const struct xe_migrate_pt_update_ops userptr_migrate_ops = {
 	.pre_commit = xe_pt_userptr_pre_commit,
 };
 
+static const struct xe_migrate_pt_update_ops svm_migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
+	.pre_commit = xe_pt_svm_pre_commit,
+};
+
 /**
  * xe_pt_update_ops_run() - Run PT update operations
  * @tile: Tile of PT update operations
@@ -2062,7 +2170,9 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 	struct xe_vma_op *op;
 	int err = 0, i;
 	struct xe_migrate_pt_update update = {
-		.ops = pt_update_ops->needs_userptr_lock ?
+		.ops = pt_update_ops->needs_svm_lock ?
+			&svm_migrate_ops :
+			pt_update_ops->needs_userptr_lock ?
 			&userptr_migrate_ops :
 			&migrate_ops,
 		.vops = vops,
@@ -2183,6 +2293,8 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 				  &ifence->base.base, &mfence->base.base);
 	}
 
+	if (pt_update_ops->needs_svm_lock)
+		xe_svm_notifier_unlock(vm);
 	if (pt_update_ops->needs_userptr_lock)
 		up_read(&vm->userptr.notifier_lock);
 
diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index 384cc04de719..69eab6f37cfe 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -104,6 +104,8 @@ struct xe_vm_pgtable_update_ops {
 	u32 num_ops;
 	/** @current_op: current operations */
 	u32 current_op;
+	/** @needs_svm_lock: Needs SVM lock */
+	bool needs_svm_lock;
 	/** @needs_userptr_lock: Needs userptr lock */
 	bool needs_userptr_lock;
 	/** @needs_invalidation: Needs invalidation */
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 3ac84f9615e2..28d139b3dbb7 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -210,12 +210,22 @@ void xe_svm_fini(struct xe_vm *vm)
 	drm_gpusvm_fini(&vm->svm.gpusvm);
 }
 
+static bool xe_svm_range_is_valid(struct xe_svm_range *range,
+				  struct xe_tile *tile)
+{
+	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
+}
+
 int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic)
 {
 	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
+	struct xe_svm_range *range;
 	struct drm_gpusvm_range *r;
+	struct drm_exec exec;
+	struct dma_fence *fence;
+	ktime_t end = 0;
 	int err;
 
 	lockdep_assert_held_write(&vm->lock);
@@ -229,11 +239,42 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (IS_ERR(r))
 		return PTR_ERR(r);
 
-	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
+	range = to_xe_range(r);
+	if (xe_svm_range_is_valid(range, tile))
+		return 0;
+
+	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
 	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
 	       goto retry;
+	if (err)
+		goto err_out;
+
+retry_bind:
+	drm_exec_init(&exec, 0, 0);
+	drm_exec_until_all_locked(&exec) {
+		err = drm_exec_lock_obj(&exec, vm->gpuvm.r_obj);
+		drm_exec_retry_on_contention(&exec);
+		if (err) {
+			drm_exec_fini(&exec);
+			goto err_out;
+		}
+
+		fence = xe_vm_range_rebind(vm, vma, range, BIT(tile->id));
+		if (IS_ERR(fence)) {
+			drm_exec_fini(&exec);
+			err = PTR_ERR(fence);
+			if (err == -EAGAIN)
+				goto retry;
+			if (xe_vm_validate_should_retry(&exec, err, &end))
+				goto retry_bind;
+			goto err_out;
+		}
+	}
+	drm_exec_fini(&exec);
 
-	/* TODO: Issue bind */
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
 
+err_out:
 	return err;
 }
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index b053b11692f0..7dabffaf4c65 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -25,4 +25,15 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic);
 
+static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
+{
+	return drm_gpusvm_range_pages_valid(range->base.gpusvm, &range->base);
+}
+
+#define xe_svm_notifier_lock(vm__)	\
+	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
+
+#define xe_svm_notifier_unlock(vm__)	\
+	drm_gpusvm_notifier_unlock(&(vm__)->svm.gpusvm)
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 17ad6a533b2f..6261a4cb2e1d 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -894,6 +894,84 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 	return fence;
 }
 
+static void xe_vm_populate_range_rebind(struct xe_vma_op *op,
+					struct xe_vma *vma,
+					struct xe_svm_range *range,
+					u8 tile_mask)
+{
+	INIT_LIST_HEAD(&op->link);
+	op->tile_mask = tile_mask;
+	op->base.op = DRM_GPUVA_OP_USER;
+	op->subop = XE_VMA_SUBOP_MAP_RANGE;
+	op->map_range.vma = vma;
+	op->map_range.range = range;
+}
+
+static int
+xe_vm_ops_add_range_rebind(struct xe_vma_ops *vops,
+			   struct xe_vma *vma,
+			   struct xe_svm_range *range,
+			   u8 tile_mask)
+{
+	struct xe_vma_op *op;
+
+	op = kzalloc(sizeof(*op), GFP_KERNEL);
+	if (!op)
+		return -ENOMEM;
+
+	xe_vm_populate_range_rebind(op, vma, range, tile_mask);
+	list_add_tail(&op->link, &vops->list);
+	xe_vma_ops_incr_pt_update_ops(vops, tile_mask);
+
+	return 0;
+}
+
+struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
+				     struct xe_vma *vma,
+				     struct xe_svm_range *range,
+				     u8 tile_mask)
+{
+	struct dma_fence *fence = NULL;
+	struct xe_vma_ops vops;
+	struct xe_vma_op *op, *next_op;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
+
+	lockdep_assert_held(&vm->lock);
+	xe_vm_assert_held(vm);
+	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+	xe_assert(vm->xe, xe_vma_is_system_allocator(vma));
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_exec_queue(tile);
+	}
+
+	err = xe_vm_ops_add_range_rebind(&vops, vma, range, tile_mask);
+	if (err)
+		return ERR_PTR(err);
+
+	err = xe_vma_ops_alloc(&vops, false);
+	if (err) {
+		fence = ERR_PTR(err);
+		goto free_ops;
+	}
+
+	fence = ops_execute(vm, &vops);
+
+free_ops:
+	list_for_each_entry_safe(op, next_op, &vops.list, link) {
+		list_del(&op->link);
+		kfree(op);
+	}
+	xe_vma_ops_fini(&vops);
+
+	return fence;
+}
+
 static void xe_vma_free(struct xe_vma *vma)
 {
 	if (xe_vma_is_userptr(vma))
@@ -2516,6 +2594,8 @@ static void op_trace(struct xe_vma_op *op)
 	case DRM_GPUVA_OP_PREFETCH:
 		trace_xe_vma_bind(gpuva_to_vma(op->base.prefetch.va));
 		break;
+	case DRM_GPUVA_OP_USER:
+		break;
 	default:
 		XE_WARN_ON("NOT POSSIBLE");
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 1a5aed678214..8bd921b33090 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -22,6 +22,7 @@ struct ttm_validate_buffer;
 struct xe_exec_queue;
 struct xe_file;
 struct xe_sync_entry;
+struct xe_svm_range;
 struct drm_exec;
 
 struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags);
@@ -217,6 +218,10 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm);
 int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
 struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma,
 				u8 tile_mask);
+struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
+				     struct xe_vma *vma,
+				     struct xe_svm_range *range,
+				     u8 tile_mask);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index bd1c0e368238..b736e53779d2 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -19,6 +19,7 @@
 #include "xe_range_fence.h"
 
 struct xe_bo;
+struct xe_svm_range;
 struct xe_sync_entry;
 struct xe_user_fence;
 struct xe_vm;
@@ -334,6 +335,14 @@ struct xe_vma_op_prefetch {
 	u32 region;
 };
 
+/** struct xe_vma_op_map_range - VMA map range operation */
+struct xe_vma_op_map_range {
+	/** @vma: VMA to map (system allocator VMA) */
+	struct xe_vma *vma;
+	/** @range: SVM range to map */
+	struct xe_svm_range *range;
+};
+
 /** enum xe_vma_op_flags - flags for VMA operation */
 enum xe_vma_op_flags {
 	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
@@ -344,6 +353,12 @@ enum xe_vma_op_flags {
 	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
 };
 
+/** enum xe_vma_subop - VMA sub-operation */
+enum xe_vma_subop {
+	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
+	XE_VMA_SUBOP_MAP_RANGE,
+};
+
 /** struct xe_vma_op - VMA operation */
 struct xe_vma_op {
 	/** @base: GPUVA base operation */
@@ -352,6 +367,8 @@ struct xe_vma_op {
 	struct list_head link;
 	/** @flags: operation flags */
 	enum xe_vma_op_flags flags;
+	/** @subop: user defined sub-operation */
+	enum xe_vma_subop subop;
 	/** @tile_mask: Tile mask for operation */
 	u8 tile_mask;
 
@@ -362,6 +379,8 @@ struct xe_vma_op {
 		struct xe_vma_op_remap remap;
 		/** @prefetch: VMA prefetch operation specific data */
 		struct xe_vma_op_prefetch prefetch;
+		/** @map: VMA map range operation specific data */
+		struct xe_vma_op_map_range map_range;
 	};
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 12/28] drm/xe: Add SVM garbage collector
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (10 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 11/28] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 13/28] drm/xe: Add unbind to " Matthew Brost
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add basic SVM garbage collector which can destroy an SVM range upon an
MMU UNMAP event.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c      | 85 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h      |  1 +
 drivers/gpu/drm/xe/xe_vm.c       |  6 +++
 drivers/gpu/drm/xe/xe_vm_types.h |  5 ++
 4 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 28d139b3dbb7..20010c09e125 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -30,6 +30,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
 	if (!range)
 		return ERR_PTR(-ENOMEM);
 
+	INIT_LIST_HEAD(&range->garbage_collector_link);
 	xe_vm_get(gpusvm_to_vm(gpusvm));
 
 	return &range->base;
@@ -46,6 +47,24 @@ static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range *r)
 	return container_of(r, struct xe_svm_range, base);
 }
 
+static void
+xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
+				   const struct mmu_notifier_range *mmu_range)
+{
+	struct xe_device *xe = vm->xe;
+
+	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
+
+	spin_lock(&vm->svm.garbage_collector.lock);
+	if (list_empty(&range->garbage_collector_link))
+		list_add_tail(&range->garbage_collector_link,
+			      &vm->svm.garbage_collector.range_list);
+	spin_unlock(&vm->svm.garbage_collector.lock);
+
+	queue_work(xe_device_get_root_tile(xe)->primary_gt->usm.pf_wq,
+		   &vm->svm.garbage_collector.work);
+}
+
 static u8
 xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
 				  const struct mmu_notifier_range *mmu_range,
@@ -88,7 +107,9 @@ xe_svm_range_notifier_event_end(struct xe_vm *vm, struct drm_gpusvm_range *r,
 	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
 
 	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
-	/* TODO: Add range to garbage collector */
+	if (mmu_range->event == MMU_NOTIFY_UNMAP)
+		xe_svm_garbage_collector_add_range(vm, to_xe_range(r),
+						   mmu_range);
 }
 
 static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
@@ -185,6 +206,58 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 		xe_svm_range_notifier_event_end(vm, r, mmu_range);
 }
 
+static int __xe_svm_garbage_collector(struct xe_vm *vm,
+				      struct xe_svm_range *range)
+{
+	/* TODO: Do unbind */
+
+	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
+
+	return 0;
+}
+
+static int xe_svm_garbage_collector(struct xe_vm *vm)
+{
+	struct xe_svm_range *range, *next;
+	int err;
+
+	lockdep_assert_held_write(&vm->lock);
+
+	if (xe_vm_is_closed_or_banned(vm))
+		return -ENOENT;
+
+	spin_lock(&vm->svm.garbage_collector.lock);
+	list_for_each_entry_safe(range, next,
+				 &vm->svm.garbage_collector.range_list,
+				 garbage_collector_link) {
+		list_del(&range->garbage_collector_link);
+		spin_unlock(&vm->svm.garbage_collector.lock);
+
+		err = __xe_svm_garbage_collector(vm, range);
+		if (err) {
+			drm_warn(&vm->xe->drm,
+				 "Garbage collection failed: %d\n", err);
+			xe_vm_kill(vm, true);
+			return err;
+		}
+
+		spin_lock(&vm->svm.garbage_collector.lock);
+	}
+	spin_unlock(&vm->svm.garbage_collector.lock);
+
+	return 0;
+}
+
+static void xe_svm_garbage_collector_work_func(struct work_struct *w)
+{
+	struct xe_vm *vm = container_of(w, struct xe_vm,
+					svm.garbage_collector.work);
+
+	down_write(&vm->lock);
+	xe_svm_garbage_collector(vm);
+	up_write(&vm->lock);
+}
+
 static const struct drm_gpusvm_ops gpusvm_ops = {
 	.range_alloc = xe_svm_range_alloc,
 	.range_free = xe_svm_range_free,
@@ -199,6 +272,11 @@ static const u64 fault_chunk_sizes[] = {
 
 int xe_svm_init(struct xe_vm *vm)
 {
+	spin_lock_init(&vm->svm.garbage_collector.lock);
+	INIT_LIST_HEAD(&vm->svm.garbage_collector.range_list);
+	INIT_WORK(&vm->svm.garbage_collector.work,
+		  xe_svm_garbage_collector_work_func);
+
 	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
 			       current->mm, NULL, 0, vm->size,
 			       SZ_512M, &gpusvm_ops, fault_chunk_sizes,
@@ -231,7 +309,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	lockdep_assert_held_write(&vm->lock);
 
 retry:
-	/* TODO: Run garbage collector */
+	/* Always process UNMAPs first so view SVM ranges is current */
+	err = xe_svm_garbage_collector(vm);
+	if (err)
+		return err;
 
 	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm, fault_addr,
 					    xe_vma_start(vma), xe_vma_end(vma),
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 7dabffaf4c65..84fd0d8c3380 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -14,6 +14,7 @@ struct xe_vma;
 
 struct xe_svm_range {
 	struct drm_gpusvm_range base;
+	struct list_head garbage_collector_link;
 	u8 tile_present;
 	u8 tile_invalidated;
 };
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 6261a4cb2e1d..1fd2c99245f2 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1633,6 +1633,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	xe_vm_close(vm);
 	if (xe_vm_in_preempt_fence_mode(vm))
 		flush_work(&vm->preempt.rebind_work);
+	if (xe_vm_in_fault_mode(vm))
+		flush_work(&vm->svm.garbage_collector.work);
 
 	down_write(&vm->lock);
 	for_each_tile(tile, xe, id) {
@@ -3064,6 +3066,10 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto put_exec_queue;
 	}
 
+	/* Ensure all UNMAPs visable */
+	if (xe_vm_in_fault_mode(vm))
+		flush_work(&vm->svm.garbage_collector.work);
+
 	err = down_write_killable(&vm->lock);
 	if (err)
 		goto put_vm;
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index b736e53779d2..2eae3575c409 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -146,6 +146,11 @@ struct xe_vm {
 	struct {
 		/** @svm.gpusvm: base GPUSVM used to track fault allocations */
 		struct drm_gpusvm gpusvm;
+		struct {
+			spinlock_t lock;
+			struct list_head range_list;
+			struct work_struct work;
+		} garbage_collector;
 	} svm;
 
 	struct xe_device *xe;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 13/28] drm/xe: Add unbind to SVM garbage collector
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (11 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 12/28] drm/xe: Add SVM garbage collector Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 14/28] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add unbind to SVM garbage collector. To facilitate add unbind support
function to VM layer which unbinds a SVM range. Also teach PY layer to
understand unbinds of SVM ranges.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       | 84 ++++++++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_svm.c      |  9 +++-
 drivers/gpu/drm/xe/xe_vm.c       | 73 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h       |  2 +
 drivers/gpu/drm/xe/xe_vm_types.h | 12 ++++-
 5 files changed, 162 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index d5e444af7e02..fc86adf9f0a6 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -905,10 +905,16 @@ static void xe_pt_cancel_bind(struct xe_vma *vma,
 	}
 }
 
+#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)
+
 static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 {
-	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_vm *vm;
 
+	if (vma == INVALID_VMA)
+		return;
+
+	vm = xe_vma_vm(vma);
 	lockdep_assert_held(&vm->lock);
 
 	if (!xe_vma_has_no_bo(vma))
@@ -934,7 +940,8 @@ static void xe_pt_commit(struct xe_vma *vma,
 		for (j = 0; j < entries[i].qwords; j++) {
 			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
 
-			xe_pt_destroy(oldpte, xe_vma_vm(vma)->flags, deferred);
+			xe_pt_destroy(oldpte, (vma == INVALID_VMA) ? 0 :
+				      xe_vma_vm(vma)->flags, deferred);
 		}
 	}
 }
@@ -1367,6 +1374,9 @@ static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update *pt_update)
 	list_for_each_entry(op, &vops->list, link) {
 		struct xe_svm_range *range = op->map_range.range;
 
+		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
+			continue;
+
 		xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
 		xe_assert(vm->xe, op->subop == XE_VMA_SUBOP_MAP_RANGE);
 
@@ -1565,7 +1575,9 @@ static const struct xe_pt_walk_ops xe_pt_stage_unbind_ops = {
  * xe_pt_stage_unbind() - Build page-table update structures for an unbind
  * operation
  * @tile: The tile we're unbinding for.
+ * @vm: The vm
  * @vma: The vma we're unbinding.
+ * @range: The range we're unbinding.
  * @entries: Caller-provided storage for the update structures.
  *
  * Builds page-table update structures for an unbind operation. The function
@@ -1575,9 +1587,14 @@ static const struct xe_pt_walk_ops xe_pt_stage_unbind_ops = {
  *
  * Return: The number of entries used.
  */
-static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct xe_vma *vma,
+static unsigned int xe_pt_stage_unbind(struct xe_tile *tile,
+				       struct xe_vm *vm,
+				       struct xe_vma *vma,
+				       struct xe_svm_range *range,
 				       struct xe_vm_pgtable_update *entries)
 {
+	u64 start = range ? range->base.va.start : xe_vma_start(vma);
+	u64 end = range ? range->base.va.end : xe_vma_end(vma);
 	struct xe_pt_stage_unbind_walk xe_walk = {
 		.base = {
 			.ops = &xe_pt_stage_unbind_ops,
@@ -1585,14 +1602,14 @@ static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct xe_vma *vma,
 			.max_level = XE_PT_HIGHEST_LEVEL,
 		},
 		.tile = tile,
-		.modified_start = xe_vma_start(vma),
-		.modified_end = xe_vma_end(vma),
+		.modified_start = start,
+		.modified_end = end,
 		.wupd.entries = entries,
 	};
-	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
+	struct xe_pt *pt = vm->pt_root[tile->id];
 
-	(void)xe_pt_walk_shared(&pt->base, pt->level, xe_vma_start(vma),
-				xe_vma_end(vma), &xe_walk.base);
+	(void)xe_pt_walk_shared(&pt->base, pt->level, start, end,
+				&xe_walk.base);
 
 	return xe_walk.wupd.num_used_entries;
 }
@@ -1834,13 +1851,6 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	       "Preparing unbind, with range [%llx...%llx)\n",
 	       xe_vma_start(vma), xe_vma_end(vma) - 1);
 
-	/*
-	 * Wait for invalidation to complete. Can corrupt internal page table
-	 * state if an invalidation is running while preparing an unbind.
-	 */
-	if (xe_vma_is_userptr(vma) && xe_vm_in_fault_mode(xe_vma_vm(vma)))
-		mmu_interval_read_begin(&to_userptr_vma(vma)->userptr.notifier);
-
 	pt_op->vma = vma;
 	pt_op->bind = false;
 	pt_op->rebind = false;
@@ -1849,7 +1859,8 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	if (err)
 		return err;
 
-	pt_op->num_entries = xe_pt_stage_unbind(tile, vma, pt_op->entries);
+	pt_op->num_entries = xe_pt_stage_unbind(tile, xe_vma_vm(vma),
+						vma, NULL, pt_op->entries);
 
 	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 				pt_op->num_entries, false);
@@ -1864,6 +1875,42 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	return 0;
 }
 
+static int unbind_range_prepare(struct xe_vm *vm,
+				struct xe_tile *tile,
+				struct xe_vm_pgtable_update_ops *pt_update_ops,
+				struct xe_svm_range *range)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+
+	if (!(range->tile_present & BIT(tile->id)))
+		return 0;
+
+	vm_dbg(&vm->xe->drm,
+	       "Preparing unbind, with range [%llx...%llx)\n",
+	       range->base.va.start, range->base.va.end - 1);
+
+	pt_op->vma = INVALID_VMA;
+	pt_op->bind = false;
+	pt_op->rebind = false;
+
+	pt_op->num_entries = xe_pt_stage_unbind(tile, vm, NULL, range,
+						pt_op->entries);
+
+	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+				pt_op->num_entries, false);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, range->base.va.start,
+					 range->base.va.end);
+	++pt_update_ops->current_op;
+	pt_update_ops->needs_svm_lock = true;
+	pt_update_ops->needs_invalidation = true;
+
+	xe_pt_commit_prepare_unbind(INVALID_VMA, pt_op->entries,
+				    pt_op->num_entries);
+
+	return 0;
+}
+
 static int op_prepare(struct xe_vm *vm,
 		      struct xe_tile *tile,
 		      struct xe_vm_pgtable_update_ops *pt_update_ops,
@@ -1931,6 +1978,9 @@ static int op_prepare(struct xe_vm *vm,
 			err = bind_range_prepare(vm, tile, pt_update_ops,
 						 op->map_range.vma,
 						 op->map_range.range);
+		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
+			err = unbind_range_prepare(vm, tile, pt_update_ops,
+						   op->unmap_range.range);
 		}
 		break;
 	default:
@@ -2119,6 +2169,8 @@ static void op_commit(struct xe_vm *vm,
 		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
 			op->map_range.range->tile_present |= BIT(tile->id);
 			op->map_range.range->tile_invalidated &= ~BIT(tile->id);
+		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
+			op->unmap_range.range->tile_present &= ~BIT(tile->id);
 		}
 		break;
 	}
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 20010c09e125..7188aa590fa5 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -209,7 +209,14 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 static int __xe_svm_garbage_collector(struct xe_vm *vm,
 				      struct xe_svm_range *range)
 {
-	/* TODO: Do unbind */
+	struct dma_fence *fence;
+
+	xe_vm_lock(vm, false);
+	fence = xe_vm_range_unbind(vm, range);
+	xe_vm_unlock(vm);
+	if (IS_ERR(fence))
+		return PTR_ERR(fence);
+	dma_fence_put(fence);
 
 	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
 
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 1fd2c99245f2..6916cdfe4be3 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -972,6 +972,79 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
 	return fence;
 }
 
+static void xe_vm_populate_range_unbind(struct xe_vma_op *op,
+					struct xe_svm_range *range)
+{
+	INIT_LIST_HEAD(&op->link);
+	op->tile_mask = range->tile_present;
+	op->base.op = DRM_GPUVA_OP_USER;
+	op->subop = XE_VMA_SUBOP_UNMAP_RANGE;
+	op->unmap_range.range = range;
+}
+
+static int
+xe_vm_ops_add_range_unbind(struct xe_vma_ops *vops,
+			   struct xe_svm_range *range)
+{
+	struct xe_vma_op *op;
+
+	op = kzalloc(sizeof(*op), GFP_KERNEL);
+	if (!op)
+		return -ENOMEM;
+
+	xe_vm_populate_range_unbind(op, range);
+	list_add_tail(&op->link, &vops->list);
+	xe_vma_ops_incr_pt_update_ops(vops, range->tile_present);
+
+	return 0;
+}
+
+struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
+				     struct xe_svm_range *range)
+{
+	struct dma_fence *fence = NULL;
+	struct xe_vma_ops vops;
+	struct xe_vma_op *op, *next_op;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
+
+	lockdep_assert_held(&vm->lock);
+	xe_vm_assert_held(vm);
+	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+
+	if (!range->tile_present)
+		return dma_fence_get_stub();
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_exec_queue(tile);
+	}
+
+	err = xe_vm_ops_add_range_unbind(&vops, range);
+	if (err)
+		return ERR_PTR(err);
+
+	err = xe_vma_ops_alloc(&vops, false);
+	if (err) {
+		fence = ERR_PTR(err);
+		goto free_ops;
+	}
+
+	fence = ops_execute(vm, &vops);
+
+free_ops:
+	list_for_each_entry_safe(op, next_op, &vops.list, link) {
+		list_del(&op->link);
+		kfree(op);
+	}
+	xe_vma_ops_fini(&vops);
+
+	return fence;
+}
+
 static void xe_vma_free(struct xe_vma *vma)
 {
 	if (xe_vma_is_userptr(vma))
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 8bd921b33090..d577ca9e3d65 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -222,6 +222,8 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
 				     struct xe_vma *vma,
 				     struct xe_svm_range *range,
 				     u8 tile_mask);
+struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
+				     struct xe_svm_range *range);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 2eae3575c409..d38cf7558f62 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -348,6 +348,12 @@ struct xe_vma_op_map_range {
 	struct xe_svm_range *range;
 };
 
+/** struct xe_vma_op_unmap_range - VMA unmap range operation */
+struct xe_vma_op_unmap_range {
+	/** @range: SVM range to unmap */
+	struct xe_svm_range *range;
+};
+
 /** enum xe_vma_op_flags - flags for VMA operation */
 enum xe_vma_op_flags {
 	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
@@ -362,6 +368,8 @@ enum xe_vma_op_flags {
 enum xe_vma_subop {
 	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
 	XE_VMA_SUBOP_MAP_RANGE,
+	/** @XE_VMA_SUBOP_UNMAP_RANGE: Unmap range */
+	XE_VMA_SUBOP_UNMAP_RANGE,
 };
 
 /** struct xe_vma_op - VMA operation */
@@ -384,8 +392,10 @@ struct xe_vma_op {
 		struct xe_vma_op_remap remap;
 		/** @prefetch: VMA prefetch operation specific data */
 		struct xe_vma_op_prefetch prefetch;
-		/** @map: VMA map range operation specific data */
+		/** @map_range: VMA map range operation specific data */
 		struct xe_vma_op_map_range map_range;
+		/** @unmap_range: VMA unmap range operation specific data */
+		struct xe_vma_op_map_range unmap_range;
 	};
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 14/28] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (12 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 13/28] drm/xe: Add unbind to " Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 15/28] drm/xe: Enable system allocator uAPI Matthew Brost
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

uAPI is designed with the the use case that only mapping a BO to a
malloc'd address will unbind a system allocator VMA. Thus it doesn't
make tons of sense to allow a system allocator VMA unbind if the GPU has
bindings in the range being unbound. Do not support this as it
simplifies the code. Can always be revisited if a use case for this
arrises.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c |  5 +++++
 drivers/gpu/drm/xe/xe_svm.h |  1 +
 drivers/gpu/drm/xe/xe_vm.c  | 16 ++++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 7188aa590fa5..2339359a1d91 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -366,3 +366,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 err_out:
 	return err;
 }
+
+bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end)
+{
+	return drm_gpusvm_has_mapping(&vm->svm.gpusvm, start, end);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 84fd0d8c3380..a4f764bcd835 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -25,6 +25,7 @@ void xe_svm_fini(struct xe_vm *vm);
 int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic);
+bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
 
 static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
 {
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 6916cdfe4be3..d9bff07ef8d1 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2352,6 +2352,17 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
 			bool skip = xe_vma_is_system_allocator(old);
+			u64 start = xe_vma_start(old), end = xe_vma_end(old);
+
+			if (op->base.remap.prev)
+				start = op->base.remap.prev->va.addr +
+					op->base.remap.prev->va.range;
+			if (op->base.remap.next)
+				end = op->base.remap.next->va.addr;
+
+			if (xe_vma_is_system_allocator(old) &&
+			    xe_svm_has_mapping(vm, start, end))
+				return -EBUSY;
 
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
@@ -2434,6 +2445,11 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 		{
 			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
 
+			if (xe_vma_is_system_allocator(vma) &&
+			    xe_svm_has_mapping(vm, xe_vma_start(vma),
+					       xe_vma_end(vma)))
+				return -EBUSY;
+
 			if (!xe_vma_is_system_allocator(vma))
 				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 15/28] drm/xe: Enable system allocator uAPI
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (13 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 14/28] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 16/28] drm/xe: Add migrate layer functions for SVM support Matthew Brost
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Support for system allocator bindings in SRAM fully in place, enable the
implementation.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index d9bff07ef8d1..f7dc681a8b0e 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2966,12 +2966,6 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
-		/* FIXME: Disabling system allocator for now */
-		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
-			err = -EOPNOTSUPP;
-			goto free_bind_ops;
-		}
-
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 16/28] drm/xe: Add migrate layer functions for SVM support
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (14 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 15/28] drm/xe: Enable system allocator uAPI Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 17/28] drm/xe: Add SVM device memory mirroring Matthew Brost
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add functions which migrate to / from VRAM accepting a single DPA
argument (VRAM) and array of dma addresses (SRAM).

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 150 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h |  10 +++
 2 files changed, 160 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index cbf54be224c9..ec033f354e1c 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1542,6 +1542,156 @@ void xe_migrate_wait(struct xe_migrate *m)
 		dma_fence_wait(m->fence, false);
 }
 
+static u32 pte_update_cmd_size(u64 size)
+{
+	u32 dword;
+	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+
+	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
+	/*
+	 * MI_STORE_DATA_IMM command is used to update page table. Each
+	 * instruction can update maximumly 0x1ff pte entries. To update
+	 * n (n <= 0x1ff) pte entries, we need:
+	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
+	 * 2 dword for the page table's physical location
+	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
+	 */
+	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
+	dword += entries * 2;
+
+	return dword;
+}
+
+static void build_pt_update_batch_sram(struct xe_migrate *m,
+				       struct xe_bb *bb, u32 pt_offset,
+				       dma_addr_t *sram_addr, u32 size)
+{
+	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
+	u32 ptes;
+	int i = 0;
+
+	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+	while (ptes) {
+		u32 chunk = min(0x1ffU, ptes);
+
+		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
+		bb->cs[bb->len++] = pt_offset;
+		bb->cs[bb->len++] = 0;
+
+		pt_offset += chunk * 8;
+		ptes -= chunk;
+
+		while (chunk--) {
+			u64 addr = sram_addr[i++] & PAGE_MASK;
+
+			xe_tile_assert(m->tile, addr);
+			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
+								 addr, pat_index,
+								 0, false, 0);
+			bb->cs[bb->len++] = lower_32_bits(addr);
+			bb->cs[bb->len++] = upper_32_bits(addr);
+		}
+	}
+}
+
+enum xe_migrate_copy_dir {
+	XE_MIGRATE_COPY_TO_VRAM,
+	XE_MIGRATE_COPY_TO_SRAM,
+};
+
+static struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
+					 unsigned long npages,
+					 dma_addr_t *sram_addr, u64 vram_addr,
+					 const enum xe_migrate_copy_dir dir)
+{
+	struct xe_gt *gt = m->tile->primary_gt;
+	struct xe_device *xe = gt_to_xe(gt);
+	struct dma_fence *fence = NULL;
+	u32 batch_size = 2;
+	u64 src_L0_ofs, dst_L0_ofs;
+	u64 round_update_size;
+	struct xe_sched_job *job;
+	struct xe_bb *bb;
+	u32 update_idx, pt_slot = 0;
+	int err;
+
+	round_update_size = min_t(u64, npages * PAGE_SIZE,
+				  MAX_PREEMPTDISABLE_TRANSFER);
+	batch_size += pte_update_cmd_size(round_update_size);
+	batch_size += EMIT_COPY_DW;
+
+	bb = xe_bb_new(gt, batch_size, true);
+	if (IS_ERR(bb)) {
+		err = PTR_ERR(bb);
+		return ERR_PTR(err);
+	}
+
+	build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+				   sram_addr, round_update_size);
+
+	if (dir == XE_MIGRATE_COPY_TO_VRAM) {
+		src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		dst_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr, false);
+
+	} else {
+		src_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr, false);
+		dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+	}
+
+	bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
+	update_idx = bb->len;
+
+	emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
+		  XE_PAGE_SIZE);
+
+	job = xe_bb_create_migration_job(m->q, bb,
+					 xe_migrate_batch_base(m, true),
+					 update_idx);
+	if (IS_ERR(job)) {
+		err = PTR_ERR(job);
+		goto err;
+	}
+
+	xe_sched_job_add_migrate_flush(job, 0);
+
+	mutex_lock(&m->job_mutex);
+	xe_sched_job_arm(job);
+	fence = dma_fence_get(&job->drm.s_fence->finished);
+	xe_sched_job_push(job);
+
+	dma_fence_put(m->fence);
+	m->fence = dma_fence_get(fence);
+	mutex_unlock(&m->job_mutex);
+
+	xe_bb_free(bb, fence);
+
+	return fence;
+
+err:
+	mutex_unlock(&m->job_mutex);
+	xe_bb_free(bb, NULL);
+
+	return ERR_PTR(err);
+}
+
+struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
+				     unsigned long npages,
+				     dma_addr_t *src_addr,
+				     u64 dst_addr)
+{
+	return xe_migrate_vram(m, npages, src_addr, dst_addr,
+			       XE_MIGRATE_COPY_TO_VRAM);
+}
+
+struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
+				       unsigned long npages,
+				       u64 src_addr,
+				       dma_addr_t *dst_addr)
+{
+	return xe_migrate_vram(m, npages, dst_addr, src_addr,
+			       XE_MIGRATE_COPY_TO_SRAM);
+}
+
 #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
 #include "tests/xe_migrate.c"
 #endif
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 0109866e398a..6ff9a963425c 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -95,6 +95,16 @@ struct xe_migrate_pt_update {
 
 struct xe_migrate *xe_migrate_init(struct xe_tile *tile);
 
+struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
+				     unsigned long npages,
+				     dma_addr_t *src_addr,
+				     u64 dst_addr);
+
+struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
+				       unsigned long npages,
+				       u64 src_addr,
+				       dma_addr_t *dst_addr);
+
 struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 				  struct xe_bo *src_bo,
 				  struct xe_bo *dst_bo,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 17/28] drm/xe: Add SVM device memory mirroring
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (15 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 16/28] drm/xe: Add migrate layer functions for SVM support Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 18/28] drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions Matthew Brost
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add SVM device memory mirroring which enables device pages for
migration.

TODO: Hide this behind Kconfig

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  8 ++++
 drivers/gpu/drm/xe/xe_svm.c          | 56 +++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h          |  3 ++
 drivers/gpu/drm/xe/xe_tile.c         |  5 +++
 4 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 4ecd620921a3..b4367efae55b 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -105,6 +105,14 @@ struct xe_mem_region {
 	resource_size_t actual_physical_size;
 	/** @mapping: pointer to VRAM mappable space */
 	void __iomem *mapping;
+	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
+	struct dev_pagemap pagemap;
+	/**
+	 * @hpa_base: base host physical address
+	 *
+	 * This is generated when remap device memory as ZONE_DEVICE
+	 */
+	resource_size_t hpa_base;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 2339359a1d91..258a94e83e57 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -21,6 +21,11 @@ static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
 	return gpusvm_to_vm(r->gpusvm);
 }
 
+static void *xe_svm_devm_owner(struct xe_device *xe)
+{
+	return xe;
+}
+
 static struct drm_gpusvm_range *
 xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
 {
@@ -285,8 +290,9 @@ int xe_svm_init(struct xe_vm *vm)
 		  xe_svm_garbage_collector_work_func);
 
 	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
-			       current->mm, NULL, 0, vm->size,
-			       SZ_512M, &gpusvm_ops, fault_chunk_sizes,
+			       current->mm, xe_svm_devm_owner(vm->xe), 0,
+			       vm->size, SZ_512M, &gpusvm_ops,
+			       fault_chunk_sizes,
 			       ARRAY_SIZE(fault_chunk_sizes));
 }
 
@@ -371,3 +377,49 @@ bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end)
 {
 	return drm_gpusvm_has_mapping(&vm->svm.gpusvm, start, end);
 }
+
+/**
+ * xe_devm_add: Remap and provide memmap backing for device memory
+ * @tile: tile that the memory region belongs to
+ * @mr: memory region to remap
+ *
+ * This remap device memory to host physical address space and create
+ * struct page to back device memory
+ *
+ * Return: 0 on success standard error code otherwise
+ */
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct xe_device *xe = tile_to_xe(tile);
+	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
+	struct resource *res;
+	void *addr;
+	int ret;
+
+	res = devm_request_free_mem_region(dev, &iomem_resource,
+					   mr->usable_size);
+	if (IS_ERR(res)) {
+		ret = PTR_ERR(res);
+		return ret;
+	}
+
+	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	mr->pagemap.range.start = res->start;
+	mr->pagemap.range.end = res->end;
+	mr->pagemap.nr_range = 1;
+	mr->pagemap.ops = drm_gpusvm_pagemap_ops_get();
+	mr->pagemap.owner = xe_svm_devm_owner(xe);
+	addr = devm_memremap_pages(dev, &mr->pagemap);
+	if (IS_ERR(addr)) {
+		devm_release_mem_region(dev, res->start, resource_size(res));
+		ret = PTR_ERR(addr);
+		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);
+		return ret;
+	}
+	mr->hpa_base = res->start;
+
+	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
+		 tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index a4f764bcd835..f15df5c813f1 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -8,6 +8,7 @@
 
 #include "drm_gpusvm.h"
 
+struct xe_mem_region;
 struct xe_tile;
 struct xe_vm;
 struct xe_vma;
@@ -19,6 +20,8 @@ struct xe_svm_range {
 	u8 tile_invalidated;
 };
 
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
+
 int xe_svm_init(struct xe_vm *vm);
 void xe_svm_fini(struct xe_vm *vm);
 
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 15ea0a942f67..1c1b3d406f1e 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -10,6 +10,7 @@
 #include "xe_gt.h"
 #include "xe_migrate.h"
 #include "xe_sa.h"
+#include "xe_svm.h"
 #include "xe_tile.h"
 #include "xe_tile_sysfs.h"
 #include "xe_ttm_vram_mgr.h"
@@ -158,6 +159,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
  */
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
+	struct xe_device *xe = tile_to_xe(tile);
 	int err;
 
 	err = tile_ttm_mgr_init(tile);
@@ -170,6 +172,9 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	xe_wa_apply_tile_workarounds(tile);
 
+	if (xe->info.has_usm && IS_DGFX(xe))
+		xe_devm_add(tile, &tile->mem.vram);
+
 	err = xe_tile_sysfs_init(tile);
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 18/28] drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (16 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 17/28] drm/xe: Add SVM device memory mirroring Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 19/28] drm/xe: Update PT layer to understand ranges in VRAM Matthew Brost
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add GPUSVM copy SRAM / VRAM vfunc functions and connect to migration
layer.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Me: Fix vram_addr == 0 case
---
 drivers/gpu/drm/xe/xe_svm.c | 153 ++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 258a94e83e57..6c690ba827e7 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -6,6 +6,7 @@
 #include "drm_gpusvm.h"
 
 #include "xe_gt_tlb_invalidation.h"
+#include "xe_migrate.h"
 #include "xe_pt.h"
 #include "xe_svm.h"
 #include "xe_vm.h"
@@ -270,9 +271,161 @@ static void xe_svm_garbage_collector_work_func(struct work_struct *w)
 	up_write(&vm->lock);
 }
 
+static struct xe_mem_region *page_to_mr(struct page *page)
+{
+	return container_of(page->pgmap, struct xe_mem_region, pagemap);
+}
+
+static struct xe_tile *mr_to_tile(struct xe_mem_region *mr)
+{
+	return container_of(mr, struct xe_tile, mem.vram);
+}
+
+static u64 xe_mem_region_page_to_dpa(struct xe_mem_region *mr,
+				     struct page *page)
+{
+	u64 dpa;
+	struct xe_tile *tile = mr_to_tile(mr);
+	u64 pfn = page_to_pfn(page);
+	u64 offset;
+
+	xe_tile_assert(tile, is_device_private_page(page));
+	xe_tile_assert(tile, (pfn << PAGE_SHIFT) >= mr->hpa_base);
+
+	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;
+
+	return dpa;
+}
+
+enum xe_svm_copy_dir {
+	XE_SVM_COPY_TO_VRAM,
+	XE_SVM_COPY_TO_SRAM,
+};
+
+static int xe_svm_copy(struct drm_gpusvm *gpusvm, struct page **pages,
+		       dma_addr_t *dma_addr, unsigned long npages,
+		       const enum xe_svm_copy_dir dir)
+{
+	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
+	struct xe_mem_region *mr = NULL;
+	struct xe_tile *tile;
+	struct dma_fence *fence = NULL;
+	unsigned long i;
+#define VRAM_ADDR_INVALID	~0x0ull
+	u64 vram_addr = VRAM_ADDR_INVALID;
+	int err = 0, pos = 0;
+	bool sram = dir == XE_SVM_COPY_TO_SRAM;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *spage = pages[i];
+		struct dma_fence *__fence;
+		u64 __vram_addr;
+		bool match = false, chunk, last;
+
+		chunk = (i - pos) == (SZ_2M / PAGE_SIZE);
+		last = (i + 1) == npages;
+
+		if (!dma_addr[i] && vram_addr == VRAM_ADDR_INVALID)
+			continue;
+
+		if (!mr) {
+			mr = page_to_mr(spage);
+			tile = mr_to_tile(mr);
+		}
+
+		if (dma_addr[i]) {
+			__vram_addr = xe_mem_region_page_to_dpa(mr, spage);
+			if (vram_addr == VRAM_ADDR_INVALID) {
+				vram_addr = __vram_addr;
+				pos = i;
+			}
+
+			xe_assert(vm->xe, __vram_addr != VRAM_ADDR_INVALID);
+			xe_assert(vm->xe, vram_addr != VRAM_ADDR_INVALID);
+
+			match = vram_addr + PAGE_SIZE * (i - pos) == __vram_addr;
+		}
+
+		if (!match || chunk || last) {
+			int incr = (match && last) ? 1 : 0;
+
+			if (vram_addr != VRAM_ADDR_INVALID) {
+				if (sram)
+					__fence = xe_migrate_from_vram(tile->migrate,
+								       i - pos + incr,
+								       vram_addr,
+								       dma_addr + pos);
+				else
+					__fence = xe_migrate_to_vram(tile->migrate,
+								     i - pos + incr,
+								     dma_addr + pos,
+								     vram_addr);
+				if (IS_ERR(__fence)) {
+					err = PTR_ERR(__fence);
+					goto err_out;
+				}
+
+				dma_fence_put(fence);
+				fence = __fence;
+			}
+
+			if (dma_addr[i]) {
+				vram_addr = __vram_addr;
+				pos = i;
+			} else {
+				vram_addr = VRAM_ADDR_INVALID;
+			}
+
+			if (!match && last && dma_addr[i]) {
+				if (sram)
+					__fence = xe_migrate_from_vram(tile->migrate, 1,
+								       vram_addr,
+								       dma_addr + pos);
+				else
+					__fence = xe_migrate_to_vram(tile->migrate, 1,
+								     dma_addr + pos,
+								     vram_addr);
+				if (IS_ERR(__fence)) {
+					err = PTR_ERR(__fence);
+					goto err_out;
+				}
+
+				dma_fence_put(fence);
+				fence = __fence;
+			}
+		}
+	}
+
+err_out:
+	if (fence) {
+		dma_fence_wait(fence, false);
+		dma_fence_put(fence);
+	}
+
+	return err;
+#undef VRAM_ADDR_INVALID
+}
+
+static int xe_svm_copy_to_vram(struct drm_gpusvm *gpusvm, struct page **pages,
+			       dma_addr_t *dma_addr, unsigned long npages)
+{
+	return xe_svm_copy(gpusvm, pages, dma_addr, npages,
+			   XE_SVM_COPY_TO_VRAM);
+}
+
+static int xe_svm_copy_to_sram(struct drm_gpusvm *gpusvm, struct page **pages,
+			       dma_addr_t *dma_addr, unsigned long npages)
+{
+	return xe_svm_copy(gpusvm, pages, dma_addr, npages,
+			   XE_SVM_COPY_TO_SRAM);
+}
+
 static const struct drm_gpusvm_ops gpusvm_ops = {
 	.range_alloc = xe_svm_range_alloc,
 	.range_free = xe_svm_range_free,
+	.copy_to_vram = xe_svm_copy_to_vram,
+	.copy_to_sram = xe_svm_copy_to_sram,
 	.invalidate = xe_svm_invalidate,
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 19/28] drm/xe: Update PT layer to understand ranges in VRAM
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (17 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 18/28] drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 20/28] drm/xe: Add Xe SVM populate_vram_pfn vfunc Matthew Brost
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Kinda cheating here using BO directly rather than VRAM pages. Same at
the moment as mixed mappings are not supported. If this changes, then
the arary of pages / dma addresses will need a cursor.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c  | 22 ++++++++++++++++------
 drivers/gpu/drm/xe/xe_svm.h | 10 ++++++++++
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index fc86adf9f0a6..e9195029ea60 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -607,9 +607,12 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		 struct xe_vm_pgtable_update *entries, u32 *num_entries)
 {
 	struct xe_device *xe = tile_to_xe(tile);
-	struct xe_bo *bo = xe_vma_bo(vma);
-	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
-		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	bool range_devmem = range && xe_svm_range_in_vram(range);
+	struct xe_bo *bo = range_devmem ? range->base.vram_allocation :
+		xe_vma_bo(vma);
+	bool is_devmem = range_devmem ||
+		(!xe_vma_is_userptr(vma) && bo &&
+		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo)));
 	struct xe_res_cursor curs;
 	struct xe_pt_stage_bind_walk xe_walk = {
 		.base = {
@@ -675,9 +678,16 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 	xe_bo_assert_held(bo);
 
 	if (range) {
-		xe_res_first_dma(range->base.dma_addr, 0,
-				 range->base.va.end - range->base.va.start,
-				 range->base.order, &curs);
+		if (is_devmem)
+			xe_res_first(bo->ttm.resource, 0,
+				     range->base.va.end - range->base.va.start,
+				     &curs);
+		else if (xe_svm_range_has_dma_mapping(range))
+			xe_res_first_dma(range->base.dma_addr, 0,
+					 range->base.va.end - range->base.va.start,
+					 range->base.order, &curs);
+		else
+			return -EAGAIN;	/* Invalidation corner case */
 	} else if (!xe_vma_is_null(vma)) {
 		if (xe_vma_is_userptr(vma))
 			xe_res_first_sg(to_userptr_vma(vma)->userptr.sg, 0,
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index f15df5c813f1..8b72e91cc37d 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -35,6 +35,16 @@ static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
 	return drm_gpusvm_range_pages_valid(range->base.gpusvm, &range->base);
 }
 
+static inline bool xe_svm_range_in_vram(struct xe_svm_range *range)
+{
+	return range->base.flags.has_vram_pages;
+}
+
+static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range *range)
+{
+	return range->base.flags.has_dma_mapping;
+}
+
 #define xe_svm_notifier_lock(vm__)	\
 	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 20/28] drm/xe: Add Xe SVM populate_vram_pfn vfunc
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (18 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 19/28] drm/xe: Update PT layer to understand ranges in VRAM Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 21/28] drm/xe: Add Xe SVM vram_release vfunc Matthew Brost
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 6c690ba827e7..82cb5a260c87 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -9,6 +9,7 @@
 #include "xe_migrate.h"
 #include "xe_pt.h"
 #include "xe_svm.h"
+#include "xe_ttm_vram_mgr.h"
 #include "xe_vm.h"
 #include "xe_vm_types.h"
 
@@ -421,9 +422,45 @@ static int xe_svm_copy_to_sram(struct drm_gpusvm *gpusvm, struct page **pages,
 			   XE_SVM_COPY_TO_SRAM);
 }
 
+static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
+{
+	return PHYS_PFN(offset + mr->hpa_base);
+}
+
+static struct drm_buddy *tile_to_buddy(struct xe_tile *tile)
+{
+	return &tile->mem.vram_mgr->mm;
+}
+
+static int xe_svm_populate_vram_pfn(struct drm_gpusvm *gpusvm,
+				    void *vram_allocation,
+				    unsigned long npages,
+				    unsigned long *pfn)
+{
+	struct xe_bo *bo = vram_allocation;
+	struct ttm_resource *res = bo->ttm.resource;
+	struct list_head *blocks = &to_xe_ttm_vram_mgr_resource(res)->blocks;
+	struct drm_buddy_block *block;
+	int j =0;
+
+	list_for_each_entry(block, blocks, link) {
+		struct xe_mem_region *mr = block->private;
+		struct xe_tile *tile = mr_to_tile(mr);
+		struct drm_buddy *buddy = tile_to_buddy(tile);
+		u64 block_pfn = block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		int i;
+
+		for(i = 0; i < drm_buddy_block_size(buddy, block) >> PAGE_SHIFT; ++i)
+			pfn[j++] = block_pfn + i;
+	}
+
+	return 0;
+}
+
 static const struct drm_gpusvm_ops gpusvm_ops = {
 	.range_alloc = xe_svm_range_alloc,
 	.range_free = xe_svm_range_free,
+	.populate_vram_pfn = xe_svm_populate_vram_pfn,
 	.copy_to_vram = xe_svm_copy_to_vram,
 	.copy_to_sram = xe_svm_copy_to_sram,
 	.invalidate = xe_svm_invalidate,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 21/28] drm/xe: Add Xe SVM vram_release vfunc
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (19 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 20/28] drm/xe: Add Xe SVM populate_vram_pfn vfunc Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 22/28] drm/xe: Add BO flags required for SVM Matthew Brost
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Implement with a simple BO put.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 82cb5a260c87..4372c02a341f 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -5,6 +5,7 @@
 
 #include "drm_gpusvm.h"
 
+#include "xe_bo.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
 #include "xe_pt.h"
@@ -422,6 +423,11 @@ static int xe_svm_copy_to_sram(struct drm_gpusvm *gpusvm, struct page **pages,
 			   XE_SVM_COPY_TO_SRAM);
 }
 
+static void xe_svm_vram_release(void *vram_allocation)
+{
+	xe_bo_put(vram_allocation);
+}
+
 static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
 {
 	return PHYS_PFN(offset + mr->hpa_base);
@@ -460,6 +466,7 @@ static int xe_svm_populate_vram_pfn(struct drm_gpusvm *gpusvm,
 static const struct drm_gpusvm_ops gpusvm_ops = {
 	.range_alloc = xe_svm_range_alloc,
 	.range_free = xe_svm_range_free,
+	.vram_release = xe_svm_vram_release,
 	.populate_vram_pfn = xe_svm_populate_vram_pfn,
 	.copy_to_vram = xe_svm_copy_to_vram,
 	.copy_to_sram = xe_svm_copy_to_sram,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 22/28] drm/xe: Add BO flags required for SVM
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (20 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 21/28] drm/xe: Add Xe SVM vram_release vfunc Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration Matthew Brost
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add XE_BO_FLAG_SYSTEM_ALLOC to indicate BO is tied to SVM range.

Add XE_BO_FLAG_SKIP_CLEAR to indicate BO does not need to cleared.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c | 11 +++++++----
 drivers/gpu/drm/xe/xe_bo.h |  2 ++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index b6c6a4a3b4d4..ad804b6f9e84 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -704,9 +704,10 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
 						(!mem_type_is_vram(old_mem_type) && !tt_has_data);
 
 	clear_system_pages = ttm && (ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE);
-	needs_clear = (ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC) ||
+	needs_clear = !(bo->flags & XE_BO_FLAG_SKIP_CLEAR) &&
+		((ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC) ||
 		(!ttm && ttm_bo->type == ttm_bo_type_device) ||
-		clear_system_pages;
+		clear_system_pages);
 
 	if (new_mem->mem_type == XE_PL_TT) {
 		ret = xe_tt_map_sg(ttm);
@@ -1284,7 +1285,8 @@ struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
 	int err;
 
 	/* Only kernel objects should set GT */
-	xe_assert(xe, !tile || type == ttm_bo_type_kernel);
+	xe_assert(xe, !tile || type == ttm_bo_type_kernel ||
+		  flags & XE_BO_FLAG_SYSTEM_ALLOC);
 
 	if (XE_WARN_ON(!size)) {
 		xe_bo_free(bo);
@@ -2292,7 +2294,8 @@ bool xe_bo_needs_ccs_pages(struct xe_bo *bo)
 	 * can't be used since there's no CCS storage associated with
 	 * non-VRAM addresses.
 	 */
-	if (IS_DGFX(xe) && (bo->flags & XE_BO_FLAG_SYSTEM))
+	if (IS_DGFX(xe) && ((bo->flags & XE_BO_FLAG_SYSTEM) ||
+	    (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC)))
 		return false;
 
 	return true;
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index dbfb3209615d..fe2ce641b256 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -39,6 +39,8 @@
 #define XE_BO_FLAG_NEEDS_64K		BIT(15)
 #define XE_BO_FLAG_NEEDS_2M		BIT(16)
 #define XE_BO_FLAG_GGTT_INVALIDATE	BIT(17)
+#define XE_BO_FLAG_SYSTEM_ALLOC		BIT(18)
+#define XE_BO_FLAG_SKIP_CLEAR		BIT(19)
 /* this one is trigger internally only */
 #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
 #define XE_BO_FLAG_INTERNAL_64K		BIT(31)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (21 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 22/28] drm/xe: Add BO flags required for SVM Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28 16:06   ` Daniel Vetter
  2024-08-28  2:48 ` [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction Matthew Brost
                   ` (8 subsequent siblings)
  31 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Migration is implemented with range granularity, with VRAM backing being
a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
SVM range is migrated to SRAM, the TTM BO is destroyed).

The design choice for using TTM BO for VRAM backing store, as opposed to
direct buddy allocation, is as follows:

- DRM buddy allocations are not at page granularity, offering no
  advantage over a BO.
- DRM buddy allocations do not solve locking inversion problems between
  mmap lock and dma-resv locks.
- Unified eviction is required (SVM VRAM and TTM BOs need to be able to
  evict each other).
- For exhaustive eviction [1], SVM VRAM allocations will almost certainly
  require a dma-resv.
- Likely allocation size is 2M which makes of size of BO (872)
  acceptable per allocation (872 / 2M == .0004158).

With this, using TTM BO for VRAM backing store seems to be an obvious
choice as it allows leveraging of the TTM eviction code.

Current migration policy is migrate any SVM range greater than or equal
to 64k once.

[1] https://patchwork.freedesktop.org/series/133643/

Signed-off-by: Matthew Brost matthew.brost@intel.com
---
 drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h |  1 +
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 4372c02a341f..fd8987e0a506 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 static int __xe_svm_garbage_collector(struct xe_vm *vm,
 				      struct xe_svm_range *range)
 {
+	struct drm_gpusvm_ctx ctx = {};
 	struct dma_fence *fence;
 
+	/* Evict any pages holding references to vram allocation */
+	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
+		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
+
 	xe_vm_lock(vm, false);
 	fence = xe_vm_range_unbind(vm, range);
 	xe_vm_unlock(vm);
@@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
 	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
 }
 
+static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
+{
+	return &tile->mem.vram;
+}
+
+static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
+				       struct xe_svm_range *range,
+				       const struct drm_gpusvm_ctx *ctx)
+{
+	struct xe_mem_region *mr = tile_to_mr(tile);
+	struct drm_buddy_block *block;
+	struct list_head *blocks;
+	struct xe_bo *bo;
+	ktime_t end = 0;
+	int err;
+
+retry:
+	xe_vm_lock(vm, false);
+	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
+			  range->base.va.start, ttm_bo_type_device,
+			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
+			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
+	xe_vm_unlock(vm);
+	if (IS_ERR(bo)) {
+		err = PTR_ERR(bo);
+		if (xe_vm_validate_should_retry(NULL, err, &end))
+			goto retry;
+		return bo;
+	}
+
+	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
+	list_for_each_entry(block, blocks, link)
+		block->private = mr;
+
+	/*
+	 * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
+	 * creation ref can be dropped upon CPU fault or unmap.
+	 */
+	xe_bo_get(bo);
+
+	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
+					 bo, ctx);
+	if (err) {
+		xe_bo_put(bo);	/* Local ref */
+		xe_bo_put(bo);	/* Creation ref */
+		return ERR_PTR(err);
+	}
+
+	return bo;
+}
+
 int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic)
 {
-	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
+	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
+		.vram_possible = IS_DGFX(vm->xe), };
 	struct xe_svm_range *range;
 	struct drm_gpusvm_range *r;
 	struct drm_exec exec;
 	struct dma_fence *fence;
+	struct xe_bo *bo = NULL;
 	ktime_t end = 0;
 	int err;
 
 	lockdep_assert_held_write(&vm->lock);
 
 retry:
+	xe_bo_put(bo);
+	bo = NULL;
+
 	/* Always process UNMAPs first so view SVM ranges is current */
 	err = xe_svm_garbage_collector(vm);
 	if (err)
@@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (xe_svm_range_is_valid(range, tile))
 		return 0;
 
+	/* XXX: Add migration policy, for now migrate range once */
+	if (IS_DGFX(vm->xe) && !range->migrated &&
+	    range->base.flags.migrate_vram &&
+	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
+		range->migrated = true;
+
+		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
+		if (IS_ERR(bo)) {
+			drm_info(&vm->xe->drm,
+				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
+				 vm->usm.asid, PTR_ERR(bo));
+			bo = NULL;
+			goto retry;
+		}
+	}
+
 	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
 	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
 	       goto retry;
@@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	dma_fence_put(fence);
 
 err_out:
+	xe_bo_put(bo);
+
 	return err;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 8b72e91cc37d..3f432483a230 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -18,6 +18,7 @@ struct xe_svm_range {
 	struct list_head garbage_collector_link;
 	u8 tile_present;
 	u8 tile_invalidated;
+	u8 migrated	:1;
 };
 
 int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (22 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-29 10:14   ` Daniel Vetter
  2024-08-28  2:48 ` [RFC PATCH 25/28] drm/xe: Add SVM debug Matthew Brost
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Wire xe_bo_move to GPUSVM migration to SRAM with trylocking of mmap
lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c       | 35 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_bo_types.h |  3 +++
 drivers/gpu/drm/xe/xe_svm.c      |  2 ++
 drivers/gpu/drm/xe/xe_svm.h      | 13 ++++++++++++
 4 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index ad804b6f9e84..ae71fcbe5380 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -25,6 +25,7 @@
 #include "xe_pm.h"
 #include "xe_preempt_fence.h"
 #include "xe_res_cursor.h"
+#include "xe_svm.h"
 #include "xe_trace_bo.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -250,6 +251,8 @@ int xe_bo_placement_for_flags(struct xe_device *xe, struct xe_bo *bo,
 static void xe_evict_flags(struct ttm_buffer_object *tbo,
 			   struct ttm_placement *placement)
 {
+	struct xe_bo *bo;
+
 	if (!xe_bo_is_xe_bo(tbo)) {
 		/* Don't handle scatter gather BOs */
 		if (tbo->type == ttm_bo_type_sg) {
@@ -261,6 +264,12 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
 		return;
 	}
 
+	bo = ttm_to_xe_bo(tbo);
+	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) {
+		*placement = sys_placement;
+		return;
+	}
+
 	/*
 	 * For xe, sg bos that are evicted to system just triggers a
 	 * rebind of the sg list upon subsequent validation to XE_PL_TT.
@@ -758,6 +767,17 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
 		}
 	}
 
+	if (!move_lacks_source && (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) &&
+	    new_mem->mem_type == XE_PL_SYSTEM) {
+		ret = xe_svm_range_evict(bo->range);
+		if (!ret) {
+			drm_dbg(&xe->drm, "Evict system allocator BO success\n");
+			ttm_bo_move_null(ttm_bo, new_mem);
+		}
+
+		goto out;
+	}
+
 	if (!move_lacks_source &&
 	    ((old_mem_type == XE_PL_SYSTEM && resource_is_vram(new_mem)) ||
 	     (mem_type_is_vram(old_mem_type) &&
@@ -1096,6 +1116,19 @@ static void xe_ttm_bo_delete_mem_notify(struct ttm_buffer_object *ttm_bo)
 	}
 }
 
+static bool xe_bo_eviction_valuable(struct ttm_buffer_object *ttm_bo,
+				    const struct ttm_place *place)
+{
+	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
+
+	/* Do not evict SVMs before having a binding */
+	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC &&
+	    !xe_svm_range_has_vram_binding(bo->range))
+		return false;
+
+	return ttm_bo_eviction_valuable(ttm_bo, place);
+}
+
 const struct ttm_device_funcs xe_ttm_funcs = {
 	.ttm_tt_create = xe_ttm_tt_create,
 	.ttm_tt_populate = xe_ttm_tt_populate,
@@ -1106,7 +1139,7 @@ const struct ttm_device_funcs xe_ttm_funcs = {
 	.io_mem_reserve = xe_ttm_io_mem_reserve,
 	.io_mem_pfn = xe_ttm_io_mem_pfn,
 	.release_notify = xe_ttm_bo_release_notify,
-	.eviction_valuable = ttm_bo_eviction_valuable,
+	.eviction_valuable = xe_bo_eviction_valuable,
 	.delete_mem_notify = xe_ttm_bo_delete_mem_notify,
 };
 
diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
index 2ed558ac2264..4523b033417c 100644
--- a/drivers/gpu/drm/xe/xe_bo_types.h
+++ b/drivers/gpu/drm/xe/xe_bo_types.h
@@ -16,6 +16,7 @@
 #include "xe_ggtt_types.h"
 
 struct xe_device;
+struct xe_svm_range;
 struct xe_vm;
 
 #define XE_BO_MAX_PLACEMENTS	3
@@ -47,6 +48,8 @@ struct xe_bo {
 	struct ttm_bo_kmap_obj kmap;
 	/** @pinned_link: link to present / evicted list of pinned BO */
 	struct list_head pinned_link;
+	/** @range: SVM range for BO */
+	struct xe_svm_range *range;
 #ifdef CONFIG_PROC_FS
 	/**
 	 * @client: @xe_drm_client which created the bo
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index fd8987e0a506..dc9810828c0a 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -531,6 +531,8 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
 			  range->base.va.start, ttm_bo_type_device,
 			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
 			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
+	if (!IS_ERR(bo))
+		bo->range = range;
 	xe_vm_unlock(vm);
 	if (IS_ERR(bo)) {
 		err = PTR_ERR(bo);
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 3f432483a230..b9cf0e2500da 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -46,6 +46,19 @@ static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range *range)
 	return range->base.flags.has_dma_mapping;
 }
 
+static inline bool xe_svm_range_has_vram_binding(struct xe_svm_range *range)
+{
+	return xe_svm_range_in_vram(range) && range->tile_present;
+}
+
+static inline int xe_svm_range_evict(struct xe_svm_range *range)
+{
+	struct drm_gpusvm_ctx ctx = { .trylock_mmap = true, };
+
+	return drm_gpusvm_migrate_to_sram(range->base.gpusvm, &range->base,
+					  &ctx);
+}
+
 #define xe_svm_notifier_lock(vm__)	\
 	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 25/28] drm/xe: Add SVM debug
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (23 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:48 ` [RFC PATCH 26/28] drm/xe: Add modparam for SVM notifier size Matthew Brost
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Add some useful SVM debug logging.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c  | 13 ++++--
 drivers/gpu/drm/xe/xe_svm.c | 93 ++++++++++++++++++++++++++++++++-----
 drivers/gpu/drm/xe/xe_svm.h |  2 +
 3 files changed, 93 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index e9195029ea60..e31af84ceb32 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -678,16 +678,20 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 	xe_bo_assert_held(bo);
 
 	if (range) {
-		if (is_devmem)
+		if (is_devmem) {
+			xe_svm_range_debug(range, "BIND PREPARE - VRAM");
 			xe_res_first(bo->ttm.resource, 0,
 				     range->base.va.end - range->base.va.start,
 				     &curs);
-		else if (xe_svm_range_has_dma_mapping(range))
+		} else if (xe_svm_range_has_dma_mapping(range)) {
+			xe_svm_range_debug(range, "BIND PREPARE - DMA");
 			xe_res_first_dma(range->base.dma_addr, 0,
 					 range->base.va.end - range->base.va.start,
 					 range->base.order, &curs);
-		else
+		} else {
+			xe_svm_range_debug(range, "BIND PREPARE - RETRY");
 			return -EAGAIN;	/* Invalidation corner case */
+		}
 	} else if (!xe_vma_is_null(vma)) {
 		if (xe_vma_is_userptr(vma))
 			xe_res_first_sg(to_userptr_vma(vma)->userptr.sg, 0,
@@ -1387,10 +1391,13 @@ static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update *pt_update)
 		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
 			continue;
 
+		xe_svm_range_debug(range, "PRE-COMMIT");
+
 		xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
 		xe_assert(vm->xe, op->subop == XE_VMA_SUBOP_MAP_RANGE);
 
 		if (!xe_svm_range_pages_valid(range)) {
+			xe_svm_range_debug(range, "PRE-COMMIT - RETRY");
 			xe_svm_notifier_unlock(vm);
 			return -EAGAIN;
 		}
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index dc9810828c0a..f9c2bffd1783 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -24,6 +24,23 @@ static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
 	return gpusvm_to_vm(r->gpusvm);
 }
 
+#define range_debug(r__, operaton__)					\
+	vm_dbg(&range_to_vm(&(r__)->base)->xe->drm,			\
+	       "%s: asid=%u, gpusvm=0x%016llx, vram=%d,%d,%d, seqno=%lu, order=%u, start=0x%014llx, end=0x%014llx, size=%llu",	\
+	       (operaton__), range_to_vm(&(r__)->base)->usm.asid,	\
+	       (u64)(r__)->base.gpusvm,					\
+	       (r__)->base.vram_allocation ? 1 : 0,			\
+	       xe_svm_range_in_vram((r__)) ? 1 : 0,			\
+	       xe_svm_range_has_vram_binding((r__)) ? 1 : 0,		\
+	       (r__)->base.notifier_seq, (r__)->base.order,		\
+	       (r__)->base.va.start, (r__)->base.va.end,		\
+	       (r__)->base.va.end - (r__)->base.va.start)
+
+void xe_svm_range_debug(struct xe_svm_range *range, const char *operation)
+{
+	range_debug(range, operation);
+}
+
 static void *xe_svm_devm_owner(struct xe_device *xe)
 {
 	return xe;
@@ -61,6 +78,8 @@ xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
 {
 	struct xe_device *xe = vm->xe;
 
+	range_debug(range, "GARBAGE COLLECTOR ADD");
+
 	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
 
 	spin_lock(&vm->svm.garbage_collector.lock);
@@ -84,10 +103,14 @@ xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
 	u8 tile_mask = 0;
 	u8 id;
 
+	range_debug(range, "NOTIFIER");
+
 	/* Skip if already unmapped or if no binding exist */
 	if (range->base.flags.unmapped || !range->tile_present)
 		return 0;
 
+	range_debug(range, "NOTIFIER - EXECUTE");
+
 	/* Adjust invalidation to range boundaries */
 	if (range->base.va.start < mmu_range->start)
 		*adj_start = range->base.va.start;
@@ -136,6 +159,11 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 	u32 fence_id = 0;
 	long err;
 
+	vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
+	       "INVALIDATE: asid=%u, gpusvm=0x%016llx, seqno=%lu, start=0x%016lx, end=0x%016lx, event=%d",
+	       vm->usm.asid, (u64)gpusvm, notifier->notifier.invalidate_seq,
+	       mmu_range->start, mmu_range->end, mmu_range->event);
+
 	/* Adjust invalidation to notifier boundaries */
 	if (adj_start < notifier->interval.start)
 		adj_start = notifier->interval.start;
@@ -220,9 +248,13 @@ static int __xe_svm_garbage_collector(struct xe_vm *vm,
 	struct drm_gpusvm_ctx ctx = {};
 	struct dma_fence *fence;
 
+	range_debug(range, "GARBAGE COLLECTOR");
+
 	/* Evict any pages holding references to vram allocation */
-	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
+	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe)) {
+		range_debug(range, "GARBAGE COLLECTOR - EVICT");
 		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
+	}
 
 	xe_vm_lock(vm, false);
 	fence = xe_vm_range_unbind(vm, range);
@@ -358,16 +390,25 @@ static int xe_svm_copy(struct drm_gpusvm *gpusvm, struct page **pages,
 			int incr = (match && last) ? 1 : 0;
 
 			if (vram_addr != VRAM_ADDR_INVALID) {
-				if (sram)
+				if (sram) {
+					vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
+					       "COPY TO SRAM - 0x%016llx -> 0x%016llx, NPAGES=%ld, asid=%u, gpusvm=0x%016llx",
+					       vram_addr, dma_addr[pos], i - pos + incr,
+					       vm->usm.asid, (u64)gpusvm);
 					__fence = xe_migrate_from_vram(tile->migrate,
 								       i - pos + incr,
 								       vram_addr,
 								       dma_addr + pos);
-				else
+				} else {
+					vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
+					       "COPY TO VRAM - 0x%016llx -> 0x%016llx, NPAGES=%ld, asid=%u, gpusvm=0x%016llx",
+					       dma_addr[pos], vram_addr, i - pos + incr,
+					       vm->usm.asid, (u64)gpusvm);
 					__fence = xe_migrate_to_vram(tile->migrate,
 								     i - pos + incr,
 								     dma_addr + pos,
 								     vram_addr);
+				}
 				if (IS_ERR(__fence)) {
 					err = PTR_ERR(__fence);
 					goto err_out;
@@ -385,14 +426,23 @@ static int xe_svm_copy(struct drm_gpusvm *gpusvm, struct page **pages,
 			}
 
 			if (!match && last && dma_addr[i]) {
-				if (sram)
+				if (sram) {
+					vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
+					       "COPY TO SRAM - 0x%016llx -> 0x%016llx, NPAGES=%d, asid=%u, gpusvm=0x%016llx",
+					       vram_addr, dma_addr[pos], 1,
+					       vm->usm.asid, (u64)gpusvm);
 					__fence = xe_migrate_from_vram(tile->migrate, 1,
 								       vram_addr,
 								       dma_addr + pos);
-				else
+				} else {
+					vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
+					       "COPY TO VRAM - 0x%016llx -> 0x%016llx, NPAGES=%d, asid=%u, gpusvm=0x%016llx",
+					       dma_addr[pos], vram_addr, 1,
+					       vm->usm.asid, (u64)gpusvm);
 					__fence = xe_migrate_to_vram(tile->migrate, 1,
 								     dma_addr + pos,
 								     vram_addr);
+				}
 				if (IS_ERR(__fence)) {
 					err = PTR_ERR(__fence);
 					goto err_out;
@@ -519,12 +569,14 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
 				       const struct drm_gpusvm_ctx *ctx)
 {
 	struct xe_mem_region *mr = tile_to_mr(tile);
+	struct drm_buddy *buddy = tile_to_buddy(tile);
 	struct drm_buddy_block *block;
 	struct list_head *blocks;
 	struct xe_bo *bo;
 	ktime_t end = 0;
 	int err;
 
+	range_debug(range, "ALLOCATE VRAM");
 retry:
 	xe_vm_lock(vm, false);
 	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
@@ -542,8 +594,13 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
 	}
 
 	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
-	list_for_each_entry(block, blocks, link)
+	list_for_each_entry(block, blocks, link) {
+		vm_dbg(&vm->xe->drm, "ALLOC VRAM: asid=%u, gpusvm=0x%016llx, pfn=%llu, npages=%llu",
+		       vm->usm.asid, (u64)&vm->svm.gpusvm,
+		       block_offset_to_pfn(mr, drm_buddy_block_offset(block)),
+		       drm_buddy_block_size(buddy, block) >> PAGE_SHIFT);
 		block->private = mr;
+	}
 
 	/*
 	 * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
@@ -597,6 +654,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (xe_svm_range_is_valid(range, tile))
 		return 0;
 
+	range_debug(range, "PAGE FAULT");
+
 	/* XXX: Add migration policy, for now migrate range once */
 	if (IS_DGFX(vm->xe) && !range->migrated &&
 	    range->base.flags.migrate_vram &&
@@ -606,18 +665,26 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
 		if (IS_ERR(bo)) {
 			drm_info(&vm->xe->drm,
-				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
-				 vm->usm.asid, PTR_ERR(bo));
+				 "VRAM allocation failed, falling back to retrying, asid=%u, gpusvm=0x%016llx, errno %ld\n",
+				 vm->usm.asid, (u64)&vm->svm.gpusvm,
+				 PTR_ERR(bo));
 			bo = NULL;
 			goto retry;
 		}
 	}
 
+	range_debug(range, "GET PAGES");
 	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
-	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
-	       goto retry;
-	if (err)
+	if (err == -EFAULT || err == -EPERM) {	/* Corner where CPU mappings have change */
+		range_debug(range, "PAGE FAULT - RETRY PAGES");
+		goto retry;
+	}
+	if (err) {
+		range_debug(range, "PAGE FAULT - FAIL PAGE COLLECT");
 		goto err_out;
+	}
+
+	range_debug(range, "PAGE FAULT - BIND");
 
 retry_bind:
 	drm_exec_init(&exec, 0, 0);
@@ -633,8 +700,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 		if (IS_ERR(fence)) {
 			drm_exec_fini(&exec);
 			err = PTR_ERR(fence);
-			if (err == -EAGAIN)
+			if (err == -EAGAIN) {
+				range_debug(range, "PAGE FAULT - RETRY BIND");
 				goto retry;
+			}
 			if (xe_vm_validate_should_retry(&exec, err, &end))
 				goto retry_bind;
 			goto err_out;
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index b9cf0e2500da..1ea5d29a6868 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -31,6 +31,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    bool atomic);
 bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
 
+void xe_svm_range_debug(struct xe_svm_range *range, const char *operation);
+
 static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
 {
 	return drm_gpusvm_range_pages_valid(range->base.gpusvm, &range->base);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 26/28] drm/xe: Add modparam for SVM notifier size
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (24 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 25/28] drm/xe: Add SVM debug Matthew Brost
@ 2024-08-28  2:48 ` Matthew Brost
  2024-08-28  2:49 ` [RFC PATCH 27/28] drm/xe: Add modparam for SVM prefault Matthew Brost
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:48 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Useful to experiment with notifier size and how it affects performance.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_module.c | 4 ++++
 drivers/gpu/drm/xe/xe_module.h | 1 +
 drivers/gpu/drm/xe/xe_svm.c    | 5 +++--
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index 923460119cec..30cfb76344a1 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -22,9 +22,13 @@ struct xe_modparam xe_modparam = {
 	.max_vfs = IS_ENABLED(CONFIG_DRM_XE_DEBUG) ? ~0 : 0,
 #endif
 	.wedged_mode = 1,
+	.svm_notifier_size = 512,
 	/* the rest are 0 by default */
 };
 
+module_param_named(svm_notifier_size, xe_modparam.svm_notifier_size, uint, 0600);
+MODULE_PARM_DESC(svm_notifier_size, "Set the svm notifier size(in MiB), must be pow2");
+
 module_param_named_unsafe(force_execlist, xe_modparam.force_execlist, bool, 0444);
 MODULE_PARM_DESC(force_execlist, "Force Execlist submission");
 
diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
index 161a5e6f717f..5a3bfea8b7b4 100644
--- a/drivers/gpu/drm/xe/xe_module.h
+++ b/drivers/gpu/drm/xe/xe_module.h
@@ -22,6 +22,7 @@ struct xe_modparam {
 	unsigned int max_vfs;
 #endif
 	int wedged_mode;
+	u32 svm_notifier_size;
 };
 
 extern struct xe_modparam xe_modparam;
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index f9c2bffd1783..5e2ec25c3cb2 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -8,6 +8,7 @@
 #include "xe_bo.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
+#include "xe_module.h"
 #include "xe_pt.h"
 #include "xe_svm.h"
 #include "xe_ttm_vram_mgr.h"
@@ -543,8 +544,8 @@ int xe_svm_init(struct xe_vm *vm)
 
 	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
 			       current->mm, xe_svm_devm_owner(vm->xe), 0,
-			       vm->size, SZ_512M, &gpusvm_ops,
-			       fault_chunk_sizes,
+			       vm->size, xe_modparam.svm_notifier_size * SZ_1M,
+			       &gpusvm_ops, fault_chunk_sizes,
 			       ARRAY_SIZE(fault_chunk_sizes));
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 27/28] drm/xe: Add modparam for SVM prefault
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (25 preceding siblings ...)
  2024-08-28  2:48 ` [RFC PATCH 26/28] drm/xe: Add modparam for SVM notifier size Matthew Brost
@ 2024-08-28  2:49 ` Matthew Brost
  2024-08-28  2:49 ` [RFC PATCH 28/28] drm/gpusvm: Ensure all pages migrated upon eviction Matthew Brost
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Useful to experiment with SVM prefault and how it affects performance.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_module.c | 3 +++
 drivers/gpu/drm/xe/xe_module.h | 1 +
 drivers/gpu/drm/xe/xe_svm.c    | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index 30cfb76344a1..edda9898f3cf 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -29,6 +29,9 @@ struct xe_modparam xe_modparam = {
 module_param_named(svm_notifier_size, xe_modparam.svm_notifier_size, uint, 0600);
 MODULE_PARM_DESC(svm_notifier_size, "Set the svm notifier size(in MiB), must be pow2");
 
+module_param_named(svm_prefault, xe_modparam.svm_prefault, bool, 0444);
+MODULE_PARM_DESC(svm_prefault, "SVM prefault CPU pages upon range allocation");
+
 module_param_named_unsafe(force_execlist, xe_modparam.force_execlist, bool, 0444);
 MODULE_PARM_DESC(force_execlist, "Force Execlist submission");
 
diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
index 5a3bfea8b7b4..c1571cd8f9fe 100644
--- a/drivers/gpu/drm/xe/xe_module.h
+++ b/drivers/gpu/drm/xe/xe_module.h
@@ -12,6 +12,7 @@
 struct xe_modparam {
 	bool force_execlist;
 	bool probe_display;
+	bool svm_prefault;
 	u32 force_vram_bar_size;
 	int guc_log_level;
 	char *guc_firmware_path;
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 5e2ec25c3cb2..8e80e8704534 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -645,9 +645,11 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (err)
 		return err;
 
+	ctx.prefault = xe_modparam.svm_prefault;
 	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm, fault_addr,
 					    xe_vma_start(vma), xe_vma_end(vma),
 					    &ctx);
+	ctx.prefault = false;
 	if (IS_ERR(r))
 		return PTR_ERR(r);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 28/28] drm/gpusvm: Ensure all pages migrated upon eviction
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (26 preceding siblings ...)
  2024-08-28  2:49 ` [RFC PATCH 27/28] drm/xe: Add modparam for SVM prefault Matthew Brost
@ 2024-08-28  2:49 ` Matthew Brost
  2024-08-28  2:55 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation Patchwork
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28  2:49 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: airlied, christian.koenig, thomas.hellstrom, matthew.auld, daniel

Let's make sure we know what we are doing and check to ensure all pages
are migrated upon eviction.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/drm_gpusvm.c | 39 +++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c b/drivers/gpu/drm/xe/drm_gpusvm.c
index fc1e44e6ae72..6df3580cf4ca 100644
--- a/drivers/gpu/drm/xe/drm_gpusvm.c
+++ b/drivers/gpu/drm/xe/drm_gpusvm.c
@@ -1830,6 +1830,40 @@ static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
 	return 0;
 }
 
+#define DRM_GPUSVM_DEBUG	/* TODO: Connect to Kconfig */
+
+#ifdef DRM_GPUSVM_DEBUG
+/**
+ * drm_gpusvm_pages_migrated - count the number of pages migrated
+ * @src_pfns: source migration pfns
+ * @npages the total number of pages in src_pfns
+ *
+ * Examine the MIGRATE_PFN_MIGRATE bit of each sfn_pfn to get a count of the
+ * number of pages migrated.
+ *
+ * Returns:
+ * Number of pages migrated
+ */
+static unsigned long
+drm_gpusvm_pages_migrated(unsigned long *src_pfns, unsigned long npages)
+{
+	int pages_migrated = 0;
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i)
+		if (src_pfns[i] && src_pfns[i] & MIGRATE_PFN_MIGRATE)
+			++pages_migrated;
+
+	return pages_migrated;
+}
+#else
+static unsigned long
+drm_gpusvm_pages_migrated(unsigned long *src_pfns, unsigned long npages)
+{
+	return npages;
+}
+#endif
+
 /**
  * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
  * @gpusvm: Pointer to the GPU SVM structure
@@ -1896,6 +1930,8 @@ static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
 	if (err)
 		drm_gpusvm_migration_put_pages(npages, dst);
 	migrate_device_pages(src, dst, npages);
+	if (!err)
+		WARN_ON(npages > drm_gpusvm_pages_migrated(src, npages));
 	migrate_device_finalize(src, dst, npages);
 	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
 				       DMA_BIDIRECTIONAL);
@@ -1994,6 +2030,9 @@ static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
 	if (err)
 		drm_gpusvm_migration_put_pages(npages, migrate.dst);
 	migrate_vma_pages(&migrate);
+	if (!err && !page)	/* Only check on eviction */
+		WARN_ON(migrate.cpages >
+			drm_gpusvm_pages_migrated(migrate.src, npages));
 	migrate_vma_finalize(&migrate);
 	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
 				       DMA_BIDIRECTIONAL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (27 preceding siblings ...)
  2024-08-28  2:49 ` [RFC PATCH 28/28] drm/gpusvm: Ensure all pages migrated upon eviction Matthew Brost
@ 2024-08-28  2:55 ` Patchwork
  2024-08-28  2:55 ` ✗ CI.checkpatch: warning " Patchwork
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 100+ messages in thread
From: Patchwork @ 2024-08-28  2:55 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Introduce GPU SVM and Xe SVM implementation
URL   : https://patchwork.freedesktop.org/series/137870/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 2940d1fa7abe drm-tip: 2024y-08m-27d-22h-16m-15s UTC integration manifest
=== git am output follows ===
Applying: dma-buf: Split out dma fence array create into alloc and arm functions
Applying: drm/xe: Invalidate media_gt TLBs in PT code
Applying: drm/xe: Retry BO allocation
Applying: mm/migrate: Add migrate_device_vma_range
Applying: drm/gpusvm: Add support for GPU Shared Virtual Memory
Applying: drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
Applying: drm/xe: Add SVM init / fini to faulting VMs
Applying: drm/xe: Add dma_addr res cursor
Applying: drm/xe: Add SVM range invalidation
Applying: drm/gpuvm: Add DRM_GPUVA_OP_USER
Applying: drm/xe: Add (re)bind to SVM page fault handler
Applying: drm/xe: Add SVM garbage collector
Applying: drm/xe: Add unbind to SVM garbage collector
Applying: drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
Applying: drm/xe: Enable system allocator uAPI
Applying: drm/xe: Add migrate layer functions for SVM support
Applying: drm/xe: Add SVM device memory mirroring
Applying: drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
Applying: drm/xe: Update PT layer to understand ranges in VRAM
Applying: drm/xe: Add Xe SVM populate_vram_pfn vfunc
Applying: drm/xe: Add Xe SVM vram_release vfunc
Applying: drm/xe: Add BO flags required for SVM
Applying: drm/xe: Add SVM VRAM migration
Applying: drm/xe: Basic SVM BO eviction
Applying: drm/xe: Add SVM debug
Applying: drm/xe: Add modparam for SVM notifier size
Applying: drm/xe: Add modparam for SVM prefault
Applying: drm/gpusvm: Ensure all pages migrated upon eviction



^ permalink raw reply	[flat|nested] 100+ messages in thread

* ✗ CI.checkpatch: warning for Introduce GPU SVM and Xe SVM implementation
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (28 preceding siblings ...)
  2024-08-28  2:55 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation Patchwork
@ 2024-08-28  2:55 ` Patchwork
  2024-08-28  2:56 ` ✗ CI.KUnit: failure " Patchwork
  2024-09-24  9:16 ` [RFC PATCH 00/28] " Simona Vetter
  31 siblings, 0 replies; 100+ messages in thread
From: Patchwork @ 2024-08-28  2:55 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Introduce GPU SVM and Xe SVM implementation
URL   : https://patchwork.freedesktop.org/series/137870/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
9fe5037901cabbcdf27a6fe0dfb047ca1474d363
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 0a43e1983467701baae395f1f912d8eb5cb3b119
Author: Matthew Brost <matthew.brost@intel.com>
Date:   Tue Aug 27 19:49:01 2024 -0700

    drm/gpusvm: Ensure all pages migrated upon eviction
    
    Let's make sure we know what we are doing and check to ensure all pages
    are migrated upon eviction.
    
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
+ /mt/dim checkpatch 2940d1fa7abe0d2a9acc95fd1c704a8d8cbc68f4 drm-intel
69ec1662f608 dma-buf: Split out dma fence array create into alloc and arm functions
-:63: WARNING:REPEATED_WORD: Possible repeated word: 'fence'
#63: FILE: drivers/dma-buf/dma-fence-array.c:170:
+ * preallocated dma fence fence in the path of reclaim or dma fence signaling.

-:71: WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
#71: FILE: drivers/dma-buf/dma-fence-array.c:174:
+			  u64 context, unsigned seqno,

-:113: WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
#113: FILE: drivers/dma-buf/dma-fence-array.c:228:
+					       u64 context, unsigned seqno,

-:132: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 0)
#132: FILE: include/linux/dma-fence-array.h:79:
 	for (index = 0, fence = dma_fence_array_first(head); fence;	\
[...]
+struct dma_fence_array *dma_fence_array_alloc(int num_fences);

-:138: WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
#138: FILE: include/linux/dma-fence-array.h:85:
+			  u64 context, unsigned seqno,

total: 0 errors, 5 warnings, 0 checks, 112 lines checked
b32ee4ad3a3d drm/xe: Invalidate media_gt TLBs in PT code
295019e1310a drm/xe: Retry BO allocation
d39141b7860c mm/migrate: Add migrate_device_vma_range
-:48: WARNING:REPEATED_WORD: Possible repeated word: 'with'
#48: FILE: mm/migrate_device.c:937:
+ * invalidation can race with with VMA start being repurposed, worst case this

-:49: WARNING:TYPO_SPELLING: 'unecessary' may be misspelled - perhaps 'unnecessary'?
#49: FILE: mm/migrate_device.c:938:
+ * would result in an unecessary invalidation.
                       ^^^^^^^^^^

total: 0 errors, 2 warnings, 0 checks, 68 lines checked
5c6446919277 drm/gpusvm: Add support for GPU Shared Virtual Memory
-:42: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#42: 
new file mode 100644

-:455: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'range__' - possible side-effects?
#455: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:409:
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))

-:455: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'next__' - possible side-effects?
#455: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:409:
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))

-:455: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#455: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:409:
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))

-:488: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'notifier__' - possible side-effects?
#488: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:442:
+#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))

-:488: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#488: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:442:
+#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))

-:504: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'notifier__' - possible side-effects?
#504: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:458:
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))

-:504: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'next__' - possible side-effects?
#504: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:458:
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))

-:504: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#504: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:458:
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))

-:616: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'fault_addr__' - possible side-effects?
#616: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:570:
+#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
+	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
+			    (fault_addr__ + 1))

-:660: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#660: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:614:
+#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
+	notifier_remove((notifier__), &(gpusvm__)->root);	\
+	list_del(&(notifier__)->rb.entry)

-:660: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'notifier__' - possible side-effects?
#660: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:614:
+#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
+	notifier_remove((notifier__), &(gpusvm__)->root);	\
+	list_del(&(notifier__)->rb.entry)

-:786: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#786: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:740:
+#define __drm_gpusvm_range_remove(notifier__, range__)		\
+	range_remove((range__), &(notifier__)->root);		\
+	list_del(&(range__)->rb.entry)

-:786: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'range__' - possible side-effects?
#786: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:740:
+#define __drm_gpusvm_range_remove(notifier__, range__)		\
+	range_remove((range__), &(notifier__)->root);		\
+	list_del(&(range__)->rb.entry)

-:1120: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'i__' - possible side-effects?
#1120: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1074:
+#define for_each_dma_page(i__, j__, npages__, order__)	\
+	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
+	     (j__)++, (i__) += 0x1 << (order__))

-:1120: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'j__' - possible side-effects?
#1120: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1074:
+#define for_each_dma_page(i__, j__, npages__, order__)	\
+	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
+	     (j__)++, (i__) += 0x1 << (order__))

-:1267: WARNING:TYPO_SPELLING: 'commiting' may be misspelled - perhaps 'committing'?
#1267: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1221:
+ * called holding gpusvm->notifier_lock and as the last step before commiting a
                                                                     ^^^^^^^^^

-:1628: WARNING:MISORDERED_TYPE: type 'long unsigned int *' should be specified in [[un]signed] [short|int|long|long long] order
#1628: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1582:
+					long unsigned int *migrate_pfn,

-:1628: WARNING:UNNECESSARY_INT: Prefer 'unsigned long *' over 'long unsigned int *' as the int is unnecessary
#1628: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1582:
+					long unsigned int *migrate_pfn,

-:2615: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'range__' - possible side-effects?
#2615: FILE: drivers/gpu/drm/xe/drm_gpusvm.h:389:
+#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
+	for ((range__) = (range__) ?:					\
+	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
+	     (range__) && (range__->va.start < (end__));		\
+	     (range__) = __drm_gpusvm_range_next(range__))

-:2615: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#2615: FILE: drivers/gpu/drm/xe/drm_gpusvm.h:389:
+#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
+	for ((range__) = (range__) ?:					\
+	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
+	     (range__) && (range__->va.start < (end__));		\
+	     (range__) = __drm_gpusvm_range_next(range__))

total: 2 errors, 4 warnings, 15 checks, 2598 lines checked
4c792b5b6097 drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
-:449: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#449: FILE: drivers/gpu/drm/xe/xe_vm.c:2816:
+		    XE_IOCTL_DBG(xe, obj_offset && (is_null ||
+				 is_system_allocator)) ||

total: 0 errors, 0 warnings, 1 checks, 469 lines checked
0240aafda33a drm/xe: Add SVM init / fini to faulting VMs
-:23: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#23: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 108 lines checked
a5a3d7b38838 drm/xe: Add dma_addr res cursor
0d0c1a9c8d54 drm/xe: Add SVM range invalidation
-:143: ERROR:OPEN_BRACE: open brace '{' following function definitions go on the next line
#143: FILE: drivers/gpu/drm/xe/xe_svm.c:14:
+static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
+ {

-:144: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#144: FILE: drivers/gpu/drm/xe/xe_svm.c:15:
+ {$

-:349: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 15)
#349: FILE: drivers/gpu/drm/xe/xe_svm.c:233:
+	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
+	       goto retry;

-:350: WARNING:TABSTOP: Statements should start on a tabstop
#350: FILE: drivers/gpu/drm/xe/xe_svm.c:234:
+	       goto retry;

total: 1 errors, 3 warnings, 0 checks, 343 lines checked
7e67697bcf92 drm/gpuvm: Add DRM_GPUVA_OP_USER
9cf1590db98f drm/xe: Add (re)bind to SVM page fault handler
14c4590f803b drm/xe: Add SVM garbage collector
-:187: CHECK:UNCOMMENTED_DEFINITION: spinlock_t definition without comment
#187: FILE: drivers/gpu/drm/xe/xe_vm_types.h:150:
+			spinlock_t lock;

total: 0 errors, 0 warnings, 1 checks, 157 lines checked
09f5cb686eb8 drm/xe: Add unbind to SVM garbage collector
-:20: ERROR:POINTER_LOCATION: "(foo*)" should be "(foo *)"
#20: FILE: drivers/gpu/drm/xe/xe_pt.c:908:
+#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)

-:20: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
#20: FILE: drivers/gpu/drm/xe/xe_pt.c:908:
+#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)

total: 2 errors, 0 warnings, 0 checks, 292 lines checked
7ff1f55df61c drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
-:7: WARNING:REPEATED_WORD: Possible repeated word: 'the'
#7: 
uAPI is designed with the the use case that only mapping a BO to a

total: 0 errors, 1 warnings, 0 checks, 43 lines checked
8995548d9786 drm/xe: Enable system allocator uAPI
08c7ca1fff12 drm/xe: Add migrate layer functions for SVM support
1e2bf1e50dc7 drm/xe: Add SVM device memory mirroring
-:11: ERROR:BAD_SIGN_OFF: Unrecognized email address: 'Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com'
#11: 
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com

-:103: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#103: FILE: drivers/gpu/drm/xe/xe_svm.c:417:
+		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);

total: 1 errors, 0 warnings, 1 checks, 123 lines checked
8ce3d1594574 drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
54efe69fa8ae drm/xe: Update PT layer to understand ranges in VRAM
88b624083ec7 drm/xe: Add Xe SVM populate_vram_pfn vfunc
-:6: ERROR:BAD_SIGN_OFF: Unrecognized email address: 'Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com'
#6: 
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com

-:7: WARNING:COMMIT_MESSAGE: Missing commit description - Add an appropriate one

-:45: ERROR:SPACING: spaces required around that '=' (ctx:WxV)
#45: FILE: drivers/gpu/drm/xe/xe_svm.c:444:
+	int j =0;
 	      ^

-:54: ERROR:SPACING: space required before the open parenthesis '('
#54: FILE: drivers/gpu/drm/xe/xe_svm.c:453:
+		for(i = 0; i < drm_buddy_block_size(buddy, block) >> PAGE_SHIFT; ++i)

total: 3 errors, 1 warnings, 0 checks, 52 lines checked
e033412ea53d drm/xe: Add Xe SVM vram_release vfunc
45d9b40b5b71 drm/xe: Add BO flags required for SVM
-:45: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#45: FILE: drivers/gpu/drm/xe/xe_bo.c:2305:
+	if (IS_DGFX(xe) && ((bo->flags & XE_BO_FLAG_SYSTEM) ||
+	    (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC)))

total: 0 errors, 0 warnings, 1 checks, 38 lines checked
0cd6cedd73b1 drm/xe: Add SVM VRAM migration
-:33: ERROR:BAD_SIGN_OFF: Unrecognized email address: 'Matthew Brost matthew.brost@intel.com'
#33: 
Signed-off-by: Matthew Brost matthew.brost@intel.com

-:175: ERROR:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Matthew Brost <matthew.brost@intel.com>'

total: 2 errors, 0 warnings, 0 checks, 128 lines checked
93416a88b507 drm/xe: Basic SVM BO eviction
eb17623048e9 drm/xe: Add SVM debug
-:60: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'r__' - possible side-effects?
#60: FILE: drivers/gpu/drm/xe/xe_svm.c:27:
+#define range_debug(r__, operaton__)					\
+	vm_dbg(&range_to_vm(&(r__)->base)->xe->drm,			\
+	       "%s: asid=%u, gpusvm=0x%016llx, vram=%d,%d,%d, seqno=%lu, order=%u, start=0x%014llx, end=0x%014llx, size=%llu",	\
+	       (operaton__), range_to_vm(&(r__)->base)->usm.asid,	\
+	       (u64)(r__)->base.gpusvm,					\
+	       (r__)->base.vram_allocation ? 1 : 0,			\
+	       xe_svm_range_in_vram((r__)) ? 1 : 0,			\
+	       xe_svm_range_has_vram_binding((r__)) ? 1 : 0,		\
+	       (r__)->base.notifier_seq, (r__)->base.order,		\
+	       (r__)->base.va.start, (r__)->base.va.end,		\
+	       (r__)->base.va.end - (r__)->base.va.start)

-:62: WARNING:LONG_LINE: line length of 129 exceeds 100 columns
#62: FILE: drivers/gpu/drm/xe/xe_svm.c:29:
+	       "%s: asid=%u, gpusvm=0x%016llx, vram=%d,%d,%d, seqno=%lu, order=%u, start=0x%014llx, end=0x%014llx, size=%llu",	\

total: 0 errors, 1 warnings, 1 checks, 244 lines checked
1f9f78fbd7e0 drm/xe: Add modparam for SVM notifier size
adabbeef048c drm/xe: Add modparam for SVM prefault
0a43e1983467 drm/gpusvm: Ensure all pages migrated upon eviction



^ permalink raw reply	[flat|nested] 100+ messages in thread

* ✗ CI.KUnit: failure for Introduce GPU SVM and Xe SVM implementation
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (29 preceding siblings ...)
  2024-08-28  2:55 ` ✗ CI.checkpatch: warning " Patchwork
@ 2024-08-28  2:56 ` Patchwork
  2024-09-24  9:16 ` [RFC PATCH 00/28] " Simona Vetter
  31 siblings, 0 replies; 100+ messages in thread
From: Patchwork @ 2024-08-28  2:56 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Introduce GPU SVM and Xe SVM implementation
URL   : https://patchwork.freedesktop.org/series/137870/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../drivers/gpu/drm/xe/drm_gpusvm.c: In function ‘drm_gpusvm_range_get_pages’:
../drivers/gpu/drm/xe/drm_gpusvm.c:1430:4: error: implicit declaration of function ‘mark_page_accessed’; did you mean ‘mark_page_reserved’? [-Werror=implicit-function-declaration]
 1430 |    mark_page_accessed(pages[j]);
      |    ^~~~~~~~~~~~~~~~~~
      |    mark_page_reserved
../drivers/gpu/drm/xe/drm_gpusvm.c: In function ‘drm_gpusvm_get_vram_page’:
../drivers/gpu/drm/xe/drm_gpusvm.c:1562:2: error: implicit declaration of function ‘zone_device_page_init’ [-Werror=implicit-function-declaration]
 1562 |  zone_device_page_init(page);
      |  ^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:244: drivers/gpu/drm/xe/drm_gpusvm.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:485: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:485: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:485: drivers/gpu] Error 2
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [../scripts/Makefile.build:485: drivers] Error 2
make[3]: *** Waiting for unfinished jobs....
../lib/iomap.c:156:5: warning: no previous prototype for ‘ioread64_lo_hi’ [-Wmissing-prototypes]
  156 | u64 ioread64_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:163:5: warning: no previous prototype for ‘ioread64_hi_lo’ [-Wmissing-prototypes]
  163 | u64 ioread64_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:170:5: warning: no previous prototype for ‘ioread64be_lo_hi’ [-Wmissing-prototypes]
  170 | u64 ioread64be_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:178:5: warning: no previous prototype for ‘ioread64be_hi_lo’ [-Wmissing-prototypes]
  178 | u64 ioread64be_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:264:6: warning: no previous prototype for ‘iowrite64_lo_hi’ [-Wmissing-prototypes]
  264 | void iowrite64_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:272:6: warning: no previous prototype for ‘iowrite64_hi_lo’ [-Wmissing-prototypes]
  272 | void iowrite64_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:280:6: warning: no previous prototype for ‘iowrite64be_lo_hi’ [-Wmissing-prototypes]
  280 | void iowrite64be_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../lib/iomap.c:288:6: warning: no previous prototype for ‘iowrite64be_hi_lo’ [-Wmissing-prototypes]
  288 | void iowrite64be_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
make[2]: *** [/kernel/Makefile:1925: .] Error 2
make[1]: *** [/kernel/Makefile:224: __sub-make] Error 2
make: *** [Makefile:224: __sub-make] Error 2

[02:55:56] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[02:56:00] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
@ 2024-08-28 14:31   ` Daniel Vetter
  2024-08-28 14:46     ` Christian König
  2024-08-30  5:00     ` Matthew Brost
  2024-08-28 18:50   ` Daniel Vetter
                     ` (6 subsequent siblings)
  7 siblings, 2 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-08-28 14:31 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> +		if (!ctx->mmap_locked) {
> +			/*
> +			 * XXX: HMM locking document indicates only a read-lock
> +			 * is required but there apears to be a window between
> +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> +			 * via migrate_vma_setup and the pages actually moving
> +			 * in migrate_vma_finalize in which this code can grab
> +			 * garbage pages. Grabbing the write-lock if the range
> +			 * is attached to vram appears to protect against this
> +			 * race.
> +			 */

This one is really scary, since it means the entire migrate pte trickery
is essentially completely busted. Grabbing the mmap write lock just means
you block out pretty much everything interesting from concurrently
happening.

My gut feeling says we need to figure out what's happening here, because
this looks a bit too fundamental to me.
-Sima


> +			if (vram_pages)
> +				mmap_write_lock(mm);
> +			else
> +				mmap_read_lock(mm);
> +		}
> +		err = hmm_range_fault(&hmm_range);
> +		if (!ctx->mmap_locked) {
> +			if (vram_pages)
> +				mmap_write_unlock(mm);
> +			else
> +				mmap_read_unlock(mm);
> +		}
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (!ctx->mmap_locked)
> +		mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +
> +	if (ctx->prefault) {
> +		range->pages = pages;
> +		goto set_seqno;
> +	}
> +
> +map_pages:
> +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> +		WARN_ON_ONCE(!range->vram_allocation);
> +
> +		for (i = 0; i < npages; ++i) {
> +			pages[i] = hmm_pfn_to_page(pfns[i]);
> +
> +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> +				err = -EOPNOTSUPP;
> +				goto err_free;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->flags.has_vram_pages = true;
> +		range->pages = pages;
> +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	} else {
> +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> +
> +		for_each_dma_page(i, j, npages, order) {
> +			if (WARN_ON_ONCE(i && order !=
> +					 hmm_pfn_to_map_order(pfns[i]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +			order = hmm_pfn_to_map_order(pfns[i]);
> +
> +			pages[j] = hmm_pfn_to_page(pfns[i]);
> +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +
> +			set_page_dirty_lock(pages[j]);
> +			mark_page_accessed(pages[j]);
> +
> +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> +						   pages[j], 0,
> +						   PAGE_SIZE << order,
> +						   DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> +				err = -EFAULT;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +		}
> +
> +		/* Huge pages, reduce memory footprint */
> +		if (order) {
> +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> +						 GFP_KERNEL);
> +			if (dma_addr) {
> +				for (i = 0; i < j; ++i)
> +					dma_addr[i] = (dma_addr_t)pfns[i];
> +				kvfree(pfns);
> +				kfree_mapping = true;
> +			} else {
> +				dma_addr = (dma_addr_t *)pfns;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->order = order;
> +		range->flags.kfree_mapping = kfree_mapping;
> +		range->flags.has_dma_mapping = true;
> +		range->dma_addr = dma_addr;
> +		range->vram_allocation = NULL;
> +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	}
> +
> +	if (err == -EAGAIN)
> +		goto retry;
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	for_each_dma_page(i, j, npages, order)
> +		dma_unmap_page(gpusvm->drm->dev,
> +			       (dma_addr_t)pfns[j],
> +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> +err_free:
> +	if (alloc_pfns)
> +		kvfree(pfns);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> +					   unsigned long *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_vram_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU SVM. It
> + * iterates over each page frame number provided in @migrate_pfn, maps the
> + * corresponding page, and stores the DMA address in the provided @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *                   failure of this function.
> + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> + *                   should hold a reference to the VRAM allocation, which
> + *                   should be dropped via ops->vram_allocation or upon the
> + *                   failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to VRAM. It performs the
> + * necessary setup and invokes the driver-specific operations for migration to
> + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> + * until ops->vram_release is called which only upon successful return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_vram)
> +		return -EINVAL;
> +
> +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> +	    !gpusvm->ops->copy_to_sram)
> +		return -EOPNOTSUPP;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(range);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> +	 * always an error. Need to revisit possible cases and how to handle. We
> +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> +	 */
> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> +					     migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> +					   migrate.src, npages, DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_vram_page(page, zdd);
> +	}
> +
> +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind vram allocation to range and zdd */
> +	range->vram_allocation = vram_allocation;
> +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> + * specified VM area structure. It allocates and locks pages in the VM area for
> + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> +						unsigned long npages,
> +						unsigned long *src_mpfn,
> +						unsigned long *mpfn, u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);
> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> + * migration done via migrate_device_* functions. Fallback path as it is
> + * preferred to issue migrations with mmap lock.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> +				    struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = npages_in_range(range->va.start, range->va.end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> +					     npages, src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_vma_range(gpusvm->mm,
> +				       gpusvm->device_private_page_owner, src,
> +				       npages, range->va.start);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> +					   dst, npages, DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @vas: Pointer to the VM area structure
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @start: Start address of the migration range
> + * @end: End address of the migration range
> + *
> + * This internal function performs the migration of the specified GPU SVM range
> + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> + * invokes the driver-specific operations for migration to SRAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +					struct vm_area_struct *vas,
> +					struct page *page,
> +					u64 start, u64 end)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page	= page,
> +	};
> +	unsigned long npages;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	/* Corner where VMA area struct has been partially unmapped */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> +						   migrate.src, migrate.dst,
> +						   start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> +					   migrate.dst, npages,
> +					   DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function initiates the migration of the specified GPU SVM range to
> + * SRAM. It performs necessary checks and invokes the internal migration
> + * function for actual migration.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	int err;
> +	bool retry = false;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		if (ctx->trylock_mmap) {
> +			if (!mmap_read_trylock(mm))  {
> +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> +				goto err_mmput;
> +			}
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	/*
> +	 * Loop required to find all VMA area structs for the corner case when
> +	 * VRAM backing has been partially unmapped from MM's address space.
> +	 */
> +again:
> +	vas = find_vma(mm, start);
> +	if (!vas) {
> +		if (!retry)
> +			err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end <= vas->vm_start || start >= vas->vm_end) {
> +		if (!retry)
> +			err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> +	if (err)
> +		goto err_mmunlock;
> +
> +	if (vas->vm_end < end) {
> +		retry = true;
> +		start = vas->vm_end;
> +		goto again;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		mmap_read_unlock(mm);
> +		/*
> +		 * Using mmput_async as this function can be called while
> +		 * holding a dma-resv lock, and a final put can grab the mmap
> +		 * lock, causing a lock inversion.
> +		 */
> +		mmput_async(mm);
> +	}
> +
> +	return 0;
> +
> +err_mmunlock:
> +	if (!ctx->mmap_locked)
> +		mmap_read_unlock(mm);
> +err_mmput:
> +	if (!ctx->mmap_locked)
> +		mmput_async(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> + * It retrieves the GPU SVM range information from the faulting page and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> +	int err;
> +
> +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> +					   vmf->vma, vmf->page,
> +					   zdd->range->va.start,
> +					   zdd->range->va.end);
> +
> +	return err ? VM_FAULT_SIGBUS : 0;
> +}
> +
> +/**
> + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> + */
> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> +	.page_free = drm_gpusvm_page_free,
> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> +};
> +
> +/**
> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> + *
> + * Returns:
> + * Pointer to the GPU SVM device page map operations structure.
> + */
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> +{
> +	return &drm_gpusvm_pagemap_ops;
> +}
> +
> +/**
> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @start: Start address
> + * @end: End address
> + *
> + * Returns:
> + * True if GPU SVM has mapping, False otherwise
> + */
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> +		struct drm_gpusvm_range *range = NULL;
> +
> +		drm_gpusvm_for_each_range(range, notifier, start, end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> new file mode 100644
> index 000000000000..0ea70f8534a8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> @@ -0,0 +1,415 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __DRM_GPUSVM_H__
> +#define __DRM_GPUSVM_H__
> +
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/workqueue.h>
> +
> +struct dev_pagemap_ops;
> +struct drm_device;
> +struct drm_gpusvm;
> +struct drm_gpusvm_notifier;
> +struct drm_gpusvm_ops;
> +struct drm_gpusvm_range;
> +
> +/**
> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> + *
> + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> + * These operations are provided by the GPU driver to manage SVM ranges and
> + * perform operations such as migration between VRAM and system RAM.
> + */
> +struct drm_gpusvm_ops {
> +	/**
> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> +	 *
> +	 * This function shall allocate a GPU SVM notifier.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> +	 */
> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> +
> +	/**
> +	 * @notifier_free: Free a GPU SVM notifier (optional)
> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> +	 *
> +	 * This function shall free a GPU SVM notifier.
> +	 */
> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> +
> +	/**
> +	 * @range_alloc: Allocate a GPU SVM range (optional)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 *
> +	 * This function shall allocate a GPU SVM range.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> +	 */
> +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> +
> +	/**
> +	 * @range_free: Free a GPU SVM range (optional)
> +	 * @range: Pointer to the GPU SVM range to be freed
> +	 *
> +	 * This function shall free a GPU SVM range.
> +	 */
> +	void (*range_free)(struct drm_gpusvm_range *range);
> +
> +	/**
> +	 * @vram_release: Release VRAM allocation (optional)
> +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> +	 *
> +	 * This function shall release VRAM allocation and expects to drop a
> +	 * reference to VRAM allocation.
> +	 */
> +	void (*vram_release)(void *vram_allocation);
> +
> +	/**
> +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> +	 * @npages: Number of pages to populate
> +	 * @pfn: Array of page frame numbers to populate
> +	 *
> +	 * This function shall populate VRAM page frame numbers (PFN).
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> +				 void *vram_allocation,
> +				 unsigned long npages,
> +				 unsigned long *pfn);
> +
> +	/**
> +	 * @copy_to_vram: Copy to VRAM (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (destination)
> +	 * @dma_addr: Pointer to array of DMA addresses (source)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to VRAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @copy_to_sram: Copy to system RAM (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (source)
> +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to system RAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @invalidate: Invalidate GPU SVM notifier (required)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @notifier: Pointer to the GPU SVM notifier
> +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> +	 *
> +	 * This function shall invalidate the GPU page tables. It can safely
> +	 * walk the notifier range RB tree/list in this function. Called while
> +	 * holding the notifier lock.
> +	 */
> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> +			   struct drm_gpusvm_notifier *notifier,
> +			   const struct mmu_notifier_range *mmu_range);
> +};
> +
> +/**
> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: MMU interval notifier
> + * @interval: Interval for the notifier
> + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> + * @root: Cached root node of the RB tree containing ranges
> + * @range_list: List head containing of ranges in the same order they appear in
> + *              interval tree. This is useful to keep iterating ranges while
> + *              doing modifications to RB tree.
> + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> + *                 removed
> + *
> + * This structure represents a GPU SVM notifier.
> + */
> +struct drm_gpusvm_notifier {
> +	struct drm_gpusvm *gpusvm;
> +	struct mmu_interval_notifier notifier;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} interval;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct rb_root_cached root;
> +	struct list_head range_list;
> +	struct {
> +		u32 removed : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier
> + * @refcount: Reference count for the range
> + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> + * @va: Virtual address range
> + * @notifier_seq: Notifier sequence number of the range's pages
> + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> + * @flags.unmapped: Flag indicating if the range has been unmapped
> + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> + *                       on @order which releases via kfree
> + *
> + * This structure represents a GPU SVM range used for tracking memory ranges
> + * mapped in a DRM device.
> + */
> +struct drm_gpusvm_range {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +	struct kref refcount;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} va;
> +	unsigned long notifier_seq;
> +	union {
> +		struct page **pages;
> +		dma_addr_t *dma_addr;
> +	};
> +	void *vram_allocation;
> +	u16 order;
> +	struct {
> +		/* All flags below must be set upon creation */
> +		u16 migrate_vram : 1;
> +		/* All flags below must be set / cleared under notifier lock */
> +		u16 unmapped : 1;
> +		u16 partial_unmap : 1;
> +		u16 has_vram_pages : 1;
> +		u16 has_dma_mapping : 1;
> +		u16 kfree_mapping : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm - GPU SVM structure
> + *
> + * @name: Name of the GPU SVM
> + * @drm: Pointer to the DRM device structure
> + * @mm: Pointer to the mm_struct for the address space
> + * @device_private_page_owner: Device private pages owner
> + * @mm_start: Start address of GPU SVM
> + * @mm_range: Range of the GPU SVM
> + * @notifier_size: Size of individual notifiers
> + * @ops: Pointer to the operations structure for GPU SVM
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> + *               Entries should be powers of 2 in descending order.
> + * @num_chunks: Number of chunks
> + * @notifier_lock: Read-write semaphore for protecting notifier operations
> + * @zdd_wq: Workqueue for deferred work on zdd destruction
> + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> + * @notifier_list: list head containing of notifiers in the same order they
> + *                 appear in interval tree. This is useful to keep iterating
> + *                 notifiers while doing modifications to RB tree.
> + *
> + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> + *
> + * No reference counting is provided, as this is expected to be embedded in the
> + * driver VM structure along with the struct drm_gpuvm, which handles reference
> + * counting.
> + */
> +struct drm_gpusvm {
> +	const char *name;
> +	struct drm_device *drm;
> +	struct mm_struct *mm;
> +	void *device_private_page_owner;
> +	u64 mm_start;
> +	u64 mm_range;
> +	u64 notifier_size;
> +	const struct drm_gpusvm_ops *ops;
> +	const u64 *chunk_sizes;
> +	int num_chunks;
> +	struct rw_semaphore notifier_lock;
> +	struct workqueue_struct *zdd_wq;
> +	struct rb_root_cached root;
> +	struct list_head notifier_list;
> +};
> +
> +/**
> + * struct drm_gpusvm_ctx - DRM GPU SVM context
> + *
> + * @mmap_locked: mmap lock is locked
> + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> + *                (e.g.dma-revs -> mmap lock)
> + * @in_notifier: entering from a MMU notifier
> + * @read_only: operating on read-only memory
> + * @vram_possible: possible to use VRAM
> + * @prefault: prefault pages
> + *
> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> + */
> +struct drm_gpusvm_ctx {
> +	u32 mmap_locked :1;
> +	u32 trylock_mmap :1;
> +	u32 in_notifier :1;
> +	u32 read_only :1;
> +	u32 vram_possible :1;
> +	u32 prefault :1;
> +};
> +
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks);
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> +
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range);
> +
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx);
> +
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx);
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> +
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> +
> +/**
> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, take lock
> + */
> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> +	down_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, drop lock
> + */
> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> +	up_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> + * @range: a pointer to the current GPU SVM range
> + *
> + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> + *         current range is the last one or if the input range is NULL.
> + */
> +static inline struct drm_gpusvm_range *
> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> +{
> +	if (range && !list_is_last(&range->rb.entry,
> +				   &range->notifier->range_list))
> +		return list_next_entry(range, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> + * to use while holding the driver SVM lock or the notifier lock.
> + */
> +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> +	for ((range__) = (range__) ?:					\
> +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> +	     (range__) && (range__->va.start < (end__));		\
> +	     (range__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> + * @range: Pointer to the GPU SVM range structure.
> + * @mmu_range: Pointer to the MMU notifier range structure.
> + *
> + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> + * if the range partially falls within the provided MMU notifier range.
> + */
> +static inline void
> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> +			      const struct mmu_notifier_range *mmu_range)
> +{
> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> +
> +	range->flags.unmapped = true;
> +	if (range->va.start < mmu_range->start ||
> +	    range->va.end > mmu_range->end)
> +		range->flags.partial_unmap = true;
> +}
> +
> +#endif /* __DRM_GPUSVM_H__ */
> -- 
> 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28 14:31   ` Daniel Vetter
@ 2024-08-28 14:46     ` Christian König
  2024-08-28 15:43       ` Matthew Brost
  2024-08-30  5:00     ` Matthew Brost
  1 sibling, 1 reply; 100+ messages in thread
From: Christian König @ 2024-08-28 14:46 UTC (permalink / raw)
  To: Daniel Vetter, Matthew Brost
  Cc: intel-xe, dri-devel, airlied, thomas.hellstrom, matthew.auld,
	daniel

Am 28.08.24 um 16:31 schrieb Daniel Vetter:
> On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
>> +		if (!ctx->mmap_locked) {
>> +			/*
>> +			 * XXX: HMM locking document indicates only a read-lock
>> +			 * is required but there apears to be a window between
>> +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
>> +			 * via migrate_vma_setup and the pages actually moving
>> +			 * in migrate_vma_finalize in which this code can grab
>> +			 * garbage pages. Grabbing the write-lock if the range
>> +			 * is attached to vram appears to protect against this
>> +			 * race.
>> +			 */
> This one is really scary, since it means the entire migrate pte trickery
> is essentially completely busted. Grabbing the mmap write lock just means
> you block out pretty much everything interesting from concurrently
> happening.
>
> My gut feeling says we need to figure out what's happening here, because
> this looks a bit too fundamental to me.

I think I have at least a high level understanding what's going on here, 
Felix and especially Philip should know more of the details.

In general grabbing the mm_lock to protect PTEs from changing is 
completely nonsense. The mm_lock is to protect the VMAs and *not* the PTEs!

Even with the write side of the mm_lock taken it is perfectly possible 
that PTE change. It's just less likely.

We run into multiple issues before we figured out this important 
distinction as well.

Christian.

> -Sima
>
>
>> +			if (vram_pages)
>> +				mmap_write_lock(mm);
>> +			else
>> +				mmap_read_lock(mm);
>> +		}
>> +		err = hmm_range_fault(&hmm_range);
>> +		if (!ctx->mmap_locked) {
>> +			if (vram_pages)
>> +				mmap_write_unlock(mm);
>> +			else
>> +				mmap_read_unlock(mm);
>> +		}
>> +
>> +		if (err == -EBUSY) {
>> +			if (time_after(jiffies, timeout))
>> +				break;
>> +
>> +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> +			continue;
>> +		}
>> +		break;
>> +	}
>> +	if (!ctx->mmap_locked)
>> +		mmput(mm);
>> +	if (err)
>> +		goto err_free;
>> +
>> +	pages = (struct page **)pfns;
>> +
>> +	if (ctx->prefault) {
>> +		range->pages = pages;
>> +		goto set_seqno;
>> +	}
>> +
>> +map_pages:
>> +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
>> +		WARN_ON_ONCE(!range->vram_allocation);
>> +
>> +		for (i = 0; i < npages; ++i) {
>> +			pages[i] = hmm_pfn_to_page(pfns[i]);
>> +
>> +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
>> +				err = -EOPNOTSUPP;
>> +				goto err_free;
>> +			}
>> +		}
>> +
>> +		/* Do not race with notifier unmapping pages */
>> +		drm_gpusvm_notifier_lock(gpusvm);
>> +		range->flags.has_vram_pages = true;
>> +		range->pages = pages;
>> +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>> +			err = -EAGAIN;
>> +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
>> +		}
>> +		drm_gpusvm_notifier_unlock(gpusvm);
>> +	} else {
>> +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
>> +
>> +		for_each_dma_page(i, j, npages, order) {
>> +			if (WARN_ON_ONCE(i && order !=
>> +					 hmm_pfn_to_map_order(pfns[i]))) {
>> +				err = -EOPNOTSUPP;
>> +				npages = i;
>> +				goto err_unmap;
>> +			}
>> +			order = hmm_pfn_to_map_order(pfns[i]);
>> +
>> +			pages[j] = hmm_pfn_to_page(pfns[i]);
>> +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
>> +				err = -EOPNOTSUPP;
>> +				npages = i;
>> +				goto err_unmap;
>> +			}
>> +
>> +			set_page_dirty_lock(pages[j]);
>> +			mark_page_accessed(pages[j]);
>> +
>> +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
>> +						   pages[j], 0,
>> +						   PAGE_SIZE << order,
>> +						   DMA_BIDIRECTIONAL);
>> +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
>> +				err = -EFAULT;
>> +				npages = i;
>> +				goto err_unmap;
>> +			}
>> +		}
>> +
>> +		/* Huge pages, reduce memory footprint */
>> +		if (order) {
>> +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
>> +						 GFP_KERNEL);
>> +			if (dma_addr) {
>> +				for (i = 0; i < j; ++i)
>> +					dma_addr[i] = (dma_addr_t)pfns[i];
>> +				kvfree(pfns);
>> +				kfree_mapping = true;
>> +			} else {
>> +				dma_addr = (dma_addr_t *)pfns;
>> +			}
>> +		}
>> +
>> +		/* Do not race with notifier unmapping pages */
>> +		drm_gpusvm_notifier_lock(gpusvm);
>> +		range->order = order;
>> +		range->flags.kfree_mapping = kfree_mapping;
>> +		range->flags.has_dma_mapping = true;
>> +		range->dma_addr = dma_addr;
>> +		range->vram_allocation = NULL;
>> +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>> +			err = -EAGAIN;
>> +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
>> +		}
>> +		drm_gpusvm_notifier_unlock(gpusvm);
>> +	}
>> +
>> +	if (err == -EAGAIN)
>> +		goto retry;
>> +set_seqno:
>> +	range->notifier_seq = hmm_range.notifier_seq;
>> +
>> +	return 0;
>> +
>> +err_unmap:
>> +	for_each_dma_page(i, j, npages, order)
>> +		dma_unmap_page(gpusvm->drm->dev,
>> +			       (dma_addr_t)pfns[j],
>> +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
>> +err_free:
>> +	if (alloc_pfns)
>> +		kvfree(pfns);
>> +err_out:
>> +	return err;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
>> + * @gpusvm: Pointer to the GPU SVM structure
>> + * @range: Pointer to the GPU SVM range structure
>> + * @ctx: GPU SVM context
>> + *
>> + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
>> + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
>> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
>> + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
>> + * security model.
>> + */
>> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
>> +				  struct drm_gpusvm_range *range,
>> +				  const struct drm_gpusvm_ctx *ctx)
>> +{
>> +	if (ctx->in_notifier)
>> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
>> +	else
>> +		drm_gpusvm_notifier_lock(gpusvm);
>> +
>> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
>> +
>> +	if (!ctx->in_notifier)
>> +		drm_gpusvm_notifier_unlock(gpusvm);
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migration_put_page - Put a migration page
>> + * @page: Pointer to the page to put
>> + *
>> + * This function unlocks and puts a page.
>> + */
>> +static void drm_gpusvm_migration_put_page(struct page *page)
>> +{
>> +	unlock_page(page);
>> +	put_page(page);
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migration_put_pages - Put migration pages
>> + * @npages: Number of pages
>> + * @migrate_pfn: Array of migrate page frame numbers
>> + *
>> + * This function puts an array of pages.
>> + */
>> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
>> +					   unsigned long *migrate_pfn)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		if (!migrate_pfn[i])
>> +			continue;
>> +
>> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
>> +		migrate_pfn[i] = 0;
>> +	}
>> +}
>> +
>> +/**
>> + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
>> + * @page: Pointer to the page
>> + * @zdd: Pointer to the GPU SVM zone device data
>> + *
>> + * This function associates the given page with the specified GPU SVM zone
>> + * device data and initializes it for zone device usage.
>> + */
>> +static void drm_gpusvm_get_vram_page(struct page *page,
>> +				     struct drm_gpusvm_zdd *zdd)
>> +{
>> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
>> +	zone_device_page_init(page);
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
>> + * @dev: The device for which the pages are being mapped
>> + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
>> + * @migrate_pfn: Array of migrate page frame numbers to map
>> + * @npages: Number of pages to map
>> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
>> + *
>> + * This function maps pages of memory for migration usage in GPU SVM. It
>> + * iterates over each page frame number provided in @migrate_pfn, maps the
>> + * corresponding page, and stores the DMA address in the provided @dma_addr
>> + * array.
>> + *
>> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
>> + */
>> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
>> +					dma_addr_t *dma_addr,
>> +					long unsigned int *migrate_pfn,
>> +					unsigned long npages,
>> +					enum dma_data_direction dir)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
>> +
>> +		if (!page)
>> +			continue;
>> +
>> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
>> +			return -EFAULT;
>> +
>> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
>> +		if (dma_mapping_error(dev, dma_addr[i]))
>> +			return -EFAULT;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
>> + * @dev: The device for which the pages were mapped
>> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
>> + * @npages: Number of pages to unmap
>> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
>> + *
>> + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
>> + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
>> + * if it's valid and not already unmapped, and unmaps the corresponding page.
>> + */
>> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
>> +					   dma_addr_t *dma_addr,
>> +					   unsigned long npages,
>> +					   enum dma_data_direction dir)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
>> +			continue;
>> +
>> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
>> +	}
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
>> + * @gpusvm: Pointer to the GPU SVM structure
>> + * @range: Pointer to the GPU SVM range structure
>> + *                   failure of this function.
>> + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
>> + *                   should hold a reference to the VRAM allocation, which
>> + *                   should be dropped via ops->vram_allocation or upon the
>> + *                   failure of this function.
>> + * @ctx: GPU SVM context
>> + *
>> + * This function migrates the specified GPU SVM range to VRAM. It performs the
>> + * necessary setup and invokes the driver-specific operations for migration to
>> + * VRAM. Upon successful return, @vram_allocation can safely reference @range
>> + * until ops->vram_release is called which only upon successful return.
>> + *
>> + * Returns:
>> + * 0 on success, negative error code on failure.
>> + */
>> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
>> +			       struct drm_gpusvm_range *range,
>> +			       void *vram_allocation,
>> +			       const struct drm_gpusvm_ctx *ctx)
>> +{
>> +	u64 start = range->va.start, end = range->va.end;
>> +	struct migrate_vma migrate = {
>> +		.start		= start,
>> +		.end		= end,
>> +		.pgmap_owner	= gpusvm->device_private_page_owner,
>> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
>> +	};
>> +	struct mm_struct *mm = gpusvm->mm;
>> +	unsigned long i, npages = npages_in_range(start, end);
>> +	struct vm_area_struct *vas;
>> +	struct drm_gpusvm_zdd *zdd = NULL;
>> +	struct page **pages;
>> +	dma_addr_t *dma_addr;
>> +	void *buf;
>> +	int err;
>> +
>> +	if (!range->flags.migrate_vram)
>> +		return -EINVAL;
>> +
>> +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
>> +	    !gpusvm->ops->copy_to_sram)
>> +		return -EOPNOTSUPP;
>> +
>> +	if (!ctx->mmap_locked) {
>> +		if (!mmget_not_zero(mm)) {
>> +			err = -EFAULT;
>> +			goto err_out;
>> +		}
>> +		mmap_write_lock(mm);
>> +	}
>> +
>> +	mmap_assert_locked(mm);
>> +
>> +	vas = vma_lookup(mm, start);
>> +	if (!vas) {
>> +		err = -ENOENT;
>> +		goto err_mmunlock;
>> +	}
>> +
>> +	if (end > vas->vm_end || start < vas->vm_start) {
>> +		err = -EINVAL;
>> +		goto err_mmunlock;
>> +	}
>> +
>> +	if (!vma_is_anonymous(vas)) {
>> +		err = -EBUSY;
>> +		goto err_mmunlock;
>> +	}
>> +
>> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
>> +		       sizeof(*pages), GFP_KERNEL);
>> +	if (!buf) {
>> +		err = -ENOMEM;
>> +		goto err_mmunlock;
>> +	}
>> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
>> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
>> +
>> +	zdd = drm_gpusvm_zdd_alloc(range);
>> +	if (!zdd) {
>> +		err = -ENOMEM;
>> +		goto err_free;
>> +	}
>> +
>> +	migrate.vma = vas;
>> +	migrate.src = buf;
>> +	migrate.dst = migrate.src + npages;
>> +
>> +	err = migrate_vma_setup(&migrate);
>> +	if (err)
>> +		goto err_free;
>> +
>> +	/*
>> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
>> +	 * always an error. Need to revisit possible cases and how to handle. We
>> +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
>> +	 */
>> +
>> +	if (!migrate.cpages) {
>> +		err = -EFAULT;
>> +		goto err_free;
>> +	}
>> +
>> +	if (migrate.cpages != npages) {
>> +		err = -EBUSY;
>> +		goto err_finalize;
>> +	}
>> +
>> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
>> +					     migrate.dst);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
>> +					   migrate.src, npages, DMA_TO_DEVICE);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +	for (i = 0; i < npages; ++i) {
>> +		struct page *page = pfn_to_page(migrate.dst[i]);
>> +
>> +		pages[i] = page;
>> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
>> +		drm_gpusvm_get_vram_page(page, zdd);
>> +	}
>> +
>> +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +	/* Upon success bind vram allocation to range and zdd */
>> +	range->vram_allocation = vram_allocation;
>> +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
>> +
>> +err_finalize:
>> +	if (err)
>> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
>> +	migrate_vma_pages(&migrate);
>> +	migrate_vma_finalize(&migrate);
>> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
>> +				       DMA_TO_DEVICE);
>> +err_free:
>> +	if (zdd)
>> +		drm_gpusvm_zdd_put(zdd);
>> +	kvfree(buf);
>> +err_mmunlock:
>> +	if (!ctx->mmap_locked) {
>> +		mmap_write_unlock(mm);
>> +		mmput(mm);
>> +	}
>> +err_out:
>> +	return err;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
>> + * @vas: Pointer to the VM area structure, can be NULL
>> + * @npages: Number of pages to populate
>> + * @src_mpfn: Source array of migrate PFNs
>> + * @mpfn: Array of migrate PFNs to populate
>> + * @addr: Start address for PFN allocation
>> + *
>> + * This function populates the SRAM migrate page frame numbers (PFNs) for the
>> + * specified VM area structure. It allocates and locks pages in the VM area for
>> + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
>> + * alloc_page for allocation.
>> + *
>> + * Returns:
>> + * 0 on success, negative error code on failure.
>> + */
>> +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
>> +						unsigned long npages,
>> +						unsigned long *src_mpfn,
>> +						unsigned long *mpfn, u64 addr)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
>> +		struct page *page;
>> +
>> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
>> +			continue;
>> +
>> +		if (vas)
>> +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
>> +		else
>> +			page = alloc_page(GFP_HIGHUSER);
>> +
>> +		if (!page)
>> +			return -ENOMEM;
>> +
>> +		lock_page(page);
>> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
>> + * @gpusvm: Pointer to the GPU SVM structure
>> + * @range: Pointer to the GPU SVM range structure
>> + *
>> + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
>> + * migration done via migrate_device_* functions. Fallback path as it is
>> + * preferred to issue migrations with mmap lock.
>> + *
>> + * Returns:
>> + * 0 on success, negative error code on failure.
>> + */
>> +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
>> +				    struct drm_gpusvm_range *range)
>> +{
>> +	unsigned long npages;
>> +	struct page **pages;
>> +	unsigned long *src, *dst;
>> +	dma_addr_t *dma_addr;
>> +	void *buf;
>> +	int i, err = 0;
>> +
>> +	npages = npages_in_range(range->va.start, range->va.end);
>> +
>> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
>> +		       sizeof(*pages), GFP_KERNEL);
>> +	if (!buf) {
>> +		err = -ENOMEM;
>> +		goto err_out;
>> +	}
>> +	src = buf;
>> +	dst = buf + (sizeof(*src) * npages);
>> +	dma_addr = buf + (2 * sizeof(*src) * npages);
>> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
>> +
>> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
>> +					     npages, src);
>> +	if (err)
>> +		goto err_free;
>> +
>> +	err = migrate_device_vma_range(gpusvm->mm,
>> +				       gpusvm->device_private_page_owner, src,
>> +				       npages, range->va.start);
>> +	if (err)
>> +		goto err_free;
>> +
>> +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
>> +					   dst, npages, DMA_BIDIRECTIONAL);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +	for (i = 0; i < npages; ++i)
>> +		pages[i] = migrate_pfn_to_page(src[i]);
>> +
>> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +err_finalize:
>> +	if (err)
>> +		drm_gpusvm_migration_put_pages(npages, dst);
>> +	migrate_device_pages(src, dst, npages);
>> +	migrate_device_finalize(src, dst, npages);
>> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
>> +				       DMA_BIDIRECTIONAL);
>> +err_free:
>> +	kvfree(buf);
>> +err_out:
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
>> + * @gpusvm: Pointer to the GPU SVM structure
>> + * @vas: Pointer to the VM area structure
>> + * @page: Pointer to the page for fault handling (can be NULL)
>> + * @start: Start address of the migration range
>> + * @end: End address of the migration range
>> + *
>> + * This internal function performs the migration of the specified GPU SVM range
>> + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
>> + * invokes the driver-specific operations for migration to SRAM.
>> + *
>> + * Returns:
>> + * 0 on success, negative error code on failure.
>> + */
>> +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
>> +					struct vm_area_struct *vas,
>> +					struct page *page,
>> +					u64 start, u64 end)
>> +{
>> +	struct migrate_vma migrate = {
>> +		.vma		= vas,
>> +		.pgmap_owner	= gpusvm->device_private_page_owner,
>> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
>> +		.fault_page	= page,
>> +	};
>> +	unsigned long npages;
>> +	struct page **pages;
>> +	dma_addr_t *dma_addr;
>> +	void *buf;
>> +	int i, err = 0;
>> +
>> +	mmap_assert_locked(gpusvm->mm);
>> +
>> +	/* Corner where VMA area struct has been partially unmapped */
>> +	if (start < vas->vm_start)
>> +		start = vas->vm_start;
>> +	if (end > vas->vm_end)
>> +		end = vas->vm_end;
>> +
>> +	migrate.start = start;
>> +	migrate.end = end;
>> +	npages = npages_in_range(start, end);
>> +
>> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
>> +		       sizeof(*pages), GFP_KERNEL);
>> +	if (!buf) {
>> +		err = -ENOMEM;
>> +		goto err_out;
>> +	}
>> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
>> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
>> +
>> +	migrate.vma = vas;
>> +	migrate.src = buf;
>> +	migrate.dst = migrate.src + npages;
>> +
>> +	err = migrate_vma_setup(&migrate);
>> +	if (err)
>> +		goto err_free;
>> +
>> +	/* Raced with another CPU fault, nothing to do */
>> +	if (!migrate.cpages)
>> +		goto err_free;
>> +
>> +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
>> +						   migrate.src, migrate.dst,
>> +						   start);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
>> +					   migrate.dst, npages,
>> +					   DMA_BIDIRECTIONAL);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +	for (i = 0; i < npages; ++i)
>> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
>> +
>> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
>> +	if (err)
>> +		goto err_finalize;
>> +
>> +err_finalize:
>> +	if (err)
>> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
>> +	migrate_vma_pages(&migrate);
>> +	migrate_vma_finalize(&migrate);
>> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
>> +				       DMA_BIDIRECTIONAL);
>> +err_free:
>> +	kvfree(buf);
>> +err_out:
>> +	mmap_assert_locked(gpusvm->mm);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
>> + * @gpusvm: Pointer to the GPU SVM structure
>> + * @range: Pointer to the GPU SVM range structure
>> + * @ctx: GPU SVM context
>> + *
>> + * This function initiates the migration of the specified GPU SVM range to
>> + * SRAM. It performs necessary checks and invokes the internal migration
>> + * function for actual migration.
>> + *
>> + * Returns:
>> + * 0 on success, negative error code on failure.
>> + */
>> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
>> +			       struct drm_gpusvm_range *range,
>> +			       const struct drm_gpusvm_ctx *ctx)
>> +{
>> +	u64 start = range->va.start, end = range->va.end;
>> +	struct mm_struct *mm = gpusvm->mm;
>> +	struct vm_area_struct *vas;
>> +	int err;
>> +	bool retry = false;
>> +
>> +	if (!ctx->mmap_locked) {
>> +		if (!mmget_not_zero(mm)) {
>> +			err = -EFAULT;
>> +			goto err_out;
>> +		}
>> +		if (ctx->trylock_mmap) {
>> +			if (!mmap_read_trylock(mm))  {
>> +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
>> +				goto err_mmput;
>> +			}
>> +		} else {
>> +			mmap_read_lock(mm);
>> +		}
>> +	}
>> +
>> +	mmap_assert_locked(mm);
>> +
>> +	/*
>> +	 * Loop required to find all VMA area structs for the corner case when
>> +	 * VRAM backing has been partially unmapped from MM's address space.
>> +	 */
>> +again:
>> +	vas = find_vma(mm, start);
>> +	if (!vas) {
>> +		if (!retry)
>> +			err = -ENOENT;
>> +		goto err_mmunlock;
>> +	}
>> +
>> +	if (end <= vas->vm_start || start >= vas->vm_end) {
>> +		if (!retry)
>> +			err = -EINVAL;
>> +		goto err_mmunlock;
>> +	}
>> +
>> +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
>> +	if (err)
>> +		goto err_mmunlock;
>> +
>> +	if (vas->vm_end < end) {
>> +		retry = true;
>> +		start = vas->vm_end;
>> +		goto again;
>> +	}
>> +
>> +	if (!ctx->mmap_locked) {
>> +		mmap_read_unlock(mm);
>> +		/*
>> +		 * Using mmput_async as this function can be called while
>> +		 * holding a dma-resv lock, and a final put can grab the mmap
>> +		 * lock, causing a lock inversion.
>> +		 */
>> +		mmput_async(mm);
>> +	}
>> +
>> +	return 0;
>> +
>> +err_mmunlock:
>> +	if (!ctx->mmap_locked)
>> +		mmap_read_unlock(mm);
>> +err_mmput:
>> +	if (!ctx->mmap_locked)
>> +		mmput_async(mm);
>> +err_out:
>> +	return err;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
>> + * @page: Pointer to the page
>> + *
>> + * This function is a callback used to put the GPU SVM zone device data
>> + * associated with a page when it is being released.
>> + */
>> +static void drm_gpusvm_page_free(struct page *page)
>> +{
>> +	drm_gpusvm_zdd_put(page->zone_device_data);
>> +}
>> +
>> +/**
>> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
>> + * @vmf: Pointer to the fault information structure
>> + *
>> + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
>> + * It retrieves the GPU SVM range information from the faulting page and invokes
>> + * the internal migration function to migrate the range back to RAM.
>> + *
>> + * Returns:
>> + * VM_FAULT_SIGBUS on failure, 0 on success.
>> + */
>> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
>> +{
>> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
>> +	int err;
>> +
>> +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
>> +					   vmf->vma, vmf->page,
>> +					   zdd->range->va.start,
>> +					   zdd->range->va.end);
>> +
>> +	return err ? VM_FAULT_SIGBUS : 0;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
>> + */
>> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
>> +	.page_free = drm_gpusvm_page_free,
>> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
>> +};
>> +
>> +/**
>> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
>> + *
>> + * Returns:
>> + * Pointer to the GPU SVM device page map operations structure.
>> + */
>> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
>> +{
>> +	return &drm_gpusvm_pagemap_ops;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
>> + * @gpusvm: Pointer to the GPU SVM structure.
>> + * @start: Start address
>> + * @end: End address
>> + *
>> + * Returns:
>> + * True if GPU SVM has mapping, False otherwise
>> + */
>> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
>> +{
>> +	struct drm_gpusvm_notifier *notifier;
>> +
>> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
>> +		struct drm_gpusvm_range *range = NULL;
>> +
>> +		drm_gpusvm_for_each_range(range, notifier, start, end)
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
>> new file mode 100644
>> index 000000000000..0ea70f8534a8
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
>> @@ -0,0 +1,415 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2024 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_GPUSVM_H__
>> +#define __DRM_GPUSVM_H__
>> +
>> +#include <linux/kref.h>
>> +#include <linux/mmu_notifier.h>
>> +#include <linux/workqueue.h>
>> +
>> +struct dev_pagemap_ops;
>> +struct drm_device;
>> +struct drm_gpusvm;
>> +struct drm_gpusvm_notifier;
>> +struct drm_gpusvm_ops;
>> +struct drm_gpusvm_range;
>> +
>> +/**
>> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
>> + *
>> + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
>> + * These operations are provided by the GPU driver to manage SVM ranges and
>> + * perform operations such as migration between VRAM and system RAM.
>> + */
>> +struct drm_gpusvm_ops {
>> +	/**
>> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
>> +	 *
>> +	 * This function shall allocate a GPU SVM notifier.
>> +	 *
>> +	 * Returns:
>> +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
>> +	 */
>> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
>> +
>> +	/**
>> +	 * @notifier_free: Free a GPU SVM notifier (optional)
>> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
>> +	 *
>> +	 * This function shall free a GPU SVM notifier.
>> +	 */
>> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
>> +
>> +	/**
>> +	 * @range_alloc: Allocate a GPU SVM range (optional)
>> +	 * @gpusvm: Pointer to the GPU SVM
>> +	 *
>> +	 * This function shall allocate a GPU SVM range.
>> +	 *
>> +	 * Returns:
>> +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
>> +	 */
>> +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
>> +
>> +	/**
>> +	 * @range_free: Free a GPU SVM range (optional)
>> +	 * @range: Pointer to the GPU SVM range to be freed
>> +	 *
>> +	 * This function shall free a GPU SVM range.
>> +	 */
>> +	void (*range_free)(struct drm_gpusvm_range *range);
>> +
>> +	/**
>> +	 * @vram_release: Release VRAM allocation (optional)
>> +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
>> +	 *
>> +	 * This function shall release VRAM allocation and expects to drop a
>> +	 * reference to VRAM allocation.
>> +	 */
>> +	void (*vram_release)(void *vram_allocation);
>> +
>> +	/**
>> +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
>> +	 * @gpusvm: Pointer to the GPU SVM
>> +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
>> +	 * @npages: Number of pages to populate
>> +	 * @pfn: Array of page frame numbers to populate
>> +	 *
>> +	 * This function shall populate VRAM page frame numbers (PFN).
>> +	 *
>> +	 * Returns:
>> +	 * 0 on success, a negative error code on failure.
>> +	 */
>> +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
>> +				 void *vram_allocation,
>> +				 unsigned long npages,
>> +				 unsigned long *pfn);
>> +
>> +	/**
>> +	 * @copy_to_vram: Copy to VRAM (required for migration)
>> +	 * @gpusvm: Pointer to the GPU SVM
>> +	 * @pages: Pointer to array of VRAM pages (destination)
>> +	 * @dma_addr: Pointer to array of DMA addresses (source)
>> +	 * @npages: Number of pages to copy
>> +	 *
>> +	 * This function shall copy pages to VRAM.
>> +	 *
>> +	 * Returns:
>> +	 * 0 on success, a negative error code on failure.
>> +	 */
>> +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
>> +			    struct page **pages,
>> +			    dma_addr_t *dma_addr,
>> +			    unsigned long npages);
>> +
>> +	/**
>> +	 * @copy_to_sram: Copy to system RAM (required for migration)
>> +	 * @gpusvm: Pointer to the GPU SVM
>> +	 * @pages: Pointer to array of VRAM pages (source)
>> +	 * @dma_addr: Pointer to array of DMA addresses (destination)
>> +	 * @npages: Number of pages to copy
>> +	 *
>> +	 * This function shall copy pages to system RAM.
>> +	 *
>> +	 * Returns:
>> +	 * 0 on success, a negative error code on failure.
>> +	 */
>> +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
>> +			    struct page **pages,
>> +			    dma_addr_t *dma_addr,
>> +			    unsigned long npages);
>> +
>> +	/**
>> +	 * @invalidate: Invalidate GPU SVM notifier (required)
>> +	 * @gpusvm: Pointer to the GPU SVM
>> +	 * @notifier: Pointer to the GPU SVM notifier
>> +	 * @mmu_range: Pointer to the mmu_notifier_range structure
>> +	 *
>> +	 * This function shall invalidate the GPU page tables. It can safely
>> +	 * walk the notifier range RB tree/list in this function. Called while
>> +	 * holding the notifier lock.
>> +	 */
>> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
>> +			   struct drm_gpusvm_notifier *notifier,
>> +			   const struct mmu_notifier_range *mmu_range);
>> +};
>> +
>> +/**
>> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
>> + *
>> + * @gpusvm: Pointer to the GPU SVM structure
>> + * @notifier: MMU interval notifier
>> + * @interval: Interval for the notifier
>> + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
>> + * @root: Cached root node of the RB tree containing ranges
>> + * @range_list: List head containing of ranges in the same order they appear in
>> + *              interval tree. This is useful to keep iterating ranges while
>> + *              doing modifications to RB tree.
>> + * @flags.removed: Flag indicating whether the MMU interval notifier has been
>> + *                 removed
>> + *
>> + * This structure represents a GPU SVM notifier.
>> + */
>> +struct drm_gpusvm_notifier {
>> +	struct drm_gpusvm *gpusvm;
>> +	struct mmu_interval_notifier notifier;
>> +	struct {
>> +		u64 start;
>> +		u64 end;
>> +	} interval;
>> +	struct {
>> +		struct rb_node node;
>> +		struct list_head entry;
>> +		u64 __subtree_last;
>> +	} rb;
>> +	struct rb_root_cached root;
>> +	struct list_head range_list;
>> +	struct {
>> +		u32 removed : 1;
>> +	} flags;
>> +};
>> +
>> +/**
>> + * struct drm_gpusvm_range - Structure representing a GPU SVM range
>> + *
>> + * @gpusvm: Pointer to the GPU SVM structure
>> + * @notifier: Pointer to the GPU SVM notifier
>> + * @refcount: Reference count for the range
>> + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
>> + * @va: Virtual address range
>> + * @notifier_seq: Notifier sequence number of the range's pages
>> + * @pages: Pointer to the array of pages (if backing store is in VRAM)
>> + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
>> + * @vram_allocation: Driver-private pointer to the VRAM allocation
>> + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
>> + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
>> + * @flags.unmapped: Flag indicating if the range has been unmapped
>> + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
>> + * @flags.has_vram_pages: Flag indicating if the range has vram pages
>> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
>> + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
>> + *                       on @order which releases via kfree
>> + *
>> + * This structure represents a GPU SVM range used for tracking memory ranges
>> + * mapped in a DRM device.
>> + */
>> +struct drm_gpusvm_range {
>> +	struct drm_gpusvm *gpusvm;
>> +	struct drm_gpusvm_notifier *notifier;
>> +	struct kref refcount;
>> +	struct {
>> +		struct rb_node node;
>> +		struct list_head entry;
>> +		u64 __subtree_last;
>> +	} rb;
>> +	struct {
>> +		u64 start;
>> +		u64 end;
>> +	} va;
>> +	unsigned long notifier_seq;
>> +	union {
>> +		struct page **pages;
>> +		dma_addr_t *dma_addr;
>> +	};
>> +	void *vram_allocation;
>> +	u16 order;
>> +	struct {
>> +		/* All flags below must be set upon creation */
>> +		u16 migrate_vram : 1;
>> +		/* All flags below must be set / cleared under notifier lock */
>> +		u16 unmapped : 1;
>> +		u16 partial_unmap : 1;
>> +		u16 has_vram_pages : 1;
>> +		u16 has_dma_mapping : 1;
>> +		u16 kfree_mapping : 1;
>> +	} flags;
>> +};
>> +
>> +/**
>> + * struct drm_gpusvm - GPU SVM structure
>> + *
>> + * @name: Name of the GPU SVM
>> + * @drm: Pointer to the DRM device structure
>> + * @mm: Pointer to the mm_struct for the address space
>> + * @device_private_page_owner: Device private pages owner
>> + * @mm_start: Start address of GPU SVM
>> + * @mm_range: Range of the GPU SVM
>> + * @notifier_size: Size of individual notifiers
>> + * @ops: Pointer to the operations structure for GPU SVM
>> + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
>> + *               Entries should be powers of 2 in descending order.
>> + * @num_chunks: Number of chunks
>> + * @notifier_lock: Read-write semaphore for protecting notifier operations
>> + * @zdd_wq: Workqueue for deferred work on zdd destruction
>> + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
>> + * @notifier_list: list head containing of notifiers in the same order they
>> + *                 appear in interval tree. This is useful to keep iterating
>> + *                 notifiers while doing modifications to RB tree.
>> + *
>> + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
>> + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
>> + *
>> + * No reference counting is provided, as this is expected to be embedded in the
>> + * driver VM structure along with the struct drm_gpuvm, which handles reference
>> + * counting.
>> + */
>> +struct drm_gpusvm {
>> +	const char *name;
>> +	struct drm_device *drm;
>> +	struct mm_struct *mm;
>> +	void *device_private_page_owner;
>> +	u64 mm_start;
>> +	u64 mm_range;
>> +	u64 notifier_size;
>> +	const struct drm_gpusvm_ops *ops;
>> +	const u64 *chunk_sizes;
>> +	int num_chunks;
>> +	struct rw_semaphore notifier_lock;
>> +	struct workqueue_struct *zdd_wq;
>> +	struct rb_root_cached root;
>> +	struct list_head notifier_list;
>> +};
>> +
>> +/**
>> + * struct drm_gpusvm_ctx - DRM GPU SVM context
>> + *
>> + * @mmap_locked: mmap lock is locked
>> + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
>> + *                (e.g.dma-revs -> mmap lock)
>> + * @in_notifier: entering from a MMU notifier
>> + * @read_only: operating on read-only memory
>> + * @vram_possible: possible to use VRAM
>> + * @prefault: prefault pages
>> + *
>> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
>> + */
>> +struct drm_gpusvm_ctx {
>> +	u32 mmap_locked :1;
>> +	u32 trylock_mmap :1;
>> +	u32 in_notifier :1;
>> +	u32 read_only :1;
>> +	u32 vram_possible :1;
>> +	u32 prefault :1;
>> +};
>> +
>> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
>> +		    const char *name, struct drm_device *drm,
>> +		    struct mm_struct *mm, void *device_private_page_owner,
>> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
>> +		    const struct drm_gpusvm_ops *ops,
>> +		    const u64 *chunk_sizes, int num_chunks);
>> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
>> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
>> +
>> +struct drm_gpusvm_range *
>> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
>> +				u64 gpuva_start, u64 gpuva_end,
>> +				const struct drm_gpusvm_ctx *ctx);
>> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
>> +			     struct drm_gpusvm_range *range);
>> +
>> +struct drm_gpusvm_range *
>> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
>> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
>> +
>> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
>> +				  struct drm_gpusvm_range *range);
>> +
>> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
>> +			       struct drm_gpusvm_range *range,
>> +			       const struct drm_gpusvm_ctx *ctx);
>> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
>> +				  struct drm_gpusvm_range *range,
>> +				  const struct drm_gpusvm_ctx *ctx);
>> +
>> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
>> +			       struct drm_gpusvm_range *range,
>> +			       void *vram_allocation,
>> +			       const struct drm_gpusvm_ctx *ctx);
>> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
>> +			       struct drm_gpusvm_range *range,
>> +			       const struct drm_gpusvm_ctx *ctx);
>> +
>> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
>> +
>> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
>> +
>> +struct drm_gpusvm_range *
>> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
>> +
>> +/**
>> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
>> + * @gpusvm__: Pointer to the GPU SVM structure.
>> + *
>> + * Abstract client usage GPU SVM notifier lock, take lock
>> + */
>> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
>> +	down_read(&(gpusvm__)->notifier_lock)
>> +
>> +/**
>> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
>> + * @gpusvm__: Pointer to the GPU SVM structure.
>> + *
>> + * Abstract client usage GPU SVM notifier lock, drop lock
>> + */
>> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
>> +	up_read(&(gpusvm__)->notifier_lock)
>> +
>> +/**
>> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
>> + * @range: a pointer to the current GPU SVM range
>> + *
>> + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
>> + *         current range is the last one or if the input range is NULL.
>> + */
>> +static inline struct drm_gpusvm_range *
>> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
>> +{
>> +	if (range && !list_is_last(&range->rb.entry,
>> +				   &range->notifier->range_list))
>> +		return list_next_entry(range, rb.entry);
>> +
>> +	return NULL;
>> +}
>> +
>> +/**
>> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
>> + * @range__: Iterator variable for the ranges. If set, it indicates the start of
>> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
>> + * @notifier__: Pointer to the GPU SVM notifier
>> + * @start__: Start address of the range
>> + * @end__: End address of the range
>> + *
>> + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
>> + * to use while holding the driver SVM lock or the notifier lock.
>> + */
>> +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
>> +	for ((range__) = (range__) ?:					\
>> +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
>> +	     (range__) && (range__->va.start < (end__));		\
>> +	     (range__) = __drm_gpusvm_range_next(range__))
>> +
>> +/**
>> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
>> + * @range: Pointer to the GPU SVM range structure.
>> + * @mmu_range: Pointer to the MMU notifier range structure.
>> + *
>> + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
>> + * if the range partially falls within the provided MMU notifier range.
>> + */
>> +static inline void
>> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
>> +			      const struct mmu_notifier_range *mmu_range)
>> +{
>> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
>> +
>> +	range->flags.unmapped = true;
>> +	if (range->va.start < mmu_range->start ||
>> +	    range->va.end > mmu_range->end)
>> +		range->flags.partial_unmap = true;
>> +}
>> +
>> +#endif /* __DRM_GPUSVM_H__ */
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28 14:46     ` Christian König
@ 2024-08-28 15:43       ` Matthew Brost
  2024-08-28 16:06         ` Alex Deucher
  2024-08-28 16:25         ` Daniel Vetter
  0 siblings, 2 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-28 15:43 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, intel-xe, dri-devel, airlied, thomas.hellstrom,
	matthew.auld, daniel

On Wed, Aug 28, 2024 at 04:46:24PM +0200, Christian König wrote:
> Am 28.08.24 um 16:31 schrieb Daniel Vetter:
> > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > +		if (!ctx->mmap_locked) {
> > > +			/*
> > > +			 * XXX: HMM locking document indicates only a read-lock
> > > +			 * is required but there apears to be a window between
> > > +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > > +			 * via migrate_vma_setup and the pages actually moving
> > > +			 * in migrate_vma_finalize in which this code can grab
> > > +			 * garbage pages. Grabbing the write-lock if the range
> > > +			 * is attached to vram appears to protect against this
> > > +			 * race.
> > > +			 */

Thanks the comments, replying to both of you inline.

> > This one is really scary, since it means the entire migrate pte trickery
> > is essentially completely busted. Grabbing the mmap write lock just means
> > you block out pretty much everything interesting from concurrently
> > happening.
> > 
> > My gut feeling says we need to figure out what's happening here, because
> > this looks a bit too fundamental to me.

I agree. I haven’t looked into this issue for a couple of months but
really need to understand what is going on.

I should have mentioned this in the cover letter: the goal of this
series was to produce something for review that is stable and supports
UMDs/user applications. It was not intended to be presented as a final
solution. This issue certainly falls into the category of "needs to be
understood and requires a proper fix."

One open question I have is whether the test case that triggers this
issue is even defined behavior. The test creates concurrent access
between the GPU and CPU to the same memory address, resulting in GPU and
CPU faults racing against each other. It’s possible that this is
undefined behavior, so data corruption might be acceptable—i.e., the
kernel can’t crash, but incorrect results might be permissible.

e.g. This is the only defined usage model:

alloc_memory();
start_compute_kernel();
sync_on_compute_kernel_completion();
read_memory();

Hopefully, in the next week or so, I'll be heavily engaging with the UMD
teams. Development can then start, and applications will be running soon
after. This will allow us to address issues like this, collect data on
memory usage, and verify some of the assumptions I've made, such as
optimizing for 2M+ allocations.

> 
> I think I have at least a high level understanding what's going on here,
> Felix and especially Philip should know more of the details.
> 

I meant to reach out to AMD for issues like this. So, Felix
(felix.kuehling@amd.com) and Philip (Philip.Yang@amd.com) would be good
contacts?

> In general grabbing the mm_lock to protect PTEs from changing is completely
> nonsense. The mm_lock is to protect the VMAs and *not* the PTEs!
> 

Thanks for the hint. I believe that in the AMD implementation, I noticed
some additional locks for migration, which might be how you mitigated
this issue.

I must say it is a bit unfortunate that the HMM locking documentation
doesn’t mention this. I believe the documentation needs additional
information, which I can add once we finalize the solution.

Matt 

> Even with the write side of the mm_lock taken it is perfectly possible that
> PTE change. It's just less likely.
> 
> We run into multiple issues before we figured out this important distinction
> as well.
> 
> Christian.
> 
> > -Sima
> > 
> > 
> > > +			if (vram_pages)
> > > +				mmap_write_lock(mm);
> > > +			else
> > > +				mmap_read_lock(mm);
> > > +		}
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (!ctx->mmap_locked) {
> > > +			if (vram_pages)
> > > +				mmap_write_unlock(mm);
> > > +			else
> > > +				mmap_read_unlock(mm);
> > > +		}
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (!ctx->mmap_locked)
> > > +		mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +
> > > +	if (ctx->prefault) {
> > > +		range->pages = pages;
> > > +		goto set_seqno;
> > > +	}
> > > +
> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > > +
> > > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > +						   pages[j], 0,
> > > +						   PAGE_SIZE << order,
> > > +						   DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > > +				err = -EFAULT;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +		}
> > > +
> > > +		/* Huge pages, reduce memory footprint */
> > > +		if (order) {
> > > +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > > +						 GFP_KERNEL);
> > > +			if (dma_addr) {
> > > +				for (i = 0; i < j; ++i)
> > > +					dma_addr[i] = (dma_addr_t)pfns[i];
> > > +				kvfree(pfns);
> > > +				kfree_mapping = true;
> > > +			} else {
> > > +				dma_addr = (dma_addr_t *)pfns;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->order = order;
> > > +		range->flags.kfree_mapping = kfree_mapping;
> > > +		range->flags.has_dma_mapping = true;
> > > +		range->dma_addr = dma_addr;
> > > +		range->vram_allocation = NULL;
> > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	}
> > > +
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	for_each_dma_page(i, j, npages, order)
> > > +		dma_unmap_page(gpusvm->drm->dev,
> > > +			       (dma_addr_t)pfns[j],
> > > +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	if (alloc_pfns)
> > > +		kvfree(pfns);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > +					   unsigned long *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified GPU SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in GPU SVM. It
> > > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > > + * @dev: The device for which the pages were mapped
> > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > + * @npages: Number of pages to unmap
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > > + */
> > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > +					   dma_addr_t *dma_addr,
> > > +					   unsigned long npages,
> > > +					   enum dma_data_direction dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > > +			continue;
> > > +
> > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *                   failure of this function.
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > > + *                   should hold a reference to the VRAM allocation, which
> > > + *                   should be dropped via ops->vram_allocation or upon the
> > > + *                   failure of this function.
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > > + * necessary setup and invokes the driver-specific operations for migration to
> > > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > > + * until ops->vram_release is called which only upon successful return.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long i, npages = npages_in_range(start, end);
> > > +	struct vm_area_struct *vas;
> > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int err;
> > > +
> > > +	if (!range->flags.migrate_vram)
> > > +		return -EINVAL;
> > > +
> > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > > +	    !gpusvm->ops->copy_to_sram)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > +	 */
> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > > +					     migrate.dst);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > +					   migrate.src, npages, DMA_TO_DEVICE);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > +
> > > +		pages[i] = page;
> > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > +	}
> > > +
> > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	/* Upon success bind vram allocation to range and zdd */
> > > +	range->vram_allocation = vram_allocation;
> > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > +				       DMA_TO_DEVICE);
> > > +err_free:
> > > +	if (zdd)
> > > +		drm_gpusvm_zdd_put(zdd);
> > > +	kvfree(buf);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > > + * @vas: Pointer to the VM area structure, can be NULL
> > > + * @npages: Number of pages to populate
> > > + * @src_mpfn: Source array of migrate PFNs
> > > + * @mpfn: Array of migrate PFNs to populate
> > > + * @addr: Start address for PFN allocation
> > > + *
> > > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > > + * specified VM area structure. It allocates and locks pages in the VM area for
> > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > > + * alloc_page for allocation.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > > +						unsigned long npages,
> > > +						unsigned long *src_mpfn,
> > > +						unsigned long *mpfn, u64 addr)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > +		struct page *page;
> > > +
> > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > +			continue;
> > > +
> > > +		if (vas)
> > > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > > +		else
> > > +			page = alloc_page(GFP_HIGHUSER);
> > > +
> > > +		if (!page)
> > > +			return -ENOMEM;
> > > +
> > > +		lock_page(page);
> > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > > + * migration done via migrate_device_* functions. Fallback path as it is
> > > + * preferred to issue migrations with mmap lock.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	unsigned long *src, *dst;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	src = buf;
> > > +	dst = buf + (sizeof(*src) * npages);
> > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > > +					     npages, src);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > +				       gpusvm->device_private_page_owner, src,
> > > +				       npages, range->va.start);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > +					   dst, npages, DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > +	migrate_device_pages(src, dst, npages);
> > > +	migrate_device_finalize(src, dst, npages);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the specified GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +					struct vm_area_struct *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	/* Corner where VMA area struct has been partially unmapped */
> > > +	if (start < vas->vm_start)
> > > +		start = vas->vm_start;
> > > +	if (end > vas->vm_end)
> > > +		end = vas->vm_end;
> > > +
> > > +	migrate.start = start;
> > > +	migrate.end = end;
> > > +	npages = npages_in_range(start, end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/* Raced with another CPU fault, nothing to do */
> > > +	if (!migrate.cpages)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > +						   migrate.src, migrate.dst,
> > > +						   start);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > +					   migrate.dst, npages,
> > > +					   DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function initiates the migration of the specified GPU SVM range to
> > > + * SRAM. It performs necessary checks and invokes the internal migration
> > > + * function for actual migration.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the corner case when
> > > +	 * VRAM backing has been partially unmapped from MM's address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > > +	if (!vas) {
> > > +		if (!retry)
> > > +			err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > +		if (!retry)
> > > +			err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > > +	if (err)
> > > +		goto err_mmunlock;
> > > +
> > > +	if (vas->vm_end < end) {
> > > +		retry = true;
> > > +		start = vas->vm_end;
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_read_unlock(mm);
> > > +		/*
> > > +		 * Using mmput_async as this function can be called while
> > > +		 * holding a dma-resv lock, and a final put can grab the mmap
> > > +		 * lock, causing a lock inversion.
> > > +		 */
> > > +		mmput_async(mm);
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked)
> > > +		mmap_read_unlock(mm);
> > > +err_mmput:
> > > +	if (!ctx->mmap_locked)
> > > +		mmput_async(mm);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > > + * @page: Pointer to the page
> > > + *
> > > + * This function is a callback used to put the GPU SVM zone device data
> > > + * associated with a page when it is being released.
> > > + */
> > > +static void drm_gpusvm_page_free(struct page *page)
> > > +{
> > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > + * the internal migration function to migrate the range back to RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > +					   vmf->vma, vmf->page,
> > > +					   zdd->range->va.start,
> > > +					   zdd->range->va.end);
> > > +
> > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > + */
> > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > +	.page_free = drm_gpusvm_page_free,
> > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM device page map operations structure.
> > > + */
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > +{
> > > +	return &drm_gpusvm_pagemap_ops;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM has mapping, False otherwise
> > > + */
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > +		struct drm_gpusvm_range *range = NULL;
> > > +
> > > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > > +			return true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > new file mode 100644
> > > index 000000000000..0ea70f8534a8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > @@ -0,0 +1,415 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_GPUSVM_H__
> > > +#define __DRM_GPUSVM_H__
> > > +
> > > +#include <linux/kref.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct dev_pagemap_ops;
> > > +struct drm_device;
> > > +struct drm_gpusvm;
> > > +struct drm_gpusvm_notifier;
> > > +struct drm_gpusvm_ops;
> > > +struct drm_gpusvm_range;
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > + *
> > > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > > + * These operations are provided by the GPU driver to manage SVM ranges and
> > > + * perform operations such as migration between VRAM and system RAM.
> > > + */
> > > +struct drm_gpusvm_ops {
> > > +	/**
> > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM notifier.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > > +	 */
> > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > +
> > > +	/**
> > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM notifier.
> > > +	 */
> > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > +
> > > +	/**
> > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM range.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > > +	 */
> > > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > > +
> > > +	/**
> > > +	 * @range_free: Free a GPU SVM range (optional)
> > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM range.
> > > +	 */
> > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > +
> > > +	/**
> > > +	 * @vram_release: Release VRAM allocation (optional)
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > +	 *
> > > +	 * This function shall release VRAM allocation and expects to drop a
> > > +	 * reference to VRAM allocation.
> > > +	 */
> > > +	void (*vram_release)(void *vram_allocation);
> > > +
> > > +	/**
> > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > +	 * @npages: Number of pages to populate
> > > +	 * @pfn: Array of page frame numbers to populate
> > > +	 *
> > > +	 * This function shall populate VRAM page frame numbers (PFN).
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > +				 void *vram_allocation,
> > > +				 unsigned long npages,
> > > +				 unsigned long *pfn);
> > > +
> > > +	/**
> > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to VRAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @copy_to_sram: Copy to system RAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to system RAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > > +	 *
> > > +	 * This function shall invalidate the GPU page tables. It can safely
> > > +	 * walk the notifier range RB tree/list in this function. Called while
> > > +	 * holding the notifier lock.
> > > +	 */
> > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > +			   struct drm_gpusvm_notifier *notifier,
> > > +			   const struct mmu_notifier_range *mmu_range);
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: MMU interval notifier
> > > + * @interval: Interval for the notifier
> > > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > > + * @root: Cached root node of the RB tree containing ranges
> > > + * @range_list: List head containing of ranges in the same order they appear in
> > > + *              interval tree. This is useful to keep iterating ranges while
> > > + *              doing modifications to RB tree.
> > > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > > + *                 removed
> > > + *
> > > + * This structure represents a GPU SVM notifier.
> > > + */
> > > +struct drm_gpusvm_notifier {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct mmu_interval_notifier notifier;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} interval;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct rb_root_cached root;
> > > +	struct list_head range_list;
> > > +	struct {
> > > +		u32 removed : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier
> > > + * @refcount: Reference count for the range
> > > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > > + * @va: Virtual address range
> > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > > + *                       on @order which releases via kfree
> > > + *
> > > + * This structure represents a GPU SVM range used for tracking memory ranges
> > > + * mapped in a DRM device.
> > > + */
> > > +struct drm_gpusvm_range {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct kref refcount;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} va;
> > > +	unsigned long notifier_seq;
> > > +	union {
> > > +		struct page **pages;
> > > +		dma_addr_t *dma_addr;
> > > +	};
> > > +	void *vram_allocation;
> > > +	u16 order;
> > > +	struct {
> > > +		/* All flags below must be set upon creation */
> > > +		u16 migrate_vram : 1;
> > > +		/* All flags below must be set / cleared under notifier lock */
> > > +		u16 unmapped : 1;
> > > +		u16 partial_unmap : 1;
> > > +		u16 has_vram_pages : 1;
> > > +		u16 has_dma_mapping : 1;
> > > +		u16 kfree_mapping : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm - GPU SVM structure
> > > + *
> > > + * @name: Name of the GPU SVM
> > > + * @drm: Pointer to the DRM device structure
> > > + * @mm: Pointer to the mm_struct for the address space
> > > + * @device_private_page_owner: Device private pages owner
> > > + * @mm_start: Start address of GPU SVM
> > > + * @mm_range: Range of the GPU SVM
> > > + * @notifier_size: Size of individual notifiers
> > > + * @ops: Pointer to the operations structure for GPU SVM
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > > + *               Entries should be powers of 2 in descending order.
> > > + * @num_chunks: Number of chunks
> > > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > > + * @notifier_list: list head containing of notifiers in the same order they
> > > + *                 appear in interval tree. This is useful to keep iterating
> > > + *                 notifiers while doing modifications to RB tree.
> > > + *
> > > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > + *
> > > + * No reference counting is provided, as this is expected to be embedded in the
> > > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > > + * counting.
> > > + */
> > > +struct drm_gpusvm {
> > > +	const char *name;
> > > +	struct drm_device *drm;
> > > +	struct mm_struct *mm;
> > > +	void *device_private_page_owner;
> > > +	u64 mm_start;
> > > +	u64 mm_range;
> > > +	u64 notifier_size;
> > > +	const struct drm_gpusvm_ops *ops;
> > > +	const u64 *chunk_sizes;
> > > +	int num_chunks;
> > > +	struct rw_semaphore notifier_lock;
> > > +	struct workqueue_struct *zdd_wq;
> > > +	struct rb_root_cached root;
> > > +	struct list_head notifier_list;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > + *
> > > + * @mmap_locked: mmap lock is locked
> > > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > > + *                (e.g.dma-revs -> mmap lock)
> > > + * @in_notifier: entering from a MMU notifier
> > > + * @read_only: operating on read-only memory
> > > + * @vram_possible: possible to use VRAM
> > > + * @prefault: prefault pages
> > > + *
> > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > + */
> > > +struct drm_gpusvm_ctx {
> > > +	u32 mmap_locked :1;
> > > +	u32 trylock_mmap :1;
> > > +	u32 in_notifier :1;
> > > +	u32 read_only :1;
> > > +	u32 vram_possible :1;
> > > +	u32 prefault :1;
> > > +};
> > > +
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks);
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > +
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range);
> > > +
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > +
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > + */
> > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > +	down_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > + */
> > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > +	up_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > + * @range: a pointer to the current GPU SVM range
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > > + *         current range is the last one or if the input range is NULL.
> > > + */
> > > +static inline struct drm_gpusvm_range *
> > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > +{
> > > +	if (range && !list_is_last(&range->rb.entry,
> > > +				   &range->notifier->range_list))
> > > +		return list_next_entry(range, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > > + * to use while holding the driver SVM lock or the notifier lock.
> > > + */
> > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > > +	for ((range__) = (range__) ?:					\
> > > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > > +	     (range__) && (range__->va.start < (end__));		\
> > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > + * @range: Pointer to the GPU SVM range structure.
> > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > + *
> > > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > > + * if the range partially falls within the provided MMU notifier range.
> > > + */
> > > +static inline void
> > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > +			      const struct mmu_notifier_range *mmu_range)
> > > +{
> > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > +
> > > +	range->flags.unmapped = true;
> > > +	if (range->va.start < mmu_range->start ||
> > > +	    range->va.end > mmu_range->end)
> > > +		range->flags.partial_unmap = true;
> > > +}
> > > +
> > > +#endif /* __DRM_GPUSVM_H__ */
> > > -- 
> > > 2.34.1
> > > 
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28 15:43       ` Matthew Brost
@ 2024-08-28 16:06         ` Alex Deucher
  2024-08-28 16:25         ` Daniel Vetter
  1 sibling, 0 replies; 100+ messages in thread
From: Alex Deucher @ 2024-08-28 16:06 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Christian König, Daniel Vetter, intel-xe, dri-devel, airlied,
	thomas.hellstrom, matthew.auld, daniel

On Wed, Aug 28, 2024 at 11:53 AM Matthew Brost <matthew.brost@intel.com> wrote:
>
> On Wed, Aug 28, 2024 at 04:46:24PM +0200, Christian König wrote:
> > Am 28.08.24 um 16:31 schrieb Daniel Vetter:
> > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > +         if (!ctx->mmap_locked) {
> > > > +                 /*
> > > > +                  * XXX: HMM locking document indicates only a read-lock
> > > > +                  * is required but there apears to be a window between
> > > > +                  * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > > > +                  * via migrate_vma_setup and the pages actually moving
> > > > +                  * in migrate_vma_finalize in which this code can grab
> > > > +                  * garbage pages. Grabbing the write-lock if the range
> > > > +                  * is attached to vram appears to protect against this
> > > > +                  * race.
> > > > +                  */
>
> Thanks the comments, replying to both of you inline.
>
> > > This one is really scary, since it means the entire migrate pte trickery
> > > is essentially completely busted. Grabbing the mmap write lock just means
> > > you block out pretty much everything interesting from concurrently
> > > happening.
> > >
> > > My gut feeling says we need to figure out what's happening here, because
> > > this looks a bit too fundamental to me.
>
> I agree. I haven’t looked into this issue for a couple of months but
> really need to understand what is going on.
>
> I should have mentioned this in the cover letter: the goal of this
> series was to produce something for review that is stable and supports
> UMDs/user applications. It was not intended to be presented as a final
> solution. This issue certainly falls into the category of "needs to be
> understood and requires a proper fix."
>
> One open question I have is whether the test case that triggers this
> issue is even defined behavior. The test creates concurrent access
> between the GPU and CPU to the same memory address, resulting in GPU and
> CPU faults racing against each other. It’s possible that this is
> undefined behavior, so data corruption might be acceptable—i.e., the
> kernel can’t crash, but incorrect results might be permissible.
>
> e.g. This is the only defined usage model:
>
> alloc_memory();
> start_compute_kernel();
> sync_on_compute_kernel_completion();
> read_memory();
>
> Hopefully, in the next week or so, I'll be heavily engaging with the UMD
> teams. Development can then start, and applications will be running soon
> after. This will allow us to address issues like this, collect data on
> memory usage, and verify some of the assumptions I've made, such as
> optimizing for 2M+ allocations.
>
> >
> > I think I have at least a high level understanding what's going on here,
> > Felix and especially Philip should know more of the details.
> >
>
> I meant to reach out to AMD for issues like this. So, Felix
> (felix.kuehling@amd.com) and Philip (Philip.Yang@amd.com) would be good
> contacts?

Yes.

Alex

>
> > In general grabbing the mm_lock to protect PTEs from changing is completely
> > nonsense. The mm_lock is to protect the VMAs and *not* the PTEs!
> >
>
> Thanks for the hint. I believe that in the AMD implementation, I noticed
> some additional locks for migration, which might be how you mitigated
> this issue.
>
> I must say it is a bit unfortunate that the HMM locking documentation
> doesn’t mention this. I believe the documentation needs additional
> information, which I can add once we finalize the solution.
>
> Matt
>
> > Even with the write side of the mm_lock taken it is perfectly possible that
> > PTE change. It's just less likely.
> >
> > We run into multiple issues before we figured out this important distinction
> > as well.
> >
> > Christian.
> >
> > > -Sima
> > >
> > >
> > > > +                 if (vram_pages)
> > > > +                         mmap_write_lock(mm);
> > > > +                 else
> > > > +                         mmap_read_lock(mm);
> > > > +         }
> > > > +         err = hmm_range_fault(&hmm_range);
> > > > +         if (!ctx->mmap_locked) {
> > > > +                 if (vram_pages)
> > > > +                         mmap_write_unlock(mm);
> > > > +                 else
> > > > +                         mmap_read_unlock(mm);
> > > > +         }
> > > > +
> > > > +         if (err == -EBUSY) {
> > > > +                 if (time_after(jiffies, timeout))
> > > > +                         break;
> > > > +
> > > > +                 hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > > +                 continue;
> > > > +         }
> > > > +         break;
> > > > + }
> > > > + if (!ctx->mmap_locked)
> > > > +         mmput(mm);
> > > > + if (err)
> > > > +         goto err_free;
> > > > +
> > > > + pages = (struct page **)pfns;
> > > > +
> > > > + if (ctx->prefault) {
> > > > +         range->pages = pages;
> > > > +         goto set_seqno;
> > > > + }
> > > > +
> > > > +map_pages:
> > > > + if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > +         WARN_ON_ONCE(!range->vram_allocation);
> > > > +
> > > > +         for (i = 0; i < npages; ++i) {
> > > > +                 pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > +
> > > > +                 if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > +                         err = -EOPNOTSUPP;
> > > > +                         goto err_free;
> > > > +                 }
> > > > +         }
> > > > +
> > > > +         /* Do not race with notifier unmapping pages */
> > > > +         drm_gpusvm_notifier_lock(gpusvm);
> > > > +         range->flags.has_vram_pages = true;
> > > > +         range->pages = pages;
> > > > +         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > +                 err = -EAGAIN;
> > > > +                 __drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +         }
> > > > +         drm_gpusvm_notifier_unlock(gpusvm);
> > > > + } else {
> > > > +         dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > +
> > > > +         for_each_dma_page(i, j, npages, order) {
> > > > +                 if (WARN_ON_ONCE(i && order !=
> > > > +                                  hmm_pfn_to_map_order(pfns[i]))) {
> > > > +                         err = -EOPNOTSUPP;
> > > > +                         npages = i;
> > > > +                         goto err_unmap;
> > > > +                 }
> > > > +                 order = hmm_pfn_to_map_order(pfns[i]);
> > > > +
> > > > +                 pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > +                 if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > +                         err = -EOPNOTSUPP;
> > > > +                         npages = i;
> > > > +                         goto err_unmap;
> > > > +                 }
> > > > +
> > > > +                 set_page_dirty_lock(pages[j]);
> > > > +                 mark_page_accessed(pages[j]);
> > > > +
> > > > +                 dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > > +                                            pages[j], 0,
> > > > +                                            PAGE_SIZE << order,
> > > > +                                            DMA_BIDIRECTIONAL);
> > > > +                 if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > > > +                         err = -EFAULT;
> > > > +                         npages = i;
> > > > +                         goto err_unmap;
> > > > +                 }
> > > > +         }
> > > > +
> > > > +         /* Huge pages, reduce memory footprint */
> > > > +         if (order) {
> > > > +                 dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > > > +                                          GFP_KERNEL);
> > > > +                 if (dma_addr) {
> > > > +                         for (i = 0; i < j; ++i)
> > > > +                                 dma_addr[i] = (dma_addr_t)pfns[i];
> > > > +                         kvfree(pfns);
> > > > +                         kfree_mapping = true;
> > > > +                 } else {
> > > > +                         dma_addr = (dma_addr_t *)pfns;
> > > > +                 }
> > > > +         }
> > > > +
> > > > +         /* Do not race with notifier unmapping pages */
> > > > +         drm_gpusvm_notifier_lock(gpusvm);
> > > > +         range->order = order;
> > > > +         range->flags.kfree_mapping = kfree_mapping;
> > > > +         range->flags.has_dma_mapping = true;
> > > > +         range->dma_addr = dma_addr;
> > > > +         range->vram_allocation = NULL;
> > > > +         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > +                 err = -EAGAIN;
> > > > +                 __drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +         }
> > > > +         drm_gpusvm_notifier_unlock(gpusvm);
> > > > + }
> > > > +
> > > > + if (err == -EAGAIN)
> > > > +         goto retry;
> > > > +set_seqno:
> > > > + range->notifier_seq = hmm_range.notifier_seq;
> > > > +
> > > > + return 0;
> > > > +
> > > > +err_unmap:
> > > > + for_each_dma_page(i, j, npages, order)
> > > > +         dma_unmap_page(gpusvm->drm->dev,
> > > > +                        (dma_addr_t)pfns[j],
> > > > +                        PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > + if (alloc_pfns)
> > > > +         kvfree(pfns);
> > > > +err_out:
> > > > + return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > > > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > > > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > > > + * security model.
> > > > + */
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +                           struct drm_gpusvm_range *range,
> > > > +                           const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > + if (ctx->in_notifier)
> > > > +         lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > > + else
> > > > +         drm_gpusvm_notifier_lock(gpusvm);
> > > > +
> > > > + __drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +
> > > > + if (!ctx->in_notifier)
> > > > +         drm_gpusvm_notifier_unlock(gpusvm);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > + * @page: Pointer to the page to put
> > > > + *
> > > > + * This function unlocks and puts a page.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > +{
> > > > + unlock_page(page);
> > > > + put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > + * @npages: Number of pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > + *
> > > > + * This function puts an array of pages.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > > +                                    unsigned long *migrate_pfn)
> > > > +{
> > > > + unsigned long i;
> > > > +
> > > > + for (i = 0; i < npages; ++i) {
> > > > +         if (!migrate_pfn[i])
> > > > +                 continue;
> > > > +
> > > > +         drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > > > +         migrate_pfn[i] = 0;
> > > > + }
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > + * @page: Pointer to the page
> > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > + *
> > > > + * This function associates the given page with the specified GPU SVM zone
> > > > + * device data and initializes it for zone device usage.
> > > > + */
> > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > +                              struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > + page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > + zone_device_page_init(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > > > + * @dev: The device for which the pages are being mapped
> > > > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > + * @npages: Number of pages to map
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function maps pages of memory for migration usage in GPU SVM. It
> > > > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > > > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > > > + * array.
> > > > + *
> > > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > > + */
> > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > +                                 dma_addr_t *dma_addr,
> > > > +                                 long unsigned int *migrate_pfn,
> > > > +                                 unsigned long npages,
> > > > +                                 enum dma_data_direction dir)
> > > > +{
> > > > + unsigned long i;
> > > > +
> > > > + for (i = 0; i < npages; ++i) {
> > > > +         struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > > > +
> > > > +         if (!page)
> > > > +                 continue;
> > > > +
> > > > +         if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > +                 return -EFAULT;
> > > > +
> > > > +         dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > > > +         if (dma_mapping_error(dev, dma_addr[i]))
> > > > +                 return -EFAULT;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > > > + * @dev: The device for which the pages were mapped
> > > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > > + * @npages: Number of pages to unmap
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > > > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > > > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > > > + */
> > > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > > +                                    dma_addr_t *dma_addr,
> > > > +                                    unsigned long npages,
> > > > +                                    enum dma_data_direction dir)
> > > > +{
> > > > + unsigned long i;
> > > > +
> > > > + for (i = 0; i < npages; ++i) {
> > > > +         if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > > > +                 continue;
> > > > +
> > > > +         dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > > + }
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *                   failure of this function.
> > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > > > + *                   should hold a reference to the VRAM allocation, which
> > > > + *                   should be dropped via ops->vram_allocation or upon the
> > > > + *                   failure of this function.
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > > > + * necessary setup and invokes the driver-specific operations for migration to
> > > > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > > > + * until ops->vram_release is called which only upon successful return.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > +                        struct drm_gpusvm_range *range,
> > > > +                        void *vram_allocation,
> > > > +                        const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > + u64 start = range->va.start, end = range->va.end;
> > > > + struct migrate_vma migrate = {
> > > > +         .start          = start,
> > > > +         .end            = end,
> > > > +         .pgmap_owner    = gpusvm->device_private_page_owner,
> > > > +         .flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > > > + };
> > > > + struct mm_struct *mm = gpusvm->mm;
> > > > + unsigned long i, npages = npages_in_range(start, end);
> > > > + struct vm_area_struct *vas;
> > > > + struct drm_gpusvm_zdd *zdd = NULL;
> > > > + struct page **pages;
> > > > + dma_addr_t *dma_addr;
> > > > + void *buf;
> > > > + int err;
> > > > +
> > > > + if (!range->flags.migrate_vram)
> > > > +         return -EINVAL;
> > > > +
> > > > + if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > > > +     !gpusvm->ops->copy_to_sram)
> > > > +         return -EOPNOTSUPP;
> > > > +
> > > > + if (!ctx->mmap_locked) {
> > > > +         if (!mmget_not_zero(mm)) {
> > > > +                 err = -EFAULT;
> > > > +                 goto err_out;
> > > > +         }
> > > > +         mmap_write_lock(mm);
> > > > + }
> > > > +
> > > > + mmap_assert_locked(mm);
> > > > +
> > > > + vas = vma_lookup(mm, start);
> > > > + if (!vas) {
> > > > +         err = -ENOENT;
> > > > +         goto err_mmunlock;
> > > > + }
> > > > +
> > > > + if (end > vas->vm_end || start < vas->vm_start) {
> > > > +         err = -EINVAL;
> > > > +         goto err_mmunlock;
> > > > + }
> > > > +
> > > > + if (!vma_is_anonymous(vas)) {
> > > > +         err = -EBUSY;
> > > > +         goto err_mmunlock;
> > > > + }
> > > > +
> > > > + buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > +                sizeof(*pages), GFP_KERNEL);
> > > > + if (!buf) {
> > > > +         err = -ENOMEM;
> > > > +         goto err_mmunlock;
> > > > + }
> > > > + dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > + pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > + zdd = drm_gpusvm_zdd_alloc(range);
> > > > + if (!zdd) {
> > > > +         err = -ENOMEM;
> > > > +         goto err_free;
> > > > + }
> > > > +
> > > > + migrate.vma = vas;
> > > > + migrate.src = buf;
> > > > + migrate.dst = migrate.src + npages;
> > > > +
> > > > + err = migrate_vma_setup(&migrate);
> > > > + if (err)
> > > > +         goto err_free;
> > > > +
> > > > + /*
> > > > +  * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > +  * always an error. Need to revisit possible cases and how to handle. We
> > > > +  * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > > +  */
> > > > +
> > > > + if (!migrate.cpages) {
> > > > +         err = -EFAULT;
> > > > +         goto err_free;
> > > > + }
> > > > +
> > > > + if (migrate.cpages != npages) {
> > > > +         err = -EBUSY;
> > > > +         goto err_finalize;
> > > > + }
> > > > +
> > > > + err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > > > +                                      migrate.dst);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > + err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > +                                    migrate.src, npages, DMA_TO_DEVICE);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > + for (i = 0; i < npages; ++i) {
> > > > +         struct page *page = pfn_to_page(migrate.dst[i]);
> > > > +
> > > > +         pages[i] = page;
> > > > +         migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > > +         drm_gpusvm_get_vram_page(page, zdd);
> > > > + }
> > > > +
> > > > + err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > + /* Upon success bind vram allocation to range and zdd */
> > > > + range->vram_allocation = vram_allocation;
> > > > + WRITE_ONCE(zdd->vram_allocation, vram_allocation);      /* Owns ref */
> > > > +
> > > > +err_finalize:
> > > > + if (err)
> > > > +         drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > + migrate_vma_pages(&migrate);
> > > > + migrate_vma_finalize(&migrate);
> > > > + drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > +                                DMA_TO_DEVICE);
> > > > +err_free:
> > > > + if (zdd)
> > > > +         drm_gpusvm_zdd_put(zdd);
> > > > + kvfree(buf);
> > > > +err_mmunlock:
> > > > + if (!ctx->mmap_locked) {
> > > > +         mmap_write_unlock(mm);
> > > > +         mmput(mm);
> > > > + }
> > > > +err_out:
> > > > + return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > + * @npages: Number of pages to populate
> > > > + * @src_mpfn: Source array of migrate PFNs
> > > > + * @mpfn: Array of migrate PFNs to populate
> > > > + * @addr: Start address for PFN allocation
> > > > + *
> > > > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > > > + * specified VM area structure. It allocates and locks pages in the VM area for
> > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > > > + * alloc_page for allocation.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > > > +                                         unsigned long npages,
> > > > +                                         unsigned long *src_mpfn,
> > > > +                                         unsigned long *mpfn, u64 addr)
> > > > +{
> > > > + unsigned long i;
> > > > +
> > > > + for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > +         struct page *page;
> > > > +
> > > > +         if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > +                 continue;
> > > > +
> > > > +         if (vas)
> > > > +                 page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > > > +         else
> > > > +                 page = alloc_page(GFP_HIGHUSER);
> > > > +
> > > > +         if (!page)
> > > > +                 return -ENOMEM;
> > > > +
> > > > +         lock_page(page);
> > > > +         mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > > > + * migration done via migrate_device_* functions. Fallback path as it is
> > > > + * preferred to issue migrations with mmap lock.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > > +                             struct drm_gpusvm_range *range)
> > > > +{
> > > > + unsigned long npages;
> > > > + struct page **pages;
> > > > + unsigned long *src, *dst;
> > > > + dma_addr_t *dma_addr;
> > > > + void *buf;
> > > > + int i, err = 0;
> > > > +
> > > > + npages = npages_in_range(range->va.start, range->va.end);
> > > > +
> > > > + buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > > > +                sizeof(*pages), GFP_KERNEL);
> > > > + if (!buf) {
> > > > +         err = -ENOMEM;
> > > > +         goto err_out;
> > > > + }
> > > > + src = buf;
> > > > + dst = buf + (sizeof(*src) * npages);
> > > > + dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > + pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > + err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > > > +                                      npages, src);
> > > > + if (err)
> > > > +         goto err_free;
> > > > +
> > > > + err = migrate_device_vma_range(gpusvm->mm,
> > > > +                                gpusvm->device_private_page_owner, src,
> > > > +                                npages, range->va.start);
> > > > + if (err)
> > > > +         goto err_free;
> > > > +
> > > > + err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > + err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > +                                    dst, npages, DMA_BIDIRECTIONAL);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > + for (i = 0; i < npages; ++i)
> > > > +         pages[i] = migrate_pfn_to_page(src[i]);
> > > > +
> > > > + err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > + if (err)
> > > > +         drm_gpusvm_migration_put_pages(npages, dst);
> > > > + migrate_device_pages(src, dst, npages);
> > > > + migrate_device_finalize(src, dst, npages);
> > > > + drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > +                                DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > + kvfree(buf);
> > > > +err_out:
> > > > +
> > > > + return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @vas: Pointer to the VM area structure
> > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > + * @start: Start address of the migration range
> > > > + * @end: End address of the migration range
> > > > + *
> > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +                                 struct vm_area_struct *vas,
> > > > +                                 struct page *page,
> > > > +                                 u64 start, u64 end)
> > > > +{
> > > > + struct migrate_vma migrate = {
> > > > +         .vma            = vas,
> > > > +         .pgmap_owner    = gpusvm->device_private_page_owner,
> > > > +         .flags          = MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +         .fault_page     = page,
> > > > + };
> > > > + unsigned long npages;
> > > > + struct page **pages;
> > > > + dma_addr_t *dma_addr;
> > > > + void *buf;
> > > > + int i, err = 0;
> > > > +
> > > > + mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > + /* Corner where VMA area struct has been partially unmapped */
> > > > + if (start < vas->vm_start)
> > > > +         start = vas->vm_start;
> > > > + if (end > vas->vm_end)
> > > > +         end = vas->vm_end;
> > > > +
> > > > + migrate.start = start;
> > > > + migrate.end = end;
> > > > + npages = npages_in_range(start, end);
> > > > +
> > > > + buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > +                sizeof(*pages), GFP_KERNEL);
> > > > + if (!buf) {
> > > > +         err = -ENOMEM;
> > > > +         goto err_out;
> > > > + }
> > > > + dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > + pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > + migrate.vma = vas;
> > > > + migrate.src = buf;
> > > > + migrate.dst = migrate.src + npages;
> > > > +
> > > > + err = migrate_vma_setup(&migrate);
> > > > + if (err)
> > > > +         goto err_free;
> > > > +
> > > > + /* Raced with another CPU fault, nothing to do */
> > > > + if (!migrate.cpages)
> > > > +         goto err_free;
> > > > +
> > > > + err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > > +                                            migrate.src, migrate.dst,
> > > > +                                            start);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > + err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > +                                    migrate.dst, npages,
> > > > +                                    DMA_BIDIRECTIONAL);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > + for (i = 0; i < npages; ++i)
> > > > +         pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > > +
> > > > + err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > + if (err)
> > > > +         goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > + if (err)
> > > > +         drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > + migrate_vma_pages(&migrate);
> > > > + migrate_vma_finalize(&migrate);
> > > > + drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > +                                DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > + kvfree(buf);
> > > > +err_out:
> > > > + mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > + return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function initiates the migration of the specified GPU SVM range to
> > > > + * SRAM. It performs necessary checks and invokes the internal migration
> > > > + * function for actual migration.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +                        struct drm_gpusvm_range *range,
> > > > +                        const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > + u64 start = range->va.start, end = range->va.end;
> > > > + struct mm_struct *mm = gpusvm->mm;
> > > > + struct vm_area_struct *vas;
> > > > + int err;
> > > > + bool retry = false;
> > > > +
> > > > + if (!ctx->mmap_locked) {
> > > > +         if (!mmget_not_zero(mm)) {
> > > > +                 err = -EFAULT;
> > > > +                 goto err_out;
> > > > +         }
> > > > +         if (ctx->trylock_mmap) {
> > > > +                 if (!mmap_read_trylock(mm))  {
> > > > +                         err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > +                         goto err_mmput;
> > > > +                 }
> > > > +         } else {
> > > > +                 mmap_read_lock(mm);
> > > > +         }
> > > > + }
> > > > +
> > > > + mmap_assert_locked(mm);
> > > > +
> > > > + /*
> > > > +  * Loop required to find all VMA area structs for the corner case when
> > > > +  * VRAM backing has been partially unmapped from MM's address space.
> > > > +  */
> > > > +again:
> > > > + vas = find_vma(mm, start);
> > > > + if (!vas) {
> > > > +         if (!retry)
> > > > +                 err = -ENOENT;
> > > > +         goto err_mmunlock;
> > > > + }
> > > > +
> > > > + if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > +         if (!retry)
> > > > +                 err = -EINVAL;
> > > > +         goto err_mmunlock;
> > > > + }
> > > > +
> > > > + err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > > > + if (err)
> > > > +         goto err_mmunlock;
> > > > +
> > > > + if (vas->vm_end < end) {
> > > > +         retry = true;
> > > > +         start = vas->vm_end;
> > > > +         goto again;
> > > > + }
> > > > +
> > > > + if (!ctx->mmap_locked) {
> > > > +         mmap_read_unlock(mm);
> > > > +         /*
> > > > +          * Using mmput_async as this function can be called while
> > > > +          * holding a dma-resv lock, and a final put can grab the mmap
> > > > +          * lock, causing a lock inversion.
> > > > +          */
> > > > +         mmput_async(mm);
> > > > + }
> > > > +
> > > > + return 0;
> > > > +
> > > > +err_mmunlock:
> > > > + if (!ctx->mmap_locked)
> > > > +         mmap_read_unlock(mm);
> > > > +err_mmput:
> > > > + if (!ctx->mmap_locked)
> > > > +         mmput_async(mm);
> > > > +err_out:
> > > > + return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > > > + * @page: Pointer to the page
> > > > + *
> > > > + * This function is a callback used to put the GPU SVM zone device data
> > > > + * associated with a page when it is being released.
> > > > + */
> > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > +{
> > > > + drm_gpusvm_zdd_put(page->zone_device_data);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > + * @vmf: Pointer to the fault information structure
> > > > + *
> > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > + * the internal migration function to migrate the range back to RAM.
> > > > + *
> > > > + * Returns:
> > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > + */
> > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > +{
> > > > + struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > + int err;
> > > > +
> > > > + err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > +                                    vmf->vma, vmf->page,
> > > > +                                    zdd->range->va.start,
> > > > +                                    zdd->range->va.end);
> > > > +
> > > > + return err ? VM_FAULT_SIGBUS : 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > > + */
> > > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > > + .page_free = drm_gpusvm_page_free,
> > > > + .migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM device page map operations structure.
> > > > + */
> > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > > +{
> > > > + return &drm_gpusvm_pagemap_ops;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + * @start: Start address
> > > > + * @end: End address
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM has mapping, False otherwise
> > > > + */
> > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > > > +{
> > > > + struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > + drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > > +         struct drm_gpusvm_range *range = NULL;
> > > > +
> > > > +         drm_gpusvm_for_each_range(range, notifier, start, end)
> > > > +                 return true;
> > > > + }
> > > > +
> > > > + return false;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > new file mode 100644
> > > > index 000000000000..0ea70f8534a8
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > @@ -0,0 +1,415 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef __DRM_GPUSVM_H__
> > > > +#define __DRM_GPUSVM_H__
> > > > +
> > > > +#include <linux/kref.h>
> > > > +#include <linux/mmu_notifier.h>
> > > > +#include <linux/workqueue.h>
> > > > +
> > > > +struct dev_pagemap_ops;
> > > > +struct drm_device;
> > > > +struct drm_gpusvm;
> > > > +struct drm_gpusvm_notifier;
> > > > +struct drm_gpusvm_ops;
> > > > +struct drm_gpusvm_range;
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > > + *
> > > > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > > > + * These operations are provided by the GPU driver to manage SVM ranges and
> > > > + * perform operations such as migration between VRAM and system RAM.
> > > > + */
> > > > +struct drm_gpusvm_ops {
> > > > + /**
> > > > +  * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > > +  *
> > > > +  * This function shall allocate a GPU SVM notifier.
> > > > +  *
> > > > +  * Returns:
> > > > +  * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > > > +  */
> > > > + struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > > +
> > > > + /**
> > > > +  * @notifier_free: Free a GPU SVM notifier (optional)
> > > > +  * @notifier: Pointer to the GPU SVM notifier to be freed
> > > > +  *
> > > > +  * This function shall free a GPU SVM notifier.
> > > > +  */
> > > > + void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > > +
> > > > + /**
> > > > +  * @range_alloc: Allocate a GPU SVM range (optional)
> > > > +  * @gpusvm: Pointer to the GPU SVM
> > > > +  *
> > > > +  * This function shall allocate a GPU SVM range.
> > > > +  *
> > > > +  * Returns:
> > > > +  * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > > > +  */
> > > > + struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > > > +
> > > > + /**
> > > > +  * @range_free: Free a GPU SVM range (optional)
> > > > +  * @range: Pointer to the GPU SVM range to be freed
> > > > +  *
> > > > +  * This function shall free a GPU SVM range.
> > > > +  */
> > > > + void (*range_free)(struct drm_gpusvm_range *range);
> > > > +
> > > > + /**
> > > > +  * @vram_release: Release VRAM allocation (optional)
> > > > +  * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > +  *
> > > > +  * This function shall release VRAM allocation and expects to drop a
> > > > +  * reference to VRAM allocation.
> > > > +  */
> > > > + void (*vram_release)(void *vram_allocation);
> > > > +
> > > > + /**
> > > > +  * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > > > +  * @gpusvm: Pointer to the GPU SVM
> > > > +  * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > +  * @npages: Number of pages to populate
> > > > +  * @pfn: Array of page frame numbers to populate
> > > > +  *
> > > > +  * This function shall populate VRAM page frame numbers (PFN).
> > > > +  *
> > > > +  * Returns:
> > > > +  * 0 on success, a negative error code on failure.
> > > > +  */
> > > > + int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > > +                          void *vram_allocation,
> > > > +                          unsigned long npages,
> > > > +                          unsigned long *pfn);
> > > > +
> > > > + /**
> > > > +  * @copy_to_vram: Copy to VRAM (required for migration)
> > > > +  * @gpusvm: Pointer to the GPU SVM
> > > > +  * @pages: Pointer to array of VRAM pages (destination)
> > > > +  * @dma_addr: Pointer to array of DMA addresses (source)
> > > > +  * @npages: Number of pages to copy
> > > > +  *
> > > > +  * This function shall copy pages to VRAM.
> > > > +  *
> > > > +  * Returns:
> > > > +  * 0 on success, a negative error code on failure.
> > > > +  */
> > > > + int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > > +                     struct page **pages,
> > > > +                     dma_addr_t *dma_addr,
> > > > +                     unsigned long npages);
> > > > +
> > > > + /**
> > > > +  * @copy_to_sram: Copy to system RAM (required for migration)
> > > > +  * @gpusvm: Pointer to the GPU SVM
> > > > +  * @pages: Pointer to array of VRAM pages (source)
> > > > +  * @dma_addr: Pointer to array of DMA addresses (destination)
> > > > +  * @npages: Number of pages to copy
> > > > +  *
> > > > +  * This function shall copy pages to system RAM.
> > > > +  *
> > > > +  * Returns:
> > > > +  * 0 on success, a negative error code on failure.
> > > > +  */
> > > > + int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > > +                     struct page **pages,
> > > > +                     dma_addr_t *dma_addr,
> > > > +                     unsigned long npages);
> > > > +
> > > > + /**
> > > > +  * @invalidate: Invalidate GPU SVM notifier (required)
> > > > +  * @gpusvm: Pointer to the GPU SVM
> > > > +  * @notifier: Pointer to the GPU SVM notifier
> > > > +  * @mmu_range: Pointer to the mmu_notifier_range structure
> > > > +  *
> > > > +  * This function shall invalidate the GPU page tables. It can safely
> > > > +  * walk the notifier range RB tree/list in this function. Called while
> > > > +  * holding the notifier lock.
> > > > +  */
> > > > + void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > > +                    struct drm_gpusvm_notifier *notifier,
> > > > +                    const struct mmu_notifier_range *mmu_range);
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > > > + *
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: MMU interval notifier
> > > > + * @interval: Interval for the notifier
> > > > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > > > + * @root: Cached root node of the RB tree containing ranges
> > > > + * @range_list: List head containing of ranges in the same order they appear in
> > > > + *              interval tree. This is useful to keep iterating ranges while
> > > > + *              doing modifications to RB tree.
> > > > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > > > + *                 removed
> > > > + *
> > > > + * This structure represents a GPU SVM notifier.
> > > > + */
> > > > +struct drm_gpusvm_notifier {
> > > > + struct drm_gpusvm *gpusvm;
> > > > + struct mmu_interval_notifier notifier;
> > > > + struct {
> > > > +         u64 start;
> > > > +         u64 end;
> > > > + } interval;
> > > > + struct {
> > > > +         struct rb_node node;
> > > > +         struct list_head entry;
> > > > +         u64 __subtree_last;
> > > > + } rb;
> > > > + struct rb_root_cached root;
> > > > + struct list_head range_list;
> > > > + struct {
> > > > +         u32 removed : 1;
> > > > + } flags;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > > + *
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier
> > > > + * @refcount: Reference count for the range
> > > > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > > > + * @va: Virtual address range
> > > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > > > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > > > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > > > + *                       on @order which releases via kfree
> > > > + *
> > > > + * This structure represents a GPU SVM range used for tracking memory ranges
> > > > + * mapped in a DRM device.
> > > > + */
> > > > +struct drm_gpusvm_range {
> > > > + struct drm_gpusvm *gpusvm;
> > > > + struct drm_gpusvm_notifier *notifier;
> > > > + struct kref refcount;
> > > > + struct {
> > > > +         struct rb_node node;
> > > > +         struct list_head entry;
> > > > +         u64 __subtree_last;
> > > > + } rb;
> > > > + struct {
> > > > +         u64 start;
> > > > +         u64 end;
> > > > + } va;
> > > > + unsigned long notifier_seq;
> > > > + union {
> > > > +         struct page **pages;
> > > > +         dma_addr_t *dma_addr;
> > > > + };
> > > > + void *vram_allocation;
> > > > + u16 order;
> > > > + struct {
> > > > +         /* All flags below must be set upon creation */
> > > > +         u16 migrate_vram : 1;
> > > > +         /* All flags below must be set / cleared under notifier lock */
> > > > +         u16 unmapped : 1;
> > > > +         u16 partial_unmap : 1;
> > > > +         u16 has_vram_pages : 1;
> > > > +         u16 has_dma_mapping : 1;
> > > > +         u16 kfree_mapping : 1;
> > > > + } flags;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm - GPU SVM structure
> > > > + *
> > > > + * @name: Name of the GPU SVM
> > > > + * @drm: Pointer to the DRM device structure
> > > > + * @mm: Pointer to the mm_struct for the address space
> > > > + * @device_private_page_owner: Device private pages owner
> > > > + * @mm_start: Start address of GPU SVM
> > > > + * @mm_range: Range of the GPU SVM
> > > > + * @notifier_size: Size of individual notifiers
> > > > + * @ops: Pointer to the operations structure for GPU SVM
> > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > > > + *               Entries should be powers of 2 in descending order.
> > > > + * @num_chunks: Number of chunks
> > > > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > > > + * @notifier_list: list head containing of notifiers in the same order they
> > > > + *                 appear in interval tree. This is useful to keep iterating
> > > > + *                 notifiers while doing modifications to RB tree.
> > > > + *
> > > > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > > + *
> > > > + * No reference counting is provided, as this is expected to be embedded in the
> > > > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > > > + * counting.
> > > > + */
> > > > +struct drm_gpusvm {
> > > > + const char *name;
> > > > + struct drm_device *drm;
> > > > + struct mm_struct *mm;
> > > > + void *device_private_page_owner;
> > > > + u64 mm_start;
> > > > + u64 mm_range;
> > > > + u64 notifier_size;
> > > > + const struct drm_gpusvm_ops *ops;
> > > > + const u64 *chunk_sizes;
> > > > + int num_chunks;
> > > > + struct rw_semaphore notifier_lock;
> > > > + struct workqueue_struct *zdd_wq;
> > > > + struct rb_root_cached root;
> > > > + struct list_head notifier_list;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > > + *
> > > > + * @mmap_locked: mmap lock is locked
> > > > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > > > + *                (e.g.dma-revs -> mmap lock)
> > > > + * @in_notifier: entering from a MMU notifier
> > > > + * @read_only: operating on read-only memory
> > > > + * @vram_possible: possible to use VRAM
> > > > + * @prefault: prefault pages
> > > > + *
> > > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > > + */
> > > > +struct drm_gpusvm_ctx {
> > > > + u32 mmap_locked :1;
> > > > + u32 trylock_mmap :1;
> > > > + u32 in_notifier :1;
> > > > + u32 read_only :1;
> > > > + u32 vram_possible :1;
> > > > + u32 prefault :1;
> > > > +};
> > > > +
> > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > +             const char *name, struct drm_device *drm,
> > > > +             struct mm_struct *mm, void *device_private_page_owner,
> > > > +             u64 mm_start, u64 mm_range, u64 notifier_size,
> > > > +             const struct drm_gpusvm_ops *ops,
> > > > +             const u64 *chunk_sizes, int num_chunks);
> > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > > > +                         u64 gpuva_start, u64 gpuva_end,
> > > > +                         const struct drm_gpusvm_ctx *ctx);
> > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > +                      struct drm_gpusvm_range *range);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > > +
> > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > +                           struct drm_gpusvm_range *range);
> > > > +
> > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > +                        struct drm_gpusvm_range *range,
> > > > +                        const struct drm_gpusvm_ctx *ctx);
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +                           struct drm_gpusvm_range *range,
> > > > +                           const struct drm_gpusvm_ctx *ctx);
> > > > +
> > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > +                        struct drm_gpusvm_range *range,
> > > > +                        void *vram_allocation,
> > > > +                        const struct drm_gpusvm_ctx *ctx);
> > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +                        struct drm_gpusvm_range *range,
> > > > +                        const struct drm_gpusvm_ctx *ctx);
> > > > +
> > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > > +
> > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > > + */
> > > > +#define drm_gpusvm_notifier_lock(gpusvm__)       \
> > > > + down_read(&(gpusvm__)->notifier_lock)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > > + */
> > > > +#define drm_gpusvm_notifier_unlock(gpusvm__)     \
> > > > + up_read(&(gpusvm__)->notifier_lock)
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > > + * @range: a pointer to the current GPU SVM range
> > > > + *
> > > > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > > > + *         current range is the last one or if the input range is NULL.
> > > > + */
> > > > +static inline struct drm_gpusvm_range *
> > > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > > +{
> > > > + if (range && !list_is_last(&range->rb.entry,
> > > > +                            &range->notifier->range_list))
> > > > +         return list_next_entry(range, rb.entry);
> > > > +
> > > > + return NULL;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > > > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > > > + *            the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the range
> > > > + * @end__: End address of the range
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > > > + * to use while holding the driver SVM lock or the notifier lock.
> > > > + */
> > > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)   \
> > > > + for ((range__) = (range__) ?:                                   \
> > > > +      drm_gpusvm_range_find((notifier__), (start__), (end__));   \
> > > > +      (range__) && (range__->va.start < (end__));                \
> > > > +      (range__) = __drm_gpusvm_range_next(range__))
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > > + * @range: Pointer to the GPU SVM range structure.
> > > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > > + *
> > > > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > > > + * if the range partially falls within the provided MMU notifier range.
> > > > + */
> > > > +static inline void
> > > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > > +                       const struct mmu_notifier_range *mmu_range)
> > > > +{
> > > > + lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > > +
> > > > + range->flags.unmapped = true;
> > > > + if (range->va.start < mmu_range->start ||
> > > > +     range->va.end > mmu_range->end)
> > > > +         range->flags.partial_unmap = true;
> > > > +}
> > > > +
> > > > +#endif /* __DRM_GPUSVM_H__ */
> > > > --
> > > > 2.34.1
> > > >
> >

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-28  2:48 ` [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration Matthew Brost
@ 2024-08-28 16:06   ` Daniel Vetter
  2024-08-28 18:22     ` Daniel Vetter
  2024-08-29  9:24     ` Christian König
  0 siblings, 2 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-08-28 16:06 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> Migration is implemented with range granularity, with VRAM backing being
> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> SVM range is migrated to SRAM, the TTM BO is destroyed).
> 
> The design choice for using TTM BO for VRAM backing store, as opposed to
> direct buddy allocation, is as follows:
> 
> - DRM buddy allocations are not at page granularity, offering no
>   advantage over a BO.

This one I'm not understanding.

> - DRM buddy allocations do not solve locking inversion problems between
>   mmap lock and dma-resv locks.

Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
matters hugely which migration path we're in, i.e. opportunistic
migration, cpu fault where we have to migrate or die, or when we run out
of vram and need to evict stuff to make space.

> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>   evict each other).

So core mm handles this by just roughly equally shrinking everything.
Seems to work, and it has a pile of object shrinkers, and the page lru is
also split into page cache and anon memory.

I think you need to put in more justification that unified eviction is
required than just stating it, because a look at mm/ gives a very well
established counterexample.

> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>   require a dma-resv.

So from the TTM side we need exhaustive eviction, or at least something a
bit more exhaustive than what ttm currently has. Note that i915-gem also
never really got to perfect exhaustive eviction, it's just a pile better
than ttm right now.

Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
eviction is going to win because the shrinkers can only trylock dma_resv.
So this part works. It actually works so well on the system memory side
that if we're not careful we can trigger oom, because we're too good at
getting at all the memory.

SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
don't see why, because the idea is that thanks to gpu and cpu page faults,
you can always get out of a pinch by just trashing everything for a while
and migrating the handfull of available pages a lot.

> - Likely allocation size is 2M which makes of size of BO (872)
>   acceptable per allocation (872 / 2M == .0004158).
> 
> With this, using TTM BO for VRAM backing store seems to be an obvious
> choice as it allows leveraging of the TTM eviction code.

Except it requires that you hold dma_resv, which brings in all kinds of
pain. And for eviction we really don't need a lot of synchronization, so a
lot of that locking is not needed, unlike the case where we have a cpu
fault, where we absolutely need mmap_lock and all that to make sure we
fault in the right page.

But for eviction we only need to throw out some pages, if we're not
entirely precise with picking the right ones (or have no idea into which
vma they're all currently mapped into) it doesn't matter. That's why
migrate_device_pages doesn't care about any of that at all, it doesn't
need to by design. But by bo backing memory you drag in all that stuff
that's causing headacheds for eviction.

The only thing migration tries to do is remove all pte, and if that
succeeds, move the page. Specialized for the gpusvm case, looking at mm/
code as cheat sheet, we need roughly:

- reverse mapping structure like anon_vma. Except gpusvm can assume that
  there's currently only one gpu side mapping, so we can just stuff the
  gpusvm an va_address into the page, and protect it with the page lock.

- we need pagetable locks, so that we can manipulate pagetables (well
  specifically make ptes invalid) without taking any other locks.

- everyone else inserting or removing ptes for svm mappings also needs to
  lock the page, or we have races. This might be the hmm_range_fault races
  you're seeing when allowing vram pages, since I don't think there's
  anything else stopping the page lookup otherwise from succeeding.

- we might also need to stuff migrate ptes into the gpu side, like the cpu
  does, to hold up refaults before the migration has finished. But I think
  those are only needed for anon memory in sram because there's no other
  way to find the right page than swap pte entries, of which migration
  entries are a special case.

- core code also expects us to handle the page refcount correctly for svm
  device memory, so we can't free the pages like normal bo pages either
  directly to drm_buddy.

Now typing this all up will look an awful lot like what you have, with the
dma_resv lock serving as the page lock and the pagetable lock. The only
reason is that these locks are much smaller and nest within all the other
stuff going on and so avoid the inversion issues.

So one annoying part is that this is a lot of pointlessly looking typing.
The other is that it's full of races, because core mm really is yolo all
the way down. So lots of ways you lock the wrong page and fun stuff like
that, but the few cases that matter work:

- svm fault handling with hmm_range fault retries with mmu notifiers. Note
  that we need to have vram pages locked and the notifier retrie needs to
  be under the pagetable lock, or there's room to escape. At least that's
  what I came up with last time I thought it all through.

- migrate_to_ram: it will hold a page reference which we know was the
  valid vram page when the cpu pte was locked, but it might not be it
  anymore. So we have to lock the page and check whether it's still gpu
  mapped, and if not retry the entire fault since most likey another
  migrate_to_ram has succeed meanwhile in parallel.

- for eviction we don't care, we might actually be migrating a page no one
  even wants anymore.

Now I think you can get all this done with the dma_resv lock and maybe the
bo refcount. But it does involve a tremendous amount of headaches and
impendence mismatch, because that's not how page faults and migrations
work in core mm.

Cheers, Sima

> Current migration policy is migrate any SVM range greater than or equal
> to 64k once.
> 
> [1] https://patchwork.freedesktop.org/series/133643/
> 
> Signed-off-by: Matthew Brost matthew.brost@intel.com
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h |  1 +
>  2 files changed, 81 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 4372c02a341f..fd8987e0a506 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>  static int __xe_svm_garbage_collector(struct xe_vm *vm,
>  				      struct xe_svm_range *range)
>  {
> +	struct drm_gpusvm_ctx ctx = {};
>  	struct dma_fence *fence;
>  
> +	/* Evict any pages holding references to vram allocation */
> +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
> +
>  	xe_vm_lock(vm, false);
>  	fence = xe_vm_range_unbind(vm, range);
>  	xe_vm_unlock(vm);
> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>  	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>  }
>  
> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> +{
> +	return &tile->mem.vram;
> +}
> +
> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> +				       struct xe_svm_range *range,
> +				       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct xe_mem_region *mr = tile_to_mr(tile);
> +	struct drm_buddy_block *block;
> +	struct list_head *blocks;
> +	struct xe_bo *bo;
> +	ktime_t end = 0;
> +	int err;
> +
> +retry:
> +	xe_vm_lock(vm, false);
> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
> +			  range->base.va.start, ttm_bo_type_device,
> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> +			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
> +	xe_vm_unlock(vm);
> +	if (IS_ERR(bo)) {
> +		err = PTR_ERR(bo);
> +		if (xe_vm_validate_should_retry(NULL, err, &end))
> +			goto retry;
> +		return bo;
> +	}
> +
> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> +	list_for_each_entry(block, blocks, link)
> +		block->private = mr;
> +
> +	/*
> +	 * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
> +	 * creation ref can be dropped upon CPU fault or unmap.
> +	 */
> +	xe_bo_get(bo);
> +
> +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
> +					 bo, ctx);
> +	if (err) {
> +		xe_bo_put(bo);	/* Local ref */
> +		xe_bo_put(bo);	/* Creation ref */
> +		return ERR_PTR(err);
> +	}
> +
> +	return bo;
> +}
> +
>  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  			    struct xe_tile *tile, u64 fault_addr,
>  			    bool atomic)
>  {
> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
> +	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
> +		.vram_possible = IS_DGFX(vm->xe), };
>  	struct xe_svm_range *range;
>  	struct drm_gpusvm_range *r;
>  	struct drm_exec exec;
>  	struct dma_fence *fence;
> +	struct xe_bo *bo = NULL;
>  	ktime_t end = 0;
>  	int err;
>  
>  	lockdep_assert_held_write(&vm->lock);
>  
>  retry:
> +	xe_bo_put(bo);
> +	bo = NULL;
> +
>  	/* Always process UNMAPs first so view SVM ranges is current */
>  	err = xe_svm_garbage_collector(vm);
>  	if (err)
> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  	if (xe_svm_range_is_valid(range, tile))
>  		return 0;
>  
> +	/* XXX: Add migration policy, for now migrate range once */
> +	if (IS_DGFX(vm->xe) && !range->migrated &&
> +	    range->base.flags.migrate_vram &&
> +	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
> +		range->migrated = true;
> +
> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> +		if (IS_ERR(bo)) {
> +			drm_info(&vm->xe->drm,
> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
> +				 vm->usm.asid, PTR_ERR(bo));
> +			bo = NULL;
> +			goto retry;
> +		}
> +	}
> +
>  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>  	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
>  	       goto retry;
> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  	dma_fence_put(fence);
>  
>  err_out:
> +	xe_bo_put(bo);
> +
>  	return err;
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 8b72e91cc37d..3f432483a230 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -18,6 +18,7 @@ struct xe_svm_range {
>  	struct list_head garbage_collector_link;
>  	u8 tile_present;
>  	u8 tile_invalidated;
> +	u8 migrated	:1;
>  };
>  
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> -- 
> 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28 15:43       ` Matthew Brost
  2024-08-28 16:06         ` Alex Deucher
@ 2024-08-28 16:25         ` Daniel Vetter
  2024-08-29 16:40           ` Matthew Brost
  1 sibling, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-08-28 16:25 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Christian König, Daniel Vetter, intel-xe, dri-devel, airlied,
	thomas.hellstrom, matthew.auld, daniel

On Wed, Aug 28, 2024 at 03:43:48PM +0000, Matthew Brost wrote:
> On Wed, Aug 28, 2024 at 04:46:24PM +0200, Christian König wrote:
> > Am 28.08.24 um 16:31 schrieb Daniel Vetter:
> > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > +		if (!ctx->mmap_locked) {
> > > > +			/*
> > > > +			 * XXX: HMM locking document indicates only a read-lock
> > > > +			 * is required but there apears to be a window between
> > > > +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > > > +			 * via migrate_vma_setup and the pages actually moving
> > > > +			 * in migrate_vma_finalize in which this code can grab
> > > > +			 * garbage pages. Grabbing the write-lock if the range
> > > > +			 * is attached to vram appears to protect against this
> > > > +			 * race.
> > > > +			 */
> 
> Thanks the comments, replying to both of you inline.
> 
> > > This one is really scary, since it means the entire migrate pte trickery
> > > is essentially completely busted. Grabbing the mmap write lock just means
> > > you block out pretty much everything interesting from concurrently
> > > happening.
> > > 
> > > My gut feeling says we need to figure out what's happening here, because
> > > this looks a bit too fundamental to me.
> 
> I agree. I haven’t looked into this issue for a couple of months but
> really need to understand what is going on.
> 
> I should have mentioned this in the cover letter: the goal of this
> series was to produce something for review that is stable and supports
> UMDs/user applications. It was not intended to be presented as a final
> solution. This issue certainly falls into the category of "needs to be
> understood and requires a proper fix."
> 
> One open question I have is whether the test case that triggers this
> issue is even defined behavior. The test creates concurrent access
> between the GPU and CPU to the same memory address, resulting in GPU and
> CPU faults racing against each other. It’s possible that this is
> undefined behavior, so data corruption might be acceptable—i.e., the
> kernel can’t crash, but incorrect results might be permissible.

Yes this is supposed to be defined, at least from an hmm pov. And core mm/
is ridiculous in how many races it allows, especially around concurrent
fault handling.

It is ofc really slow if every fault results in a migration, but that's a
matter of the application setting stupid memory migration hints for the
gpu.

> e.g. This is the only defined usage model:
> 
> alloc_memory();
> start_compute_kernel();
> sync_on_compute_kernel_completion();
> read_memory();
> 
> Hopefully, in the next week or so, I'll be heavily engaging with the UMD
> teams. Development can then start, and applications will be running soon
> after. This will allow us to address issues like this, collect data on
> memory usage, and verify some of the assumptions I've made, such as
> optimizing for 2M+ allocations.
> 
> > 
> > I think I have at least a high level understanding what's going on here,
> > Felix and especially Philip should know more of the details.
> > 
> 
> I meant to reach out to AMD for issues like this. So, Felix
> (felix.kuehling@amd.com) and Philip (Philip.Yang@amd.com) would be good
> contacts?
> 
> > In general grabbing the mm_lock to protect PTEs from changing is completely
> > nonsense. The mm_lock is to protect the VMAs and *not* the PTEs!
> > 
> 
> Thanks for the hint. I believe that in the AMD implementation, I noticed
> some additional locks for migration, which might be how you mitigated
> this issue.

Yeah, so in general hold mmap_reading is indeed pure magic thinking for
preventing pte changes, like Christian points out. It doesn't stop
invalidates, and with the per vma locking it also doesn't stop new valid
ptes from being inserted at least for anon memory.

Except migration pte entries that point at vram pages are special, and are
_only_ resolved while holding mmap_read. Which means holding mmap_write
for the case of looking up our own vram pages with hmm_range_fault
actually prevents issues. And so this duct-tape of holding mmap_write very
much looks like a working hack to plug any races against concurrently
ongoing migrations to system memory due to cpu faults.

An even more fun corner case is multiple concurrent cpu faults on the same
vram page. fork gets you that, or maybe a bit more reasonable mremap with
MREMAP_DONTUNMAP | MREMAP_MAYMOVE. I think just hammer the same va with
multiple threads along isn't enough, it's better to have a private va for
each thread pointing at the same anon memory page, so that you can get
more parallel faults due to finely grained pte locking.

Would be a good testcase to add, if you don't have it yet.

> I must say it is a bit unfortunate that the HMM locking documentation
> doesn’t mention this. I believe the documentation needs additional
> information, which I can add once we finalize the solution.

Yeah, at least from my very cursory lock you don't have enough locking.
I've written an in-depth reply to patch 23 with the high-level summary of
my thoughts.

Cheers, Sima

> 
> Matt 
> 
> > Even with the write side of the mm_lock taken it is perfectly possible that
> > PTE change. It's just less likely.
> > 
> > We run into multiple issues before we figured out this important distinction
> > as well.
> > 
> > Christian.
> > 
> > > -Sima
> > > 
> > > 
> > > > +			if (vram_pages)
> > > > +				mmap_write_lock(mm);
> > > > +			else
> > > > +				mmap_read_lock(mm);
> > > > +		}
> > > > +		err = hmm_range_fault(&hmm_range);
> > > > +		if (!ctx->mmap_locked) {
> > > > +			if (vram_pages)
> > > > +				mmap_write_unlock(mm);
> > > > +			else
> > > > +				mmap_read_unlock(mm);
> > > > +		}
> > > > +
> > > > +		if (err == -EBUSY) {
> > > > +			if (time_after(jiffies, timeout))
> > > > +				break;
> > > > +
> > > > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > > +			continue;
> > > > +		}
> > > > +		break;
> > > > +	}
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmput(mm);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	pages = (struct page **)pfns;
> > > > +
> > > > +	if (ctx->prefault) {
> > > > +		range->pages = pages;
> > > > +		goto set_seqno;
> > > > +	}
> > > > +
> > > > +map_pages:
> > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > +
> > > > +		for (i = 0; i < npages; ++i) {
> > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > +
> > > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				goto err_free;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->flags.has_vram_pages = true;
> > > > +		range->pages = pages;
> > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	} else {
> > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > +
> > > > +		for_each_dma_page(i, j, npages, order) {
> > > > +			if (WARN_ON_ONCE(i && order !=
> > > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > +
> > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			set_page_dirty_lock(pages[j]);
> > > > +			mark_page_accessed(pages[j]);
> > > > +
> > > > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > > +						   pages[j], 0,
> > > > +						   PAGE_SIZE << order,
> > > > +						   DMA_BIDIRECTIONAL);
> > > > +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > > > +				err = -EFAULT;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Huge pages, reduce memory footprint */
> > > > +		if (order) {
> > > > +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > > > +						 GFP_KERNEL);
> > > > +			if (dma_addr) {
> > > > +				for (i = 0; i < j; ++i)
> > > > +					dma_addr[i] = (dma_addr_t)pfns[i];
> > > > +				kvfree(pfns);
> > > > +				kfree_mapping = true;
> > > > +			} else {
> > > > +				dma_addr = (dma_addr_t *)pfns;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->order = order;
> > > > +		range->flags.kfree_mapping = kfree_mapping;
> > > > +		range->flags.has_dma_mapping = true;
> > > > +		range->dma_addr = dma_addr;
> > > > +		range->vram_allocation = NULL;
> > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	}
> > > > +
> > > > +	if (err == -EAGAIN)
> > > > +		goto retry;
> > > > +set_seqno:
> > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_unmap:
> > > > +	for_each_dma_page(i, j, npages, order)
> > > > +		dma_unmap_page(gpusvm->drm->dev,
> > > > +			       (dma_addr_t)pfns[j],
> > > > +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	if (alloc_pfns)
> > > > +		kvfree(pfns);
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > > > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > > > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > > > + * security model.
> > > > + */
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range *range,
> > > > +				  const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	if (ctx->in_notifier)
> > > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > > +	else
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +
> > > > +	if (!ctx->in_notifier)
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > + * @page: Pointer to the page to put
> > > > + *
> > > > + * This function unlocks and puts a page.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > + * @npages: Number of pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > + *
> > > > + * This function puts an array of pages.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > > +					   unsigned long *migrate_pfn)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!migrate_pfn[i])
> > > > +			continue;
> > > > +
> > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > > > +		migrate_pfn[i] = 0;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > + * @page: Pointer to the page
> > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > + *
> > > > + * This function associates the given page with the specified GPU SVM zone
> > > > + * device data and initializes it for zone device usage.
> > > > + */
> > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > +				     struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > +	zone_device_page_init(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > > > + * @dev: The device for which the pages are being mapped
> > > > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > + * @npages: Number of pages to map
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function maps pages of memory for migration usage in GPU SVM. It
> > > > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > > > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > > > + * array.
> > > > + *
> > > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > > + */
> > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > +					dma_addr_t *dma_addr,
> > > > +					long unsigned int *migrate_pfn,
> > > > +					unsigned long npages,
> > > > +					enum dma_data_direction dir)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > > > +
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > +			return -EFAULT;
> > > > +
> > > > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > +			return -EFAULT;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > > > + * @dev: The device for which the pages were mapped
> > > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > > + * @npages: Number of pages to unmap
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > > > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > > > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > > > + */
> > > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > > +					   dma_addr_t *dma_addr,
> > > > +					   unsigned long npages,
> > > > +					   enum dma_data_direction dir)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > > > +			continue;
> > > > +
> > > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *                   failure of this function.
> > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > > > + *                   should hold a reference to the VRAM allocation, which
> > > > + *                   should be dropped via ops->vram_allocation or upon the
> > > > + *                   failure of this function.
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > > > + * necessary setup and invokes the driver-specific operations for migration to
> > > > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > > > + * until ops->vram_release is called which only upon successful return.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       void *vram_allocation,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	u64 start = range->va.start, end = range->va.end;
> > > > +	struct migrate_vma migrate = {
> > > > +		.start		= start,
> > > > +		.end		= end,
> > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > > +	};
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	unsigned long i, npages = npages_in_range(start, end);
> > > > +	struct vm_area_struct *vas;
> > > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int err;
> > > > +
> > > > +	if (!range->flags.migrate_vram)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > > > +	    !gpusvm->ops->copy_to_sram)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		mmap_write_lock(mm);
> > > > +	}
> > > > +
> > > > +	mmap_assert_locked(mm);
> > > > +
> > > > +	vas = vma_lookup(mm, start);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > +		err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (!vma_is_anonymous(vas)) {
> > > > +		err = -EBUSY;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > +	if (!zdd) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/*
> > > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > > +	 */
> > > > +
> > > > +	if (!migrate.cpages) {
> > > > +		err = -EFAULT;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	if (migrate.cpages != npages) {
> > > > +		err = -EBUSY;
> > > > +		goto err_finalize;
> > > > +	}
> > > > +
> > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > > > +					     migrate.dst);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > +					   migrate.src, npages, DMA_TO_DEVICE);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > > +
> > > > +		pages[i] = page;
> > > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > > +	}
> > > > +
> > > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	/* Upon success bind vram allocation to range and zdd */
> > > > +	range->vram_allocation = vram_allocation;
> > > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > +	migrate_vma_pages(&migrate);
> > > > +	migrate_vma_finalize(&migrate);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > +				       DMA_TO_DEVICE);
> > > > +err_free:
> > > > +	if (zdd)
> > > > +		drm_gpusvm_zdd_put(zdd);
> > > > +	kvfree(buf);
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_write_unlock(mm);
> > > > +		mmput(mm);
> > > > +	}
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > + * @npages: Number of pages to populate
> > > > + * @src_mpfn: Source array of migrate PFNs
> > > > + * @mpfn: Array of migrate PFNs to populate
> > > > + * @addr: Start address for PFN allocation
> > > > + *
> > > > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > > > + * specified VM area structure. It allocates and locks pages in the VM area for
> > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > > > + * alloc_page for allocation.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > > > +						unsigned long npages,
> > > > +						unsigned long *src_mpfn,
> > > > +						unsigned long *mpfn, u64 addr)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > +		struct page *page;
> > > > +
> > > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > +			continue;
> > > > +
> > > > +		if (vas)
> > > > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > > > +		else
> > > > +			page = alloc_page(GFP_HIGHUSER);
> > > > +
> > > > +		if (!page)
> > > > +			return -ENOMEM;
> > > > +
> > > > +		lock_page(page);
> > > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > > > + * migration done via migrate_device_* functions. Fallback path as it is
> > > > + * preferred to issue migrations with mmap lock.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > > +				    struct drm_gpusvm_range *range)
> > > > +{
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	unsigned long *src, *dst;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_out;
> > > > +	}
> > > > +	src = buf;
> > > > +	dst = buf + (sizeof(*src) * npages);
> > > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > > > +					     npages, src);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > > +				       gpusvm->device_private_page_owner, src,
> > > > +				       npages, range->va.start);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > +					   dst, npages, DMA_BIDIRECTIONAL);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i)
> > > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > > +
> > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > > +	migrate_device_pages(src, dst, npages);
> > > > +	migrate_device_finalize(src, dst, npages);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > +				       DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	kvfree(buf);
> > > > +err_out:
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @vas: Pointer to the VM area structure
> > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > + * @start: Start address of the migration range
> > > > + * @end: End address of the migration range
> > > > + *
> > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +					struct vm_area_struct *vas,
> > > > +					struct page *page,
> > > > +					u64 start, u64 end)
> > > > +{
> > > > +	struct migrate_vma migrate = {
> > > > +		.vma		= vas,
> > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page	= page,
> > > > +	};
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	/* Corner where VMA area struct has been partially unmapped */
> > > > +	if (start < vas->vm_start)
> > > > +		start = vas->vm_start;
> > > > +	if (end > vas->vm_end)
> > > > +		end = vas->vm_end;
> > > > +
> > > > +	migrate.start = start;
> > > > +	migrate.end = end;
> > > > +	npages = npages_in_range(start, end);
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_out;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/* Raced with another CPU fault, nothing to do */
> > > > +	if (!migrate.cpages)
> > > > +		goto err_free;
> > > > +
> > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > > +						   migrate.src, migrate.dst,
> > > > +						   start);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > +					   migrate.dst, npages,
> > > > +					   DMA_BIDIRECTIONAL);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i)
> > > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > > +
> > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > +	migrate_vma_pages(&migrate);
> > > > +	migrate_vma_finalize(&migrate);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > +				       DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	kvfree(buf);
> > > > +err_out:
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function initiates the migration of the specified GPU SVM range to
> > > > + * SRAM. It performs necessary checks and invokes the internal migration
> > > > + * function for actual migration.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	u64 start = range->va.start, end = range->va.end;
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	struct vm_area_struct *vas;
> > > > +	int err;
> > > > +	bool retry = false;
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		if (ctx->trylock_mmap) {
> > > > +			if (!mmap_read_trylock(mm))  {
> > > > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > +				goto err_mmput;
> > > > +			}
> > > > +		} else {
> > > > +			mmap_read_lock(mm);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	mmap_assert_locked(mm);
> > > > +
> > > > +	/*
> > > > +	 * Loop required to find all VMA area structs for the corner case when
> > > > +	 * VRAM backing has been partially unmapped from MM's address space.
> > > > +	 */
> > > > +again:
> > > > +	vas = find_vma(mm, start);
> > > > +	if (!vas) {
> > > > +		if (!retry)
> > > > +			err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > +		if (!retry)
> > > > +			err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > > > +	if (err)
> > > > +		goto err_mmunlock;
> > > > +
> > > > +	if (vas->vm_end < end) {
> > > > +		retry = true;
> > > > +		start = vas->vm_end;
> > > > +		goto again;
> > > > +	}
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_read_unlock(mm);
> > > > +		/*
> > > > +		 * Using mmput_async as this function can be called while
> > > > +		 * holding a dma-resv lock, and a final put can grab the mmap
> > > > +		 * lock, causing a lock inversion.
> > > > +		 */
> > > > +		mmput_async(mm);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmap_read_unlock(mm);
> > > > +err_mmput:
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmput_async(mm);
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > > > + * @page: Pointer to the page
> > > > + *
> > > > + * This function is a callback used to put the GPU SVM zone device data
> > > > + * associated with a page when it is being released.
> > > > + */
> > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > +{
> > > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > + * @vmf: Pointer to the fault information structure
> > > > + *
> > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > + * the internal migration function to migrate the range back to RAM.
> > > > + *
> > > > + * Returns:
> > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > + */
> > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > +	int err;
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > +					   vmf->vma, vmf->page,
> > > > +					   zdd->range->va.start,
> > > > +					   zdd->range->va.end);
> > > > +
> > > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > > + */
> > > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > > +	.page_free = drm_gpusvm_page_free,
> > > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM device page map operations structure.
> > > > + */
> > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > > +{
> > > > +	return &drm_gpusvm_pagemap_ops;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + * @start: Start address
> > > > + * @end: End address
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM has mapping, False otherwise
> > > > + */
> > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > > +		struct drm_gpusvm_range *range = NULL;
> > > > +
> > > > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > > > +			return true;
> > > > +	}
> > > > +
> > > > +	return false;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > new file mode 100644
> > > > index 000000000000..0ea70f8534a8
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > @@ -0,0 +1,415 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef __DRM_GPUSVM_H__
> > > > +#define __DRM_GPUSVM_H__
> > > > +
> > > > +#include <linux/kref.h>
> > > > +#include <linux/mmu_notifier.h>
> > > > +#include <linux/workqueue.h>
> > > > +
> > > > +struct dev_pagemap_ops;
> > > > +struct drm_device;
> > > > +struct drm_gpusvm;
> > > > +struct drm_gpusvm_notifier;
> > > > +struct drm_gpusvm_ops;
> > > > +struct drm_gpusvm_range;
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > > + *
> > > > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > > > + * These operations are provided by the GPU driver to manage SVM ranges and
> > > > + * perform operations such as migration between VRAM and system RAM.
> > > > + */
> > > > +struct drm_gpusvm_ops {
> > > > +	/**
> > > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > > +	 *
> > > > +	 * This function shall allocate a GPU SVM notifier.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > > > +	 */
> > > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > > +
> > > > +	/**
> > > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > > +	 *
> > > > +	 * This function shall free a GPU SVM notifier.
> > > > +	 */
> > > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > > +
> > > > +	/**
> > > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 *
> > > > +	 * This function shall allocate a GPU SVM range.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > > > +	 */
> > > > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > > > +
> > > > +	/**
> > > > +	 * @range_free: Free a GPU SVM range (optional)
> > > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > > +	 *
> > > > +	 * This function shall free a GPU SVM range.
> > > > +	 */
> > > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > > +
> > > > +	/**
> > > > +	 * @vram_release: Release VRAM allocation (optional)
> > > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > +	 *
> > > > +	 * This function shall release VRAM allocation and expects to drop a
> > > > +	 * reference to VRAM allocation.
> > > > +	 */
> > > > +	void (*vram_release)(void *vram_allocation);
> > > > +
> > > > +	/**
> > > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > +	 * @npages: Number of pages to populate
> > > > +	 * @pfn: Array of page frame numbers to populate
> > > > +	 *
> > > > +	 * This function shall populate VRAM page frame numbers (PFN).
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * 0 on success, a negative error code on failure.
> > > > +	 */
> > > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > > +				 void *vram_allocation,
> > > > +				 unsigned long npages,
> > > > +				 unsigned long *pfn);
> > > > +
> > > > +	/**
> > > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > > +	 * @npages: Number of pages to copy
> > > > +	 *
> > > > +	 * This function shall copy pages to VRAM.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * 0 on success, a negative error code on failure.
> > > > +	 */
> > > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > > +			    struct page **pages,
> > > > +			    dma_addr_t *dma_addr,
> > > > +			    unsigned long npages);
> > > > +
> > > > +	/**
> > > > +	 * @copy_to_sram: Copy to system RAM (required for migration)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > > > +	 * @npages: Number of pages to copy
> > > > +	 *
> > > > +	 * This function shall copy pages to system RAM.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * 0 on success, a negative error code on failure.
> > > > +	 */
> > > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > > +			    struct page **pages,
> > > > +			    dma_addr_t *dma_addr,
> > > > +			    unsigned long npages);
> > > > +
> > > > +	/**
> > > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > > > +	 *
> > > > +	 * This function shall invalidate the GPU page tables. It can safely
> > > > +	 * walk the notifier range RB tree/list in this function. Called while
> > > > +	 * holding the notifier lock.
> > > > +	 */
> > > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > > +			   struct drm_gpusvm_notifier *notifier,
> > > > +			   const struct mmu_notifier_range *mmu_range);
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > > > + *
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: MMU interval notifier
> > > > + * @interval: Interval for the notifier
> > > > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > > > + * @root: Cached root node of the RB tree containing ranges
> > > > + * @range_list: List head containing of ranges in the same order they appear in
> > > > + *              interval tree. This is useful to keep iterating ranges while
> > > > + *              doing modifications to RB tree.
> > > > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > > > + *                 removed
> > > > + *
> > > > + * This structure represents a GPU SVM notifier.
> > > > + */
> > > > +struct drm_gpusvm_notifier {
> > > > +	struct drm_gpusvm *gpusvm;
> > > > +	struct mmu_interval_notifier notifier;
> > > > +	struct {
> > > > +		u64 start;
> > > > +		u64 end;
> > > > +	} interval;
> > > > +	struct {
> > > > +		struct rb_node node;
> > > > +		struct list_head entry;
> > > > +		u64 __subtree_last;
> > > > +	} rb;
> > > > +	struct rb_root_cached root;
> > > > +	struct list_head range_list;
> > > > +	struct {
> > > > +		u32 removed : 1;
> > > > +	} flags;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > > + *
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier
> > > > + * @refcount: Reference count for the range
> > > > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > > > + * @va: Virtual address range
> > > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > > > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > > > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > > > + *                       on @order which releases via kfree
> > > > + *
> > > > + * This structure represents a GPU SVM range used for tracking memory ranges
> > > > + * mapped in a DRM device.
> > > > + */
> > > > +struct drm_gpusvm_range {
> > > > +	struct drm_gpusvm *gpusvm;
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +	struct kref refcount;
> > > > +	struct {
> > > > +		struct rb_node node;
> > > > +		struct list_head entry;
> > > > +		u64 __subtree_last;
> > > > +	} rb;
> > > > +	struct {
> > > > +		u64 start;
> > > > +		u64 end;
> > > > +	} va;
> > > > +	unsigned long notifier_seq;
> > > > +	union {
> > > > +		struct page **pages;
> > > > +		dma_addr_t *dma_addr;
> > > > +	};
> > > > +	void *vram_allocation;
> > > > +	u16 order;
> > > > +	struct {
> > > > +		/* All flags below must be set upon creation */
> > > > +		u16 migrate_vram : 1;
> > > > +		/* All flags below must be set / cleared under notifier lock */
> > > > +		u16 unmapped : 1;
> > > > +		u16 partial_unmap : 1;
> > > > +		u16 has_vram_pages : 1;
> > > > +		u16 has_dma_mapping : 1;
> > > > +		u16 kfree_mapping : 1;
> > > > +	} flags;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm - GPU SVM structure
> > > > + *
> > > > + * @name: Name of the GPU SVM
> > > > + * @drm: Pointer to the DRM device structure
> > > > + * @mm: Pointer to the mm_struct for the address space
> > > > + * @device_private_page_owner: Device private pages owner
> > > > + * @mm_start: Start address of GPU SVM
> > > > + * @mm_range: Range of the GPU SVM
> > > > + * @notifier_size: Size of individual notifiers
> > > > + * @ops: Pointer to the operations structure for GPU SVM
> > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > > > + *               Entries should be powers of 2 in descending order.
> > > > + * @num_chunks: Number of chunks
> > > > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > > > + * @notifier_list: list head containing of notifiers in the same order they
> > > > + *                 appear in interval tree. This is useful to keep iterating
> > > > + *                 notifiers while doing modifications to RB tree.
> > > > + *
> > > > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > > + *
> > > > + * No reference counting is provided, as this is expected to be embedded in the
> > > > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > > > + * counting.
> > > > + */
> > > > +struct drm_gpusvm {
> > > > +	const char *name;
> > > > +	struct drm_device *drm;
> > > > +	struct mm_struct *mm;
> > > > +	void *device_private_page_owner;
> > > > +	u64 mm_start;
> > > > +	u64 mm_range;
> > > > +	u64 notifier_size;
> > > > +	const struct drm_gpusvm_ops *ops;
> > > > +	const u64 *chunk_sizes;
> > > > +	int num_chunks;
> > > > +	struct rw_semaphore notifier_lock;
> > > > +	struct workqueue_struct *zdd_wq;
> > > > +	struct rb_root_cached root;
> > > > +	struct list_head notifier_list;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > > + *
> > > > + * @mmap_locked: mmap lock is locked
> > > > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > > > + *                (e.g.dma-revs -> mmap lock)
> > > > + * @in_notifier: entering from a MMU notifier
> > > > + * @read_only: operating on read-only memory
> > > > + * @vram_possible: possible to use VRAM
> > > > + * @prefault: prefault pages
> > > > + *
> > > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > > + */
> > > > +struct drm_gpusvm_ctx {
> > > > +	u32 mmap_locked :1;
> > > > +	u32 trylock_mmap :1;
> > > > +	u32 in_notifier :1;
> > > > +	u32 read_only :1;
> > > > +	u32 vram_possible :1;
> > > > +	u32 prefault :1;
> > > > +};
> > > > +
> > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > +		    const char *name, struct drm_device *drm,
> > > > +		    struct mm_struct *mm, void *device_private_page_owner,
> > > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > > +		    const struct drm_gpusvm_ops *ops,
> > > > +		    const u64 *chunk_sizes, int num_chunks);
> > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > > > +				u64 gpuva_start, u64 gpuva_end,
> > > > +				const struct drm_gpusvm_ctx *ctx);
> > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > +			     struct drm_gpusvm_range *range);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > > +
> > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range *range);
> > > > +
> > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range *range,
> > > > +				  const struct drm_gpusvm_ctx *ctx);
> > > > +
> > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       void *vram_allocation,
> > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > +
> > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > > +
> > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > > + */
> > > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > > +	down_read(&(gpusvm__)->notifier_lock)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > > + */
> > > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > > +	up_read(&(gpusvm__)->notifier_lock)
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > > + * @range: a pointer to the current GPU SVM range
> > > > + *
> > > > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > > > + *         current range is the last one or if the input range is NULL.
> > > > + */
> > > > +static inline struct drm_gpusvm_range *
> > > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	if (range && !list_is_last(&range->rb.entry,
> > > > +				   &range->notifier->range_list))
> > > > +		return list_next_entry(range, rb.entry);
> > > > +
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > > > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the range
> > > > + * @end__: End address of the range
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > > > + * to use while holding the driver SVM lock or the notifier lock.
> > > > + */
> > > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > > > +	for ((range__) = (range__) ?:					\
> > > > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > > > +	     (range__) && (range__->va.start < (end__));		\
> > > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > > + * @range: Pointer to the GPU SVM range structure.
> > > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > > + *
> > > > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > > > + * if the range partially falls within the provided MMU notifier range.
> > > > + */
> > > > +static inline void
> > > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > > +			      const struct mmu_notifier_range *mmu_range)
> > > > +{
> > > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > > +
> > > > +	range->flags.unmapped = true;
> > > > +	if (range->va.start < mmu_range->start ||
> > > > +	    range->va.end > mmu_range->end)
> > > > +		range->flags.partial_unmap = true;
> > > > +}
> > > > +
> > > > +#endif /* __DRM_GPUSVM_H__ */
> > > > -- 
> > > > 2.34.1
> > > > 
> > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-28 16:06   ` Daniel Vetter
@ 2024-08-28 18:22     ` Daniel Vetter
  2024-08-29  9:24     ` Christian König
  1 sibling, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-08-28 18:22 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Wed, Aug 28, 2024 at 06:06:47PM +0200, Daniel Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> > Migration is implemented with range granularity, with VRAM backing being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >   advantage over a BO.
> 
> This one I'm not understanding.
> 
> > - DRM buddy allocations do not solve locking inversion problems between
> >   mmap lock and dma-resv locks.
> 
> Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> matters hugely which migration path we're in, i.e. opportunistic
> migration, cpu fault where we have to migrate or die, or when we run out
> of vram and need to evict stuff to make space.
> 
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
> >   evict each other).
> 
> So core mm handles this by just roughly equally shrinking everything.
> Seems to work, and it has a pile of object shrinkers, and the page lru is
> also split into page cache and anon memory.
> 
> I think you need to put in more justification that unified eviction is
> required than just stating it, because a look at mm/ gives a very well
> established counterexample.
> 
> > - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
> >   require a dma-resv.
> 
> So from the TTM side we need exhaustive eviction, or at least something a
> bit more exhaustive than what ttm currently has. Note that i915-gem also
> never really got to perfect exhaustive eviction, it's just a pile better
> than ttm right now.
> 
> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> eviction is going to win because the shrinkers can only trylock dma_resv.
> So this part works. It actually works so well on the system memory side
> that if we're not careful we can trigger oom, because we're too good at
> getting at all the memory.
> 
> SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> don't see why, because the idea is that thanks to gpu and cpu page faults,
> you can always get out of a pinch by just trashing everything for a while
> and migrating the handfull of available pages a lot.
> 
> > - Likely allocation size is 2M which makes of size of BO (872)
> >   acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> 
> Except it requires that you hold dma_resv, which brings in all kinds of
> pain. And for eviction we really don't need a lot of synchronization, so a
> lot of that locking is not needed, unlike the case where we have a cpu
> fault, where we absolutely need mmap_lock and all that to make sure we
> fault in the right page.
> 
> But for eviction we only need to throw out some pages, if we're not
> entirely precise with picking the right ones (or have no idea into which
> vma they're all currently mapped into) it doesn't matter. That's why
> migrate_device_pages doesn't care about any of that at all, it doesn't
> need to by design. But by bo backing memory you drag in all that stuff
> that's causing headacheds for eviction.
> 
> The only thing migration tries to do is remove all pte, and if that
> succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> code as cheat sheet, we need roughly:
> 
> - reverse mapping structure like anon_vma. Except gpusvm can assume that
>   there's currently only one gpu side mapping, so we can just stuff the
>   gpusvm an va_address into the page, and protect it with the page lock.
> 
> - we need pagetable locks, so that we can manipulate pagetables (well
>   specifically make ptes invalid) without taking any other locks.
> 
> - everyone else inserting or removing ptes for svm mappings also needs to
>   lock the page, or we have races. This might be the hmm_range_fault races
>   you're seeing when allowing vram pages, since I don't think there's
>   anything else stopping the page lookup otherwise from succeeding.
> 
> - we might also need to stuff migrate ptes into the gpu side, like the cpu
>   does, to hold up refaults before the migration has finished. But I think
>   those are only needed for anon memory in sram because there's no other
>   way to find the right page than swap pte entries, of which migration
>   entries are a special case.
> 
> - core code also expects us to handle the page refcount correctly for svm
>   device memory, so we can't free the pages like normal bo pages either
>   directly to drm_buddy.
> 
> Now typing this all up will look an awful lot like what you have, with the
> dma_resv lock serving as the page lock and the pagetable lock. The only
> reason is that these locks are much smaller and nest within all the other
> stuff going on and so avoid the inversion issues.
> 
> So one annoying part is that this is a lot of pointlessly looking typing.
> The other is that it's full of races, because core mm really is yolo all
> the way down. So lots of ways you lock the wrong page and fun stuff like
> that, but the few cases that matter work:
> 
> - svm fault handling with hmm_range fault retries with mmu notifiers. Note
>   that we need to have vram pages locked and the notifier retrie needs to
>   be under the pagetable lock, or there's room to escape. At least that's
>   what I came up with last time I thought it all through.
> 
> - migrate_to_ram: it will hold a page reference which we know was the
>   valid vram page when the cpu pte was locked, but it might not be it
>   anymore. So we have to lock the page and check whether it's still gpu
>   mapped, and if not retry the entire fault since most likey another
>   migrate_to_ram has succeed meanwhile in parallel.
> 
> - for eviction we don't care, we might actually be migrating a page no one
>   even wants anymore.
> 
> Now I think you can get all this done with the dma_resv lock and maybe the
> bo refcount. But it does involve a tremendous amount of headaches and
> impendence mismatch, because that's not how page faults and migrations
> work in core mm.

Bit a different take, and this might be completely wrong because I've
misread your gpusvm code or the amdkfd code. But I figured it might be
good for understanding when I explain, where the current page lock and
pgatable locks are.

For the pagetable lock:

In amdkfd that's svm_range_lock/unlock. This is also held while
re-checking mmu_notifiers. In gpusvm the equivalent is the
gpusvm->notifier_lock, which is global instead of per notifier range, but
the same idea.

For the page lock:

In amdkfd this is essentiall svm_range->migrate_mutex, it's the thing that
ensures we're consistent with any concurrent other migrations in the same
range. Note that this protects a virtual address range, but because that
has a 1:1 mapping to bot the cpu mm va ranges and to any vram allocations
it defacto serves as the page lock since there can never be more than one
svm_range for a svm vram allocation. That avoids the revalidation dance
core mm/ needs to do once it locks a page, since the va->page mappings
might have changed meanwhile. That's why it's pretty much everywhere just
trylocks while holding the pgtable locks and bailing out if that fails.

In your gpusvm design I guess it should be the bo's dma_resv lock that
backs a gpusvm_range. But it's not consistently enough used, or not with
big enough locking scope to protect against concurrent migration races.
The trouble with using the dma_resv lock like this is that it's not very
compatible with classic dma_resv usage (at least I think, might be wrong).

For core kernel:

Pagetable lock is in each pagetable at each level, so parallelizes
ridiculously. We could adopt that scheme by also storing the mmu notifier
seqno into each pgtable page. That would give us a mmu notifier ranges
that pretty much perfectly scale as we add/remove pagetables and map/unmap
stuff, instead of having to maintain a separate rbtree of ad-hoc notifiers
like gpuvsm and amdkfd do. And because it's hierarchical and we could
store the mmu notifier seqno at each level you can even optimize for both
small and huge ranges and it all checks out, since if the entire huge
range hasn't been invalidated, we can skip checking the ranges below.

page lock is the per-page (or well nowadays often the per-folio) lock. It
makes sure no one can rip the actual data stored in your page away from
underneath you (with migration or something else funny), since the
reference count only makes sure the allocation stays. The refcount itself
does not guarantee that the page you've looked up is actually still the
one that's mapped, it's like the bo refcount in that regard in gpusvm and
amdkfd for svm allocations in vram. Also with folios the locks are now per
chunk size, like gpusvm does.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
  2024-08-28 14:31   ` Daniel Vetter
@ 2024-08-28 18:50   ` Daniel Vetter
  2024-08-29 16:49     ` Matthew Brost
  2024-08-29  9:16   ` Thomas Hellström
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-08-28 18:50 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	int err;
> +	bool retry = false;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		if (ctx->trylock_mmap) {
> +			if (!mmap_read_trylock(mm))  {
> +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> +				goto err_mmput;
> +			}
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	/*
> +	 * Loop required to find all VMA area structs for the corner case when
> +	 * VRAM backing has been partially unmapped from MM's address space.
> +	 */
> +again:
> +	vas = find_vma(mm, start);

So a hiliarous case that amdkfd gets a bit better but still not entirely
is that the original vma might entirely gone. Even when you can still get
at the mm of that process. This happens with cow (or shared too I think)
mappings in forked child processes, or also if you play fun mremap games.

I think that outside of the ->migrate_to_ram callback migration/eviction
to sram cannot assume there's any reasonable vma around and has to
unconditionally go with the drm_gpusvm_evict_to_sram path.

Also in the migrate_to_ram case the vma is essentially nothing else that
informational about which ranges we might need if we prefault a bit (in
case the child changed the vma compared to the original one). So it's good
to as parameter for migrate_vma_setup, but absolutely nothing else.

amdkfd almost gets this right by being entirely based on their svm_range
structures, except they still have the lingering check that the orignal mm
is still alive. Of course you cannot ever use that memory on the gpu
anymore, but the child process could get very pissed if their memory is
suddenly gone. Also the eviction code has the same issue as yours and
limits itself to vma that still exist in the original mm, leaving anything
that's orphaned in children or remaps stuck in vram. At least that's my
understanding, I might very well be wrong.

So probably want a bunch of these testcases too to make sure that all
works, and we're not stuck with memory allocations in vram that we can't
move out.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range
  2024-08-28  2:48 ` [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range Matthew Brost
@ 2024-08-29  9:03   ` Daniel Vetter
  2024-08-29 15:58     ` Matthew Brost
  0 siblings, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-08-29  9:03 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Aug 27, 2024 at 07:48:37PM -0700, Matthew Brost wrote:
> Add migrate_device_vma_range which prepares an array of pre-populated
> device pages for migration and issues a MMU invalidation.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/linux/migrate.h |  3 +++
>  mm/migrate_device.c     | 53 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 644be30b69c8..e8cce05bf9c2 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -226,6 +226,9 @@ void migrate_vma_pages(struct migrate_vma *migrate);
>  void migrate_vma_finalize(struct migrate_vma *migrate);
>  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
>  			unsigned long npages);
> +int migrate_device_vma_range(struct mm_struct *mm, void *pgmap_owner,
> +			     unsigned long *src_pfns, unsigned long npages,
> +			     unsigned long start);
>  void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
>  			unsigned long npages);
>  void migrate_device_finalize(unsigned long *src_pfns,
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 6d66dc1c6ffa..e25f12a132e8 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -920,6 +920,59 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
>  }
>  EXPORT_SYMBOL(migrate_device_range);
>  
> +/**
> + * migrate_device_vma_range() - migrate device private pfns to normal memory and
> + * trigger MMU invalidation.
> + * @mm: struct mm of device pages.
> + * @src_pfns: pre-popluated array of source device private pfns to migrate.
> + * @pgmap_owner: page group map owner of device pages.
> + * @npages: number of pages to migrate.
> + * @start: VMA start of device pages.
> + *
> + * Similar to migrate_device_range() but supports non-contiguous pre-popluated
> + * array of device pages to migrate. Also triggers MMU invalidation. Useful in
> + * device memory eviction paths where lock is held protecting the device pages
> + * but where the mmap lock cannot be taken to due to a locking inversion (e.g.
> + * DRM drivers). Since the mmap lock is not required to be held, the MMU
> + * invalidation can race with with VMA start being repurposed, worst case this
> + * would result in an unecessary invalidation.
> + */
> +int migrate_device_vma_range(struct mm_struct *mm, void *pgmap_owner,
> +			     unsigned long *src_pfns, unsigned long npages,
> +			     unsigned long start)
> +{
> +	struct mmu_notifier_range range;
> +	unsigned long i;
> +
> +	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
> +				      mm, start, start + npages * PAGE_SIZE,
> +				      pgmap_owner);
> +	mmu_notifier_invalidate_range_start(&range);

This isn't needed, try_to_migrate called from migrate_device_unmap already
has a notifier, if there's actually any ptes to clear. If you need this
one you've missed a pte clear notification somewhere, or there's some
other bad bug somewhere.
-Sima

> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *page = pfn_to_page(src_pfns[i]);
> +
> +		if (!get_page_unless_zero(page)) {
> +			src_pfns[i] = 0;
> +			continue;
> +		}
> +
> +		if (!trylock_page(page)) {
> +			src_pfns[i] = 0;
> +			put_page(page);
> +			continue;
> +		}
> +
> +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> +	}
> +
> +	migrate_device_unmap(src_pfns, npages, NULL);
> +	mmu_notifier_invalidate_range_end(&range);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(migrate_device_vma_range);
> +
>  /*
>   * Migrate a device coherent page back to normal memory. The caller should have
>   * a reference on page which will be copied to the new page if migration is
> -- 
> 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
  2024-08-28 14:31   ` Daniel Vetter
  2024-08-28 18:50   ` Daniel Vetter
@ 2024-08-29  9:16   ` Thomas Hellström
  2024-08-29 17:45     ` Matthew Brost
  2024-08-30  1:35     ` Matthew Brost
  2024-08-29  9:45   ` Daniel Vetter
                     ` (4 subsequent siblings)
  7 siblings, 2 replies; 100+ messages in thread
From: Thomas Hellström @ 2024-08-29  9:16 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: airlied, christian.koenig, matthew.auld, daniel

Hi, Matt. 

Some initial design comments / questions:

On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> This patch introduces support for GPU Shared Virtual Memory (SVM) in
> the
> Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> sharing of memory between the CPU and GPU, enhancing performance and
> flexibility in GPU computing tasks.
> 
> The patch adds the necessary infrastructure for SVM, including data
> structures and functions for managing SVM ranges and notifiers. It
> also
> provides mechanisms for allocating, deallocating, and migrating
> memory
> regions between system RAM and GPU VRAM.
> 
> This mid-layer is largely inspired by GPUVM.
> 
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile     |    3 +-
>  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
>  3 files changed, 2591 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index b9670ae09a9e..b8fc2ee58f1a 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
>  
>  # core driver code
>  
> -xe-y += xe_bb.o \
> +xe-y += drm_gpusvm.o \
> +	xe_bb.o \
>  	xe_bo.o \
>  	xe_bo_evict.o \
>  	xe_devcoredump.o \
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> b/drivers/gpu/drm/xe/drm_gpusvm.c
> new file mode 100644
> index 000000000000..fc1e44e6ae72
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> @@ -0,0 +1,2174 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + *
> + * Authors:
> + *     Matthew Brost <matthew.brost@intel.com>
> + */
> +
> +#include <linux/dma-mapping.h>
> +#include <linux/interval_tree_generic.h>
> +#include <linux/hmm.h>
> +#include <linux/memremap.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_types.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +
> +#include <drm/drm_device.h>
> +#include "drm_gpusvm.h"
> +
> +/**
> + * DOC: Overview
> + *
> + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> Rendering Manager (DRM)
> + *
> + * The GPU SVM layer is a component of the DRM framework designed to
> manage shared
> + * virtual memory between the CPU and GPU. It enables efficient data
> exchange and
> + * processing for GPU-accelerated applications by allowing memory
> sharing and
> + * synchronization between the CPU's and GPU's virtual address
> spaces.
> + *
> + * Key GPU SVM Components:
> + * - Notifiers: Notifiers: Used for tracking memory intervals and
> notifying the
> + *		GPU of changes, notifiers are sized based on a GPU
> SVM
> + *		initialization parameter, with a recommendation of
> 512M or
> + *		larger. They maintain a Red-BlacK tree and a list of
> ranges that
> + *		fall within the notifier interval. Notifiers are
> tracked within
> + *		a GPU SVM Red-BlacK tree and list and are
> dynamically inserted
> + *		or removed as ranges within the interval are created
> or
> + *		destroyed.

What is the benefit of this extra layer compared to direct insertion of
ranges using mmu_interval_notifier_insert?

IIRC the argument made previously about having wide notifiers was that
the rb tree lookups inside the core were costly and if there were only
a few, then the rb tree lookups within a notifier range could be
replaced with the page-table radix-tree-like lookup, so each lookup
complexity would be O(log(n_notifiers) + page_table_depth).

But now we have first an rb-tree lookup in the core and then an rb-tree
lookup within each notifier yeilding O(log(n_ranges))

I can see a small benefit in that inserting directly into the core rb-
tree will block pending ongoing invalidations, but at a cost of an
extra multiplexing layer.

> + * - Ranges: Represent memory ranges mapped in a DRM device and
> managed
> + *	     by GPU SVM. They are sized based on an array of chunk
> sizes, which
> + *	     is a GPU SVM initialization parameter, and the CPU
> address space.
> + *	     Upon GPU fault, the largest aligned chunk that fits
> within the
> + *	     faulting CPU address space is chosen for the range
> size. Ranges are
> + *	     expected to be dynamically allocated on GPU fault and
> removed on an
> + *	     MMU notifier UNMAP event. As mentioned above, ranges
> are tracked in
> + *	     a notifier's Red-Black tree.

How do ranges and chunks map to
 
a) Prefaulting granularity
b) Migration granularity?

> + * - Operations: Define the interface for driver-specific SVM
> operations such as
> + *		 allocation, page collection, migration,
> invalidations, and VRAM
> + *		 release.
> + *
> + * This layer provides interfaces for allocating, mapping,
> migrating, and
> + * releasing memory ranges between the CPU and GPU. It handles all
> core memory
> + * management interactions (DMA mapping, HMM, and migration) and
> provides
> + * driver-specific virtual functions (vfuncs). This infrastructure
> is sufficient
> + * to build the expected driver components for an SVM implementation
> as detailed
> + * below.
> + *
> + * Expected Driver Components:
> + * - GPU page fault handler: Used to create ranges and notifiers
> based on the
> + *			     fault address, optionally migrate the
> range to
> + *			     VRAM, and create GPU bindings.
> + * - Garbage collector: Used to destroy GPU bindings for ranges.
> Ranges are
> + *			expected to be added to the garbage
> collector upon
> + *			MMU_NOTIFY_UNMAP event.
> + */
> +
> +/**
> + * DOC: Locking
> + *
> + * GPU SVM handles locking for core MM interactions, i.e., it
> locks/unlocks the
> + * mmap lock as needed. Alternatively, if the driver prefers to
> handle the mmap
> + * lock itself, a 'locked' argument is provided to the functions
> that require
> + * the mmap lock. This option may be useful for drivers that need to
> call into
> + * GPU SVM while also holding a dma-resv lock, thus preventing
> locking
> + * inversions between the mmap and dma-resv locks.
> + *
> + * GPU SVM introduces a global notifier lock, which safeguards the
> notifier's
> + * range RB tree and list, as well as the range's DMA mappings and
> sequence
> + * number. GPU SVM manages all necessary locking and unlocking
> operations,
> + * except for the recheck of the range's sequence number
> + * (mmu_interval_read_retry) when the driver is committing GPU
> bindings. This
> + * lock corresponds to the 'driver->update' lock mentioned in the
> HMM
> + * documentation (TODO: Link). Future revisions may transition from
> a GPU SVM
> + * global lock to a per-notifier lock if finer-grained locking is
> deemed
> + * necessary.
> + *
> + * In addition to the locking mentioned above, the driver should
> implement a
> + * lock to safeguard core GPU SVM function calls that modify state,
> such as
> + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> Alternatively,
> + * these core functions can be called within a single kernel thread,
> for
> + * instance, using an ordered work queue. This lock is denoted as
> + * 'driver_svm_lock' in code examples.
> + */
> +
> +/**
> + * DOC: Migrataion
> + *
> + * The migration support is quite simple, allowing migration between
> SRAM and
> + * VRAM at the range granularity. For example, GPU SVM currently
> does not
> + * support mixing SRAM and VRAM pages within a range. This means
> that upon GPU
> + * fault, the entire range can be migrated to VRAM, and upon CPU
> fault, the
> + * entire range is migrated to SRAM.
> + *
> + * The reasoning for only supporting range granularity is as
> follows: it
> + * simplifies the implementation, and range sizes are driver-defined
> and should
> + * be relatively small.
> + */
> +
> +/**
> + * DOC: Partial Unmapping of Ranges
> + *
> + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> CPU resulting
> + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> main one
> + * being that a subset of the range still has CPU and GPU mappings.
> If the
> + * backing store for the range is in VRAM, a subset of the backing
> store has
> + * references. One option would be to split the range and VRAM
> backing store,
> + * but the implementation for this would be quite complicated. Given
> that
> + * partial unmappings are rare and driver-defined range sizes are
> relatively
> + * small, GPU SVM does not support splitting of ranges.
> + *
> + * With no support for range splitting, upon partial unmapping of a
> range, the
> + * driver is expected to invalidate and destroy the entire range. If
> the range
> + * has VRAM as its backing, the driver is also expected to migrate
> any remaining
> + * pages back to SRAM.

So what happens if we get a one-page invalidation, say protection
change event, or NUMA accounting event, in the middle of a range? Can
we unmap just that single gpu pte covering that range, that is, how do
the ranges map to invalidation granularity? Does this differ between
igfx an dgfx?

Thanks,
Thomas




> + */
> +
> +/**
> + * DOC: Examples
> + *
> + * This section provides two examples of how to build the expected
> driver
> + * components: the GPU page fault handler and the garbage collector.
> A third
> + * example demonstrates a sample invalidation driver vfunc.
> + *
> + * The generic code provided does not include logic for complex
> migration
> + * policies, optimized invalidations, or other potentially required
> driver
> + * locking (e.g., DMA-resv locks).
> + *
> + * 1) GPU page fault handler
> + *
> + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> drm_gpusvm_range *range)
> + *	{
> + *		int err = 0;
> + *
> + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> range);
> + *
> + *		drm_gpusvm_notifier_lock(gpusvm);
> + *		if (drm_gpusvm_range_pages_valid(range))
> + *			driver_commit_bind(gpusvm, range);
> + *		else
> + *			err = -EAGAIN;
> + *		drm_gpusvm_notifier_unlock(gpusvm);
> + *
> + *		return err;
> + *	}
> + *
> + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> + *			     u64 gpuva_start, u64 gpuva_end)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *		int err;
> + *
> + *		driver_svm_lock();
> + *	retry:
> + *		// Always process UNMAPs first so view of GPU SVM
> ranges is current
> + *		driver_garbage_collector(gpusvm);
> + *
> + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> fault_addr,
> + *							gpuva_start,
> gpuva_end,
> + *						        &ctx);
> + *		if (IS_ERR(range)) {
> + *			err = PTR_ERR(range);
> + *			goto unlock;
> + *		}
> + *
> + *		if (driver_migration_policy(range)) {
> + *			bo = driver_alloc_bo();
> + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> range, bo, &ctx);
> + *			if (err)	// CPU mappings may have
> changed
> + *				goto retry;
> + *		}
> + *
> + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &ctx);
> + *		if (err == -EFAULT || err == -EPERM)	// CPU
> mappings changed
> + *			goto retry;
> + *		else if (err)
> + *			goto unlock;
> + *
> + *		err = driver_bind_range(gpusvm, range);
> + *		if (err == -EAGAIN)	// CPU mappings changed
> + *			goto retry
> + *
> + *	unlock:
> + *		driver_svm_unlock();
> + *		return err;
> + *	}
> + *
> + * 2) Garbage Collector.
> + *
> + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> + *					struct drm_gpusvm_range
> *range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		// Partial unmap, migrate any remaining VRAM pages
> back to SRAM
> + *		if (range->flags.partial_unmap)
> + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> &ctx);
> + *
> + *		driver_unbind_range(range);
> + *		drm_gpusvm_range_remove(gpusvm, range);
> + *	}
> + *
> + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> + *	{
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		for_each_range_in_garbage_collector(gpusvm, range)
> + *			__driver_garbage_collector(gpusvm, range);
> + *	}
> + *
> + * 3) Invalidation driver vfunc.
> + *
> + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> + *				 struct drm_gpusvm_notifier
> *notifier,
> + *				 const struct mmu_notifier_range
> *mmu_range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> };
> + *		struct drm_gpusvm_range *range = NULL;
> + *
> + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> >start, mmu_range->end);
> + *
> + *		drm_gpusvm_for_each_range(range, notifier,
> mmu_range->start,
> + *					  mmu_range->end) {
> + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> &ctx);
> + *
> + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> + *				continue;
> + *
> + *			drm_gpusvm_range_set_unmapped(range,
> mmu_range);
> + *			driver_garbage_collector_add(gpusvm, range);
> + *		}
> + *	}
> + */
> +
> +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> rb.__subtree_last,
> +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> +		     static __maybe_unused, range);
> +
> +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> >interval.start)
> +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> >interval.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> notifier);
> +
> +/**
> + * npages_in_range() - Calculate the number of pages in a given
> range
> + * @start__: The start address of the range
> + * @end__: The end address of the range
> + *
> + * This macro calculates the number of pages in a given memory
> range,
> + * specified by the start and end addresses. It divides the
> difference
> + * between the end and start addresses by the page size (PAGE_SIZE)
> to
> + * determine the number of pages in the range.
> + *
> + * Return: The number of pages in the specified range.
> + */
> +#define npages_in_range(start__, end__)	\
> +	(((end__) - (start__)) >> PAGE_SHIFT)
> +
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd destruction
> + * @range: Pointer to the GPU SVM range
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up
> a range
> + * upon CPU page fault and asynchronously releasing VRAM once the
> CPU has no
> + * page references. Asynchronous release is useful because CPU page
> references
> + * can be dropped in IRQ contexts, while releasing VRAM likely
> requires sleeping
> + * locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_range *range;
> +	void *vram_allocation;
> +};
> +
> +/**
> + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> zdd
> + * @w: Pointer to the work_struct
> + *
> + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> + */
> +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(w, struct drm_gpusvm_zdd,
> destroy_work);
> +	struct drm_gpusvm_range *range = zdd->range;
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> +		gpusvm->ops->vram_release(zdd->vram_allocation);
> +	drm_gpusvm_range_put(range);
> +	kfree(zdd);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> + * @range: Pointer to the GPU SVM range.
> + *
> + * This function allocates and initializes a new zdd structure. It
> sets up the
> + * reference count, initializes the destroy work, and links the
> provided GPU SVM
> + * range.
> + *
> + * Returns:
> + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_zdd *
> +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_zdd *zdd;
> +
> +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> +	if (!zdd)
> +		return NULL;
> +
> +	kref_init(&zdd->refcount);
> +	INIT_WORK(&zdd->destroy_work,
> drm_gpusvm_zdd_destroy_work_func);
> +	zdd->range = drm_gpusvm_range_get(range);
> +	zdd->vram_allocation = NULL;
> +
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function increments the reference count of the provided zdd
> structure.
> + *
> + * Returns: Pointer to the zdd structure.
> + */
> +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> drm_gpusvm_zdd *zdd)
> +{
> +	kref_get(&zdd->refcount);
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> + * @ref: Pointer to the reference count structure.
> + *
> + * This function queues the destroy_work of the zdd for asynchronous
> destruction.
> + */
> +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> +
> +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_put - Put a zdd reference.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function decrements the reference count of the provided zdd
> structure
> + * and schedules its destruction if the count drops to zero.
> + */
> +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> + * @notifier: Pointer to the GPU SVM notifier structure.
> + * @start: Start address of the range
> + * @end: End address of the range
> + *
> + * Return: A pointer to the drm_gpusvm_range if found or NULL
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end)
> +{
> +	return range_iter_first(&notifier->root, start, end - 1);
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> ranges in a notifier
> + * @range__: Iterator variable for the ranges
> + * @next__: Iterator variable for the ranges temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier
> while
> + * removing ranges from it.
> + */
> +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> start__, end__)	\
> +	for ((range__) = drm_gpusvm_range_find((notifier__),
> (start__), (end__)),	\
> +	     (next__) =
> __drm_gpusvm_range_next(range__);				\
> +	     (range__) && (range__->va.start <
> (end__));				\
> +	     (range__) = (next__), (next__) =
> __drm_gpusvm_range_next(range__))
> +
> +/**
> + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> the list
> + * @notifier: a pointer to the current drm_gpusvm_notifier
> + *
> + * Return: A pointer to the next drm_gpusvm_notifier if available,
> or NULL if
> + *         the current notifier is the last one or if the input
> notifier is
> + *         NULL.
> + */
> +static struct drm_gpusvm_notifier *
> +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> +{
> +	if (notifier && !list_is_last(&notifier->rb.entry,
> +				      &notifier->gpusvm-
> >notifier_list))
> +		return list_next_entry(notifier, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> + */
> +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> end__)		\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1);	\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @next__: Iterator variable for the notifiers temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> while
> + * removing notifiers from it.
> + */
> +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> gpusvm__, start__, end__)	\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1),	\
> +	     (next__) =
> __drm_gpusvm_notifier_next(notifier__);				\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = (next__), (next__) =
> __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> + * @mni: Pointer to the mmu_interval_notifier structure.
> + * @mmu_range: Pointer to the mmu_notifier_range structure.
> + * @cur_seq: Current sequence number.
> + *
> + * This function serves as a generic MMU notifier for GPU SVM. It
> sets the MMU
> + * notifier sequence number and calls the driver invalidate vfunc
> under
> + * gpusvm->notifier_lock.
> + *
> + * Returns:
> + * true if the operation succeeds, false otherwise.
> + */
> +static bool
> +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> +			       const struct mmu_notifier_range
> *mmu_range,
> +			       unsigned long cur_seq)
> +{
> +	struct drm_gpusvm_notifier *notifier =
> +		container_of(mni, typeof(*notifier), notifier);
> +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> +
> +	if (!mmu_notifier_range_blockable(mmu_range))
> +		return false;
> +
> +	down_write(&gpusvm->notifier_lock);
> +	mmu_interval_set_seq(mni, cur_seq);
> +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> +	up_write(&gpusvm->notifier_lock);
> +
> +	return true;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> GPU SVM
> + */
> +static const struct mmu_interval_notifier_ops
> drm_gpusvm_notifier_ops = {
> +	.invalidate = drm_gpusvm_notifier_invalidate,
> +};
> +
> +/**
> + * drm_gpusvm_init - Initialize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @name: Name of the GPU SVM.
> + * @drm: Pointer to the DRM device structure.
> + * @mm: Pointer to the mm_struct for the address space.
> + * @device_private_page_owner: Device private pages owner.
> + * @mm_start: Start address of GPU SVM.
> + * @mm_range: Range of the GPU SVM.
> + * @notifier_size: Size of individual notifiers.
> + * @ops: Pointer to the operations structure for GPU SVM.
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order
> with last
> + *               entry being SZ_4K.
> + * @num_chunks: Number of chunks.
> + *
> + * This function initializes the GPU SVM.
> + *
> + * Returns:
> + * 0 on success, a negative error code on failure.
> + */
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks)
> +{
> +	if (!ops->invalidate || !num_chunks)
> +		return -EINVAL;
> +
> +	gpusvm->name = name;
> +	gpusvm->drm = drm;
> +	gpusvm->mm = mm;
> +	gpusvm->device_private_page_owner =
> device_private_page_owner;
> +	gpusvm->mm_start = mm_start;
> +	gpusvm->mm_range = mm_range;
> +	gpusvm->notifier_size = notifier_size;
> +	gpusvm->ops = ops;
> +	gpusvm->chunk_sizes = chunk_sizes;
> +	gpusvm->num_chunks = num_chunks;
> +	gpusvm->zdd_wq = system_wq;
> +
> +	mmgrab(mm);
> +	gpusvm->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> +
> +	init_rwsem(&gpusvm->notifier_lock);
> +
> +	fs_reclaim_acquire(GFP_KERNEL);
> +	might_lock(&gpusvm->notifier_lock);
> +	fs_reclaim_release(GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure
> + * @fault_addr__: Fault address
> + *
> + * This macro finds the GPU SVM notifier associated with the fault
> address.
> + *
> + * Returns:
> + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> + */
> +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> +			    (fault_addr__ + 1))
> +
> +/**
> + * to_drm_gpusvm_notifier - retrieve the container struct for a
> given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_notifier struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_notifier
> structure.
> + */
> +#define to_drm_gpusvm_notifier(__node)				\
> +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> +
> +/**
> + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function inserts the GPU SVM notifier into the GPU SVM RB
> tree and list.
> + */
> +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	notifier_insert(notifier, &gpusvm->root);
> +
> +	node = rb_prev(&notifier->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> +	else
> +		head = &gpusvm->notifier_list;
> +
> +	list_add(&notifier->rb.entry, head);
> +}
> +
> +/**
> + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM tructure
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + *
> + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> and list.
> + */
> +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> +	list_del(&(notifier__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_fini - Finalize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + *
> + * This function finalizes the GPU SVM by cleaning up any remaining
> ranges and
> + * notifiers, and dropping a reference to struct MM.
> + */
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> +{
> +	struct drm_gpusvm_notifier *notifier, *next;
> +
> +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> LONG_MAX) {
> +		struct drm_gpusvm_range *range, *__next;
> +
> +		/*
> +		 * Remove notifier first to avoid racing with any
> invalidation
> +		 */
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +		notifier->flags.removed = true;
> +
> +		drm_gpusvm_for_each_range_safe(range, __next,
> notifier, 0,
> +					       LONG_MAX)
> +			drm_gpusvm_range_remove(gpusvm, range);
> +	}
> +
> +	mmdrop(gpusvm->mm);
> +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> +}
> +
> +/**
> + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + *
> + * This function allocates and initializes the GPU SVM notifier
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> on failure.
> + */
> +static struct drm_gpusvm_notifier *
> +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	if (gpusvm->ops->notifier_alloc)
> +		notifier = gpusvm->ops->notifier_alloc();
> +	else
> +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> +
> +	if (!notifier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	notifier->gpusvm = gpusvm;
> +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> >notifier_size);
> +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> >notifier_size);
> +	INIT_LIST_HEAD(&notifier->rb.entry);
> +	notifier->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&notifier->range_list);
> +
> +	return notifier;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function frees the GPU SVM notifier structure.
> + */
> +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> +				     struct drm_gpusvm_notifier
> *notifier)
> +{
> +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> +
> +	if (gpusvm->ops->notifier_free)
> +		gpusvm->ops->notifier_free(notifier);
> +	else
> +		kfree(notifier);
> +}
> +
> +/**
> + * to_drm_gpusvm_range - retrieve the container struct for a given
> rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_range struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_range structure.
> + */
> +#define to_drm_gpusvm_range(node__)	\
> +	container_of((node__), struct drm_gpusvm_range, rb.node)
> +
> +/**
> + * drm_gpusvm_range_insert - Insert GPU SVM range
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function inserts the GPU SVM range into the notifier RB tree
> and list.
> + */
> +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> *notifier,
> +				    struct drm_gpusvm_range *range)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> +	range_insert(range, &notifier->root);
> +
> +	node = rb_prev(&range->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> +	else
> +		head = &notifier->range_list;
> +
> +	list_add(&range->rb.entry, head);
> +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> +}
> +
> +/**
> + * __drm_gpusvm_range_remove - Remove GPU SVM range
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + * @range__: Pointer to the GPU SVM range structure
> + *
> + * This macro removes the GPU SVM range from the notifier RB tree
> and list.
> + */
> +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> +	range_remove((range__), &(notifier__)->root);		\
> +	list_del(&(range__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @fault_addr: Fault address
> + * @chunk_size: Chunk size
> + * @migrate_vram: Flag indicating whether to migrate VRAM
> + *
> + * This function allocates and initializes the GPU SVM range
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> failure.
> + */
> +static struct drm_gpusvm_range *
> +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> +		       struct drm_gpusvm_notifier *notifier,
> +		       u64 fault_addr, u64 chunk_size, bool
> migrate_vram)
> +{
> +	struct drm_gpusvm_range *range;
> +
> +	if (gpusvm->ops->range_alloc)
> +		range = gpusvm->ops->range_alloc(gpusvm);
> +	else
> +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&range->refcount);
> +	range->gpusvm = gpusvm;
> +	range->notifier = notifier;
> +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> +	INIT_LIST_HEAD(&range->rb.entry);
> +	range->notifier_seq = LONG_MAX;
> +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_check_pages - Check pages
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @start: Start address
> + * @end: End address
> + *
> + * Check if pages between start and end have been faulted in on the
> CPU. Use to
> + * prevent migration of pages without CPU backing store.
> + *
> + * Returns:
> + * True if pages have been faulted into CPU, False otherwise
> + */
> +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> +				   struct drm_gpusvm_notifier
> *notifier,
> +				   u64 start, u64 end)
> +{
> +	struct hmm_range hmm_range = {
> +		.default_flags = 0,
> +		.notifier = &notifier->notifier,
> +		.start = start,
> +		.end = end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns;
> +	unsigned long npages = npages_in_range(start, end);
> +	int err, i;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return false;
> +
> +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> >notifier);
> +	hmm_range.hmm_pfns = pfns;
> +
> +	while (true) {
> +		err = hmm_range_fault(&hmm_range);
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(&notifier->notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (err)
> +		goto err_free;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!(pfns[i] & HMM_PFN_VALID)) {
> +			err = -EFAULT;
> +			goto err_free;
> +		}
> +	}
> +
> +err_free:
> +	kvfree(pfns);
> +	return err ? false : true;
> +}
> +
> +/**
> + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @vas: Pointer to the virtual memory area structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @check_pages: Flag indicating whether to check pages
> + *
> + * This function determines the chunk size for the GPU SVM range
> based on the
> + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> the virtual
> + * memory area boundaries.
> + *
> + * Returns:
> + * Chunk size on success, LONG_MAX on failure.
> + */
> +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier,
> +				       struct vm_area_struct *vas,
> +				       u64 fault_addr, u64
> gpuva_start,
> +				       u64 gpuva_end, bool
> check_pages)
> +{
> +	u64 start, end;
> +	int i = 0;
> +
> +retry:
> +	for (; i < gpusvm->num_chunks; ++i) {
> +		start = ALIGN_DOWN(fault_addr, gpusvm-
> >chunk_sizes[i]);
> +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> +
> +		if (start >= vas->vm_start && end <= vas->vm_end &&
> +		    start >= notifier->interval.start &&
> +		    end <= notifier->interval.end &&
> +		    start >= gpuva_start && end <= gpuva_end)
> +			break;
> +	}
> +
> +	if (i == gpusvm->num_chunks)
> +		return LONG_MAX;
> +
> +	/*
> +	 * If allocation more than page, ensure not to overlap with
> existing
> +	 * ranges.
> +	 */
> +	if (end - start != SZ_4K) {
> +		struct drm_gpusvm_range *range;
> +
> +		range = drm_gpusvm_range_find(notifier, start, end);
> +		if (range) {
> +			++i;
> +			goto retry;
> +		}
> +
> +		/*
> +		 * XXX: Only create range on pages CPU has faulted
> in. Without
> +		 * this check, or prefault, on BMG
> 'xe_exec_system_allocator --r
> +		 * process-many-malloc' fails. In the failure case,
> each process
> +		 * mallocs 16k but the CPU VMA is ~128k which
> results in 64k SVM
> +		 * ranges. When migrating the SVM ranges, some
> processes fail in
> +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> != npages'
> +		 * and then upon drm_gpusvm_range_get_pages device
> pages from
> +		 * other processes are collected + faulted in which
> creates all
> +		 * sorts of problems. Unsure exactly how this
> happening, also
> +		 * problem goes away if 'xe_exec_system_allocator --
> r
> +		 * process-many-malloc' mallocs at least 64k at a
> time.
> +		 */
> +		if (check_pages &&
> +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> end)) {
> +			++i;
> +			goto retry;
> +		}
> +	}
> +
> +	return end - start;
> +}
> +
> +/**
> + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @ctx: GPU SVM context
> + *
> + * This function finds or inserts a newly allocated a GPU SVM range
> based on the
> + * fault address. Caller must hold a lock to protect range lookup
> and insertion.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +	struct drm_gpusvm_range *range;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	bool notifier_alloc = false;
> +	u64 chunk_size;
> +	int err;
> +	bool migrate_vram;
> +
> +	if (fault_addr < gpusvm->mm_start ||
> +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_write_locked(mm);
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> +	if (!notifier) {
> +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> fault_addr);
> +		if (IS_ERR(notifier)) {
> +			err = PTR_ERR(notifier);
> +			goto err_mmunlock;
> +		}
> +		notifier_alloc = true;
> +		err = mmu_interval_notifier_insert_locked(&notifier-
> >notifier,
> +							  mm,
> notifier->interval.start,
> +							  notifier-
> >interval.end -
> +							  notifier-
> >interval.start,
> +							 
> &drm_gpusvm_notifier_ops);
> +		if (err)
> +			goto err_notifier;
> +	}
> +
> +	vas = vma_lookup(mm, fault_addr);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_notifier_remove;
> +	}
> +
> +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> +		err = -EPERM;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_find(notifier, fault_addr,
> fault_addr + 1);
> +	if (range)
> +		goto out_mmunlock;
> +	/*
> +	 * XXX: Short-circuiting migration based on migrate_vma_*
> current
> +	 * limitations. If/when migrate_vma_* add more support, this
> logic will
> +	 * have to change.
> +	 */
> +	migrate_vram = ctx->vram_possible &&
> +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> +
> +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> vas,
> +						 fault_addr,
> gpuva_start,
> +						 gpuva_end,
> migrate_vram &&
> +						 !ctx->prefault);
> +	if (chunk_size == LONG_MAX) {
> +		err = -EINVAL;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> chunk_size,
> +				       migrate_vram);
> +	if (IS_ERR(range)) {
> +		err = PTR_ERR(range);
> +		goto err_notifier_remove;
> +	}
> +
> +	drm_gpusvm_range_insert(notifier, range);
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> +
> +	if (ctx->prefault) {
> +		struct drm_gpusvm_ctx __ctx = *ctx;
> +
> +		__ctx.mmap_locked = true;
> +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &__ctx);
> +		if (err)
> +			goto err_range_remove;
> +	}
> +
> +out_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +
> +	return range;
> +
> +err_range_remove:
> +	__drm_gpusvm_range_remove(notifier, range);
> +err_notifier_remove:
> +	if (notifier_alloc)
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +err_notifier:
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * for_each_dma_page - iterate over pages in a DMA regio`n
> + * @i__: the current page index in the iteration
> + * @j__: the current page index, log order, in the iteration
> + * @npages__: the total number of pages in the DMA region
> + * @order__: the order of the pages in the DMA region
> + *
> + * This macro iterates over each page in a DMA region. The DMA
> region
> + * is assumed to be composed of 2^@order__ pages, and the macro will
> + * step through the region one block of 2^@order__ pages at a time.
> + */
> +#define for_each_dma_page(i__, j__, npages__, order__)	\
> +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> +	     (j__)++, (i__) += 0x1 << (order__))
> +
> +/**
> + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> GPU SVM range (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function unmap pages associated with a GPU SVM range.
> Assumes and
> + * asserts correct locking is in place when called.
> + */
> +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> *gpusvm,
> +					   struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		unsigned long i, j, npages = npages_in_range(range-
> >va.start,
> +							     range-
> >va.end);
> +
> +		if (range->flags.has_dma_mapping) {
> +			for_each_dma_page(i, j, npages, range-
> >order)
> +				dma_unmap_page(gpusvm->drm->dev,
> +					       range->dma_addr[j],
> +					       PAGE_SIZE << range-
> >order,
> +					       DMA_BIDIRECTIONAL);
> +		}
> +
> +		range->flags.has_vram_pages = false;
> +		range->flags.has_dma_mapping = false;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function free pages associated with a GPU SVM range.
> + */
> +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> +					struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		if (range->flags.kfree_mapping) {
> +			kfree(range->dma_addr);
> +			range->flags.kfree_mapping = false;
> +			range->pages = NULL;
> +		} else {
> +			kvfree(range->pages);
> +			range->pages = NULL;
> +		}
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_remove - Remove GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function removes the specified GPU SVM range and also
> removes the parent
> + * GPU SVM notifier if no more ranges remain in the notifier. The
> caller must
> + * hold a lock to protect range and notifier removal.
> + */
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> >va.start);
> +	if (WARN_ON_ONCE(!notifier))
> +		return;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +	drm_gpusvm_range_free_pages(gpusvm, range);
> +	__drm_gpusvm_range_remove(notifier, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	drm_gpusvm_range_put(range);
> +
> +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> +		if (!notifier->flags.removed)
> +			mmu_interval_notifier_remove(&notifier-
> >notifier);
> +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function increments the reference count of the specified GPU
> SVM range.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> +{
> +	kref_get(&range->refcount);
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> + * @refcount: Pointer to the reference counter embedded in the GPU
> SVM range
> + *
> + * This function destroys the specified GPU SVM range when its
> reference count
> + * reaches zero. If a custom range-free function is provided, it is
> invoked to
> + * free the range; otherwise, the range is deallocated using
> kfree().
> + */
> +static void drm_gpusvm_range_destroy(struct kref *refcount)
> +{
> +	struct drm_gpusvm_range *range =
> +		container_of(refcount, struct drm_gpusvm_range,
> refcount);
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->range_free)
> +		gpusvm->ops->range_free(range);
> +	else
> +		kfree(range);
> +}
> +
> +/**
> + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function decrements the reference count of the specified GPU
> SVM range
> + * and frees it when the count reaches zero.
> + */
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> +{
> +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called holding gpusvm->notifier_lock and as the last step before
> commiting a
> + * GPU binding.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	return range->flags.has_vram_pages || range-
> >flags.has_dma_mapping;
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> unlocked
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called without holding gpusvm->notifier_lock.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +static bool
> +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> +				      struct drm_gpusvm_range
> *range)
> +{
> +	bool pages_valid;
> +
> +	if (!range->pages)
> +		return false;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> +	if (!pages_valid && range->flags.kfree_mapping) {
> +		kfree(range->dma_addr);
> +		range->flags.kfree_mapping = false;
> +		range->pages = NULL;
> +	}
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	return pages_valid;
> +}
> +
> +/**
> + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function gets pages for a GPU SVM range and ensures they are
> mapped for
> + * DMA access.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct mmu_interval_notifier *notifier = &range->notifier-
> >notifier;
> +	struct hmm_range hmm_range = {
> +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> ? 0 :
> +			HMM_PFN_REQ_WRITE),
> +		.notifier = notifier,
> +		.start = range->va.start,
> +		.end = range->va.end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long i, j;
> +	unsigned long npages = npages_in_range(range->va.start,
> range->va.end);
> +	unsigned int order = 0;
> +	unsigned long *pfns;
> +	struct page **pages;
> +	int err = 0;
> +	bool vram_pages = !!range->flags.migrate_vram;
> +	bool alloc_pfns = false, kfree_mapping;
> +
> +retry:
> +	kfree_mapping = false;
> +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> +		return 0;
> +
> +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> >pages) {
> +		if (ctx->prefault)
> +			return 0;
> +
> +		pfns = (unsigned long *)range->pages;
> +		pages = range->pages;
> +		goto map_pages;
> +	}
> +
> +	if (!range->pages) {
> +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> GFP_KERNEL);
> +		if (!pfns)
> +			return -ENOMEM;
> +		alloc_pfns = true;
> +	} else {
> +		pfns = (unsigned long *)range->pages;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +	}
> +
> +	hmm_range.hmm_pfns = pfns;
> +	while (true) {
> +		/* Must be checked after mmu_interval_read_begin */
> +		if (range->flags.unmapped) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		if (!ctx->mmap_locked) {
> +			/*
> +			 * XXX: HMM locking document indicates only
> a read-lock
> +			 * is required but there apears to be a
> window between
> +			 * the MMU_NOTIFY_MIGRATE event triggered in
> a CPU fault
> +			 * via migrate_vma_setup and the pages
> actually moving
> +			 * in migrate_vma_finalize in which this
> code can grab
> +			 * garbage pages. Grabbing the write-lock if
> the range
> +			 * is attached to vram appears to protect
> against this
> +			 * race.
> +			 */
> +			if (vram_pages)
> +				mmap_write_lock(mm);
> +			else
> +				mmap_read_lock(mm);
> +		}
> +		err = hmm_range_fault(&hmm_range);
> +		if (!ctx->mmap_locked) {
> +			if (vram_pages)
> +				mmap_write_unlock(mm);
> +			else
> +				mmap_read_unlock(mm);
> +		}
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (!ctx->mmap_locked)
> +		mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +
> +	if (ctx->prefault) {
> +		range->pages = pages;
> +		goto set_seqno;
> +	}
> +
> +map_pages:
> +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> +		WARN_ON_ONCE(!range->vram_allocation);
> +
> +		for (i = 0; i < npages; ++i) {
> +			pages[i] = hmm_pfn_to_page(pfns[i]);
> +
> +			if
> (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> +				err = -EOPNOTSUPP;
> +				goto err_free;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->flags.has_vram_pages = true;
> +		range->pages = pages;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm,
> range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	} else {
> +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> +
> +		for_each_dma_page(i, j, npages, order) {
> +			if (WARN_ON_ONCE(i && order !=
> +					
> hmm_pfn_to_map_order(pfns[i]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +			order = hmm_pfn_to_map_order(pfns[i]);
> +
> +			pages[j] = hmm_pfn_to_page(pfns[i]);
> +			if
> (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +
> +			set_page_dirty_lock(pages[j]);
> +			mark_page_accessed(pages[j]);
> +
> +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> +						   pages[j], 0,
> +						   PAGE_SIZE <<
> order,
> +						  
> DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev,
> dma_addr[j])) {
> +				err = -EFAULT;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +		}
> +
> +		/* Huge pages, reduce memory footprint */
> +		if (order) {
> +			dma_addr = kmalloc_array(j,
> sizeof(*dma_addr),
> +						 GFP_KERNEL);
> +			if (dma_addr) {
> +				for (i = 0; i < j; ++i)
> +					dma_addr[i] =
> (dma_addr_t)pfns[i];
> +				kvfree(pfns);
> +				kfree_mapping = true;
> +			} else {
> +				dma_addr = (dma_addr_t *)pfns;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->order = order;
> +		range->flags.kfree_mapping = kfree_mapping;
> +		range->flags.has_dma_mapping = true;
> +		range->dma_addr = dma_addr;
> +		range->vram_allocation = NULL;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm,
> range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	}
> +
> +	if (err == -EAGAIN)
> +		goto retry;
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	for_each_dma_page(i, j, npages, order)
> +		dma_unmap_page(gpusvm->drm->dev,
> +			       (dma_addr_t)pfns[j],
> +			       PAGE_SIZE << order,
> DMA_BIDIRECTIONAL);
> +err_free:
> +	if (alloc_pfns)
> +		kvfree(pfns);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If
> @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write
> mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> called on
> + * each GPU SVM range attached to notifier in gpusvm->ops-
> >invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> +					   unsigned long
> *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> grate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU
> SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_vram_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to mapped
> pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU
> SVM. It
> + * iterates over each page frame number provided in @migrate_pfn,
> maps the
> + * corresponding page, and stores the DMA address in the provided
> @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int
> *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page =
> migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for GPU
> Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in
> @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the
> corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction
> dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev,
> dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *                   failure of this function.
> + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> The caller
> + *                   should hold a reference to the VRAM allocation,
> which
> + *                   should be dropped via ops->vram_allocation or
> upon the
> + *                   failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to VRAM. It
> performs the
> + * necessary setup and invokes the driver-specific operations for
> migration to
> + * VRAM. Upon successful return, @vram_allocation can safely
> reference @range
> + * until ops->vram_release is called which only upon successful
> return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_vram)
> +		return -EINVAL;
> +
> +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> >copy_to_vram ||
> +	    !gpusvm->ops->copy_to_sram)
> +		return -EOPNOTSUPP;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(range);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> npages, not
> +	 * always an error. Need to revisit possible cases and how
> to handle. We
> +	 * could prefault on migrate.cpages != npages via
> hmm_range_fault.
> +	 */
> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> vram_allocation, npages,
> +					     migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.src, npages,
> DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_vram_page(page, zdd);
> +	}
> +
> +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind vram allocation to range and zdd */
> +	range->vram_allocation = vram_allocation;
> +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> VM area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the SRAM migrate page frame numbers
> (PFNs) for the
> + * specified VM area structure. It allocates and locks pages in the
> VM area for
> + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> vm_area_struct *vas,
> +						unsigned long
> npages,
> +						unsigned long
> *src_mpfn,
> +						unsigned long *mpfn,
> u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);
> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> lock and
> + * migration done via migrate_device_* functions. Fallback path as
> it is
> + * preferred to issue migrations with mmap lock.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> +				    struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = npages_in_range(range->va.start, range->va.end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> npages;
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> >vram_allocation,
> +					     npages, src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_vma_range(gpusvm->mm,
> +				       gpusvm-
> >device_private_page_owner, src,
> +				       npages, range->va.start);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> src, dst, 0);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   dst, npages,
> DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @vas: Pointer to the VM area structure
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @start: Start address of the migration range
> + * @end: End address of the migration range
> + *
> + * This internal function performs the migration of the specified
> GPU SVM range
> + * to SRAM. It sets up the migration, populates + dma maps SRAM
> PFNs, and
> + * invokes the driver-specific operations for migration to SRAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +					struct vm_area_struct *vas,
> +					struct page *page,
> +					u64 start, u64 end)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page	= page,
> +	};
> +	unsigned long npages;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	/* Corner where VMA area struct has been partially unmapped
> */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> +						   migrate.src,
> migrate.dst,
> +						   start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.dst, npages,
> +					   DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function initiates the migration of the specified GPU SVM
> range to
> + * SRAM. It performs necessary checks and invokes the internal
> migration
> + * function for actual migration.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	int err;
> +	bool retry = false;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		if (ctx->trylock_mmap) {
> +			if (!mmap_read_trylock(mm))  {
> +				err =
> drm_gpusvm_evict_to_sram(gpusvm, range);
> +				goto err_mmput;
> +			}
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	/*
> +	 * Loop required to find all VMA area structs for the corner
> case when
> +	 * VRAM backing has been partially unmapped from MM's
> address space.
> +	 */
> +again:
> +	vas = find_vma(mm, start);
> +	if (!vas) {
> +		if (!retry)
> +			err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end <= vas->vm_start || start >= vas->vm_end) {
> +		if (!retry)
> +			err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> end);
> +	if (err)
> +		goto err_mmunlock;
> +
> +	if (vas->vm_end < end) {
> +		retry = true;
> +		start = vas->vm_end;
> +		goto again;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		mmap_read_unlock(mm);
> +		/*
> +		 * Using mmput_async as this function can be called
> while
> +		 * holding a dma-resv lock, and a final put can grab
> the mmap
> +		 * lock, causing a lock inversion.
> +		 */
> +		mmput_async(mm);
> +	}
> +
> +	return 0;
> +
> +err_mmunlock:
> +	if (!ctx->mmap_locked)
> +		mmap_read_unlock(mm);
> +err_mmput:
> +	if (!ctx->mmap_locked)
> +		mmput_async(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device
> data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM
> range to RAM.
> + * It retrieves the GPU SVM range information from the faulting page
> and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> +	int err;
> +
> +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> +					   vmf->vma, vmf->page,
> +					   zdd->range->va.start,
> +					   zdd->range->va.end);
> +
> +	return err ? VM_FAULT_SIGBUS : 0;
> +}
> +
> +/**
> + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> + */
> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> +	.page_free = drm_gpusvm_page_free,
> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> +};
> +
> +/**
> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> operations
> + *
> + * Returns:
> + * Pointer to the GPU SVM device page map operations structure.
> + */
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> +{
> +	return &drm_gpusvm_pagemap_ops;
> +}
> +
> +/**
> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> given address range
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @start: Start address
> + * @end: End address
> + *
> + * Returns:
> + * True if GPU SVM has mapping, False otherwise
> + */
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> u64 end)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> +		struct drm_gpusvm_range *range = NULL;
> +
> +		drm_gpusvm_for_each_range(range, notifier, start,
> end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> b/drivers/gpu/drm/xe/drm_gpusvm.h
> new file mode 100644
> index 000000000000..0ea70f8534a8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> @@ -0,0 +1,415 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __DRM_GPUSVM_H__
> +#define __DRM_GPUSVM_H__
> +
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/workqueue.h>
> +
> +struct dev_pagemap_ops;
> +struct drm_device;
> +struct drm_gpusvm;
> +struct drm_gpusvm_notifier;
> +struct drm_gpusvm_ops;
> +struct drm_gpusvm_range;
> +
> +/**
> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> + *
> + * This structure defines the operations for GPU Shared Virtual
> Memory (SVM).
> + * These operations are provided by the GPU driver to manage SVM
> ranges and
> + * perform operations such as migration between VRAM and system RAM.
> + */
> +struct drm_gpusvm_ops {
> +	/**
> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> +	 *
> +	 * This function shall allocate a GPU SVM notifier.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM notifier on success,
> NULL on failure.
> +	 */
> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> +
> +	/**
> +	 * @notifier_free: Free a GPU SVM notifier (optional)
> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> +	 *
> +	 * This function shall free a GPU SVM notifier.
> +	 */
> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> +
> +	/**
> +	 * @range_alloc: Allocate a GPU SVM range (optional)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 *
> +	 * This function shall allocate a GPU SVM range.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM range on success, NULL
> on failure.
> +	 */
> +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> *gpusvm);
> +
> +	/**
> +	 * @range_free: Free a GPU SVM range (optional)
> +	 * @range: Pointer to the GPU SVM range to be freed
> +	 *
> +	 * This function shall free a GPU SVM range.
> +	 */
> +	void (*range_free)(struct drm_gpusvm_range *range);
> +
> +	/**
> +	 * @vram_release: Release VRAM allocation (optional)
> +	 * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> +	 *
> +	 * This function shall release VRAM allocation and expects
> to drop a
> +	 * reference to VRAM allocation.
> +	 */
> +	void (*vram_release)(void *vram_allocation);
> +
> +	/**
> +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> +	 * @npages: Number of pages to populate
> +	 * @pfn: Array of page frame numbers to populate
> +	 *
> +	 * This function shall populate VRAM page frame numbers
> (PFN).
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> +				 void *vram_allocation,
> +				 unsigned long npages,
> +				 unsigned long *pfn);
> +
> +	/**
> +	 * @copy_to_vram: Copy to VRAM (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (destination)
> +	 * @dma_addr: Pointer to array of DMA addresses (source)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to VRAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @copy_to_sram: Copy to system RAM (required for
> migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (source)
> +	 * @dma_addr: Pointer to array of DMA addresses
> (destination)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to system RAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @invalidate: Invalidate GPU SVM notifier (required)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @notifier: Pointer to the GPU SVM notifier
> +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> +	 *
> +	 * This function shall invalidate the GPU page tables. It
> can safely
> +	 * walk the notifier range RB tree/list in this function.
> Called while
> +	 * holding the notifier lock.
> +	 */
> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> +			   struct drm_gpusvm_notifier *notifier,
> +			   const struct mmu_notifier_range
> *mmu_range);
> +};
> +
> +/**
> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> notifier
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: MMU interval notifier
> + * @interval: Interval for the notifier
> + * @rb: Red-black tree node for the parent GPU SVM structure
> notifier tree
> + * @root: Cached root node of the RB tree containing ranges
> + * @range_list: List head containing of ranges in the same order
> they appear in
> + *              interval tree. This is useful to keep iterating
> ranges while
> + *              doing modifications to RB tree.
> + * @flags.removed: Flag indicating whether the MMU interval notifier
> has been
> + *                 removed
> + *
> + * This structure represents a GPU SVM notifier.
> + */
> +struct drm_gpusvm_notifier {
> +	struct drm_gpusvm *gpusvm;
> +	struct mmu_interval_notifier notifier;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} interval;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct rb_root_cached root;
> +	struct list_head range_list;
> +	struct {
> +		u32 removed : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier
> + * @refcount: Reference count for the range
> + * @rb: Red-black tree node for the parent GPU SVM notifier
> structure range tree
> + * @va: Virtual address range
> + * @notifier_seq: Notifier sequence number of the range's pages
> + * @pages: Pointer to the array of pages (if backing store is in
> VRAM)
> + * @dma_addr: DMA address array (if backing store is SRAM and DMA
> mapped)
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping
> size
> + * @flags.migrate_vram: Flag indicating whether the range can be
> migrated to VRAM
> + * @flags.unmapped: Flag indicating if the range has been unmapped
> + * @flags.partial_unmap: Flag indicating if the range has been
> partially unmapped
> + * @flags.has_vram_pages: Flag indicating if the range has vram
> pages
> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> mapping
> + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> allocation based
> + *                       on @order which releases via kfree
> + *
> + * This structure represents a GPU SVM range used for tracking
> memory ranges
> + * mapped in a DRM device.
> + */
> +struct drm_gpusvm_range {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +	struct kref refcount;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} va;
> +	unsigned long notifier_seq;
> +	union {
> +		struct page **pages;
> +		dma_addr_t *dma_addr;
> +	};
> +	void *vram_allocation;
> +	u16 order;
> +	struct {
> +		/* All flags below must be set upon creation */
> +		u16 migrate_vram : 1;
> +		/* All flags below must be set / cleared under
> notifier lock */
> +		u16 unmapped : 1;
> +		u16 partial_unmap : 1;
> +		u16 has_vram_pages : 1;
> +		u16 has_dma_mapping : 1;
> +		u16 kfree_mapping : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm - GPU SVM structure
> + *
> + * @name: Name of the GPU SVM
> + * @drm: Pointer to the DRM device structure
> + * @mm: Pointer to the mm_struct for the address space
> + * @device_private_page_owner: Device private pages owner
> + * @mm_start: Start address of GPU SVM
> + * @mm_range: Range of the GPU SVM
> + * @notifier_size: Size of individual notifiers
> + * @ops: Pointer to the operations structure for GPU SVM
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order.
> + * @num_chunks: Number of chunks
> + * @notifier_lock: Read-write semaphore for protecting notifier
> operations
> + * @zdd_wq: Workqueue for deferred work on zdd destruction
> + * @root: Cached root node of the Red-Black tree containing GPU SVM
> notifiers
> + * @notifier_list: list head containing of notifiers in the same
> order they
> + *                 appear in interval tree. This is useful to keep
> iterating
> + *                 notifiers while doing modifications to RB tree.
> + *
> + * This structure represents a GPU SVM (Shared Virtual Memory) used
> for tracking
> + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> + *
> + * No reference counting is provided, as this is expected to be
> embedded in the
> + * driver VM structure along with the struct drm_gpuvm, which
> handles reference
> + * counting.
> + */
> +struct drm_gpusvm {
> +	const char *name;
> +	struct drm_device *drm;
> +	struct mm_struct *mm;
> +	void *device_private_page_owner;
> +	u64 mm_start;
> +	u64 mm_range;
> +	u64 notifier_size;
> +	const struct drm_gpusvm_ops *ops;
> +	const u64 *chunk_sizes;
> +	int num_chunks;
> +	struct rw_semaphore notifier_lock;
> +	struct workqueue_struct *zdd_wq;
> +	struct rb_root_cached root;
> +	struct list_head notifier_list;
> +};
> +
> +/**
> + * struct drm_gpusvm_ctx - DRM GPU SVM context
> + *
> + * @mmap_locked: mmap lock is locked
> + * @trylock_mmap: trylock mmap lock, used to avoid locking
> inversions
> + *                (e.g.dma-revs -> mmap lock)
> + * @in_notifier: entering from a MMU notifier
> + * @read_only: operating on read-only memory
> + * @vram_possible: possible to use VRAM
> + * @prefault: prefault pages
> + *
> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> + */
> +struct drm_gpusvm_ctx {
> +	u32 mmap_locked :1;
> +	u32 trylock_mmap :1;
> +	u32 in_notifier :1;
> +	u32 read_only :1;
> +	u32 vram_possible :1;
> +	u32 prefault :1;
> +};
> +
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks);
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> +
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range);
> +
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx);
> +
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx);
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> +
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> u64 end);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end);
> +
> +/**
> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, take lock
> + */
> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> +	down_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, drop lock
> + */
> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> +	up_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> + * @range: a pointer to the current GPU SVM range
> + *
> + * Return: A pointer to the next drm_gpusvm_range if available, or
> NULL if the
> + *         current range is the last one or if the input range is
> NULL.
> + */
> +static inline struct drm_gpusvm_range *
> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> +{
> +	if (range && !list_is_last(&range->rb.entry,
> +				   &range->notifier->range_list))
> +		return list_next_entry(range, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> notifier
> + * @range__: Iterator variable for the ranges. If set, it indicates
> the start of
> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> get the range.
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier.
> It is safe
> + * to use while holding the driver SVM lock or the notifier lock.
> + */
> +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> end__)	\
> +	for ((range__) = (range__)
> ?:					\
> +	     drm_gpusvm_range_find((notifier__), (start__),
> (end__));	\
> +	     (range__) && (range__->va.start <
> (end__));		\
> +	     (range__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> + * @range: Pointer to the GPU SVM range structure.
> + * @mmu_range: Pointer to the MMU notifier range structure.
> + *
> + * This function marks a GPU SVM range as unmapped and sets the
> partial_unmap flag
> + * if the range partially falls within the provided MMU notifier
> range.
> + */
> +static inline void
> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> +			      const struct mmu_notifier_range
> *mmu_range)
> +{
> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> +
> +	range->flags.unmapped = true;
> +	if (range->va.start < mmu_range->start ||
> +	    range->va.end > mmu_range->end)
> +		range->flags.partial_unmap = true;
> +}
> +
> +#endif /* __DRM_GPUSVM_H__ */


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-28 16:06   ` Daniel Vetter
  2024-08-28 18:22     ` Daniel Vetter
@ 2024-08-29  9:24     ` Christian König
  2024-08-29  9:53       ` Thomas Hellström
  2024-08-29 21:48       ` Matthew Brost
  1 sibling, 2 replies; 100+ messages in thread
From: Christian König @ 2024-08-29  9:24 UTC (permalink / raw)
  To: Daniel Vetter, Matthew Brost
  Cc: intel-xe, dri-devel, airlied, thomas.hellstrom, matthew.auld,
	daniel, Paneer Selvam, Arunpravin

[-- Attachment #1: Type: text/plain, Size: 12071 bytes --]

Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
>> Migration is implemented with range granularity, with VRAM backing being
>> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
>> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
>> SVM range is migrated to SRAM, the TTM BO is destroyed).
>>
>> The design choice for using TTM BO for VRAM backing store, as opposed to
>> direct buddy allocation, is as follows:
>>
>> - DRM buddy allocations are not at page granularity, offering no
>>    advantage over a BO.
> This one I'm not understanding.

Adding Arun as well. I couldn't understand it fully either, but maybe 
it's because the buddy allocator is more optimized for higher orders of 
allocations?

>
>> - DRM buddy allocations do not solve locking inversion problems between
>>    mmap lock and dma-resv locks.
> Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> matters hugely which migration path we're in, i.e. opportunistic
> migration, cpu fault where we have to migrate or die, or when we run out
> of vram and need to evict stuff to make space.

Mhm I think the locking order between mmap lock and dma-resv lock is 
well defined since dma_resv_lockdep() was added.

>> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>>    evict each other).
> So core mm handles this by just roughly equally shrinking everything.
> Seems to work, and it has a pile of object shrinkers, and the page lru is
> also split into page cache and anon memory.
>
> I think you need to put in more justification that unified eviction is
> required than just stating it, because a look at mm/ gives a very well
> established counterexample.
>
>> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>>    require a dma-resv.
> So from the TTM side we need exhaustive eviction, or at least something a
> bit more exhaustive than what ttm currently has. Note that i915-gem also
> never really got to perfect exhaustive eviction, it's just a pile better
> than ttm right now.

Please define what exhaustive eviction should mean? I think I know what 
it is and I have been pushing TTM into the direction of solving this for 
years.

The last missing puzzle piece is to use drm_exec for TTM evictions, but 
apart from that everything should work now.

Regards,
Christian.

> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> eviction is going to win because the shrinkers can only trylock dma_resv.
> So this part works. It actually works so well on the system memory side
> that if we're not careful we can trigger oom, because we're too good at
> getting at all the memory.
>
> SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> don't see why, because the idea is that thanks to gpu and cpu page faults,
> you can always get out of a pinch by just trashing everything for a while
> and migrating the handfull of available pages a lot.
>
>> - Likely allocation size is 2M which makes of size of BO (872)
>>    acceptable per allocation (872 / 2M == .0004158).
>>
>> With this, using TTM BO for VRAM backing store seems to be an obvious
>> choice as it allows leveraging of the TTM eviction code.
> Except it requires that you hold dma_resv, which brings in all kinds of
> pain. And for eviction we really don't need a lot of synchronization, so a
> lot of that locking is not needed, unlike the case where we have a cpu
> fault, where we absolutely need mmap_lock and all that to make sure we
> fault in the right page.
>
> But for eviction we only need to throw out some pages, if we're not
> entirely precise with picking the right ones (or have no idea into which
> vma they're all currently mapped into) it doesn't matter. That's why
> migrate_device_pages doesn't care about any of that at all, it doesn't
> need to by design. But by bo backing memory you drag in all that stuff
> that's causing headacheds for eviction.
>
> The only thing migration tries to do is remove all pte, and if that
> succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> code as cheat sheet, we need roughly:
>
> - reverse mapping structure like anon_vma. Except gpusvm can assume that
>    there's currently only one gpu side mapping, so we can just stuff the
>    gpusvm an va_address into the page, and protect it with the page lock.
>
> - we need pagetable locks, so that we can manipulate pagetables (well
>    specifically make ptes invalid) without taking any other locks.
>
> - everyone else inserting or removing ptes for svm mappings also needs to
>    lock the page, or we have races. This might be the hmm_range_fault races
>    you're seeing when allowing vram pages, since I don't think there's
>    anything else stopping the page lookup otherwise from succeeding.
>
> - we might also need to stuff migrate ptes into the gpu side, like the cpu
>    does, to hold up refaults before the migration has finished. But I think
>    those are only needed for anon memory in sram because there's no other
>    way to find the right page than swap pte entries, of which migration
>    entries are a special case.
>
> - core code also expects us to handle the page refcount correctly for svm
>    device memory, so we can't free the pages like normal bo pages either
>    directly to drm_buddy.
>
> Now typing this all up will look an awful lot like what you have, with the
> dma_resv lock serving as the page lock and the pagetable lock. The only
> reason is that these locks are much smaller and nest within all the other
> stuff going on and so avoid the inversion issues.
>
> So one annoying part is that this is a lot of pointlessly looking typing.
> The other is that it's full of races, because core mm really is yolo all
> the way down. So lots of ways you lock the wrong page and fun stuff like
> that, but the few cases that matter work:
>
> - svm fault handling with hmm_range fault retries with mmu notifiers. Note
>    that we need to have vram pages locked and the notifier retrie needs to
>    be under the pagetable lock, or there's room to escape. At least that's
>    what I came up with last time I thought it all through.
>
> - migrate_to_ram: it will hold a page reference which we know was the
>    valid vram page when the cpu pte was locked, but it might not be it
>    anymore. So we have to lock the page and check whether it's still gpu
>    mapped, and if not retry the entire fault since most likey another
>    migrate_to_ram has succeed meanwhile in parallel.
>
> - for eviction we don't care, we might actually be migrating a page no one
>    even wants anymore.
>
> Now I think you can get all this done with the dma_resv lock and maybe the
> bo refcount. But it does involve a tremendous amount of headaches and
> impendence mismatch, because that's not how page faults and migrations
> work in core mm.
>
> Cheers, Sima
>
>> Current migration policy is migrate any SVM range greater than or equal
>> to 64k once.
>>
>> [1]https://patchwork.freedesktop.org/series/133643/
>>
>> Signed-off-by: Matthew Brostmatthew.brost@intel.com
>> ---
>>   drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/xe/xe_svm.h |  1 +
>>   2 files changed, 81 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
>> index 4372c02a341f..fd8987e0a506 100644
>> --- a/drivers/gpu/drm/xe/xe_svm.c
>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>>   static int __xe_svm_garbage_collector(struct xe_vm *vm,
>>   				      struct xe_svm_range *range)
>>   {
>> +	struct drm_gpusvm_ctx ctx = {};
>>   	struct dma_fence *fence;
>>   
>> +	/* Evict any pages holding references to vram allocation */
>> +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
>> +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
>> +
>>   	xe_vm_lock(vm, false);
>>   	fence = xe_vm_range_unbind(vm, range);
>>   	xe_vm_unlock(vm);
>> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>>   	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>>   }
>>   
>> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
>> +{
>> +	return &tile->mem.vram;
>> +}
>> +
>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
>> +				       struct xe_svm_range *range,
>> +				       const struct drm_gpusvm_ctx *ctx)
>> +{
>> +	struct xe_mem_region *mr = tile_to_mr(tile);
>> +	struct drm_buddy_block *block;
>> +	struct list_head *blocks;
>> +	struct xe_bo *bo;
>> +	ktime_t end = 0;
>> +	int err;
>> +
>> +retry:
>> +	xe_vm_lock(vm, false);
>> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
>> +			  range->base.va.start, ttm_bo_type_device,
>> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>> +			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
>> +	xe_vm_unlock(vm);
>> +	if (IS_ERR(bo)) {
>> +		err = PTR_ERR(bo);
>> +		if (xe_vm_validate_should_retry(NULL, err, &end))
>> +			goto retry;
>> +		return bo;
>> +	}
>> +
>> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
>> +	list_for_each_entry(block, blocks, link)
>> +		block->private = mr;
>> +
>> +	/*
>> +	 * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
>> +	 * creation ref can be dropped upon CPU fault or unmap.
>> +	 */
>> +	xe_bo_get(bo);
>> +
>> +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
>> +					 bo, ctx);
>> +	if (err) {
>> +		xe_bo_put(bo);	/* Local ref */
>> +		xe_bo_put(bo);	/* Creation ref */
>> +		return ERR_PTR(err);
>> +	}
>> +
>> +	return bo;
>> +}
>> +
>>   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   			    struct xe_tile *tile, u64 fault_addr,
>>   			    bool atomic)
>>   {
>> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
>> +	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
>> +		.vram_possible = IS_DGFX(vm->xe), };
>>   	struct xe_svm_range *range;
>>   	struct drm_gpusvm_range *r;
>>   	struct drm_exec exec;
>>   	struct dma_fence *fence;
>> +	struct xe_bo *bo = NULL;
>>   	ktime_t end = 0;
>>   	int err;
>>   
>>   	lockdep_assert_held_write(&vm->lock);
>>   
>>   retry:
>> +	xe_bo_put(bo);
>> +	bo = NULL;
>> +
>>   	/* Always process UNMAPs first so view SVM ranges is current */
>>   	err = xe_svm_garbage_collector(vm);
>>   	if (err)
>> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   	if (xe_svm_range_is_valid(range, tile))
>>   		return 0;
>>   
>> +	/* XXX: Add migration policy, for now migrate range once */
>> +	if (IS_DGFX(vm->xe) && !range->migrated &&
>> +	    range->base.flags.migrate_vram &&
>> +	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
>> +		range->migrated = true;
>> +
>> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>> +		if (IS_ERR(bo)) {
>> +			drm_info(&vm->xe->drm,
>> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
>> +				 vm->usm.asid, PTR_ERR(bo));
>> +			bo = NULL;
>> +			goto retry;
>> +		}
>> +	}
>> +
>>   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>>   	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
>>   	       goto retry;
>> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   	dma_fence_put(fence);
>>   
>>   err_out:
>> +	xe_bo_put(bo);
>> +
>>   	return err;
>>   }
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
>> index 8b72e91cc37d..3f432483a230 100644
>> --- a/drivers/gpu/drm/xe/xe_svm.h
>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>> @@ -18,6 +18,7 @@ struct xe_svm_range {
>>   	struct list_head garbage_collector_link;
>>   	u8 tile_present;
>>   	u8 tile_invalidated;
>> +	u8 migrated	:1;
>>   };
>>   
>>   int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
>> -- 
>> 2.34.1
>>

[-- Attachment #2: Type: text/html, Size: 13405 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                     ` (2 preceding siblings ...)
  2024-08-29  9:16   ` Thomas Hellström
@ 2024-08-29  9:45   ` Daniel Vetter
  2024-08-29 17:27     ` Matthew Brost
  2024-08-30  9:16   ` Thomas Hellström
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-08-29  9:45 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> sharing of memory between the CPU and GPU, enhancing performance and
> flexibility in GPU computing tasks.
> 
> The patch adds the necessary infrastructure for SVM, including data
> structures and functions for managing SVM ranges and notifiers. It also
> provides mechanisms for allocating, deallocating, and migrating memory
> regions between system RAM and GPU VRAM.
> 
> This mid-layer is largely inspired by GPUVM.
> 
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Still not sure I've got the right race that you paper over with
mmap_write_lock, but I spotted a few things, commments inline.

> ---
>  drivers/gpu/drm/xe/Makefile     |    3 +-
>  drivers/gpu/drm/xe/drm_gpusvm.c | 2174 +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
>  3 files changed, 2591 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index b9670ae09a9e..b8fc2ee58f1a 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
>  
>  # core driver code
>  
> -xe-y += xe_bb.o \
> +xe-y += drm_gpusvm.o \
> +	xe_bb.o \
>  	xe_bo.o \
>  	xe_bo_evict.o \
>  	xe_devcoredump.o \
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c b/drivers/gpu/drm/xe/drm_gpusvm.c
> new file mode 100644
> index 000000000000..fc1e44e6ae72
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> @@ -0,0 +1,2174 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + *
> + * Authors:
> + *     Matthew Brost <matthew.brost@intel.com>
> + */
> +
> +#include <linux/dma-mapping.h>
> +#include <linux/interval_tree_generic.h>
> +#include <linux/hmm.h>
> +#include <linux/memremap.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_types.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +
> +#include <drm/drm_device.h>
> +#include "drm_gpusvm.h"
> +
> +/**
> + * DOC: Overview
> + *
> + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct Rendering Manager (DRM)
> + *
> + * The GPU SVM layer is a component of the DRM framework designed to manage shared
> + * virtual memory between the CPU and GPU. It enables efficient data exchange and
> + * processing for GPU-accelerated applications by allowing memory sharing and
> + * synchronization between the CPU's and GPU's virtual address spaces.
> + *
> + * Key GPU SVM Components:
> + * - Notifiers: Notifiers: Used for tracking memory intervals and notifying the
> + *		GPU of changes, notifiers are sized based on a GPU SVM
> + *		initialization parameter, with a recommendation of 512M or
> + *		larger. They maintain a Red-BlacK tree and a list of ranges that
> + *		fall within the notifier interval. Notifiers are tracked within
> + *		a GPU SVM Red-BlacK tree and list and are dynamically inserted
> + *		or removed as ranges within the interval are created or
> + *		destroyed.
> + * - Ranges: Represent memory ranges mapped in a DRM device and managed
> + *	     by GPU SVM. They are sized based on an array of chunk sizes, which
> + *	     is a GPU SVM initialization parameter, and the CPU address space.
> + *	     Upon GPU fault, the largest aligned chunk that fits within the
> + *	     faulting CPU address space is chosen for the range size. Ranges are
> + *	     expected to be dynamically allocated on GPU fault and removed on an
> + *	     MMU notifier UNMAP event. As mentioned above, ranges are tracked in
> + *	     a notifier's Red-Black tree.
> + * - Operations: Define the interface for driver-specific SVM operations such as
> + *		 allocation, page collection, migration, invalidations, and VRAM
> + *		 release.
> + *
> + * This layer provides interfaces for allocating, mapping, migrating, and
> + * releasing memory ranges between the CPU and GPU. It handles all core memory
> + * management interactions (DMA mapping, HMM, and migration) and provides
> + * driver-specific virtual functions (vfuncs). This infrastructure is sufficient
> + * to build the expected driver components for an SVM implementation as detailed
> + * below.
> + *
> + * Expected Driver Components:
> + * - GPU page fault handler: Used to create ranges and notifiers based on the
> + *			     fault address, optionally migrate the range to
> + *			     VRAM, and create GPU bindings.
> + * - Garbage collector: Used to destroy GPU bindings for ranges. Ranges are
> + *			expected to be added to the garbage collector upon
> + *			MMU_NOTIFY_UNMAP event.
> + */
> +
> +/**
> + * DOC: Locking
> + *
> + * GPU SVM handles locking for core MM interactions, i.e., it locks/unlocks the
> + * mmap lock as needed. Alternatively, if the driver prefers to handle the mmap
> + * lock itself, a 'locked' argument is provided to the functions that require
> + * the mmap lock. This option may be useful for drivers that need to call into
> + * GPU SVM while also holding a dma-resv lock, thus preventing locking
> + * inversions between the mmap and dma-resv locks.
> + *
> + * GPU SVM introduces a global notifier lock, which safeguards the notifier's
> + * range RB tree and list, as well as the range's DMA mappings and sequence
> + * number. GPU SVM manages all necessary locking and unlocking operations,
> + * except for the recheck of the range's sequence number
> + * (mmu_interval_read_retry) when the driver is committing GPU bindings. This
> + * lock corresponds to the 'driver->update' lock mentioned in the HMM
> + * documentation (TODO: Link). Future revisions may transition from a GPU SVM
> + * global lock to a per-notifier lock if finer-grained locking is deemed
> + * necessary.
> + *
> + * In addition to the locking mentioned above, the driver should implement a
> + * lock to safeguard core GPU SVM function calls that modify state, such as
> + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove. Alternatively,
> + * these core functions can be called within a single kernel thread, for
> + * instance, using an ordered work queue. This lock is denoted as
> + * 'driver_svm_lock' in code examples.

I think this doesn't work, because essentially it forces a single threaded
design. Core mm isn't single threaded, and you cannot lock them all out,
at least not easily.

So I think a design requirement is that gpusvm can cope with migrations to
ram due to cpu faults, migrations for other reasons, gpu fault handling
all concurrently. Currently with the combo of driver_svm_lock + taking
mmap_write_lock you serialize this all a lot, which I think is hiding
design bugs.

> + */
> +
> +/**
> + * DOC: Migrataion
> + *
> + * The migration support is quite simple, allowing migration between SRAM and
> + * VRAM at the range granularity. For example, GPU SVM currently does not
> + * support mixing SRAM and VRAM pages within a range. This means that upon GPU
> + * fault, the entire range can be migrated to VRAM, and upon CPU fault, the
> + * entire range is migrated to SRAM.
> + *
> + * The reasoning for only supporting range granularity is as follows: it
> + * simplifies the implementation, and range sizes are driver-defined and should
> + * be relatively small.
> + */
> +
> +/**
> + * DOC: Partial Unmapping of Ranges
> + *
> + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by CPU resulting
> + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the main one
> + * being that a subset of the range still has CPU and GPU mappings. If the
> + * backing store for the range is in VRAM, a subset of the backing store has
> + * references. One option would be to split the range and VRAM backing store,
> + * but the implementation for this would be quite complicated. Given that
> + * partial unmappings are rare and driver-defined range sizes are relatively
> + * small, GPU SVM does not support splitting of ranges.
> + *
> + * With no support for range splitting, upon partial unmapping of a range, the
> + * driver is expected to invalidate and destroy the entire range. If the range
> + * has VRAM as its backing, the driver is also expected to migrate any remaining
> + * pages back to SRAM.
> + */
> +
> +/**
> + * DOC: Examples
> + *
> + * This section provides two examples of how to build the expected driver
> + * components: the GPU page fault handler and the garbage collector. A third
> + * example demonstrates a sample invalidation driver vfunc.
> + *
> + * The generic code provided does not include logic for complex migration
> + * policies, optimized invalidations, or other potentially required driver
> + * locking (e.g., DMA-resv locks).
> + *
> + * 1) GPU page fault handler
> + *
> + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct drm_gpusvm_range *range)
> + *	{
> + *		int err = 0;
> + *
> + *		driver_alloc_and_setup_memory_for_bind(gpusvm, range);
> + *
> + *		drm_gpusvm_notifier_lock(gpusvm);
> + *		if (drm_gpusvm_range_pages_valid(range))
> + *			driver_commit_bind(gpusvm, range);
> + *		else
> + *			err = -EAGAIN;
> + *		drm_gpusvm_notifier_unlock(gpusvm);
> + *
> + *		return err;
> + *	}
> + *
> + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64 fault_addr,
> + *			     u64 gpuva_start, u64 gpuva_end)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *		int err;
> + *
> + *		driver_svm_lock();
> + *	retry:
> + *		// Always process UNMAPs first so view of GPU SVM ranges is current
> + *		driver_garbage_collector(gpusvm);
> + *
> + *		range = drm_gpusvm_range_find_or_insert(gpusvm, fault_addr,
> + *							gpuva_start, gpuva_end,
> + *						        &ctx);
> + *		if (IS_ERR(range)) {
> + *			err = PTR_ERR(range);
> + *			goto unlock;
> + *		}
> + *
> + *		if (driver_migration_policy(range)) {
> + *			bo = driver_alloc_bo();
> + *			err = drm_gpusvm_migrate_to_vram(gpusvm, range, bo, &ctx);
> + *			if (err)	// CPU mappings may have changed
> + *				goto retry;
> + *		}
> + *
> + *		err = drm_gpusvm_range_get_pages(gpusvm, range, &ctx);
> + *		if (err == -EFAULT || err == -EPERM)	// CPU mappings changed
> + *			goto retry;
> + *		else if (err)
> + *			goto unlock;
> + *
> + *		err = driver_bind_range(gpusvm, range);
> + *		if (err == -EAGAIN)	// CPU mappings changed
> + *			goto retry
> + *
> + *	unlock:
> + *		driver_svm_unlock();
> + *		return err;
> + *	}
> + *
> + * 2) Garbage Collector.
> + *
> + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> + *					struct drm_gpusvm_range *range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> + *		if (range->flags.partial_unmap)
> + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);

Note that the migration back to sram isn't guaranteed to succeed, so you
might be still stuck with partially migrated range. This might be a case
where hmm gives you vram pfns, but the range you have doesn't have any
vram allocation anymore because you droppped it here. Not sure tbh.

> + *
> + *		driver_unbind_range(range);
> + *		drm_gpusvm_range_remove(gpusvm, range);
> + *	}
> + *
> + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> + *	{
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		for_each_range_in_garbage_collector(gpusvm, range)
> + *			__driver_garbage_collector(gpusvm, range);
> + *	}
> + *
> + * 3) Invalidation driver vfunc.
> + *
> + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> + *				 struct drm_gpusvm_notifier *notifier,
> + *				 const struct mmu_notifier_range *mmu_range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> + *		struct drm_gpusvm_range *range = NULL;
> + *
> + *		driver_invalidate_device_tlb(gpusvm, mmu_range->start, mmu_range->end);
> + *
> + *		drm_gpusvm_for_each_range(range, notifier, mmu_range->start,
> + *					  mmu_range->end) {
> + *			drm_gpusvm_range_unmap_pages(gpusvm, range, &ctx);
> + *
> + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> + *				continue;
> + *
> + *			drm_gpusvm_range_set_unmapped(range, mmu_range);
> + *			driver_garbage_collector_add(gpusvm, range);
> + *		}
> + *	}
> + */
> +
> +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64, rb.__subtree_last,
> +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> +		     static __maybe_unused, range);
> +
> +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)->interval.start)
> +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)->interval.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused, notifier);
> +
> +/**
> + * npages_in_range() - Calculate the number of pages in a given range
> + * @start__: The start address of the range
> + * @end__: The end address of the range
> + *
> + * This macro calculates the number of pages in a given memory range,
> + * specified by the start and end addresses. It divides the difference
> + * between the end and start addresses by the page size (PAGE_SIZE) to
> + * determine the number of pages in the range.
> + *
> + * Return: The number of pages in the specified range.
> + */
> +#define npages_in_range(start__, end__)	\
> +	(((end__) - (start__)) >> PAGE_SHIFT)
> +
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd destruction
> + * @range: Pointer to the GPU SVM range
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up a range
> + * upon CPU page fault and asynchronously releasing VRAM once the CPU has no
> + * page references. Asynchronous release is useful because CPU page references
> + * can be dropped in IRQ contexts, while releasing VRAM likely requires sleeping
> + * locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_range *range;
> +	void *vram_allocation;
> +};
> +
> +/**
> + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a zdd
> + * @w: Pointer to the work_struct
> + *
> + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> + */
> +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(w, struct drm_gpusvm_zdd, destroy_work);
> +	struct drm_gpusvm_range *range = zdd->range;
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> +		gpusvm->ops->vram_release(zdd->vram_allocation);
> +	drm_gpusvm_range_put(range);
> +	kfree(zdd);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> + * @range: Pointer to the GPU SVM range.
> + *
> + * This function allocates and initializes a new zdd structure. It sets up the
> + * reference count, initializes the destroy work, and links the provided GPU SVM
> + * range.
> + *
> + * Returns:
> + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_zdd *
> +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_zdd *zdd;
> +
> +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> +	if (!zdd)
> +		return NULL;
> +
> +	kref_init(&zdd->refcount);
> +	INIT_WORK(&zdd->destroy_work, drm_gpusvm_zdd_destroy_work_func);
> +	zdd->range = drm_gpusvm_range_get(range);
> +	zdd->vram_allocation = NULL;
> +
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function increments the reference count of the provided zdd structure.
> + *
> + * Returns: Pointer to the zdd structure.
> + */
> +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_get(&zdd->refcount);
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> + * @ref: Pointer to the reference count structure.
> + *
> + * This function queues the destroy_work of the zdd for asynchronous destruction.
> + */
> +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> +
> +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_put - Put a zdd reference.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function decrements the reference count of the provided zdd structure
> + * and schedules its destruction if the count drops to zero.
> + */
> +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> + * @notifier: Pointer to the GPU SVM notifier structure.
> + * @start: Start address of the range
> + * @end: End address of the range
> + *
> + * Return: A pointer to the drm_gpusvm_range if found or NULL
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end)
> +{
> +	return range_iter_first(&notifier->root, start, end - 1);
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM ranges in a notifier
> + * @range__: Iterator variable for the ranges
> + * @next__: Iterator variable for the ranges temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier while
> + * removing ranges from it.
> + */
> +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
> +	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
> +	     (next__) = __drm_gpusvm_range_next(range__);				\
> +	     (range__) && (range__->va.start < (end__));				\
> +	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in the list
> + * @notifier: a pointer to the current drm_gpusvm_notifier
> + *
> + * Return: A pointer to the next drm_gpusvm_notifier if available, or NULL if
> + *         the current notifier is the last one or if the input notifier is
> + *         NULL.
> + */
> +static struct drm_gpusvm_notifier *
> +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> +{
> +	if (notifier && !list_is_last(&notifier->rb.entry,
> +				      &notifier->gpusvm->notifier_list))
> +		return list_next_entry(notifier, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> + */
> +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
> +	     (notifier__) && (notifier__->interval.start < (end__));			\
> +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @next__: Iterator variable for the notifiers temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm while
> + * removing notifiers from it.
> + */
> +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
> +	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
> +	     (notifier__) && (notifier__->interval.start < (end__));			\
> +	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> + * @mni: Pointer to the mmu_interval_notifier structure.
> + * @mmu_range: Pointer to the mmu_notifier_range structure.
> + * @cur_seq: Current sequence number.
> + *
> + * This function serves as a generic MMU notifier for GPU SVM. It sets the MMU
> + * notifier sequence number and calls the driver invalidate vfunc under
> + * gpusvm->notifier_lock.
> + *
> + * Returns:
> + * true if the operation succeeds, false otherwise.
> + */
> +static bool
> +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> +			       const struct mmu_notifier_range *mmu_range,
> +			       unsigned long cur_seq)
> +{
> +	struct drm_gpusvm_notifier *notifier =
> +		container_of(mni, typeof(*notifier), notifier);
> +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> +
> +	if (!mmu_notifier_range_blockable(mmu_range))
> +		return false;
> +
> +	down_write(&gpusvm->notifier_lock);
> +	mmu_interval_set_seq(mni, cur_seq);
> +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> +	up_write(&gpusvm->notifier_lock);
> +
> +	return true;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
> + */
> +static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
> +	.invalidate = drm_gpusvm_notifier_invalidate,
> +};
> +
> +/**
> + * drm_gpusvm_init - Initialize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @name: Name of the GPU SVM.
> + * @drm: Pointer to the DRM device structure.
> + * @mm: Pointer to the mm_struct for the address space.
> + * @device_private_page_owner: Device private pages owner.
> + * @mm_start: Start address of GPU SVM.
> + * @mm_range: Range of the GPU SVM.
> + * @notifier_size: Size of individual notifiers.
> + * @ops: Pointer to the operations structure for GPU SVM.
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> + *               Entries should be powers of 2 in descending order with last
> + *               entry being SZ_4K.
> + * @num_chunks: Number of chunks.
> + *
> + * This function initializes the GPU SVM.
> + *
> + * Returns:
> + * 0 on success, a negative error code on failure.
> + */
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks)
> +{
> +	if (!ops->invalidate || !num_chunks)
> +		return -EINVAL;
> +
> +	gpusvm->name = name;
> +	gpusvm->drm = drm;
> +	gpusvm->mm = mm;
> +	gpusvm->device_private_page_owner = device_private_page_owner;
> +	gpusvm->mm_start = mm_start;
> +	gpusvm->mm_range = mm_range;
> +	gpusvm->notifier_size = notifier_size;
> +	gpusvm->ops = ops;
> +	gpusvm->chunk_sizes = chunk_sizes;
> +	gpusvm->num_chunks = num_chunks;
> +	gpusvm->zdd_wq = system_wq;
> +
> +	mmgrab(mm);
> +	gpusvm->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> +
> +	init_rwsem(&gpusvm->notifier_lock);
> +
> +	fs_reclaim_acquire(GFP_KERNEL);
> +	might_lock(&gpusvm->notifier_lock);
> +	fs_reclaim_release(GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure
> + * @fault_addr__: Fault address
> + *
> + * This macro finds the GPU SVM notifier associated with the fault address.
> + *
> + * Returns:
> + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> + */
> +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> +			    (fault_addr__ + 1))
> +
> +/**
> + * to_drm_gpusvm_notifier - retrieve the container struct for a given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_notifier struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_notifier structure.
> + */
> +#define to_drm_gpusvm_notifier(__node)				\
> +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> +
> +/**
> + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function inserts the GPU SVM notifier into the GPU SVM RB tree and list.
> + */
> +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier *notifier)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	notifier_insert(notifier, &gpusvm->root);
> +
> +	node = rb_prev(&notifier->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> +	else
> +		head = &gpusvm->notifier_list;
> +
> +	list_add(&notifier->rb.entry, head);
> +}
> +
> +/**
> + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM tructure
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + *
> + * This macro removes the GPU SVM notifier from the GPU SVM RB tree and list.
> + */
> +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> +	list_del(&(notifier__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_fini - Finalize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + *
> + * This function finalizes the GPU SVM by cleaning up any remaining ranges and
> + * notifiers, and dropping a reference to struct MM.
> + */
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> +{
> +	struct drm_gpusvm_notifier *notifier, *next;
> +
> +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0, LONG_MAX) {
> +		struct drm_gpusvm_range *range, *__next;
> +
> +		/*
> +		 * Remove notifier first to avoid racing with any invalidation
> +		 */
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +		notifier->flags.removed = true;
> +
> +		drm_gpusvm_for_each_range_safe(range, __next, notifier, 0,
> +					       LONG_MAX)
> +			drm_gpusvm_range_remove(gpusvm, range);
> +	}
> +
> +	mmdrop(gpusvm->mm);
> +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> +}
> +
> +/**
> + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + *
> + * This function allocates and initializes the GPU SVM notifier structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_notifier *
> +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	if (gpusvm->ops->notifier_alloc)
> +		notifier = gpusvm->ops->notifier_alloc();
> +	else
> +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> +
> +	if (!notifier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	notifier->gpusvm = gpusvm;
> +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
> +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm->notifier_size);
> +	INIT_LIST_HEAD(&notifier->rb.entry);
> +	notifier->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&notifier->range_list);
> +
> +	return notifier;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function frees the GPU SVM notifier structure.
> + */
> +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> +				     struct drm_gpusvm_notifier *notifier)
> +{
> +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> +
> +	if (gpusvm->ops->notifier_free)
> +		gpusvm->ops->notifier_free(notifier);
> +	else
> +		kfree(notifier);
> +}
> +
> +/**
> + * to_drm_gpusvm_range - retrieve the container struct for a given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_range struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_range structure.
> + */
> +#define to_drm_gpusvm_range(node__)	\
> +	container_of((node__), struct drm_gpusvm_range, rb.node)
> +
> +/**
> + * drm_gpusvm_range_insert - Insert GPU SVM range
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function inserts the GPU SVM range into the notifier RB tree and list.
> + */
> +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier *notifier,
> +				    struct drm_gpusvm_range *range)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> +	range_insert(range, &notifier->root);
> +
> +	node = rb_prev(&range->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> +	else
> +		head = &notifier->range_list;
> +
> +	list_add(&range->rb.entry, head);
> +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> +}
> +
> +/**
> + * __drm_gpusvm_range_remove - Remove GPU SVM range
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + * @range__: Pointer to the GPU SVM range structure
> + *
> + * This macro removes the GPU SVM range from the notifier RB tree and list.
> + */
> +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> +	range_remove((range__), &(notifier__)->root);		\
> +	list_del(&(range__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @fault_addr: Fault address
> + * @chunk_size: Chunk size
> + * @migrate_vram: Flag indicating whether to migrate VRAM
> + *
> + * This function allocates and initializes the GPU SVM range structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_range *
> +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> +		       struct drm_gpusvm_notifier *notifier,
> +		       u64 fault_addr, u64 chunk_size, bool migrate_vram)
> +{
> +	struct drm_gpusvm_range *range;
> +
> +	if (gpusvm->ops->range_alloc)
> +		range = gpusvm->ops->range_alloc(gpusvm);
> +	else
> +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&range->refcount);
> +	range->gpusvm = gpusvm;
> +	range->notifier = notifier;
> +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> +	INIT_LIST_HEAD(&range->rb.entry);
> +	range->notifier_seq = LONG_MAX;
> +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_check_pages - Check pages
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @start: Start address
> + * @end: End address
> + *
> + * Check if pages between start and end have been faulted in on the CPU. Use to
> + * prevent migration of pages without CPU backing store.
> + *
> + * Returns:
> + * True if pages have been faulted into CPU, False otherwise
> + */
> +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> +				   struct drm_gpusvm_notifier *notifier,
> +				   u64 start, u64 end)
> +{
> +	struct hmm_range hmm_range = {
> +		.default_flags = 0,
> +		.notifier = &notifier->notifier,
> +		.start = start,
> +		.end = end,
> +		.dev_private_owner = gpusvm->device_private_page_owner,
> +	};
> +	unsigned long timeout =
> +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns;
> +	unsigned long npages = npages_in_range(start, end);
> +	int err, i;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return false;
> +
> +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> +	hmm_range.hmm_pfns = pfns;
> +
> +	while (true) {
> +		err = hmm_range_fault(&hmm_range);
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (err)
> +		goto err_free;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!(pfns[i] & HMM_PFN_VALID)) {
> +			err = -EFAULT;
> +			goto err_free;
> +		}
> +	}
> +
> +err_free:
> +	kvfree(pfns);
> +	return err ? false : true;
> +}
> +
> +/**
> + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @vas: Pointer to the virtual memory area structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @check_pages: Flag indicating whether to check pages
> + *
> + * This function determines the chunk size for the GPU SVM range based on the
> + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and the virtual
> + * memory area boundaries.
> + *
> + * Returns:
> + * Chunk size on success, LONG_MAX on failure.
> + */
> +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier *notifier,
> +				       struct vm_area_struct *vas,
> +				       u64 fault_addr, u64 gpuva_start,
> +				       u64 gpuva_end, bool check_pages)
> +{
> +	u64 start, end;
> +	int i = 0;
> +
> +retry:
> +	for (; i < gpusvm->num_chunks; ++i) {
> +		start = ALIGN_DOWN(fault_addr, gpusvm->chunk_sizes[i]);
> +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> +
> +		if (start >= vas->vm_start && end <= vas->vm_end &&
> +		    start >= notifier->interval.start &&
> +		    end <= notifier->interval.end &&
> +		    start >= gpuva_start && end <= gpuva_end)
> +			break;
> +	}
> +
> +	if (i == gpusvm->num_chunks)
> +		return LONG_MAX;
> +
> +	/*
> +	 * If allocation more than page, ensure not to overlap with existing
> +	 * ranges.
> +	 */
> +	if (end - start != SZ_4K) {
> +		struct drm_gpusvm_range *range;
> +
> +		range = drm_gpusvm_range_find(notifier, start, end);
> +		if (range) {
> +			++i;
> +			goto retry;
> +		}
> +
> +		/*
> +		 * XXX: Only create range on pages CPU has faulted in. Without
> +		 * this check, or prefault, on BMG 'xe_exec_system_allocator --r
> +		 * process-many-malloc' fails. In the failure case, each process
> +		 * mallocs 16k but the CPU VMA is ~128k which results in 64k SVM
> +		 * ranges. When migrating the SVM ranges, some processes fail in
> +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages != npages'
> +		 * and then upon drm_gpusvm_range_get_pages device pages from
> +		 * other processes are collected + faulted in which creates all
> +		 * sorts of problems. Unsure exactly how this happening, also
> +		 * problem goes away if 'xe_exec_system_allocator --r
> +		 * process-many-malloc' mallocs at least 64k at a time.
> +		 */
> +		if (check_pages &&
> +		    !drm_gpusvm_check_pages(gpusvm, notifier, start, end)) {
> +			++i;
> +			goto retry;
> +		}
> +	}
> +
> +	return end - start;
> +}
> +
> +/**
> + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @ctx: GPU SVM context
> + *
> + * This function finds or inserts a newly allocated a GPU SVM range based on the
> + * fault address. Caller must hold a lock to protect range lookup and insertion.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +	struct drm_gpusvm_range *range;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	bool notifier_alloc = false;
> +	u64 chunk_size;
> +	int err;
> +	bool migrate_vram;
> +
> +	if (fault_addr < gpusvm->mm_start ||
> +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_write_locked(mm);
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> +	if (!notifier) {
> +		notifier = drm_gpusvm_notifier_alloc(gpusvm, fault_addr);
> +		if (IS_ERR(notifier)) {
> +			err = PTR_ERR(notifier);
> +			goto err_mmunlock;
> +		}
> +		notifier_alloc = true;
> +		err = mmu_interval_notifier_insert_locked(&notifier->notifier,
> +							  mm, notifier->interval.start,
> +							  notifier->interval.end -
> +							  notifier->interval.start,
> +							  &drm_gpusvm_notifier_ops);
> +		if (err)
> +			goto err_notifier;
> +	}
> +
> +	vas = vma_lookup(mm, fault_addr);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_notifier_remove;
> +	}
> +
> +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> +		err = -EPERM;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_find(notifier, fault_addr, fault_addr + 1);
> +	if (range)
> +		goto out_mmunlock;
> +	/*
> +	 * XXX: Short-circuiting migration based on migrate_vma_* current
> +	 * limitations. If/when migrate_vma_* add more support, this logic will
> +	 * have to change.
> +	 */
> +	migrate_vram = ctx->vram_possible &&
> +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> +
> +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier, vas,
> +						 fault_addr, gpuva_start,
> +						 gpuva_end, migrate_vram &&
> +						 !ctx->prefault);
> +	if (chunk_size == LONG_MAX) {
> +		err = -EINVAL;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr, chunk_size,
> +				       migrate_vram);
> +	if (IS_ERR(range)) {
> +		err = PTR_ERR(range);
> +		goto err_notifier_remove;
> +	}
> +
> +	drm_gpusvm_range_insert(notifier, range);
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> +
> +	if (ctx->prefault) {
> +		struct drm_gpusvm_ctx __ctx = *ctx;
> +
> +		__ctx.mmap_locked = true;
> +		err = drm_gpusvm_range_get_pages(gpusvm, range, &__ctx);
> +		if (err)
> +			goto err_range_remove;
> +	}
> +
> +out_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +
> +	return range;
> +
> +err_range_remove:
> +	__drm_gpusvm_range_remove(notifier, range);
> +err_notifier_remove:
> +	if (notifier_alloc)
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +err_notifier:
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * for_each_dma_page - iterate over pages in a DMA regio`n
> + * @i__: the current page index in the iteration
> + * @j__: the current page index, log order, in the iteration
> + * @npages__: the total number of pages in the DMA region
> + * @order__: the order of the pages in the DMA region
> + *
> + * This macro iterates over each page in a DMA region. The DMA region
> + * is assumed to be composed of 2^@order__ pages, and the macro will
> + * step through the region one block of 2^@order__ pages at a time.
> + */
> +#define for_each_dma_page(i__, j__, npages__, order__)	\
> +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> +	     (j__)++, (i__) += 0x1 << (order__))
> +
> +/**
> + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function unmap pages associated with a GPU SVM range. Assumes and
> + * asserts correct locking is in place when called.
> + */
> +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +					   struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		unsigned long i, j, npages = npages_in_range(range->va.start,
> +							     range->va.end);
> +
> +		if (range->flags.has_dma_mapping) {
> +			for_each_dma_page(i, j, npages, range->order)
> +				dma_unmap_page(gpusvm->drm->dev,
> +					       range->dma_addr[j],
> +					       PAGE_SIZE << range->order,
> +					       DMA_BIDIRECTIONAL);
> +		}
> +
> +		range->flags.has_vram_pages = false;
> +		range->flags.has_dma_mapping = false;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_free_pages - Free pages associated with a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function free pages associated with a GPU SVM range.
> + */
> +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> +					struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		if (range->flags.kfree_mapping) {
> +			kfree(range->dma_addr);
> +			range->flags.kfree_mapping = false;
> +			range->pages = NULL;
> +		} else {
> +			kvfree(range->pages);
> +			range->pages = NULL;
> +		}
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_remove - Remove GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function removes the specified GPU SVM range and also removes the parent
> + * GPU SVM notifier if no more ranges remain in the notifier. The caller must
> + * hold a lock to protect range and notifier removal.
> + */
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, range->va.start);
> +	if (WARN_ON_ONCE(!notifier))
> +		return;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +	drm_gpusvm_range_free_pages(gpusvm, range);
> +	__drm_gpusvm_range_remove(notifier, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	drm_gpusvm_range_put(range);
> +
> +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> +		if (!notifier->flags.removed)
> +			mmu_interval_notifier_remove(&notifier->notifier);
> +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function increments the reference count of the specified GPU SVM range.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> +{
> +	kref_get(&range->refcount);
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> + * @refcount: Pointer to the reference counter embedded in the GPU SVM range
> + *
> + * This function destroys the specified GPU SVM range when its reference count
> + * reaches zero. If a custom range-free function is provided, it is invoked to
> + * free the range; otherwise, the range is deallocated using kfree().
> + */
> +static void drm_gpusvm_range_destroy(struct kref *refcount)
> +{
> +	struct drm_gpusvm_range *range =
> +		container_of(refcount, struct drm_gpusvm_range, refcount);
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->range_free)
> +		gpusvm->ops->range_free(range);
> +	else
> +		kfree(range);
> +}
> +
> +/**
> + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function decrements the reference count of the specified GPU SVM range
> + * and frees it when the count reaches zero.
> + */
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> +{
> +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid. Expected be
> + * called holding gpusvm->notifier_lock and as the last step before commiting a
> + * GPU binding.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	return range->flags.has_vram_pages || range->flags.has_dma_mapping;
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid unlocked
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid. Expected be
> + * called without holding gpusvm->notifier_lock.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +static bool
> +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> +				      struct drm_gpusvm_range *range)
> +{
> +	bool pages_valid;
> +
> +	if (!range->pages)
> +		return false;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> +	if (!pages_valid && range->flags.kfree_mapping) {
> +		kfree(range->dma_addr);
> +		range->flags.kfree_mapping = false;
> +		range->pages = NULL;
> +	}
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	return pages_valid;
> +}
> +
> +/**
> + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function gets pages for a GPU SVM range and ensures they are mapped for
> + * DMA access.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct mmu_interval_notifier *notifier = &range->notifier->notifier;
> +	struct hmm_range hmm_range = {
> +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only ? 0 :
> +			HMM_PFN_REQ_WRITE),
> +		.notifier = notifier,
> +		.start = range->va.start,
> +		.end = range->va.end,
> +		.dev_private_owner = gpusvm->device_private_page_owner,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long timeout =
> +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long i, j;
> +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> +	unsigned int order = 0;
> +	unsigned long *pfns;
> +	struct page **pages;
> +	int err = 0;
> +	bool vram_pages = !!range->flags.migrate_vram;
> +	bool alloc_pfns = false, kfree_mapping;
> +
> +retry:
> +	kfree_mapping = false;
> +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> +		return 0;
> +
> +	if (range->notifier_seq == hmm_range.notifier_seq && range->pages) {
> +		if (ctx->prefault)
> +			return 0;
> +
> +		pfns = (unsigned long *)range->pages;
> +		pages = range->pages;
> +		goto map_pages;
> +	}
> +
> +	if (!range->pages) {
> +		pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +		if (!pfns)
> +			return -ENOMEM;
> +		alloc_pfns = true;
> +	} else {
> +		pfns = (unsigned long *)range->pages;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +	}
> +
> +	hmm_range.hmm_pfns = pfns;
> +	while (true) {
> +		/* Must be checked after mmu_interval_read_begin */
> +		if (range->flags.unmapped) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		if (!ctx->mmap_locked) {
> +			/*
> +			 * XXX: HMM locking document indicates only a read-lock
> +			 * is required but there apears to be a window between
> +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> +			 * via migrate_vma_setup and the pages actually moving
> +			 * in migrate_vma_finalize in which this code can grab
> +			 * garbage pages. Grabbing the write-lock if the range
> +			 * is attached to vram appears to protect against this
> +			 * race.
> +			 */
> +			if (vram_pages)
> +				mmap_write_lock(mm);
> +			else
> +				mmap_read_lock(mm);
> +		}
> +		err = hmm_range_fault(&hmm_range);
> +		if (!ctx->mmap_locked) {
> +			if (vram_pages)
> +				mmap_write_unlock(mm);
> +			else
> +				mmap_read_unlock(mm);
> +		}
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (!ctx->mmap_locked)
> +		mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +
> +	if (ctx->prefault) {
> +		range->pages = pages;
> +		goto set_seqno;
> +	}
> +
> +map_pages:
> +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> +		WARN_ON_ONCE(!range->vram_allocation);
> +
> +		for (i = 0; i < npages; ++i) {
> +			pages[i] = hmm_pfn_to_page(pfns[i]);
> +
> +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> +				err = -EOPNOTSUPP;
> +				goto err_free;
> +			}
> +		}

You can't do the above, because the pfn you get from hmm come with zero
guarantees, you neither hold a page reference nor the page lock. The only
thing you can do is grab the pagetable lock (or mmu notifier locks) and
check it's still valid, before you can touch any state. I think the
range->vram_allocation is probably always valid since you clean that up
under the same lock/thread, but there's good chances the vram allocation
is otherwise already gone for good. Or you get an inconsistent snapshot.

> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->flags.has_vram_pages = true;
> +		range->pages = pages;
> +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	} else {
> +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> +
> +		for_each_dma_page(i, j, npages, order) {
> +			if (WARN_ON_ONCE(i && order !=
> +					 hmm_pfn_to_map_order(pfns[i]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +			order = hmm_pfn_to_map_order(pfns[i]);
> +
> +			pages[j] = hmm_pfn_to_page(pfns[i]);
> +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +
> +			set_page_dirty_lock(pages[j]);
> +			mark_page_accessed(pages[j]);

You can't do these, because you don't hold a page reference. They're also
not needed because hmm_range_fault goes thorugh the full mkwrite dance,
which takes care of these, unlike the gup family of functions.

> +
> +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> +						   pages[j], 0,
> +						   PAGE_SIZE << order,
> +						   DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> +				err = -EFAULT;
> +				npages = i;
> +				goto err_unmap;
> +			}

Aside: dma_map_page is about the only thing that's ok, because it doesn't
do anything harmful and especially doesn't make any assumption about what
that page is.

> +		}
> +
> +		/* Huge pages, reduce memory footprint */
> +		if (order) {
> +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> +						 GFP_KERNEL);
> +			if (dma_addr) {
> +				for (i = 0; i < j; ++i)
> +					dma_addr[i] = (dma_addr_t)pfns[i];
> +				kvfree(pfns);
> +				kfree_mapping = true;
> +			} else {
> +				dma_addr = (dma_addr_t *)pfns;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->order = order;
> +		range->flags.kfree_mapping = kfree_mapping;
> +		range->flags.has_dma_mapping = true;
> +		range->dma_addr = dma_addr;
> +		range->vram_allocation = NULL;
> +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	}
> +
> +	if (err == -EAGAIN)
> +		goto retry;
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	for_each_dma_page(i, j, npages, order)
> +		dma_unmap_page(gpusvm->drm->dev,
> +			       (dma_addr_t)pfns[j],
> +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> +err_free:
> +	if (alloc_pfns)
> +		kvfree(pfns);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> +					   unsigned long *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_vram_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU SVM. It
> + * iterates over each page frame number provided in @migrate_pfn, maps the
> + * corresponding page, and stores the DMA address in the provided @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *                   failure of this function.
> + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> + *                   should hold a reference to the VRAM allocation, which
> + *                   should be dropped via ops->vram_allocation or upon the
> + *                   failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to VRAM. It performs the
> + * necessary setup and invokes the driver-specific operations for migration to
> + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> + * until ops->vram_release is called which only upon successful return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_vram)
> +		return -EINVAL;
> +
> +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> +	    !gpusvm->ops->copy_to_sram)
> +		return -EOPNOTSUPP;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(range);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> +	 * always an error. Need to revisit possible cases and how to handle. We
> +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> +	 */

Yeah I think especially under contention partial migrations, at least back
to sram due to cpu faults, are pretty much expected. And you need to cope
somehow.

> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> +					     migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> +					   migrate.src, npages, DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_vram_page(page, zdd);
> +	}
> +
> +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind vram allocation to range and zdd */
> +	range->vram_allocation = vram_allocation;
> +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> + * specified VM area structure. It allocates and locks pages in the VM area for
> + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> +						unsigned long npages,
> +						unsigned long *src_mpfn,
> +						unsigned long *mpfn, u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);
> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> + * migration done via migrate_device_* functions. Fallback path as it is
> + * preferred to issue migrations with mmap lock.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> +				    struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = npages_in_range(range->va.start, range->va.end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> +					     npages, src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_vma_range(gpusvm->mm,
> +				       gpusvm->device_private_page_owner, src,
> +				       npages, range->va.start);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> +					   dst, npages, DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @vas: Pointer to the VM area structure
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @start: Start address of the migration range
> + * @end: End address of the migration range
> + *
> + * This internal function performs the migration of the specified GPU SVM range
> + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> + * invokes the driver-specific operations for migration to SRAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +					struct vm_area_struct *vas,
> +					struct page *page,
> +					u64 start, u64 end)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page	= page,
> +	};
> +	unsigned long npages;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	mmap_assert_locked(gpusvm->mm);

That's the wrong mm, at least for the ->migrate_to_ram path. You might be
called on a anon mapping from a child process. That also means that the
vma you're looking at might have no relationship with anythign you're
tracking in your gpusvm.

> +
> +	/* Corner where VMA area struct has been partially unmapped */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> +						   migrate.src, migrate.dst,
> +						   start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> +					   migrate.dst, npages,
> +					   DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function initiates the migration of the specified GPU SVM range to
> + * SRAM. It performs necessary checks and invokes the internal migration
> + * function for actual migration.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	int err;
> +	bool retry = false;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		if (ctx->trylock_mmap) {
> +			if (!mmap_read_trylock(mm))  {
> +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> +				goto err_mmput;
> +			}
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	/*
> +	 * Loop required to find all VMA area structs for the corner case when
> +	 * VRAM backing has been partially unmapped from MM's address space.
> +	 */
> +again:
> +	vas = find_vma(mm, start);
> +	if (!vas) {
> +		if (!retry)
> +			err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end <= vas->vm_start || start >= vas->vm_end) {
> +		if (!retry)
> +			err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> +	if (err)
> +		goto err_mmunlock;
> +
> +	if (vas->vm_end < end) {
> +		retry = true;
> +		start = vas->vm_end;
> +		goto again;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		mmap_read_unlock(mm);
> +		/*
> +		 * Using mmput_async as this function can be called while
> +		 * holding a dma-resv lock, and a final put can grab the mmap
> +		 * lock, causing a lock inversion.
> +		 */
> +		mmput_async(mm);
> +	}
> +
> +	return 0;
> +
> +err_mmunlock:
> +	if (!ctx->mmap_locked)
> +		mmap_read_unlock(mm);
> +err_mmput:
> +	if (!ctx->mmap_locked)
> +		mmput_async(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> + * It retrieves the GPU SVM range information from the faulting page and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> +	int err;
> +
> +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,

So I think zdd->range doesn't work, because even within a single mm the
vma mapping a given piece of anon memory does not need to be unique, you
can duplicate them with mremap.

So all you have here is the physical memory and the vma, which might or
might not be from the same process as gpusvm->mm.

Also the child process scenario means you using mmap_write on the fault
side doesn't stop all cpu faults migrating stuff back.

Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
is busted, because it's va based and so misses concurrently ongoing
different mappings moving physical storage around underneath.


Cheers, Sima

> +					   vmf->vma, vmf->page,
> +					   zdd->range->va.start,
> +					   zdd->range->va.end);
> +
> +	return err ? VM_FAULT_SIGBUS : 0;
> +}
> +
> +/**
> + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> + */
> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> +	.page_free = drm_gpusvm_page_free,
> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> +};
> +
> +/**
> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> + *
> + * Returns:
> + * Pointer to the GPU SVM device page map operations structure.
> + */
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> +{
> +	return &drm_gpusvm_pagemap_ops;
> +}
> +
> +/**
> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @start: Start address
> + * @end: End address
> + *
> + * Returns:
> + * True if GPU SVM has mapping, False otherwise
> + */
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> +		struct drm_gpusvm_range *range = NULL;
> +
> +		drm_gpusvm_for_each_range(range, notifier, start, end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> new file mode 100644
> index 000000000000..0ea70f8534a8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> @@ -0,0 +1,415 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __DRM_GPUSVM_H__
> +#define __DRM_GPUSVM_H__
> +
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/workqueue.h>
> +
> +struct dev_pagemap_ops;
> +struct drm_device;
> +struct drm_gpusvm;
> +struct drm_gpusvm_notifier;
> +struct drm_gpusvm_ops;
> +struct drm_gpusvm_range;
> +
> +/**
> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> + *
> + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> + * These operations are provided by the GPU driver to manage SVM ranges and
> + * perform operations such as migration between VRAM and system RAM.
> + */
> +struct drm_gpusvm_ops {
> +	/**
> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> +	 *
> +	 * This function shall allocate a GPU SVM notifier.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> +	 */
> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> +
> +	/**
> +	 * @notifier_free: Free a GPU SVM notifier (optional)
> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> +	 *
> +	 * This function shall free a GPU SVM notifier.
> +	 */
> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> +
> +	/**
> +	 * @range_alloc: Allocate a GPU SVM range (optional)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 *
> +	 * This function shall allocate a GPU SVM range.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> +	 */
> +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> +
> +	/**
> +	 * @range_free: Free a GPU SVM range (optional)
> +	 * @range: Pointer to the GPU SVM range to be freed
> +	 *
> +	 * This function shall free a GPU SVM range.
> +	 */
> +	void (*range_free)(struct drm_gpusvm_range *range);
> +
> +	/**
> +	 * @vram_release: Release VRAM allocation (optional)
> +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> +	 *
> +	 * This function shall release VRAM allocation and expects to drop a
> +	 * reference to VRAM allocation.
> +	 */
> +	void (*vram_release)(void *vram_allocation);
> +
> +	/**
> +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> +	 * @npages: Number of pages to populate
> +	 * @pfn: Array of page frame numbers to populate
> +	 *
> +	 * This function shall populate VRAM page frame numbers (PFN).
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> +				 void *vram_allocation,
> +				 unsigned long npages,
> +				 unsigned long *pfn);
> +
> +	/**
> +	 * @copy_to_vram: Copy to VRAM (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (destination)
> +	 * @dma_addr: Pointer to array of DMA addresses (source)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to VRAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @copy_to_sram: Copy to system RAM (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (source)
> +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to system RAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @invalidate: Invalidate GPU SVM notifier (required)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @notifier: Pointer to the GPU SVM notifier
> +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> +	 *
> +	 * This function shall invalidate the GPU page tables. It can safely
> +	 * walk the notifier range RB tree/list in this function. Called while
> +	 * holding the notifier lock.
> +	 */
> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> +			   struct drm_gpusvm_notifier *notifier,
> +			   const struct mmu_notifier_range *mmu_range);
> +};
> +
> +/**
> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: MMU interval notifier
> + * @interval: Interval for the notifier
> + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> + * @root: Cached root node of the RB tree containing ranges
> + * @range_list: List head containing of ranges in the same order they appear in
> + *              interval tree. This is useful to keep iterating ranges while
> + *              doing modifications to RB tree.
> + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> + *                 removed
> + *
> + * This structure represents a GPU SVM notifier.
> + */
> +struct drm_gpusvm_notifier {
> +	struct drm_gpusvm *gpusvm;
> +	struct mmu_interval_notifier notifier;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} interval;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct rb_root_cached root;
> +	struct list_head range_list;
> +	struct {
> +		u32 removed : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier
> + * @refcount: Reference count for the range
> + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> + * @va: Virtual address range
> + * @notifier_seq: Notifier sequence number of the range's pages
> + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> + * @flags.unmapped: Flag indicating if the range has been unmapped
> + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> + *                       on @order which releases via kfree
> + *
> + * This structure represents a GPU SVM range used for tracking memory ranges
> + * mapped in a DRM device.
> + */
> +struct drm_gpusvm_range {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +	struct kref refcount;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} va;
> +	unsigned long notifier_seq;
> +	union {
> +		struct page **pages;
> +		dma_addr_t *dma_addr;
> +	};
> +	void *vram_allocation;
> +	u16 order;
> +	struct {
> +		/* All flags below must be set upon creation */
> +		u16 migrate_vram : 1;
> +		/* All flags below must be set / cleared under notifier lock */
> +		u16 unmapped : 1;
> +		u16 partial_unmap : 1;
> +		u16 has_vram_pages : 1;
> +		u16 has_dma_mapping : 1;
> +		u16 kfree_mapping : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm - GPU SVM structure
> + *
> + * @name: Name of the GPU SVM
> + * @drm: Pointer to the DRM device structure
> + * @mm: Pointer to the mm_struct for the address space
> + * @device_private_page_owner: Device private pages owner
> + * @mm_start: Start address of GPU SVM
> + * @mm_range: Range of the GPU SVM
> + * @notifier_size: Size of individual notifiers
> + * @ops: Pointer to the operations structure for GPU SVM
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> + *               Entries should be powers of 2 in descending order.
> + * @num_chunks: Number of chunks
> + * @notifier_lock: Read-write semaphore for protecting notifier operations
> + * @zdd_wq: Workqueue for deferred work on zdd destruction
> + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> + * @notifier_list: list head containing of notifiers in the same order they
> + *                 appear in interval tree. This is useful to keep iterating
> + *                 notifiers while doing modifications to RB tree.
> + *
> + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> + *
> + * No reference counting is provided, as this is expected to be embedded in the
> + * driver VM structure along with the struct drm_gpuvm, which handles reference
> + * counting.
> + */
> +struct drm_gpusvm {
> +	const char *name;
> +	struct drm_device *drm;
> +	struct mm_struct *mm;
> +	void *device_private_page_owner;
> +	u64 mm_start;
> +	u64 mm_range;
> +	u64 notifier_size;
> +	const struct drm_gpusvm_ops *ops;
> +	const u64 *chunk_sizes;
> +	int num_chunks;
> +	struct rw_semaphore notifier_lock;
> +	struct workqueue_struct *zdd_wq;
> +	struct rb_root_cached root;
> +	struct list_head notifier_list;
> +};
> +
> +/**
> + * struct drm_gpusvm_ctx - DRM GPU SVM context
> + *
> + * @mmap_locked: mmap lock is locked
> + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> + *                (e.g.dma-revs -> mmap lock)
> + * @in_notifier: entering from a MMU notifier
> + * @read_only: operating on read-only memory
> + * @vram_possible: possible to use VRAM
> + * @prefault: prefault pages
> + *
> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> + */
> +struct drm_gpusvm_ctx {
> +	u32 mmap_locked :1;
> +	u32 trylock_mmap :1;
> +	u32 in_notifier :1;
> +	u32 read_only :1;
> +	u32 vram_possible :1;
> +	u32 prefault :1;
> +};
> +
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks);
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> +
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range);
> +
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx);
> +
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx);
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> +
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> +
> +/**
> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, take lock
> + */
> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> +	down_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, drop lock
> + */
> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> +	up_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> + * @range: a pointer to the current GPU SVM range
> + *
> + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> + *         current range is the last one or if the input range is NULL.
> + */
> +static inline struct drm_gpusvm_range *
> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> +{
> +	if (range && !list_is_last(&range->rb.entry,
> +				   &range->notifier->range_list))
> +		return list_next_entry(range, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> + * to use while holding the driver SVM lock or the notifier lock.
> + */
> +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> +	for ((range__) = (range__) ?:					\
> +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> +	     (range__) && (range__->va.start < (end__));		\
> +	     (range__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> + * @range: Pointer to the GPU SVM range structure.
> + * @mmu_range: Pointer to the MMU notifier range structure.
> + *
> + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> + * if the range partially falls within the provided MMU notifier range.
> + */
> +static inline void
> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> +			      const struct mmu_notifier_range *mmu_range)
> +{
> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> +
> +	range->flags.unmapped = true;
> +	if (range->va.start < mmu_range->start ||
> +	    range->va.end > mmu_range->end)
> +		range->flags.partial_unmap = true;
> +}
> +
> +#endif /* __DRM_GPUSVM_H__ */
> -- 
> 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29  9:24     ` Christian König
@ 2024-08-29  9:53       ` Thomas Hellström
  2024-08-29 11:02         ` Daniel Vetter
  2024-08-29 14:30         ` Christian König
  2024-08-29 21:48       ` Matthew Brost
  1 sibling, 2 replies; 100+ messages in thread
From: Thomas Hellström @ 2024-08-29  9:53 UTC (permalink / raw)
  To: Christian König, Daniel Vetter, Matthew Brost
  Cc: intel-xe, dri-devel, airlied, matthew.auld, daniel,
	Paneer Selvam, Arunpravin

Hi, Christian,

On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> 

...

> > > - Unified eviction is required (SVM VRAM and TTM BOs need to be
> > > able to
> > >    evict each other).
> > So core mm handles this by just roughly equally shrinking
> > everything.
> > Seems to work, and it has a pile of object shrinkers, and the page
> > lru is
> > also split into page cache and anon memory.
> > 
> > I think you need to put in more justification that unified eviction
> > is
> > required than just stating it, because a look at mm/ gives a very
> > well
> > established counterexample.
> > 
> > > - For exhaustive eviction [1], SVM VRAM allocations will almost
> > > certainly
> > >    require a dma-resv.
> > So from the TTM side we need exhaustive eviction, or at least
> > something a
> > bit more exhaustive than what ttm currently has. Note that i915-gem
> > also
> > never really got to perfect exhaustive eviction, it's just a pile
> > better
> > than ttm right now.
> 
> Please define what exhaustive eviction should mean? I think I know
> what 
> it is and I have been pushing TTM into the direction of solving this
> for 
> years.

We internally refer to exhaustive eviction being a client is always
guaranteed to eventually make progress in obtaining non-pinned vram,
typically by incrementally locking and keeping dma-resvs across a
single validation including validations during buffer object
allocations.

> 
> The last missing puzzle piece is to use drm_exec for TTM evictions,

and IMO keeping the dma-resv locks grabbed during eviction until at
least one unit of progress (one validation) has succeeded.

> but 
> apart from that everything should work now.
> 
> 
> Regards,
> Christian.

But as Sima pointed out in private communication, exhaustive eviction
is not really needed for faulting to make (crawling) progress.
Watermarks and VRAM trylock shrinking should suffice, since we're
strictly only required to service a single gpu page granule at a time.

However, ordinary bo-based jobs would still like to be able to
completely evict SVM vram. Whether that is important enough to strive
for is ofc up for discussion.

/Thomas



> 
> > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> > eviction is going to win because the shrinkers can only trylock
> > dma_resv.
> > So this part works. It actually works so well on the system memory
> > side
> > that if we're not careful we can trigger oom, because we're too
> > good at
> > getting at all the memory.
> > 
> > SVM VRAM allocations otoh do not need exhaustive evictions. Or at
> > least I
> > don't see why, because the idea is that thanks to gpu and cpu page
> > faults,
> > you can always get out of a pinch by just trashing everything for a
> > while
> > and migrating the handfull of available pages a lot.
> > 
> > > - Likely allocation size is 2M which makes of size of BO (872)
> > >    acceptable per allocation (872 / 2M == .0004158).
> > > 
> > > With this, using TTM BO for VRAM backing store seems to be an
> > > obvious
> > > choice as it allows leveraging of the TTM eviction code.
> > Except it requires that you hold dma_resv, which brings in all
> > kinds of
> > pain. And for eviction we really don't need a lot of
> > synchronization, so a
> > lot of that locking is not needed, unlike the case where we have a
> > cpu
> > fault, where we absolutely need mmap_lock and all that to make sure
> > we
> > fault in the right page.
> > 
> > But for eviction we only need to throw out some pages, if we're not
> > entirely precise with picking the right ones (or have no idea into
> > which
> > vma they're all currently mapped into) it doesn't matter. That's
> > why
> > migrate_device_pages doesn't care about any of that at all, it
> > doesn't
> > need to by design. But by bo backing memory you drag in all that
> > stuff
> > that's causing headacheds for eviction.
> > 
> > The only thing migration tries to do is remove all pte, and if that
> > succeeds, move the page. Specialized for the gpusvm case, looking
> > at mm/
> > code as cheat sheet, we need roughly:
> > 
> > - reverse mapping structure like anon_vma. Except gpusvm can assume
> > that
> >    there's currently only one gpu side mapping, so we can just
> > stuff the
> >    gpusvm an va_address into the page, and protect it with the page
> > lock.
> > 
> > - we need pagetable locks, so that we can manipulate pagetables
> > (well
> >    specifically make ptes invalid) without taking any other locks.
> > 
> > - everyone else inserting or removing ptes for svm mappings also
> > needs to
> >    lock the page, or we have races. This might be the
> > hmm_range_fault races
> >    you're seeing when allowing vram pages, since I don't think
> > there's
> >    anything else stopping the page lookup otherwise from
> > succeeding.
> > 
> > - we might also need to stuff migrate ptes into the gpu side, like
> > the cpu
> >    does, to hold up refaults before the migration has finished. But
> > I think
> >    those are only needed for anon memory in sram because there's no
> > other
> >    way to find the right page than swap pte entries, of which
> > migration
> >    entries are a special case.
> > 
> > - core code also expects us to handle the page refcount correctly
> > for svm
> >    device memory, so we can't free the pages like normal bo pages
> > either
> >    directly to drm_buddy.
> > 
> > Now typing this all up will look an awful lot like what you have,
> > with the
> > dma_resv lock serving as the page lock and the pagetable lock. The
> > only
> > reason is that these locks are much smaller and nest within all the
> > other
> > stuff going on and so avoid the inversion issues.
> > 
> > So one annoying part is that this is a lot of pointlessly looking
> > typing.
> > The other is that it's full of races, because core mm really is
> > yolo all
> > the way down. So lots of ways you lock the wrong page and fun stuff
> > like
> > that, but the few cases that matter work:
> > 
> > - svm fault handling with hmm_range fault retries with mmu
> > notifiers. Note
> >    that we need to have vram pages locked and the notifier retrie
> > needs to
> >    be under the pagetable lock, or there's room to escape. At least
> > that's
> >    what I came up with last time I thought it all through.
> > 
> > - migrate_to_ram: it will hold a page reference which we know was
> > the
> >    valid vram page when the cpu pte was locked, but it might not be
> > it
> >    anymore. So we have to lock the page and check whether it's
> > still gpu
> >    mapped, and if not retry the entire fault since most likey
> > another
> >    migrate_to_ram has succeed meanwhile in parallel.
> > 
> > - for eviction we don't care, we might actually be migrating a page
> > no one
> >    even wants anymore.
> > 
> > Now I think you can get all this done with the dma_resv lock and
> > maybe the
> > bo refcount. But it does involve a tremendous amount of headaches
> > and
> > impendence mismatch, because that's not how page faults and
> > migrations
> > work in core mm.
> > 
> > Cheers, Sima
> > 
> > > Current migration policy is migrate any SVM range greater than or
> > > equal
> > > to 64k once.
> > > 
> > > [1]https://patchwork.freedesktop.org/series/133643/
> > > 
> > > Signed-off-by: Matthew Brostmatthew.brost@intel.com
> > > ---
> > >   drivers/gpu/drm/xe/xe_svm.c | 81
> > > ++++++++++++++++++++++++++++++++++++-
> > >   drivers/gpu/drm/xe/xe_svm.h |  1 +
> > >   2 files changed, 81 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index 4372c02a341f..fd8987e0a506 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
> > > drm_gpusvm *gpusvm,
> > >   static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > >   				      struct xe_svm_range
> > > *range)
> > >   {
> > > +	struct drm_gpusvm_ctx ctx = {};
> > >   	struct dma_fence *fence;
> > >   
> > > +	/* Evict any pages holding references to vram allocation
> > > */
> > > +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> > > +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
> > > &range->base, &ctx);
> > > +
> > >   	xe_vm_lock(vm, false);
> > >   	fence = xe_vm_range_unbind(vm, range);
> > >   	xe_vm_unlock(vm);
> > > @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
> > > xe_svm_range *range,
> > >   	return (range->tile_present & ~range->tile_invalidated)
> > > & BIT(tile->id);
> > >   }
> > >   
> > > +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> > > +{
> > > +	return &tile->mem.vram;
> > > +}
> > > +
> > > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> > > xe_tile *tile,
> > > +				       struct xe_svm_range
> > > *range,
> > > +				       const struct
> > > drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct xe_mem_region *mr = tile_to_mr(tile);
> > > +	struct drm_buddy_block *block;
> > > +	struct list_head *blocks;
> > > +	struct xe_bo *bo;
> > > +	ktime_t end = 0;
> > > +	int err;
> > > +
> > > +retry:
> > > +	xe_vm_lock(vm, false);
> > > +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> > > >base.va.end -
> > > +			  range->base.va.start,
> > > ttm_bo_type_device,
> > > +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> > > +			  XE_BO_FLAG_SYSTEM_ALLOC |
> > > XE_BO_FLAG_SKIP_CLEAR);
> > > +	xe_vm_unlock(vm);
> > > +	if (IS_ERR(bo)) {
> > > +		err = PTR_ERR(bo);
> > > +		if (xe_vm_validate_should_retry(NULL, err,
> > > &end))
> > > +			goto retry;
> > > +		return bo;
> > > +	}
> > > +
> > > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > > >blocks;
> > > +	list_for_each_entry(block, blocks, link)
> > > +		block->private = mr;
> > > +
> > > +	/*
> > > +	 * Take ref because as soon as
> > > drm_gpusvm_migrate_to_vram succeeds the
> > > +	 * creation ref can be dropped upon CPU fault or unmap.
> > > +	 */
> > > +	xe_bo_get(bo);
> > > +
> > > +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
> > > &range->base,
> > > +					 bo, ctx);
> > > +	if (err) {
> > > +		xe_bo_put(bo);	/* Local ref */
> > > +		xe_bo_put(bo);	/* Creation ref */
> > > +		return ERR_PTR(err);
> > > +	}
> > > +
> > > +	return bo;
> > > +}
> > > +
> > >   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > >   			    struct xe_tile *tile, u64
> > > fault_addr,
> > >   			    bool atomic)
> > >   {
> > > -	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma), };
> > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma),
> > > +		.vram_possible = IS_DGFX(vm->xe), };
> > >   	struct xe_svm_range *range;
> > >   	struct drm_gpusvm_range *r;
> > >   	struct drm_exec exec;
> > >   	struct dma_fence *fence;
> > > +	struct xe_bo *bo = NULL;
> > >   	ktime_t end = 0;
> > >   	int err;
> > >   
> > >   	lockdep_assert_held_write(&vm->lock);
> > >   
> > >   retry:
> > > +	xe_bo_put(bo);
> > > +	bo = NULL;
> > > +
> > >   	/* Always process UNMAPs first so view SVM ranges is
> > > current */
> > >   	err = xe_svm_garbage_collector(vm);
> > >   	if (err)
> > > @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > *vm, struct xe_vma *vma,
> > >   	if (xe_svm_range_is_valid(range, tile))
> > >   		return 0;
> > >   
> > > +	/* XXX: Add migration policy, for now migrate range once
> > > */
> > > +	if (IS_DGFX(vm->xe) && !range->migrated &&
> > > +	    range->base.flags.migrate_vram &&
> > > +	    (range->base.va.end - range->base.va.start) >=
> > > SZ_64K) {
> > > +		range->migrated = true;
> > > +
> > > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > > +		if (IS_ERR(bo)) {
> > > +			drm_info(&vm->xe->drm,
> > > +				 "VRAM allocation failed,
> > > falling back to retrying, asid=%u, errno %ld\n",
> > > +				 vm->usm.asid, PTR_ERR(bo));
> > > +			bo = NULL;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > >   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > &ctx);
> > >   	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > CPU mappings have change */
> > >   	       goto retry;
> > > @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > > struct xe_vma *vma,
> > >   	dma_fence_put(fence);
> > >   
> > >   err_out:
> > > +	xe_bo_put(bo);
> > > +
> > >   	return err;
> > >   }
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > index 8b72e91cc37d..3f432483a230 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -18,6 +18,7 @@ struct xe_svm_range {
> > >   	struct list_head garbage_collector_link;
> > >   	u8 tile_present;
> > >   	u8 tile_invalidated;
> > > +	u8 migrated	:1;
> > >   };
> > >   
> > >   int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
> > > *mr);
> > > -- 
> > > 2.34.1
> > > 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction
  2024-08-28  2:48 ` [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction Matthew Brost
@ 2024-08-29 10:14   ` Daniel Vetter
  2024-08-29 15:55     ` Matthew Brost
  0 siblings, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-08-29 10:14 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Aug 27, 2024 at 07:48:57PM -0700, Matthew Brost wrote:
> Wire xe_bo_move to GPUSVM migration to SRAM with trylocking of mmap
> lock.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c       | 35 +++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_bo_types.h |  3 +++
>  drivers/gpu/drm/xe/xe_svm.c      |  2 ++
>  drivers/gpu/drm/xe/xe_svm.h      | 13 ++++++++++++
>  4 files changed, 52 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index ad804b6f9e84..ae71fcbe5380 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -25,6 +25,7 @@
>  #include "xe_pm.h"
>  #include "xe_preempt_fence.h"
>  #include "xe_res_cursor.h"
> +#include "xe_svm.h"
>  #include "xe_trace_bo.h"
>  #include "xe_ttm_stolen_mgr.h"
>  #include "xe_vm.h"
> @@ -250,6 +251,8 @@ int xe_bo_placement_for_flags(struct xe_device *xe, struct xe_bo *bo,
>  static void xe_evict_flags(struct ttm_buffer_object *tbo,
>  			   struct ttm_placement *placement)
>  {
> +	struct xe_bo *bo;
> +
>  	if (!xe_bo_is_xe_bo(tbo)) {
>  		/* Don't handle scatter gather BOs */
>  		if (tbo->type == ttm_bo_type_sg) {
> @@ -261,6 +264,12 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
>  		return;
>  	}
>  
> +	bo = ttm_to_xe_bo(tbo);
> +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) {
> +		*placement = sys_placement;
> +		return;
> +	}
> +
>  	/*
>  	 * For xe, sg bos that are evicted to system just triggers a
>  	 * rebind of the sg list upon subsequent validation to XE_PL_TT.
> @@ -758,6 +767,17 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
>  		}
>  	}
>  
> +	if (!move_lacks_source && (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) &&
> +	    new_mem->mem_type == XE_PL_SYSTEM) {
> +		ret = xe_svm_range_evict(bo->range);
> +		if (!ret) {
> +			drm_dbg(&xe->drm, "Evict system allocator BO success\n");
> +			ttm_bo_move_null(ttm_bo, new_mem);
> +		}
> +
> +		goto out;
> +	}
> +
>  	if (!move_lacks_source &&
>  	    ((old_mem_type == XE_PL_SYSTEM && resource_is_vram(new_mem)) ||
>  	     (mem_type_is_vram(old_mem_type) &&
> @@ -1096,6 +1116,19 @@ static void xe_ttm_bo_delete_mem_notify(struct ttm_buffer_object *ttm_bo)
>  	}
>  }
>  
> +static bool xe_bo_eviction_valuable(struct ttm_buffer_object *ttm_bo,
> +				    const struct ttm_place *place)
> +{
> +	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
> +
> +	/* Do not evict SVMs before having a binding */
> +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC &&
> +	    !xe_svm_range_has_vram_binding(bo->range))
> +		return false;
> +
> +	return ttm_bo_eviction_valuable(ttm_bo, place);
> +}
> +
>  const struct ttm_device_funcs xe_ttm_funcs = {
>  	.ttm_tt_create = xe_ttm_tt_create,
>  	.ttm_tt_populate = xe_ttm_tt_populate,
> @@ -1106,7 +1139,7 @@ const struct ttm_device_funcs xe_ttm_funcs = {
>  	.io_mem_reserve = xe_ttm_io_mem_reserve,
>  	.io_mem_pfn = xe_ttm_io_mem_pfn,
>  	.release_notify = xe_ttm_bo_release_notify,
> -	.eviction_valuable = ttm_bo_eviction_valuable,
> +	.eviction_valuable = xe_bo_eviction_valuable,
>  	.delete_mem_notify = xe_ttm_bo_delete_mem_notify,
>  };
>  
> diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
> index 2ed558ac2264..4523b033417c 100644
> --- a/drivers/gpu/drm/xe/xe_bo_types.h
> +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> @@ -16,6 +16,7 @@
>  #include "xe_ggtt_types.h"
>  
>  struct xe_device;
> +struct xe_svm_range;
>  struct xe_vm;
>  
>  #define XE_BO_MAX_PLACEMENTS	3
> @@ -47,6 +48,8 @@ struct xe_bo {
>  	struct ttm_bo_kmap_obj kmap;
>  	/** @pinned_link: link to present / evicted list of pinned BO */
>  	struct list_head pinned_link;
> +	/** @range: SVM range for BO */
> +	struct xe_svm_range *range;
>  #ifdef CONFIG_PROC_FS
>  	/**
>  	 * @client: @xe_drm_client which created the bo
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index fd8987e0a506..dc9810828c0a 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -531,6 +531,8 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
>  			  range->base.va.start, ttm_bo_type_device,
>  			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>  			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
> +	if (!IS_ERR(bo))
> +		bo->range = range;
>  	xe_vm_unlock(vm);
>  	if (IS_ERR(bo)) {
>  		err = PTR_ERR(bo);
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 3f432483a230..b9cf0e2500da 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -46,6 +46,19 @@ static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range *range)
>  	return range->base.flags.has_dma_mapping;
>  }
>  
> +static inline bool xe_svm_range_has_vram_binding(struct xe_svm_range *range)
> +{
> +	return xe_svm_range_in_vram(range) && range->tile_present;
> +}
> +
> +static inline int xe_svm_range_evict(struct xe_svm_range *range)
> +{
> +	struct drm_gpusvm_ctx ctx = { .trylock_mmap = true, };

So even trying to acquire an mmap lock for eviction is I think a design
bug for svm memory ranges. It's a bunch of physical memory, you have no
idea how many mm/vma map it and which one you pick as the special one is
fairly arbitrary.

So dont, eviction should entirely ignore va/mm issues at the top level
like the migrate_device_range function does (maybe we need a
scatter-gather version of that instead of just a range.

That function internally makes sure you're in sync with any vma/vm by:
- installing migration ptes everywhere, which does the mmu_notifer dance
- locking the pages to prevent other concurrent migration or other fun
  stuff from happening
- then restore ptes to something sensisble when it's all done

And it does that by looping over _all_ possible mappings of a page with
the rmap_walk infrastructure.

The only reason when we need the mmap lock (or vma lock or whatever) is if
we need to be coherent with other concurrent mm updates of a specific mm.
That should only be the case when migrating to vram, where the gpusvm->mm
is the special one, and when migrating to sram due to cpu faults, where
the vmf->vma->mm is special (and might at best have a tenous relationship
to the gpusvm->mm). But that's the only cases where a specific mm and vma
have any relevance to svm vram allocations.

-Sima

> +
> +	return drm_gpusvm_migrate_to_sram(range->base.gpusvm, &range->base,
> +					  &ctx);
> +}
> +
>  #define xe_svm_notifier_lock(vm__)	\
>  	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
>  
> -- 
> 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29  9:53       ` Thomas Hellström
@ 2024-08-29 11:02         ` Daniel Vetter
  2024-08-29 22:12           ` Matthew Brost
  2024-08-29 14:30         ` Christian König
  1 sibling, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-08-29 11:02 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Christian König, Daniel Vetter, Matthew Brost, intel-xe,
	dri-devel, airlied, matthew.auld, daniel,
	Paneer Selvam, Arunpravin

On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> But as Sima pointed out in private communication, exhaustive eviction
> is not really needed for faulting to make (crawling) progress.
> Watermarks and VRAM trylock shrinking should suffice, since we're
> strictly only required to service a single gpu page granule at a time.
> 
> However, ordinary bo-based jobs would still like to be able to
> completely evict SVM vram. Whether that is important enough to strive
> for is ofc up for discussion.

My take is that you don't win anything for exhaustive eviction by having
the dma_resv somewhere in there for svm allocations. Roughly for split lru
world, where svm ignores bo/dma_resv:

When evicting vram from the ttm side we'll fairly switch between selecting
bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
will eventually succeed in vacuuming up everything (with a few retries
perhaps, if we're not yet at the head of the ww ticket queue).

svm pages we need to try to evict anyway - there's no guarantee, becaue
the core mm might be holding temporary page references (which block
migration) or have the page locked (which also block the migration). But
as long as those two steps succeed, we'll win and get the pages. There
might be some thrashing against concurrent svm faults stealing them again,
but they have a disadvantage since they can't steal dma_resv_locked bo.
And if it's still too much we can stall them in the page allocator.

So it's not entirely reliable, but should be close enough.

Now for bo based svm the picture isn't any different, because holding
dma_resv is not actually enough to migrate svm mappings. We still need to
hope there's no temporary page references around, and we still need to
succeed at locking the page. And the migration code only does trylocks,
because that's it's deadlock prevent algorithm if different migrations
needing the same set of pages, but acquiring them in a different order. So
we win nothing.

Worse, if dma_resv does actually hold up svm migration and reclaim, then
we potentially deadlock because that lock is for a bigger range than
individual pages (or folios). And the core mm assumes that it can get out
of a deadlock bind by (at least stochastically) eventually succeeding in
acquiring/locking down a single page.

This means we cannot use dma_resv tricks to give the ttm world an
advantage in exhaustive eviction against concurrent svm faults. Or at
least not more than we can do without by just stalling svm faults that
need to allocate gpu memory (but that must happen without holding locks or
we're busted).

So the only benefit I'm seeing is the unified lru, which I'm not sure is
worth it. There's also a bit a lru design tension here, because for the bo
world we want objects that are locked to stay on the lru, so that the
competing processes can figure out who has the winning ww ticket. The core
mm design otoh does isolate pages and remove them from the lru when
they're acquired, so that they don't gunk up other processes from trying
to make forward progress and are better hidden. Which reduces temporary
page references (from lru walk) preventing migration and stuff like that.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29  9:53       ` Thomas Hellström
  2024-08-29 11:02         ` Daniel Vetter
@ 2024-08-29 14:30         ` Christian König
  2024-08-29 21:53           ` Matthew Brost
  1 sibling, 1 reply; 100+ messages in thread
From: Christian König @ 2024-08-29 14:30 UTC (permalink / raw)
  To: Thomas Hellström, Daniel Vetter, Matthew Brost
  Cc: intel-xe, dri-devel, airlied, matthew.auld, daniel,
	Paneer Selvam, Arunpravin



Am 29.08.24 um 11:53 schrieb Thomas Hellström:
> Hi, Christian,
>
> On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> ...
>
>>>> - Unified eviction is required (SVM VRAM and TTM BOs need to be
>>>> able to
>>>>     evict each other).
>>> So core mm handles this by just roughly equally shrinking
>>> everything.
>>> Seems to work, and it has a pile of object shrinkers, and the page
>>> lru is
>>> also split into page cache and anon memory.
>>>
>>> I think you need to put in more justification that unified eviction
>>> is
>>> required than just stating it, because a look at mm/ gives a very
>>> well
>>> established counterexample.
>>>
>>>> - For exhaustive eviction [1], SVM VRAM allocations will almost
>>>> certainly
>>>>     require a dma-resv.
>>> So from the TTM side we need exhaustive eviction, or at least
>>> something a
>>> bit more exhaustive than what ttm currently has. Note that i915-gem
>>> also
>>> never really got to perfect exhaustive eviction, it's just a pile
>>> better
>>> than ttm right now.
>> Please define what exhaustive eviction should mean? I think I know
>> what
>> it is and I have been pushing TTM into the direction of solving this
>> for
>> years.
> We internally refer to exhaustive eviction being a client is always
> guaranteed to eventually make progress in obtaining non-pinned vram,
> typically by incrementally locking and keeping dma-resvs across a
> single validation including validations during buffer object
> allocations.
>
>> The last missing puzzle piece is to use drm_exec for TTM evictions,
> and IMO keeping the dma-resv locks grabbed during eviction until at
> least one unit of progress (one validation) has succeeded.

Yes, exactly that. My guessed understanding was actually correct.

>
>> but
>> apart from that everything should work now.
>>
>>
>> Regards,
>> Christian.
> But as Sima pointed out in private communication, exhaustive eviction
> is not really needed for faulting to make (crawling) progress.
> Watermarks and VRAM trylock shrinking should suffice, since we're
> strictly only required to service a single gpu page granule at a time.

Yeah fault based memory management should be able to keep working as 
long as the page isn't re-migrated before you make any progress.

Since the number of VRAM or system memory pages is very high that should 
basically never happen.

> However, ordinary bo-based jobs would still like to be able to
> completely evict SVM vram. Whether that is important enough to strive
> for is ofc up for discussion.

Yes, exactly that. Felix, Alex, a bunch of other AMD folks and I came up 
with the same conclusion at AMD internally as well.

Regards,
Christian.

>
> /Thomas
>
>
>
>>> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
>>> eviction is going to win because the shrinkers can only trylock
>>> dma_resv.
>>> So this part works. It actually works so well on the system memory
>>> side
>>> that if we're not careful we can trigger oom, because we're too
>>> good at
>>> getting at all the memory.
>>>
>>> SVM VRAM allocations otoh do not need exhaustive evictions. Or at
>>> least I
>>> don't see why, because the idea is that thanks to gpu and cpu page
>>> faults,
>>> you can always get out of a pinch by just trashing everything for a
>>> while
>>> and migrating the handfull of available pages a lot.
>>>
>>>> - Likely allocation size is 2M which makes of size of BO (872)
>>>>     acceptable per allocation (872 / 2M == .0004158).
>>>>
>>>> With this, using TTM BO for VRAM backing store seems to be an
>>>> obvious
>>>> choice as it allows leveraging of the TTM eviction code.
>>> Except it requires that you hold dma_resv, which brings in all
>>> kinds of
>>> pain. And for eviction we really don't need a lot of
>>> synchronization, so a
>>> lot of that locking is not needed, unlike the case where we have a
>>> cpu
>>> fault, where we absolutely need mmap_lock and all that to make sure
>>> we
>>> fault in the right page.
>>>
>>> But for eviction we only need to throw out some pages, if we're not
>>> entirely precise with picking the right ones (or have no idea into
>>> which
>>> vma they're all currently mapped into) it doesn't matter. That's
>>> why
>>> migrate_device_pages doesn't care about any of that at all, it
>>> doesn't
>>> need to by design. But by bo backing memory you drag in all that
>>> stuff
>>> that's causing headacheds for eviction.
>>>
>>> The only thing migration tries to do is remove all pte, and if that
>>> succeeds, move the page. Specialized for the gpusvm case, looking
>>> at mm/
>>> code as cheat sheet, we need roughly:
>>>
>>> - reverse mapping structure like anon_vma. Except gpusvm can assume
>>> that
>>>     there's currently only one gpu side mapping, so we can just
>>> stuff the
>>>     gpusvm an va_address into the page, and protect it with the page
>>> lock.
>>>
>>> - we need pagetable locks, so that we can manipulate pagetables
>>> (well
>>>     specifically make ptes invalid) without taking any other locks.
>>>
>>> - everyone else inserting or removing ptes for svm mappings also
>>> needs to
>>>     lock the page, or we have races. This might be the
>>> hmm_range_fault races
>>>     you're seeing when allowing vram pages, since I don't think
>>> there's
>>>     anything else stopping the page lookup otherwise from
>>> succeeding.
>>>
>>> - we might also need to stuff migrate ptes into the gpu side, like
>>> the cpu
>>>     does, to hold up refaults before the migration has finished. But
>>> I think
>>>     those are only needed for anon memory in sram because there's no
>>> other
>>>     way to find the right page than swap pte entries, of which
>>> migration
>>>     entries are a special case.
>>>
>>> - core code also expects us to handle the page refcount correctly
>>> for svm
>>>     device memory, so we can't free the pages like normal bo pages
>>> either
>>>     directly to drm_buddy.
>>>
>>> Now typing this all up will look an awful lot like what you have,
>>> with the
>>> dma_resv lock serving as the page lock and the pagetable lock. The
>>> only
>>> reason is that these locks are much smaller and nest within all the
>>> other
>>> stuff going on and so avoid the inversion issues.
>>>
>>> So one annoying part is that this is a lot of pointlessly looking
>>> typing.
>>> The other is that it's full of races, because core mm really is
>>> yolo all
>>> the way down. So lots of ways you lock the wrong page and fun stuff
>>> like
>>> that, but the few cases that matter work:
>>>
>>> - svm fault handling with hmm_range fault retries with mmu
>>> notifiers. Note
>>>     that we need to have vram pages locked and the notifier retrie
>>> needs to
>>>     be under the pagetable lock, or there's room to escape. At least
>>> that's
>>>     what I came up with last time I thought it all through.
>>>
>>> - migrate_to_ram: it will hold a page reference which we know was
>>> the
>>>     valid vram page when the cpu pte was locked, but it might not be
>>> it
>>>     anymore. So we have to lock the page and check whether it's
>>> still gpu
>>>     mapped, and if not retry the entire fault since most likey
>>> another
>>>     migrate_to_ram has succeed meanwhile in parallel.
>>>
>>> - for eviction we don't care, we might actually be migrating a page
>>> no one
>>>     even wants anymore.
>>>
>>> Now I think you can get all this done with the dma_resv lock and
>>> maybe the
>>> bo refcount. But it does involve a tremendous amount of headaches
>>> and
>>> impendence mismatch, because that's not how page faults and
>>> migrations
>>> work in core mm.
>>>
>>> Cheers, Sima
>>>
>>>> Current migration policy is migrate any SVM range greater than or
>>>> equal
>>>> to 64k once.
>>>>
>>>> [1]https://patchwork.freedesktop.org/series/133643/
>>>>
>>>> Signed-off-by: Matthew Brostmatthew.brost@intel.com
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_svm.c | 81
>>>> ++++++++++++++++++++++++++++++++++++-
>>>>    drivers/gpu/drm/xe/xe_svm.h |  1 +
>>>>    2 files changed, 81 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.c
>>>> b/drivers/gpu/drm/xe/xe_svm.c
>>>> index 4372c02a341f..fd8987e0a506 100644
>>>> --- a/drivers/gpu/drm/xe/xe_svm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>>>> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
>>>> drm_gpusvm *gpusvm,
>>>>    static int __xe_svm_garbage_collector(struct xe_vm *vm,
>>>>    				      struct xe_svm_range
>>>> *range)
>>>>    {
>>>> +	struct drm_gpusvm_ctx ctx = {};
>>>>    	struct dma_fence *fence;
>>>>    
>>>> +	/* Evict any pages holding references to vram allocation
>>>> */
>>>> +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
>>>> +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
>>>> &range->base, &ctx);
>>>> +
>>>>    	xe_vm_lock(vm, false);
>>>>    	fence = xe_vm_range_unbind(vm, range);
>>>>    	xe_vm_unlock(vm);
>>>> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
>>>> xe_svm_range *range,
>>>>    	return (range->tile_present & ~range->tile_invalidated)
>>>> & BIT(tile->id);
>>>>    }
>>>>    
>>>> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
>>>> +{
>>>> +	return &tile->mem.vram;
>>>> +}
>>>> +
>>>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
>>>> xe_tile *tile,
>>>> +				       struct xe_svm_range
>>>> *range,
>>>> +				       const struct
>>>> drm_gpusvm_ctx *ctx)
>>>> +{
>>>> +	struct xe_mem_region *mr = tile_to_mr(tile);
>>>> +	struct drm_buddy_block *block;
>>>> +	struct list_head *blocks;
>>>> +	struct xe_bo *bo;
>>>> +	ktime_t end = 0;
>>>> +	int err;
>>>> +
>>>> +retry:
>>>> +	xe_vm_lock(vm, false);
>>>> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
>>>>> base.va.end -
>>>> +			  range->base.va.start,
>>>> ttm_bo_type_device,
>>>> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>>>> +			  XE_BO_FLAG_SYSTEM_ALLOC |
>>>> XE_BO_FLAG_SKIP_CLEAR);
>>>> +	xe_vm_unlock(vm);
>>>> +	if (IS_ERR(bo)) {
>>>> +		err = PTR_ERR(bo);
>>>> +		if (xe_vm_validate_should_retry(NULL, err,
>>>> &end))
>>>> +			goto retry;
>>>> +		return bo;
>>>> +	}
>>>> +
>>>> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
>>>>> blocks;
>>>> +	list_for_each_entry(block, blocks, link)
>>>> +		block->private = mr;
>>>> +
>>>> +	/*
>>>> +	 * Take ref because as soon as
>>>> drm_gpusvm_migrate_to_vram succeeds the
>>>> +	 * creation ref can be dropped upon CPU fault or unmap.
>>>> +	 */
>>>> +	xe_bo_get(bo);
>>>> +
>>>> +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
>>>> &range->base,
>>>> +					 bo, ctx);
>>>> +	if (err) {
>>>> +		xe_bo_put(bo);	/* Local ref */
>>>> +		xe_bo_put(bo);	/* Creation ref */
>>>> +		return ERR_PTR(err);
>>>> +	}
>>>> +
>>>> +	return bo;
>>>> +}
>>>> +
>>>>    int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
>>>> *vma,
>>>>    			    struct xe_tile *tile, u64
>>>> fault_addr,
>>>>    			    bool atomic)
>>>>    {
>>>> -	struct drm_gpusvm_ctx ctx = { .read_only =
>>>> xe_vma_read_only(vma), };
>>>> +	struct drm_gpusvm_ctx ctx = { .read_only =
>>>> xe_vma_read_only(vma),
>>>> +		.vram_possible = IS_DGFX(vm->xe), };
>>>>    	struct xe_svm_range *range;
>>>>    	struct drm_gpusvm_range *r;
>>>>    	struct drm_exec exec;
>>>>    	struct dma_fence *fence;
>>>> +	struct xe_bo *bo = NULL;
>>>>    	ktime_t end = 0;
>>>>    	int err;
>>>>    
>>>>    	lockdep_assert_held_write(&vm->lock);
>>>>    
>>>>    retry:
>>>> +	xe_bo_put(bo);
>>>> +	bo = NULL;
>>>> +
>>>>    	/* Always process UNMAPs first so view SVM ranges is
>>>> current */
>>>>    	err = xe_svm_garbage_collector(vm);
>>>>    	if (err)
>>>> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
>>>> *vm, struct xe_vma *vma,
>>>>    	if (xe_svm_range_is_valid(range, tile))
>>>>    		return 0;
>>>>    
>>>> +	/* XXX: Add migration policy, for now migrate range once
>>>> */
>>>> +	if (IS_DGFX(vm->xe) && !range->migrated &&
>>>> +	    range->base.flags.migrate_vram &&
>>>> +	    (range->base.va.end - range->base.va.start) >=
>>>> SZ_64K) {
>>>> +		range->migrated = true;
>>>> +
>>>> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>>>> +		if (IS_ERR(bo)) {
>>>> +			drm_info(&vm->xe->drm,
>>>> +				 "VRAM allocation failed,
>>>> falling back to retrying, asid=%u, errno %ld\n",
>>>> +				 vm->usm.asid, PTR_ERR(bo));
>>>> +			bo = NULL;
>>>> +			goto retry;
>>>> +		}
>>>> +	}
>>>> +
>>>>    	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
>>>> &ctx);
>>>>    	if (err == -EFAULT || err == -EPERM)	/* Corner where
>>>> CPU mappings have change */
>>>>    	       goto retry;
>>>> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
>>>> struct xe_vma *vma,
>>>>    	dma_fence_put(fence);
>>>>    
>>>>    err_out:
>>>> +	xe_bo_put(bo);
>>>> +
>>>>    	return err;
>>>>    }
>>>>    
>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.h
>>>> b/drivers/gpu/drm/xe/xe_svm.h
>>>> index 8b72e91cc37d..3f432483a230 100644
>>>> --- a/drivers/gpu/drm/xe/xe_svm.h
>>>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>>>> @@ -18,6 +18,7 @@ struct xe_svm_range {
>>>>    	struct list_head garbage_collector_link;
>>>>    	u8 tile_present;
>>>>    	u8 tile_invalidated;
>>>> +	u8 migrated	:1;
>>>>    };
>>>>    
>>>>    int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
>>>> *mr);
>>>> -- 
>>>> 2.34.1
>>>>


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction
  2024-08-29 10:14   ` Daniel Vetter
@ 2024-08-29 15:55     ` Matthew Brost
  2024-09-02 13:05       ` Daniel Vetter
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 15:55 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Thu, Aug 29, 2024 at 12:14:53PM +0200, Daniel Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:57PM -0700, Matthew Brost wrote:
> > Wire xe_bo_move to GPUSVM migration to SRAM with trylocking of mmap
> > lock.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_bo.c       | 35 +++++++++++++++++++++++++++++++-
> >  drivers/gpu/drm/xe/xe_bo_types.h |  3 +++
> >  drivers/gpu/drm/xe/xe_svm.c      |  2 ++
> >  drivers/gpu/drm/xe/xe_svm.h      | 13 ++++++++++++
> >  4 files changed, 52 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > index ad804b6f9e84..ae71fcbe5380 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -25,6 +25,7 @@
> >  #include "xe_pm.h"
> >  #include "xe_preempt_fence.h"
> >  #include "xe_res_cursor.h"
> > +#include "xe_svm.h"
> >  #include "xe_trace_bo.h"
> >  #include "xe_ttm_stolen_mgr.h"
> >  #include "xe_vm.h"
> > @@ -250,6 +251,8 @@ int xe_bo_placement_for_flags(struct xe_device *xe, struct xe_bo *bo,
> >  static void xe_evict_flags(struct ttm_buffer_object *tbo,
> >  			   struct ttm_placement *placement)
> >  {
> > +	struct xe_bo *bo;
> > +
> >  	if (!xe_bo_is_xe_bo(tbo)) {
> >  		/* Don't handle scatter gather BOs */
> >  		if (tbo->type == ttm_bo_type_sg) {
> > @@ -261,6 +264,12 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
> >  		return;
> >  	}
> >  
> > +	bo = ttm_to_xe_bo(tbo);
> > +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) {
> > +		*placement = sys_placement;
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * For xe, sg bos that are evicted to system just triggers a
> >  	 * rebind of the sg list upon subsequent validation to XE_PL_TT.
> > @@ -758,6 +767,17 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
> >  		}
> >  	}
> >  
> > +	if (!move_lacks_source && (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) &&
> > +	    new_mem->mem_type == XE_PL_SYSTEM) {
> > +		ret = xe_svm_range_evict(bo->range);
> > +		if (!ret) {
> > +			drm_dbg(&xe->drm, "Evict system allocator BO success\n");
> > +			ttm_bo_move_null(ttm_bo, new_mem);
> > +		}
> > +
> > +		goto out;
> > +	}
> > +
> >  	if (!move_lacks_source &&
> >  	    ((old_mem_type == XE_PL_SYSTEM && resource_is_vram(new_mem)) ||
> >  	     (mem_type_is_vram(old_mem_type) &&
> > @@ -1096,6 +1116,19 @@ static void xe_ttm_bo_delete_mem_notify(struct ttm_buffer_object *ttm_bo)
> >  	}
> >  }
> >  
> > +static bool xe_bo_eviction_valuable(struct ttm_buffer_object *ttm_bo,
> > +				    const struct ttm_place *place)
> > +{
> > +	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
> > +
> > +	/* Do not evict SVMs before having a binding */
> > +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC &&
> > +	    !xe_svm_range_has_vram_binding(bo->range))
> > +		return false;
> > +
> > +	return ttm_bo_eviction_valuable(ttm_bo, place);
> > +}
> > +
> >  const struct ttm_device_funcs xe_ttm_funcs = {
> >  	.ttm_tt_create = xe_ttm_tt_create,
> >  	.ttm_tt_populate = xe_ttm_tt_populate,
> > @@ -1106,7 +1139,7 @@ const struct ttm_device_funcs xe_ttm_funcs = {
> >  	.io_mem_reserve = xe_ttm_io_mem_reserve,
> >  	.io_mem_pfn = xe_ttm_io_mem_pfn,
> >  	.release_notify = xe_ttm_bo_release_notify,
> > -	.eviction_valuable = ttm_bo_eviction_valuable,
> > +	.eviction_valuable = xe_bo_eviction_valuable,
> >  	.delete_mem_notify = xe_ttm_bo_delete_mem_notify,
> >  };
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
> > index 2ed558ac2264..4523b033417c 100644
> > --- a/drivers/gpu/drm/xe/xe_bo_types.h
> > +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> > @@ -16,6 +16,7 @@
> >  #include "xe_ggtt_types.h"
> >  
> >  struct xe_device;
> > +struct xe_svm_range;
> >  struct xe_vm;
> >  
> >  #define XE_BO_MAX_PLACEMENTS	3
> > @@ -47,6 +48,8 @@ struct xe_bo {
> >  	struct ttm_bo_kmap_obj kmap;
> >  	/** @pinned_link: link to present / evicted list of pinned BO */
> >  	struct list_head pinned_link;
> > +	/** @range: SVM range for BO */
> > +	struct xe_svm_range *range;
> >  #ifdef CONFIG_PROC_FS
> >  	/**
> >  	 * @client: @xe_drm_client which created the bo
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > index fd8987e0a506..dc9810828c0a 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -531,6 +531,8 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> >  			  range->base.va.start, ttm_bo_type_device,
> >  			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> >  			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
> > +	if (!IS_ERR(bo))
> > +		bo->range = range;
> >  	xe_vm_unlock(vm);
> >  	if (IS_ERR(bo)) {
> >  		err = PTR_ERR(bo);
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index 3f432483a230..b9cf0e2500da 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -46,6 +46,19 @@ static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range *range)
> >  	return range->base.flags.has_dma_mapping;
> >  }
> >  
> > +static inline bool xe_svm_range_has_vram_binding(struct xe_svm_range *range)
> > +{
> > +	return xe_svm_range_in_vram(range) && range->tile_present;
> > +}
> > +
> > +static inline int xe_svm_range_evict(struct xe_svm_range *range)
> > +{
> > +	struct drm_gpusvm_ctx ctx = { .trylock_mmap = true, };
> 
> So even trying to acquire an mmap lock for eviction is I think a design
> bug for svm memory ranges. It's a bunch of physical memory, you have no
> idea how many mm/vma map it and which one you pick as the special one is
> fairly arbitrary.
> 

Let me drop whole trylock this and just evict via
drm_gpusvm_evict_to_sram / migrate_device_vma_range which does not
require the mmap. I add this code very recently per Matt Auld suggestion
and agree it makes the try locking unnecessary.

> So dont, eviction should entirely ignore va/mm issues at the top level
> like the migrate_device_range function does (maybe we need a
> scatter-gather version of that instead of just a range.
> 

I needed to add migrate_device_vma_range (might be a bad name) as VRAM
may be non-continuous pfns when memory pressure exists where as
migrate_device_range only supports continuous pfns.

> That function internally makes sure you're in sync with any vma/vm by:
> - installing migration ptes everywhere, which does the mmu_notifer dance
> - locking the pages to prevent other concurrent migration or other fun
>   stuff from happening
> - then restore ptes to something sensisble when it's all done
> 
> And it does that by looping over _all_ possible mappings of a page with
> the rmap_walk infrastructure.
> 
> The only reason when we need the mmap lock (or vma lock or whatever) is if
> we need to be coherent with other concurrent mm updates of a specific mm.
> That should only be the case when migrating to vram, where the gpusvm->mm
> is the special one, and when migrating to sram due to cpu faults, where
> the vmf->vma->mm is special (and might at best have a tenous relationship
> to the gpusvm->mm). But that's the only cases where a specific mm and vma
> have any relevance to svm vram allocations.
> 

Thanks for the info.

Matt

> -Sima
> 
> > +
> > +	return drm_gpusvm_migrate_to_sram(range->base.gpusvm, &range->base,
> > +					  &ctx);
> > +}
> > +
> >  #define xe_svm_notifier_lock(vm__)	\
> >  	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
> >  
> > -- 
> > 2.34.1
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range
  2024-08-29  9:03   ` Daniel Vetter
@ 2024-08-29 15:58     ` Matthew Brost
  0 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 15:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Thu, Aug 29, 2024 at 11:03:29AM +0200, Daniel Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:37PM -0700, Matthew Brost wrote:
> > Add migrate_device_vma_range which prepares an array of pre-populated
> > device pages for migration and issues a MMU invalidation.
> > 
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/linux/migrate.h |  3 +++
> >  mm/migrate_device.c     | 53 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> > 
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 644be30b69c8..e8cce05bf9c2 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -226,6 +226,9 @@ void migrate_vma_pages(struct migrate_vma *migrate);
> >  void migrate_vma_finalize(struct migrate_vma *migrate);
> >  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
> >  			unsigned long npages);
> > +int migrate_device_vma_range(struct mm_struct *mm, void *pgmap_owner,
> > +			     unsigned long *src_pfns, unsigned long npages,
> > +			     unsigned long start);
> >  void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
> >  			unsigned long npages);
> >  void migrate_device_finalize(unsigned long *src_pfns,
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 6d66dc1c6ffa..e25f12a132e8 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -920,6 +920,59 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
> >  }
> >  EXPORT_SYMBOL(migrate_device_range);
> >  
> > +/**
> > + * migrate_device_vma_range() - migrate device private pfns to normal memory and
> > + * trigger MMU invalidation.
> > + * @mm: struct mm of device pages.
> > + * @src_pfns: pre-popluated array of source device private pfns to migrate.
> > + * @pgmap_owner: page group map owner of device pages.
> > + * @npages: number of pages to migrate.
> > + * @start: VMA start of device pages.
> > + *
> > + * Similar to migrate_device_range() but supports non-contiguous pre-popluated
> > + * array of device pages to migrate. Also triggers MMU invalidation. Useful in
> > + * device memory eviction paths where lock is held protecting the device pages
> > + * but where the mmap lock cannot be taken to due to a locking inversion (e.g.
> > + * DRM drivers). Since the mmap lock is not required to be held, the MMU
> > + * invalidation can race with with VMA start being repurposed, worst case this
> > + * would result in an unecessary invalidation.
> > + */
> > +int migrate_device_vma_range(struct mm_struct *mm, void *pgmap_owner,
> > +			     unsigned long *src_pfns, unsigned long npages,
> > +			     unsigned long start)
> > +{
> > +	struct mmu_notifier_range range;
> > +	unsigned long i;
> > +
> > +	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
> > +				      mm, start, start + npages * PAGE_SIZE,
> > +				      pgmap_owner);
> > +	mmu_notifier_invalidate_range_start(&range);
> 
> This isn't needed, try_to_migrate called from migrate_device_unmap already
> has a notifier, if there's actually any ptes to clear. If you need this
> one you've missed a pte clear notification somewhere, or there's some
> other bad bug somewhere.

Thanks for the tip, let me pull this out and confirm that we get a
notifier from try_to_migrate when this function is called. Agree if we
do get a notifier, this is not needed.

Matt 

> -Sima
> 
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		struct page *page = pfn_to_page(src_pfns[i]);
> > +
> > +		if (!get_page_unless_zero(page)) {
> > +			src_pfns[i] = 0;
> > +			continue;
> > +		}
> > +
> > +		if (!trylock_page(page)) {
> > +			src_pfns[i] = 0;
> > +			put_page(page);
> > +			continue;
> > +		}
> > +
> > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> > +	}
> > +
> > +	migrate_device_unmap(src_pfns, npages, NULL);
> > +	mmu_notifier_invalidate_range_end(&range);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(migrate_device_vma_range);
> > +
> >  /*
> >   * Migrate a device coherent page back to normal memory. The caller should have
> >   * a reference on page which will be copied to the new page if migration is
> > -- 
> > 2.34.1
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28 16:25         ` Daniel Vetter
@ 2024-08-29 16:40           ` Matthew Brost
  2024-09-02 11:29             ` Daniel Vetter
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 16:40 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, intel-xe, dri-devel, airlied,
	thomas.hellstrom, matthew.auld, daniel

On Wed, Aug 28, 2024 at 06:25:18PM +0200, Daniel Vetter wrote:
> On Wed, Aug 28, 2024 at 03:43:48PM +0000, Matthew Brost wrote:
> > On Wed, Aug 28, 2024 at 04:46:24PM +0200, Christian König wrote:
> > > Am 28.08.24 um 16:31 schrieb Daniel Vetter:
> > > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > > +		if (!ctx->mmap_locked) {
> > > > > +			/*
> > > > > +			 * XXX: HMM locking document indicates only a read-lock
> > > > > +			 * is required but there apears to be a window between
> > > > > +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > > > > +			 * via migrate_vma_setup and the pages actually moving
> > > > > +			 * in migrate_vma_finalize in which this code can grab
> > > > > +			 * garbage pages. Grabbing the write-lock if the range
> > > > > +			 * is attached to vram appears to protect against this
> > > > > +			 * race.
> > > > > +			 */
> > 
> > Thanks the comments, replying to both of you inline.
> > 
> > > > This one is really scary, since it means the entire migrate pte trickery
> > > > is essentially completely busted. Grabbing the mmap write lock just means
> > > > you block out pretty much everything interesting from concurrently
> > > > happening.
> > > > 
> > > > My gut feeling says we need to figure out what's happening here, because
> > > > this looks a bit too fundamental to me.
> > 
> > I agree. I haven’t looked into this issue for a couple of months but
> > really need to understand what is going on.
> > 
> > I should have mentioned this in the cover letter: the goal of this
> > series was to produce something for review that is stable and supports
> > UMDs/user applications. It was not intended to be presented as a final
> > solution. This issue certainly falls into the category of "needs to be
> > understood and requires a proper fix."
> > 
> > One open question I have is whether the test case that triggers this
> > issue is even defined behavior. The test creates concurrent access
> > between the GPU and CPU to the same memory address, resulting in GPU and
> > CPU faults racing against each other. It’s possible that this is
> > undefined behavior, so data corruption might be acceptable—i.e., the
> > kernel can’t crash, but incorrect results might be permissible.
> 
> Yes this is supposed to be defined, at least from an hmm pov. And core mm/
> is ridiculous in how many races it allows, especially around concurrent
> fault handling.
> 
> It is ofc really slow if every fault results in a migration, but that's a
> matter of the application setting stupid memory migration hints for the
> gpu.
> 
> > e.g. This is the only defined usage model:
> > 
> > alloc_memory();
> > start_compute_kernel();
> > sync_on_compute_kernel_completion();
> > read_memory();
> > 
> > Hopefully, in the next week or so, I'll be heavily engaging with the UMD
> > teams. Development can then start, and applications will be running soon
> > after. This will allow us to address issues like this, collect data on
> > memory usage, and verify some of the assumptions I've made, such as
> > optimizing for 2M+ allocations.
> > 
> > > 
> > > I think I have at least a high level understanding what's going on here,
> > > Felix and especially Philip should know more of the details.
> > > 
> > 
> > I meant to reach out to AMD for issues like this. So, Felix
> > (felix.kuehling@amd.com) and Philip (Philip.Yang@amd.com) would be good
> > contacts?
> > 
> > > In general grabbing the mm_lock to protect PTEs from changing is completely
> > > nonsense. The mm_lock is to protect the VMAs and *not* the PTEs!
> > > 
> > 
> > Thanks for the hint. I believe that in the AMD implementation, I noticed
> > some additional locks for migration, which might be how you mitigated
> > this issue.
> 
> Yeah, so in general hold mmap_reading is indeed pure magic thinking for
> preventing pte changes, like Christian points out. It doesn't stop
> invalidates, and with the per vma locking it also doesn't stop new valid

Invalidations happening to parallel to migrations, get pages, or
bindings should be fine. The notifier lock usage should make all of this
safe.

> ptes from being inserted at least for anon memory.
> 
> Except migration pte entries that point at vram pages are special, and are
> _only_ resolved while holding mmap_read. Which means holding mmap_write
> for the case of looking up our own vram pages with hmm_range_fault
> actually prevents issues. And so this duct-tape of holding mmap_write very
> much looks like a working hack to plug any races against concurrently
> ongoing migrations to system memory due to cpu faults.
> 

Agree holding mmap_write is a hack. Looking at AMD 'To serialize concurrent
migrations or validations of the same range, the prange->migrate_mutex
must be held.', seemly I could drop mmap write lock abuse and use
something like this here. The would like be an inner lock of the mmap
lock.

Does this seem like a reasonable thing to explore?

> An even more fun corner case is multiple concurrent cpu faults on the same
> vram page. fork gets you that, or maybe a bit more reasonable mremap with

My understanding is memory shared between processes cannot migrated due
to current limitations migrate layer.

e.g. mmap called with MAP_SHARED is not eligible for migration.

Unsure what the behavior is fork() is called on a process with memory in
VRAM and the child tries to access it. Maybe fork() is different than
MAP_SHARED where as parent / child processes can share memory in VRAM? 

Also really I'm unsure what would happen if user space calls fork() and
has an Xe VM open and tries to use it too. Before commenting more on
this, I need play around with test cases like this educate myself.

My current test doesn't use mremap, agree that would be to good add.
Again before commenting more here, let add more test cases to educate
myself.

> MREMAP_DONTUNMAP | MREMAP_MAYMOVE. I think just hammer the same va with
> multiple threads along isn't enough, it's better to have a private va for

I do have test cases where multiple CPU faults from threads hammer the
same memory. Found some bugs in my initial code but as far as I can tell
multiple CPU faults in parallel occur in my testing and do work.

> each thread pointing at the same anon memory page, so that you can get

You are losing me here - 'private va for each thread pointing at the
same anon memory page'. This is a fork() case where the parent allocates
memory and then all children try to read in parallel?

> more parallel faults due to finely grained pte locking.
> 
> Would be a good testcase to add, if you don't have it yet.
>

See above, agree these are good test cases which I haven't considered and
will expand my suite to include these. Thanks for the tip - IMO testing
is as important or even more important than the KMD design and need to
ensure I have all possible uses covered.

> > I must say it is a bit unfortunate that the HMM locking documentation
> > doesn’t mention this. I believe the documentation needs additional
> > information, which I can add once we finalize the solution.
> 
> Yeah, at least from my very cursory lock you don't have enough locking.
> I've written an in-depth reply to patch 23 with the high-level summary of
> my thoughts.
>

Will look and reply there.

Matt

> Cheers, Sima
> 
> > 
> > Matt 
> > 
> > > Even with the write side of the mm_lock taken it is perfectly possible that
> > > PTE change. It's just less likely.
> > > 
> > > We run into multiple issues before we figured out this important distinction
> > > as well.
> > > 
> > > Christian.
> > > 
> > > > -Sima
> > > > 
> > > > 
> > > > > +			if (vram_pages)
> > > > > +				mmap_write_lock(mm);
> > > > > +			else
> > > > > +				mmap_read_lock(mm);
> > > > > +		}
> > > > > +		err = hmm_range_fault(&hmm_range);
> > > > > +		if (!ctx->mmap_locked) {
> > > > > +			if (vram_pages)
> > > > > +				mmap_write_unlock(mm);
> > > > > +			else
> > > > > +				mmap_read_unlock(mm);
> > > > > +		}
> > > > > +
> > > > > +		if (err == -EBUSY) {
> > > > > +			if (time_after(jiffies, timeout))
> > > > > +				break;
> > > > > +
> > > > > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > > > +			continue;
> > > > > +		}
> > > > > +		break;
> > > > > +	}
> > > > > +	if (!ctx->mmap_locked)
> > > > > +		mmput(mm);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	pages = (struct page **)pfns;
> > > > > +
> > > > > +	if (ctx->prefault) {
> > > > > +		range->pages = pages;
> > > > > +		goto set_seqno;
> > > > > +	}
> > > > > +
> > > > > +map_pages:
> > > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > > +
> > > > > +		for (i = 0; i < npages; ++i) {
> > > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > > +
> > > > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				goto err_free;
> > > > > +			}
> > > > > +		}
> > > > > +
> > > > > +		/* Do not race with notifier unmapping pages */
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +		range->flags.has_vram_pages = true;
> > > > > +		range->pages = pages;
> > > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > > +			err = -EAGAIN;
> > > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > +		}
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +	} else {
> > > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > > +
> > > > > +		for_each_dma_page(i, j, npages, order) {
> > > > > +			if (WARN_ON_ONCE(i && order !=
> > > > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > > +
> > > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +
> > > > > +			set_page_dirty_lock(pages[j]);
> > > > > +			mark_page_accessed(pages[j]);
> > > > > +
> > > > > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > > > +						   pages[j], 0,
> > > > > +						   PAGE_SIZE << order,
> > > > > +						   DMA_BIDIRECTIONAL);
> > > > > +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > > > > +				err = -EFAULT;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +		}
> > > > > +
> > > > > +		/* Huge pages, reduce memory footprint */
> > > > > +		if (order) {
> > > > > +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > > > > +						 GFP_KERNEL);
> > > > > +			if (dma_addr) {
> > > > > +				for (i = 0; i < j; ++i)
> > > > > +					dma_addr[i] = (dma_addr_t)pfns[i];
> > > > > +				kvfree(pfns);
> > > > > +				kfree_mapping = true;
> > > > > +			} else {
> > > > > +				dma_addr = (dma_addr_t *)pfns;
> > > > > +			}
> > > > > +		}
> > > > > +
> > > > > +		/* Do not race with notifier unmapping pages */
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +		range->order = order;
> > > > > +		range->flags.kfree_mapping = kfree_mapping;
> > > > > +		range->flags.has_dma_mapping = true;
> > > > > +		range->dma_addr = dma_addr;
> > > > > +		range->vram_allocation = NULL;
> > > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > > +			err = -EAGAIN;
> > > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > +		}
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +	}
> > > > > +
> > > > > +	if (err == -EAGAIN)
> > > > > +		goto retry;
> > > > > +set_seqno:
> > > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +err_unmap:
> > > > > +	for_each_dma_page(i, j, npages, order)
> > > > > +		dma_unmap_page(gpusvm->drm->dev,
> > > > > +			       (dma_addr_t)pfns[j],
> > > > > +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > > > > +err_free:
> > > > > +	if (alloc_pfns)
> > > > > +		kvfree(pfns);
> > > > > +err_out:
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > > > > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > > > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > > > > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > > > > + * security model.
> > > > > + */
> > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > +				  struct drm_gpusvm_range *range,
> > > > > +				  const struct drm_gpusvm_ctx *ctx)
> > > > > +{
> > > > > +	if (ctx->in_notifier)
> > > > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > > > +	else
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +
> > > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > +
> > > > > +	if (!ctx->in_notifier)
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > > + * @page: Pointer to the page to put
> > > > > + *
> > > > > + * This function unlocks and puts a page.
> > > > > + */
> > > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > > +{
> > > > > +	unlock_page(page);
> > > > > +	put_page(page);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > > + * @npages: Number of pages
> > > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > > + *
> > > > > + * This function puts an array of pages.
> > > > > + */
> > > > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > > > +					   unsigned long *migrate_pfn)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		if (!migrate_pfn[i])
> > > > > +			continue;
> > > > > +
> > > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > > > > +		migrate_pfn[i] = 0;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > > + * @page: Pointer to the page
> > > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > > + *
> > > > > + * This function associates the given page with the specified GPU SVM zone
> > > > > + * device data and initializes it for zone device usage.
> > > > > + */
> > > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > > +				     struct drm_gpusvm_zdd *zdd)
> > > > > +{
> > > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > > +	zone_device_page_init(page);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > > > > + * @dev: The device for which the pages are being mapped
> > > > > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > > + * @npages: Number of pages to map
> > > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > > + *
> > > > > + * This function maps pages of memory for migration usage in GPU SVM. It
> > > > > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > > > > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > > > > + * array.
> > > > > + *
> > > > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > > > + */
> > > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > > +					dma_addr_t *dma_addr,
> > > > > +					long unsigned int *migrate_pfn,
> > > > > +					unsigned long npages,
> > > > > +					enum dma_data_direction dir)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > > > > +
> > > > > +		if (!page)
> > > > > +			continue;
> > > > > +
> > > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > > +			return -EFAULT;
> > > > > +
> > > > > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > > +			return -EFAULT;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > > > > + * @dev: The device for which the pages were mapped
> > > > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > > > + * @npages: Number of pages to unmap
> > > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > > + *
> > > > > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > > > > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > > > > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > > > > + */
> > > > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > > > +					   dma_addr_t *dma_addr,
> > > > > +					   unsigned long npages,
> > > > > +					   enum dma_data_direction dir)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > > > > +			continue;
> > > > > +
> > > > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *                   failure of this function.
> > > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > > > > + *                   should hold a reference to the VRAM allocation, which
> > > > > + *                   should be dropped via ops->vram_allocation or upon the
> > > > > + *                   failure of this function.
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > > > > + * necessary setup and invokes the driver-specific operations for migration to
> > > > > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > > > > + * until ops->vram_release is called which only upon successful return.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range *range,
> > > > > +			       void *vram_allocation,
> > > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > > +{
> > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > +	struct migrate_vma migrate = {
> > > > > +		.start		= start,
> > > > > +		.end		= end,
> > > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > > > +	};
> > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > +	unsigned long i, npages = npages_in_range(start, end);
> > > > > +	struct vm_area_struct *vas;
> > > > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > > > +	struct page **pages;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int err;
> > > > > +
> > > > > +	if (!range->flags.migrate_vram)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > > > > +	    !gpusvm->ops->copy_to_sram)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		if (!mmget_not_zero(mm)) {
> > > > > +			err = -EFAULT;
> > > > > +			goto err_out;
> > > > > +		}
> > > > > +		mmap_write_lock(mm);
> > > > > +	}
> > > > > +
> > > > > +	mmap_assert_locked(mm);
> > > > > +
> > > > > +	vas = vma_lookup(mm, start);
> > > > > +	if (!vas) {
> > > > > +		err = -ENOENT;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > > +		err = -EINVAL;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (!vma_is_anonymous(vas)) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > > +
> > > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > > +	if (!zdd) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	migrate.vma = vas;
> > > > > +	migrate.src = buf;
> > > > > +	migrate.dst = migrate.src + npages;
> > > > > +
> > > > > +	err = migrate_vma_setup(&migrate);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	/*
> > > > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > > > +	 */
> > > > > +
> > > > > +	if (!migrate.cpages) {
> > > > > +		err = -EFAULT;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	if (migrate.cpages != npages) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_finalize;
> > > > > +	}
> > > > > +
> > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > > > > +					     migrate.dst);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > > +					   migrate.src, npages, DMA_TO_DEVICE);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > > > +
> > > > > +		pages[i] = page;
> > > > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > > > +	}
> > > > > +
> > > > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	/* Upon success bind vram allocation to range and zdd */
> > > > > +	range->vram_allocation = vram_allocation;
> > > > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> > > > > +
> > > > > +err_finalize:
> > > > > +	if (err)
> > > > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > > +	migrate_vma_pages(&migrate);
> > > > > +	migrate_vma_finalize(&migrate);
> > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > > +				       DMA_TO_DEVICE);
> > > > > +err_free:
> > > > > +	if (zdd)
> > > > > +		drm_gpusvm_zdd_put(zdd);
> > > > > +	kvfree(buf);
> > > > > +err_mmunlock:
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		mmap_write_unlock(mm);
> > > > > +		mmput(mm);
> > > > > +	}
> > > > > +err_out:
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > > + * @npages: Number of pages to populate
> > > > > + * @src_mpfn: Source array of migrate PFNs
> > > > > + * @mpfn: Array of migrate PFNs to populate
> > > > > + * @addr: Start address for PFN allocation
> > > > > + *
> > > > > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > > > > + * specified VM area structure. It allocates and locks pages in the VM area for
> > > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > > > > + * alloc_page for allocation.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > > > > +						unsigned long npages,
> > > > > +						unsigned long *src_mpfn,
> > > > > +						unsigned long *mpfn, u64 addr)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > > +		struct page *page;
> > > > > +
> > > > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > > +			continue;
> > > > > +
> > > > > +		if (vas)
> > > > > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > > > > +		else
> > > > > +			page = alloc_page(GFP_HIGHUSER);
> > > > > +
> > > > > +		if (!page)
> > > > > +			return -ENOMEM;
> > > > > +
> > > > > +		lock_page(page);
> > > > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > > > > + * migration done via migrate_device_* functions. Fallback path as it is
> > > > > + * preferred to issue migrations with mmap lock.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +				    struct drm_gpusvm_range *range)
> > > > > +{
> > > > > +	unsigned long npages;
> > > > > +	struct page **pages;
> > > > > +	unsigned long *src, *dst;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int i, err = 0;
> > > > > +
> > > > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_out;
> > > > > +	}
> > > > > +	src = buf;
> > > > > +	dst = buf + (sizeof(*src) * npages);
> > > > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > > > > +
> > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > > > > +					     npages, src);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > > > +				       gpusvm->device_private_page_owner, src,
> > > > > +				       npages, range->va.start);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > > +					   dst, npages, DMA_BIDIRECTIONAL);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i)
> > > > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > > > +
> > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +err_finalize:
> > > > > +	if (err)
> > > > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > > > +	migrate_device_pages(src, dst, npages);
> > > > > +	migrate_device_finalize(src, dst, npages);
> > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > > +				       DMA_BIDIRECTIONAL);
> > > > > +err_free:
> > > > > +	kvfree(buf);
> > > > > +err_out:
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @vas: Pointer to the VM area structure
> > > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > > + * @start: Start address of the migration range
> > > > > + * @end: End address of the migration range
> > > > > + *
> > > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +					struct vm_area_struct *vas,
> > > > > +					struct page *page,
> > > > > +					u64 start, u64 end)
> > > > > +{
> > > > > +	struct migrate_vma migrate = {
> > > > > +		.vma		= vas,
> > > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > > +		.fault_page	= page,
> > > > > +	};
> > > > > +	unsigned long npages;
> > > > > +	struct page **pages;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int i, err = 0;
> > > > > +
> > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > +
> > > > > +	/* Corner where VMA area struct has been partially unmapped */
> > > > > +	if (start < vas->vm_start)
> > > > > +		start = vas->vm_start;
> > > > > +	if (end > vas->vm_end)
> > > > > +		end = vas->vm_end;
> > > > > +
> > > > > +	migrate.start = start;
> > > > > +	migrate.end = end;
> > > > > +	npages = npages_in_range(start, end);
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_out;
> > > > > +	}
> > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > > +
> > > > > +	migrate.vma = vas;
> > > > > +	migrate.src = buf;
> > > > > +	migrate.dst = migrate.src + npages;
> > > > > +
> > > > > +	err = migrate_vma_setup(&migrate);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	/* Raced with another CPU fault, nothing to do */
> > > > > +	if (!migrate.cpages)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > > > +						   migrate.src, migrate.dst,
> > > > > +						   start);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > > +					   migrate.dst, npages,
> > > > > +					   DMA_BIDIRECTIONAL);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i)
> > > > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > > > +
> > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +err_finalize:
> > > > > +	if (err)
> > > > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > > +	migrate_vma_pages(&migrate);
> > > > > +	migrate_vma_finalize(&migrate);
> > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > > +				       DMA_BIDIRECTIONAL);
> > > > > +err_free:
> > > > > +	kvfree(buf);
> > > > > +err_out:
> > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function initiates the migration of the specified GPU SVM range to
> > > > > + * SRAM. It performs necessary checks and invokes the internal migration
> > > > > + * function for actual migration.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range *range,
> > > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > > +{
> > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > +	struct vm_area_struct *vas;
> > > > > +	int err;
> > > > > +	bool retry = false;
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		if (!mmget_not_zero(mm)) {
> > > > > +			err = -EFAULT;
> > > > > +			goto err_out;
> > > > > +		}
> > > > > +		if (ctx->trylock_mmap) {
> > > > > +			if (!mmap_read_trylock(mm))  {
> > > > > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > > +				goto err_mmput;
> > > > > +			}
> > > > > +		} else {
> > > > > +			mmap_read_lock(mm);
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	mmap_assert_locked(mm);
> > > > > +
> > > > > +	/*
> > > > > +	 * Loop required to find all VMA area structs for the corner case when
> > > > > +	 * VRAM backing has been partially unmapped from MM's address space.
> > > > > +	 */
> > > > > +again:
> > > > > +	vas = find_vma(mm, start);
> > > > > +	if (!vas) {
> > > > > +		if (!retry)
> > > > > +			err = -ENOENT;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > > +		if (!retry)
> > > > > +			err = -EINVAL;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > > > > +	if (err)
> > > > > +		goto err_mmunlock;
> > > > > +
> > > > > +	if (vas->vm_end < end) {
> > > > > +		retry = true;
> > > > > +		start = vas->vm_end;
> > > > > +		goto again;
> > > > > +	}
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		mmap_read_unlock(mm);
> > > > > +		/*
> > > > > +		 * Using mmput_async as this function can be called while
> > > > > +		 * holding a dma-resv lock, and a final put can grab the mmap
> > > > > +		 * lock, causing a lock inversion.
> > > > > +		 */
> > > > > +		mmput_async(mm);
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +err_mmunlock:
> > > > > +	if (!ctx->mmap_locked)
> > > > > +		mmap_read_unlock(mm);
> > > > > +err_mmput:
> > > > > +	if (!ctx->mmap_locked)
> > > > > +		mmput_async(mm);
> > > > > +err_out:
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > > > > + * @page: Pointer to the page
> > > > > + *
> > > > > + * This function is a callback used to put the GPU SVM zone device data
> > > > > + * associated with a page when it is being released.
> > > > > + */
> > > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > > +{
> > > > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > > + * @vmf: Pointer to the fault information structure
> > > > > + *
> > > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > > + * the internal migration function to migrate the range back to RAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > > + */
> > > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > > +{
> > > > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > > +	int err;
> > > > > +
> > > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > > +					   vmf->vma, vmf->page,
> > > > > +					   zdd->range->va.start,
> > > > > +					   zdd->range->va.end);
> > > > > +
> > > > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > > > + */
> > > > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > > > +	.page_free = drm_gpusvm_page_free,
> > > > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the GPU SVM device page map operations structure.
> > > > > + */
> > > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > > > +{
> > > > > +	return &drm_gpusvm_pagemap_ops;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > + * @start: Start address
> > > > > + * @end: End address
> > > > > + *
> > > > > + * Returns:
> > > > > + * True if GPU SVM has mapping, False otherwise
> > > > > + */
> > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > > > > +{
> > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > +
> > > > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > > > +		struct drm_gpusvm_range *range = NULL;
> > > > > +
> > > > > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > > > > +			return true;
> > > > > +	}
> > > > > +
> > > > > +	return false;
> > > > > +}
> > > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > new file mode 100644
> > > > > index 000000000000..0ea70f8534a8
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > @@ -0,0 +1,415 @@
> > > > > +/* SPDX-License-Identifier: MIT */
> > > > > +/*
> > > > > + * Copyright © 2024 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +#ifndef __DRM_GPUSVM_H__
> > > > > +#define __DRM_GPUSVM_H__
> > > > > +
> > > > > +#include <linux/kref.h>
> > > > > +#include <linux/mmu_notifier.h>
> > > > > +#include <linux/workqueue.h>
> > > > > +
> > > > > +struct dev_pagemap_ops;
> > > > > +struct drm_device;
> > > > > +struct drm_gpusvm;
> > > > > +struct drm_gpusvm_notifier;
> > > > > +struct drm_gpusvm_ops;
> > > > > +struct drm_gpusvm_range;
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > > > + *
> > > > > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > > > > + * These operations are provided by the GPU driver to manage SVM ranges and
> > > > > + * perform operations such as migration between VRAM and system RAM.
> > > > > + */
> > > > > +struct drm_gpusvm_ops {
> > > > > +	/**
> > > > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > > > +	 *
> > > > > +	 * This function shall allocate a GPU SVM notifier.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > > > > +	 */
> > > > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > > > +
> > > > > +	/**
> > > > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > > > +	 *
> > > > > +	 * This function shall free a GPU SVM notifier.
> > > > > +	 */
> > > > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > > > +
> > > > > +	/**
> > > > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 *
> > > > > +	 * This function shall allocate a GPU SVM range.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > > > > +	 */
> > > > > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > > > > +
> > > > > +	/**
> > > > > +	 * @range_free: Free a GPU SVM range (optional)
> > > > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > > > +	 *
> > > > > +	 * This function shall free a GPU SVM range.
> > > > > +	 */
> > > > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > > > +
> > > > > +	/**
> > > > > +	 * @vram_release: Release VRAM allocation (optional)
> > > > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > > +	 *
> > > > > +	 * This function shall release VRAM allocation and expects to drop a
> > > > > +	 * reference to VRAM allocation.
> > > > > +	 */
> > > > > +	void (*vram_release)(void *vram_allocation);
> > > > > +
> > > > > +	/**
> > > > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > > +	 * @npages: Number of pages to populate
> > > > > +	 * @pfn: Array of page frame numbers to populate
> > > > > +	 *
> > > > > +	 * This function shall populate VRAM page frame numbers (PFN).
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * 0 on success, a negative error code on failure.
> > > > > +	 */
> > > > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > > > +				 void *vram_allocation,
> > > > > +				 unsigned long npages,
> > > > > +				 unsigned long *pfn);
> > > > > +
> > > > > +	/**
> > > > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > > > +	 * @npages: Number of pages to copy
> > > > > +	 *
> > > > > +	 * This function shall copy pages to VRAM.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * 0 on success, a negative error code on failure.
> > > > > +	 */
> > > > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > > > +			    struct page **pages,
> > > > > +			    dma_addr_t *dma_addr,
> > > > > +			    unsigned long npages);
> > > > > +
> > > > > +	/**
> > > > > +	 * @copy_to_sram: Copy to system RAM (required for migration)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > > > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > > > > +	 * @npages: Number of pages to copy
> > > > > +	 *
> > > > > +	 * This function shall copy pages to system RAM.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * 0 on success, a negative error code on failure.
> > > > > +	 */
> > > > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > > > +			    struct page **pages,
> > > > > +			    dma_addr_t *dma_addr,
> > > > > +			    unsigned long npages);
> > > > > +
> > > > > +	/**
> > > > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > > > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > > > > +	 *
> > > > > +	 * This function shall invalidate the GPU page tables. It can safely
> > > > > +	 * walk the notifier range RB tree/list in this function. Called while
> > > > > +	 * holding the notifier lock.
> > > > > +	 */
> > > > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > > > +			   struct drm_gpusvm_notifier *notifier,
> > > > > +			   const struct mmu_notifier_range *mmu_range);
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > > > > + *
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: MMU interval notifier
> > > > > + * @interval: Interval for the notifier
> > > > > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > > > > + * @root: Cached root node of the RB tree containing ranges
> > > > > + * @range_list: List head containing of ranges in the same order they appear in
> > > > > + *              interval tree. This is useful to keep iterating ranges while
> > > > > + *              doing modifications to RB tree.
> > > > > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > > > > + *                 removed
> > > > > + *
> > > > > + * This structure represents a GPU SVM notifier.
> > > > > + */
> > > > > +struct drm_gpusvm_notifier {
> > > > > +	struct drm_gpusvm *gpusvm;
> > > > > +	struct mmu_interval_notifier notifier;
> > > > > +	struct {
> > > > > +		u64 start;
> > > > > +		u64 end;
> > > > > +	} interval;
> > > > > +	struct {
> > > > > +		struct rb_node node;
> > > > > +		struct list_head entry;
> > > > > +		u64 __subtree_last;
> > > > > +	} rb;
> > > > > +	struct rb_root_cached root;
> > > > > +	struct list_head range_list;
> > > > > +	struct {
> > > > > +		u32 removed : 1;
> > > > > +	} flags;
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > > > + *
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: Pointer to the GPU SVM notifier
> > > > > + * @refcount: Reference count for the range
> > > > > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > > > > + * @va: Virtual address range
> > > > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > > > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > > > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > > > > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > > > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > > > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > > > > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > > > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > > > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > > > > + *                       on @order which releases via kfree
> > > > > + *
> > > > > + * This structure represents a GPU SVM range used for tracking memory ranges
> > > > > + * mapped in a DRM device.
> > > > > + */
> > > > > +struct drm_gpusvm_range {
> > > > > +	struct drm_gpusvm *gpusvm;
> > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > +	struct kref refcount;
> > > > > +	struct {
> > > > > +		struct rb_node node;
> > > > > +		struct list_head entry;
> > > > > +		u64 __subtree_last;
> > > > > +	} rb;
> > > > > +	struct {
> > > > > +		u64 start;
> > > > > +		u64 end;
> > > > > +	} va;
> > > > > +	unsigned long notifier_seq;
> > > > > +	union {
> > > > > +		struct page **pages;
> > > > > +		dma_addr_t *dma_addr;
> > > > > +	};
> > > > > +	void *vram_allocation;
> > > > > +	u16 order;
> > > > > +	struct {
> > > > > +		/* All flags below must be set upon creation */
> > > > > +		u16 migrate_vram : 1;
> > > > > +		/* All flags below must be set / cleared under notifier lock */
> > > > > +		u16 unmapped : 1;
> > > > > +		u16 partial_unmap : 1;
> > > > > +		u16 has_vram_pages : 1;
> > > > > +		u16 has_dma_mapping : 1;
> > > > > +		u16 kfree_mapping : 1;
> > > > > +	} flags;
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm - GPU SVM structure
> > > > > + *
> > > > > + * @name: Name of the GPU SVM
> > > > > + * @drm: Pointer to the DRM device structure
> > > > > + * @mm: Pointer to the mm_struct for the address space
> > > > > + * @device_private_page_owner: Device private pages owner
> > > > > + * @mm_start: Start address of GPU SVM
> > > > > + * @mm_range: Range of the GPU SVM
> > > > > + * @notifier_size: Size of individual notifiers
> > > > > + * @ops: Pointer to the operations structure for GPU SVM
> > > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > > > > + *               Entries should be powers of 2 in descending order.
> > > > > + * @num_chunks: Number of chunks
> > > > > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > > > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > > > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > > > > + * @notifier_list: list head containing of notifiers in the same order they
> > > > > + *                 appear in interval tree. This is useful to keep iterating
> > > > > + *                 notifiers while doing modifications to RB tree.
> > > > > + *
> > > > > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > > > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > > > + *
> > > > > + * No reference counting is provided, as this is expected to be embedded in the
> > > > > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > > > > + * counting.
> > > > > + */
> > > > > +struct drm_gpusvm {
> > > > > +	const char *name;
> > > > > +	struct drm_device *drm;
> > > > > +	struct mm_struct *mm;
> > > > > +	void *device_private_page_owner;
> > > > > +	u64 mm_start;
> > > > > +	u64 mm_range;
> > > > > +	u64 notifier_size;
> > > > > +	const struct drm_gpusvm_ops *ops;
> > > > > +	const u64 *chunk_sizes;
> > > > > +	int num_chunks;
> > > > > +	struct rw_semaphore notifier_lock;
> > > > > +	struct workqueue_struct *zdd_wq;
> > > > > +	struct rb_root_cached root;
> > > > > +	struct list_head notifier_list;
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > > > + *
> > > > > + * @mmap_locked: mmap lock is locked
> > > > > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > > > > + *                (e.g.dma-revs -> mmap lock)
> > > > > + * @in_notifier: entering from a MMU notifier
> > > > > + * @read_only: operating on read-only memory
> > > > > + * @vram_possible: possible to use VRAM
> > > > > + * @prefault: prefault pages
> > > > > + *
> > > > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > > > + */
> > > > > +struct drm_gpusvm_ctx {
> > > > > +	u32 mmap_locked :1;
> > > > > +	u32 trylock_mmap :1;
> > > > > +	u32 in_notifier :1;
> > > > > +	u32 read_only :1;
> > > > > +	u32 vram_possible :1;
> > > > > +	u32 prefault :1;
> > > > > +};
> > > > > +
> > > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > > +		    const char *name, struct drm_device *drm,
> > > > > +		    struct mm_struct *mm, void *device_private_page_owner,
> > > > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > > > +		    const struct drm_gpusvm_ops *ops,
> > > > > +		    const u64 *chunk_sizes, int num_chunks);
> > > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > > > +
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > > > > +				u64 gpuva_start, u64 gpuva_end,
> > > > > +				const struct drm_gpusvm_ctx *ctx);
> > > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > > +			     struct drm_gpusvm_range *range);
> > > > > +
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > > > +
> > > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > > +				  struct drm_gpusvm_range *range);
> > > > > +
> > > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range *range,
> > > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > +				  struct drm_gpusvm_range *range,
> > > > > +				  const struct drm_gpusvm_ctx *ctx);
> > > > > +
> > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range *range,
> > > > > +			       void *vram_allocation,
> > > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range *range,
> > > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > > +
> > > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > > > +
> > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > > > > +
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > + *
> > > > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > > > + */
> > > > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > > > +	down_read(&(gpusvm__)->notifier_lock)
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > + *
> > > > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > > > + */
> > > > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > > > +	up_read(&(gpusvm__)->notifier_lock)
> > > > > +
> > > > > +/**
> > > > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > > > + * @range: a pointer to the current GPU SVM range
> > > > > + *
> > > > > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > > > > + *         current range is the last one or if the input range is NULL.
> > > > > + */
> > > > > +static inline struct drm_gpusvm_range *
> > > > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > > > +{
> > > > > +	if (range && !list_is_last(&range->rb.entry,
> > > > > +				   &range->notifier->range_list))
> > > > > +		return list_next_entry(range, rb.entry);
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > > > > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > > > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > + * @start__: Start address of the range
> > > > > + * @end__: End address of the range
> > > > > + *
> > > > > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > > > > + * to use while holding the driver SVM lock or the notifier lock.
> > > > > + */
> > > > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > > > > +	for ((range__) = (range__) ?:					\
> > > > > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > > > > +	     (range__) && (range__->va.start < (end__));		\
> > > > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > > > + * @range: Pointer to the GPU SVM range structure.
> > > > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > > > + *
> > > > > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > > > > + * if the range partially falls within the provided MMU notifier range.
> > > > > + */
> > > > > +static inline void
> > > > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > > > +			      const struct mmu_notifier_range *mmu_range)
> > > > > +{
> > > > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > > > +
> > > > > +	range->flags.unmapped = true;
> > > > > +	if (range->va.start < mmu_range->start ||
> > > > > +	    range->va.end > mmu_range->end)
> > > > > +		range->flags.partial_unmap = true;
> > > > > +}
> > > > > +
> > > > > +#endif /* __DRM_GPUSVM_H__ */
> > > > > -- 
> > > > > 2.34.1
> > > > > 
> > > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28 18:50   ` Daniel Vetter
@ 2024-08-29 16:49     ` Matthew Brost
  2024-09-02 11:40       ` Daniel Vetter
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 16:49 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Wed, Aug 28, 2024 at 08:50:02PM +0200, Daniel Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner case when
> > +	 * VRAM backing has been partially unmapped from MM's address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> 
> So a hiliarous case that amdkfd gets a bit better but still not entirely
> is that the original vma might entirely gone. Even when you can still get
> at the mm of that process. This happens with cow (or shared too I think)
> mappings in forked child processes, or also if you play fun mremap games.
> 
> I think that outside of the ->migrate_to_ram callback migration/eviction
> to sram cannot assume there's any reasonable vma around and has to
> unconditionally go with the drm_gpusvm_evict_to_sram path.
> 

See my response here [1]. Let me drop the whole trylock thing and
convert to an 'evict' flag which calls drm_gpusvm_evict_to_sram in
places where Xe needs to evict VRAM. Or maybe just export that function
and call it directly. That way the only place the VMA is looked up for
SRAM -> VRAM is upon CPU page fault.

[1] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164

> Also in the migrate_to_ram case the vma is essentially nothing else that
> informational about which ranges we might need if we prefault a bit (in
> case the child changed the vma compared to the original one). So it's good
> to as parameter for migrate_vma_setup, but absolutely nothing else.
> 
> amdkfd almost gets this right by being entirely based on their svm_range
> structures, except they still have the lingering check that the orignal mm
> is still alive. Of course you cannot ever use that memory on the gpu
> anymore, but the child process could get very pissed if their memory is
> suddenly gone. Also the eviction code has the same issue as yours and
> limits itself to vma that still exist in the original mm, leaving anything
> that's orphaned in children or remaps stuck in vram. At least that's my
> understanding, I might very well be wrong.
> 
> So probably want a bunch of these testcases too to make sure that all
> works, and we're not stuck with memory allocations in vram that we can't
> move out.

When writing some additional test cases, let me add hooks in my IGTs to
be able to verify we are not orphaning VRAM too.

Matt

> -Sima
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29  9:45   ` Daniel Vetter
@ 2024-08-29 17:27     ` Matthew Brost
  2024-09-02 11:53       ` Daniel Vetter
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 17:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Thu, Aug 29, 2024 at 11:45:08AM +0200, Daniel Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > sharing of memory between the CPU and GPU, enhancing performance and
> > flexibility in GPU computing tasks.
> > 
> > The patch adds the necessary infrastructure for SVM, including data
> > structures and functions for managing SVM ranges and notifiers. It also
> > provides mechanisms for allocating, deallocating, and migrating memory
> > regions between system RAM and GPU VRAM.
> > 
> > This mid-layer is largely inspired by GPUVM.
> > 
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> Still not sure I've got the right race that you paper over with
> mmap_write_lock, but I spotted a few things, commments inline.
> 

I've replied to this issue several times, let's table the
mmap_write_lock issue in this reply - a lot of other things to get
through. Current thinking is try to add a range->migrate_lock like AMD
which I state here [1]. Let's continue discussing the mmap lock issue
there if possible.

[1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169

> > ---
> >  drivers/gpu/drm/xe/Makefile     |    3 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174 +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> >  3 files changed, 2591 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index b9670ae09a9e..b8fc2ee58f1a 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> >  
> >  # core driver code
> >  
> > -xe-y += xe_bb.o \
> > +xe-y += drm_gpusvm.o \
> > +	xe_bb.o \
> >  	xe_bo.o \
> >  	xe_bo_evict.o \
> >  	xe_devcoredump.o \
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c b/drivers/gpu/drm/xe/drm_gpusvm.c
> > new file mode 100644
> > index 000000000000..fc1e44e6ae72
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > @@ -0,0 +1,2174 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + *
> > + * Authors:
> > + *     Matthew Brost <matthew.brost@intel.com>
> > + */
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interval_tree_generic.h>
> > +#include <linux/hmm.h>
> > +#include <linux/memremap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_types.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/slab.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include "drm_gpusvm.h"
> > +
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework designed to manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient data exchange and
> > + * processing for GPU-accelerated applications by allowing memory sharing and
> > + * synchronization between the CPU's and GPU's virtual address spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU SVM
> > + *		initialization parameter, with a recommendation of 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of ranges that
> > + *		fall within the notifier interval. Notifiers are tracked within
> > + *		a GPU SVM Red-BlacK tree and list and are dynamically inserted
> > + *		or removed as ranges within the interval are created or
> > + *		destroyed.
> > + * - Ranges: Represent memory ranges mapped in a DRM device and managed
> > + *	     by GPU SVM. They are sized based on an array of chunk sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU address space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits within the
> > + *	     faulting CPU address space is chosen for the range size. Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges are tracked in
> > + *	     a notifier's Red-Black tree.
> > + * - Operations: Define the interface for driver-specific SVM operations such as
> > + *		 allocation, page collection, migration, invalidations, and VRAM
> > + *		 release.
> > + *
> > + * This layer provides interfaces for allocating, mapping, migrating, and
> > + * releasing memory ranges between the CPU and GPU. It handles all core memory
> > + * management interactions (DMA mapping, HMM, and migration) and provides
> > + * driver-specific virtual functions (vfuncs). This infrastructure is sufficient
> > + * to build the expected driver components for an SVM implementation as detailed
> > + * below.
> > + *
> > + * Expected Driver Components:
> > + * - GPU page fault handler: Used to create ranges and notifiers based on the
> > + *			     fault address, optionally migrate the range to
> > + *			     VRAM, and create GPU bindings.
> > + * - Garbage collector: Used to destroy GPU bindings for ranges. Ranges are
> > + *			expected to be added to the garbage collector upon
> > + *			MMU_NOTIFY_UNMAP event.
> > + */
> > +
> > +/**
> > + * DOC: Locking
> > + *
> > + * GPU SVM handles locking for core MM interactions, i.e., it locks/unlocks the
> > + * mmap lock as needed. Alternatively, if the driver prefers to handle the mmap
> > + * lock itself, a 'locked' argument is provided to the functions that require
> > + * the mmap lock. This option may be useful for drivers that need to call into
> > + * GPU SVM while also holding a dma-resv lock, thus preventing locking
> > + * inversions between the mmap and dma-resv locks.
> > + *
> > + * GPU SVM introduces a global notifier lock, which safeguards the notifier's
> > + * range RB tree and list, as well as the range's DMA mappings and sequence
> > + * number. GPU SVM manages all necessary locking and unlocking operations,
> > + * except for the recheck of the range's sequence number
> > + * (mmu_interval_read_retry) when the driver is committing GPU bindings. This
> > + * lock corresponds to the 'driver->update' lock mentioned in the HMM
> > + * documentation (TODO: Link). Future revisions may transition from a GPU SVM
> > + * global lock to a per-notifier lock if finer-grained locking is deemed
> > + * necessary.
> > + *
> > + * In addition to the locking mentioned above, the driver should implement a
> > + * lock to safeguard core GPU SVM function calls that modify state, such as
> > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove. Alternatively,
> > + * these core functions can be called within a single kernel thread, for
> > + * instance, using an ordered work queue. This lock is denoted as
> > + * 'driver_svm_lock' in code examples.
> 
> I think this doesn't work, because essentially it forces a single threaded
> design. Core mm isn't single threaded, and you cannot lock them all out,
> at least not easily.
> 
> So I think a design requirement is that gpusvm can cope with migrations to
> ram due to cpu faults, migrations for other reasons, gpu fault handling
> all concurrently. Currently with the combo of driver_svm_lock + taking
> mmap_write_lock you serialize this all a lot, which I think is hiding
> design bugs.

See above, mmap_write_lock is wrong will work on other solutions.
driver_svm_lock is a per GPUSVM lock which in Xe maps to an existing per
GPUVM lock. All of Xe's binding code requires this lock. This is only
taken in the path of a GPU faults, certainly only 1 GPU fault per VM can
be serviced at a time. Agree cpu faults and migrations for other reasons
can happen in parallel with a GPU fault. Once we drop the mmap write
lock hack, this can freely happen.

> 
> > + */
> > +
> > +/**
> > + * DOC: Migrataion
> > + *
> > + * The migration support is quite simple, allowing migration between SRAM and
> > + * VRAM at the range granularity. For example, GPU SVM currently does not
> > + * support mixing SRAM and VRAM pages within a range. This means that upon GPU
> > + * fault, the entire range can be migrated to VRAM, and upon CPU fault, the
> > + * entire range is migrated to SRAM.
> > + *
> > + * The reasoning for only supporting range granularity is as follows: it
> > + * simplifies the implementation, and range sizes are driver-defined and should
> > + * be relatively small.
> > + */
> > +
> > +/**
> > + * DOC: Partial Unmapping of Ranges
> > + *
> > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by CPU resulting
> > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the main one
> > + * being that a subset of the range still has CPU and GPU mappings. If the
> > + * backing store for the range is in VRAM, a subset of the backing store has
> > + * references. One option would be to split the range and VRAM backing store,
> > + * but the implementation for this would be quite complicated. Given that
> > + * partial unmappings are rare and driver-defined range sizes are relatively
> > + * small, GPU SVM does not support splitting of ranges.
> > + *
> > + * With no support for range splitting, upon partial unmapping of a range, the
> > + * driver is expected to invalidate and destroy the entire range. If the range
> > + * has VRAM as its backing, the driver is also expected to migrate any remaining
> > + * pages back to SRAM.
> > + */
> > +
> > +/**
> > + * DOC: Examples
> > + *
> > + * This section provides two examples of how to build the expected driver
> > + * components: the GPU page fault handler and the garbage collector. A third
> > + * example demonstrates a sample invalidation driver vfunc.
> > + *
> > + * The generic code provided does not include logic for complex migration
> > + * policies, optimized invalidations, or other potentially required driver
> > + * locking (e.g., DMA-resv locks).
> > + *
> > + * 1) GPU page fault handler
> > + *
> > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct drm_gpusvm_range *range)
> > + *	{
> > + *		int err = 0;
> > + *
> > + *		driver_alloc_and_setup_memory_for_bind(gpusvm, range);
> > + *
> > + *		drm_gpusvm_notifier_lock(gpusvm);
> > + *		if (drm_gpusvm_range_pages_valid(range))
> > + *			driver_commit_bind(gpusvm, range);
> > + *		else
> > + *			err = -EAGAIN;
> > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > + *
> > + *		return err;
> > + *	}
> > + *
> > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > + *			     u64 gpuva_start, u64 gpuva_end)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *		int err;
> > + *
> > + *		driver_svm_lock();
> > + *	retry:
> > + *		// Always process UNMAPs first so view of GPU SVM ranges is current
> > + *		driver_garbage_collector(gpusvm);
> > + *
> > + *		range = drm_gpusvm_range_find_or_insert(gpusvm, fault_addr,
> > + *							gpuva_start, gpuva_end,
> > + *						        &ctx);
> > + *		if (IS_ERR(range)) {
> > + *			err = PTR_ERR(range);
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		if (driver_migration_policy(range)) {
> > + *			bo = driver_alloc_bo();
> > + *			err = drm_gpusvm_migrate_to_vram(gpusvm, range, bo, &ctx);
> > + *			if (err)	// CPU mappings may have changed
> > + *				goto retry;
> > + *		}
> > + *
> > + *		err = drm_gpusvm_range_get_pages(gpusvm, range, &ctx);
> > + *		if (err == -EFAULT || err == -EPERM)	// CPU mappings changed
> > + *			goto retry;
> > + *		else if (err)
> > + *			goto unlock;
> > + *
> > + *		err = driver_bind_range(gpusvm, range);
> > + *		if (err == -EAGAIN)	// CPU mappings changed
> > + *			goto retry
> > + *
> > + *	unlock:
> > + *		driver_svm_unlock();
> > + *		return err;
> > + *	}
> > + *
> > + * 2) Garbage Collector.
> > + *
> > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > + *					struct drm_gpusvm_range *range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> > + *		if (range->flags.partial_unmap)
> > + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
> 
> Note that the migration back to sram isn't guaranteed to succeed, so you
> might be still stuck with partially migrated range. This might be a case
> where hmm gives you vram pfns, but the range you have doesn't have any
> vram allocation anymore because you droppped it here. Not sure tbh.
>

Hmm isn't the picture here nor will a VMA once the
drm_gpusvm_evict_to_sram path is always taken as discussed here [2]. I
might have a corner case BO refcounting / TTM resource lookup bug in
somewhere in here which needs to be resolved though (e.g. eviction
racing with this code path), will try to close on that.

[2] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164
 
> > + *
> > + *		driver_unbind_range(range);
> > + *		drm_gpusvm_range_remove(gpusvm, range);
> > + *	}
> > + *
> > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > + *			__driver_garbage_collector(gpusvm, range);
> > + *	}
> > + *
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier *notifier,
> > + *				 const struct mmu_notifier_range *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range->start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier, mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm, range, &ctx);
> > + *
> > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range, mmu_range);
> > + *			driver_garbage_collector_add(gpusvm, range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64, rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)->interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)->interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused, notifier);
> > +
> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory range,
> > + * specified by the start and end addresses. It divides the difference
> > + * between the end and start addresses by the page size (PAGE_SIZE) to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @range: Pointer to the GPU SVM range
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up a range
> > + * upon CPU page fault and asynchronously releasing VRAM once the CPU has no
> > + * page references. Asynchronous release is useful because CPU page references
> > + * can be dropped in IRQ contexts, while releasing VRAM likely requires sleeping
> > + * locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_range *range;
> > +	void *vram_allocation;
> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a zdd
> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd, destroy_work);
> > +	struct drm_gpusvm_range *range = zdd->range;
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > +	drm_gpusvm_range_put(range);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @range: Pointer to the GPU SVM range.
> > + *
> > + * This function allocates and initializes a new zdd structure. It sets up the
> > + * reference count, initializes the destroy work, and links the provided GPU SVM
> > + * range.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work, drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->range = drm_gpusvm_range_get(range);
> > +	zdd->vram_allocation = NULL;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for asynchronous destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > +
> > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
> > +	     (next__) = __drm_gpusvm_range_next(range__);				\
> > +	     (range__) && (range__->va.start < (end__));				\
> > +	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available, or NULL if
> > + *         the current notifier is the last one or if the input notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm->notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start < (end__));			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
> > +	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
> > +	     (notifier__) && (notifier__->interval.start < (end__));			\
> > +	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > + *               Entries should be powers of 2 in descending order with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner = device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +	gpusvm->zdd_wq = system_wq;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)				\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0, LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next, notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm->notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > +	range_remove((range__), &(notifier__)->root);		\
> > +	list_del(&(range__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > + *
> > + * This function allocates and initializes the GPU SVM range structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool migrate_vram)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm->device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64 gpuva_start,
> > +				       u64 gpuva_end, bool check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm->chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted in. Without
> > +		 * this check, or prefault, on BMG 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case, each process
> > +		 * mallocs 16k but the CPU VMA is ~128k which results in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some processes fail in
> > +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages device pages from
> > +		 * other processes are collected + faulted in which creates all
> > +		 * sorts of problems. Unsure exactly how this happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' mallocs at least 64k at a time.
> > +		 */
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start, end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range based on the
> > + * fault address. Caller must hold a lock to protect range lookup and insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_vram;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_write_locked(mm);
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm, fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert_locked(&notifier->notifier,
> > +							  mm, notifier->interval.start,
> > +							  notifier->interval.end -
> > +							  notifier->interval.start,
> > +							  &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr, fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_* current
> > +	 * limitations. If/when migrate_vma_* add more support, this logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_vram = ctx->vram_possible &&
> > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier, vas,
> > +						 fault_addr, gpuva_start,
> > +						 gpuva_end, migrate_vram &&
> > +						 !ctx->prefault);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr, chunk_size,
> > +				       migrate_vram);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +	if (ctx->prefault) {
> > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > +
> > +		__ctx.mmap_locked = true;
> > +		err = drm_gpusvm_range_get_pages(gpusvm, range, &__ctx);
> > +		if (err)
> > +			goto err_range_remove;
> > +	}
> > +
> > +out_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +
> > +	return range;
> > +
> > +err_range_remove:
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +err_notifier_remove:
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > + * @i__: the current page index in the iteration
> > + * @j__: the current page index, log order, in the iteration
> > + * @npages__: the total number of pages in the DMA region
> > + * @order__: the order of the pages in the DMA region
> > + *
> > + * This macro iterates over each page in a DMA region. The DMA region
> > + * is assumed to be composed of 2^@order__ pages, and the macro will
> > + * step through the region one block of 2^@order__ pages at a time.
> > + */
> > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > +	     (j__)++, (i__) += 0x1 << (order__))
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function unmap pages associated with a GPU SVM range. Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +					   struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		unsigned long i, j, npages = npages_in_range(range->va.start,
> > +							     range->va.end);
> > +
> > +		if (range->flags.has_dma_mapping) {
> > +			for_each_dma_page(i, j, npages, range->order)
> > +				dma_unmap_page(gpusvm->drm->dev,
> > +					       range->dma_addr[j],
> > +					       PAGE_SIZE << range->order,
> > +					       DMA_BIDIRECTIONAL);
> > +		}
> > +
> > +		range->flags.has_vram_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > +					struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		if (range->flags.kfree_mapping) {
> > +			kfree(range->dma_addr);
> > +			range->flags.kfree_mapping = false;
> > +			range->pages = NULL;
> > +		} else {
> > +			kvfree(range->pages);
> > +			range->pages = NULL;
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range->va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier->notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified GPU SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the GPU SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its reference count
> > + * reaches zero. If a custom range-free function is provided, it is invoked to
> > + * free the range; otherwise, the range is deallocated using kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range, refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified GPU SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid. Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_vram_pages || range->flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid. Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > +				      struct drm_gpusvm_range *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->pages)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > +	if (!pages_valid && range->flags.kfree_mapping) {
> > +		kfree(range->dma_addr);
> > +		range->flags.kfree_mapping = false;
> > +		range->pages = NULL;
> > +	}
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they are mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mmu_interval_notifier *notifier = &range->notifier->notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm->device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long timeout =
> > +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	bool vram_pages = !!range->flags.migrate_vram;
> > +	bool alloc_pfns = false, kfree_mapping;
> > +
> > +retry:
> > +	kfree_mapping = false;
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > +		return 0;
> > +
> > +	if (range->notifier_seq == hmm_range.notifier_seq && range->pages) {
> > +		if (ctx->prefault)
> > +			return 0;
> > +
> > +		pfns = (unsigned long *)range->pages;
> > +		pages = range->pages;
> > +		goto map_pages;
> > +	}
> > +
> > +	if (!range->pages) {
> > +		pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +		if (!pfns)
> > +			return -ENOMEM;
> > +		alloc_pfns = true;
> > +	} else {
> > +		pfns = (unsigned long *)range->pages;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		/* Must be checked after mmu_interval_read_begin */
> > +		if (range->flags.unmapped) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		if (!ctx->mmap_locked) {
> > +			/*
> > +			 * XXX: HMM locking document indicates only a read-lock
> > +			 * is required but there apears to be a window between
> > +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > +			 * via migrate_vma_setup and the pages actually moving
> > +			 * in migrate_vma_finalize in which this code can grab
> > +			 * garbage pages. Grabbing the write-lock if the range
> > +			 * is attached to vram appears to protect against this
> > +			 * race.
> > +			 */
> > +			if (vram_pages)
> > +				mmap_write_lock(mm);
> > +			else
> > +				mmap_read_lock(mm);
> > +		}
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (!ctx->mmap_locked) {
> > +			if (vram_pages)
> > +				mmap_write_unlock(mm);
> > +			else
> > +				mmap_read_unlock(mm);
> > +		}
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (!ctx->mmap_locked)
> > +		mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +
> > +	if (ctx->prefault) {
> > +		range->pages = pages;
> > +		goto set_seqno;
> > +	}
> > +
> > +map_pages:
> > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > +		WARN_ON_ONCE(!range->vram_allocation);
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_free;
> > +			}
> > +		}
> 
> You can't do the above, because the pfn you get from hmm come with zero
> guarantees, you neither hold a page reference nor the page lock. The only
> thing you can do is grab the pagetable lock (or mmu notifier locks) and
> check it's still valid, before you can touch any state. I think the
> range->vram_allocation is probably always valid since you clean that up
> under the same lock/thread, but there's good chances the vram allocation
> is otherwise already gone for good. Or you get an inconsistent snapshot.
> 

I haven't seen this pop in my testing yet which is fairly thorough. My
thinking was migration always being enforced at range grainularity we'd
never get mixed mappings from the core as migration is completely under
control of the driver. Maybe I'm not understanding what you are saying
here...

> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->flags.has_vram_pages = true;
> > +		range->pages = pages;
> > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	} else {
> > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > +
> > +		for_each_dma_page(i, j, npages, order) {
> > +			if (WARN_ON_ONCE(i && order !=
> > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +			order = hmm_pfn_to_map_order(pfns[i]);
> > +
> > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +
> > +			set_page_dirty_lock(pages[j]);
> > +			mark_page_accessed(pages[j]);
> 
> You can't do these, because you don't hold a page reference. They're also
> not needed because hmm_range_fault goes thorugh the full mkwrite dance,
> which takes care of these, unlike the gup family of functions.
>

This is a left over from our existing userpte code and it does appear to
be incorrect. Let me remove this and fixup our userptr code while I'm at
it.
 
> > +
> > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > +						   pages[j], 0,
> > +						   PAGE_SIZE << order,
> > +						   DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > +				err = -EFAULT;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> 
> Aside: dma_map_page is about the only thing that's ok, because it doesn't
> do anything harmful and especially doesn't make any assumption about what
> that page is.
> 

+1

> > +		}
> > +
> > +		/* Huge pages, reduce memory footprint */
> > +		if (order) {
> > +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > +						 GFP_KERNEL);
> > +			if (dma_addr) {
> > +				for (i = 0; i < j; ++i)
> > +					dma_addr[i] = (dma_addr_t)pfns[i];
> > +				kvfree(pfns);
> > +				kfree_mapping = true;
> > +			} else {
> > +				dma_addr = (dma_addr_t *)pfns;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->order = order;
> > +		range->flags.kfree_mapping = kfree_mapping;
> > +		range->flags.has_dma_mapping = true;
> > +		range->dma_addr = dma_addr;
> > +		range->vram_allocation = NULL;
> > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	}
> > +
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	for_each_dma_page(i, j, npages, order)
> > +		dma_unmap_page(gpusvm->drm->dev,
> > +			       (dma_addr_t)pfns[j],
> > +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > +err_free:
> > +	if (alloc_pfns)
> > +		kvfree(pfns);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_vram_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *                   failure of this function.
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > + *                   should hold a reference to the VRAM allocation, which
> > + *                   should be dropped via ops->vram_allocation or upon the
> > + *                   failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > + * necessary setup and invokes the driver-specific operations for migration to
> > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > + * until ops->vram_release is called which only upon successful return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_vram)
> > +		return -EINVAL;
> > +
> > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > +	    !gpusvm->ops->copy_to_sram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(range);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > +	 * always an error. Need to revisit possible cases and how to handle. We
> > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.

This is a bit stale, can update this comment.

> > +	 */
> 
> Yeah I think especially under contention partial migrations, at least back
> to sram due to cpu faults, are pretty much expected. And you need to cope
> somehow.
> 

I have seen these pop if the IGT calls mlock on the memory. My thinking
is migration to VRAM is basically optional and fallback to leaving range
in SRAM if an error occurs rather than doing a partial migration. This
is what currently happens so it is coped with.

If the memory is marked as must be in VRAM (NIY), well then the user
program has done something wrong and can kill the app (akin to
segfault).

> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > +					     migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > +					   migrate.src, npages, DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_vram_page(page, zdd);
> > +	}
> > +
> > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind vram allocation to range and zdd */
> > +	range->vram_allocation = vram_allocation;
> > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the VM area for
> > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > +						unsigned long npages,
> > +						unsigned long *src_mpfn,
> > +						unsigned long *mpfn, u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > + * migration done via migrate_device_* functions. Fallback path as it is
> > + * preferred to issue migrations with mmap lock.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > +					     npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_vma_range(gpusvm->mm,
> > +				       gpusvm->device_private_page_owner, src,
> > +				       npages, range->va.start);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > +					   dst, npages, DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @vas: Pointer to the VM area structure
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @start: Start address of the migration range
> > + * @end: End address of the migration range
> > + *
> > + * This internal function performs the migration of the specified GPU SVM range
> > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > + * invokes the driver-specific operations for migration to SRAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +					struct vm_area_struct *vas,
> > +					struct page *page,
> > +					u64 start, u64 end)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page	= page,
> > +	};
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> 
> That's the wrong mm, at least for the ->migrate_to_ram path. You might be
> called on a anon mapping from a child process. That also means that the
> vma you're looking at might have no relationship with anythign you're
> tracking in your gpusvm.
>

Hmm, as discussed [3] I haven't added tests with child processes yet.
Let me do that and update the design as needed. This likely isn't
correct as you say.

[3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169 
 
> > +
> > +	/* Corner where VMA area struct has been partially unmapped */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > +						   migrate.src, migrate.dst,
> > +						   start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > +					   migrate.dst, npages,
> > +					   DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function initiates the migration of the specified GPU SVM range to
> > + * SRAM. It performs necessary checks and invokes the internal migration
> > + * function for actual migration.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner case when
> > +	 * VRAM backing has been partially unmapped from MM's address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> > +	if (!vas) {
> > +		if (!retry)
> > +			err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > +		if (!retry)
> > +			err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > +	if (err)
> > +		goto err_mmunlock;
> > +
> > +	if (vas->vm_end < end) {
> > +		retry = true;
> > +		start = vas->vm_end;
> > +		goto again;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		mmap_read_unlock(mm);
> > +		/*
> > +		 * Using mmput_async as this function can be called while
> > +		 * holding a dma-resv lock, and a final put can grab the mmap
> > +		 * lock, causing a lock inversion.
> > +		 */
> > +		mmput_async(mm);
> > +	}
> > +
> > +	return 0;
> > +
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked)
> > +		mmap_read_unlock(mm);
> > +err_mmput:
> > +	if (!ctx->mmap_locked)
> > +		mmput_async(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> 
> So I think zdd->range doesn't work, because even within a single mm the
> vma mapping a given piece of anon memory does not need to be unique, you
> can duplicate them with mremap.
> 

This is attached to a page, not a VMA. Both AMD and Nvidia drivers use a
similar lookup mechanism.

> So all you have here is the physical memory and the vma, which might or
> might not be from the same process as gpusvm->mm.
> 
> Also the child process scenario means you using mmap_write on the fault
> side doesn't stop all cpu faults migrating stuff back.
> 
> Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
> is busted, because it's va based and so misses concurrently ongoing
> different mappings moving physical storage around underneath.
>

I think all of the above which falls into the fork() + child process
issues which you have raise. Until I test this out I can't speak to this
any level of confidence so I won't. Thanks for raising this issue and
let me write test cases as discussed and educate myself. Once I do that,
we can engage in further discussions.

Matt

> 
> Cheers, Sima
> 
> > +					   vmf->vma, vmf->page,
> > +					   zdd->range->va.start,
> > +					   zdd->range->va.end);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..0ea70f8534a8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,415 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM ranges and
> > + * perform operations such as migration between VRAM and system RAM.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @vram_release: Release VRAM allocation (optional)
> > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > +	 *
> > +	 * This function shall release VRAM allocation and expects to drop a
> > +	 * reference to VRAM allocation.
> > +	 */
> > +	void (*vram_release)(void *vram_allocation);
> > +
> > +	/**
> > +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate VRAM page frame numbers (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > +				 void *vram_allocation,
> > +				 unsigned long npages,
> > +				 unsigned long *pfn);
> > +
> > +	/**
> > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to VRAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_sram: Copy to system RAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It can safely
> > +	 * walk the notifier range RB tree/list in this function. Called while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order they appear in
> > + *              interval tree. This is useful to keep iterating ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > + *                 removed
> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > + *                       on @order which releases via kfree
> > + *
> > + * This structure represents a GPU SVM range used for tracking memory ranges
> > + * mapped in a DRM device.
> > + */
> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	union {
> > +		struct page **pages;
> > +		dma_addr_t *dma_addr;
> > +	};
> > +	void *vram_allocation;
> > +	u16 order;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_vram : 1;
> > +		/* All flags below must be set / cleared under notifier lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_vram_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +		u16 kfree_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > + * @notifier_list: list head containing of notifiers in the same order they
> > + *                 appear in interval tree. This is useful to keep iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > + *
> > + * No reference counting is provided, as this is expected to be embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct workqueue_struct *zdd_wq;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @mmap_locked: mmap lock is locked
> > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > + *                (e.g.dma-revs -> mmap lock)
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @vram_possible: possible to use VRAM
> > + * @prefault: prefault pages
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 mmap_locked :1;
> > +	u32 trylock_mmap :1;
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 vram_possible :1;
> > +	u32 prefault :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> > +
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > + *         current range is the last one or if the input range is NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > +	for ((range__) = (range__) ?:					\
> > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > +	     (range__) && (range__->va.start < (end__));		\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > +			      const struct mmu_notifier_range *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> > -- 
> > 2.34.1
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29  9:16   ` Thomas Hellström
@ 2024-08-29 17:45     ` Matthew Brost
  2024-08-29 18:13       ` Matthew Brost
  2024-08-29 19:18       ` Thomas Hellström
  2024-08-30  1:35     ` Matthew Brost
  1 sibling, 2 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 17:45 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Thu, Aug 29, 2024 at 11:16:49AM +0200, Thomas Hellström wrote:
> Hi, Matt. 
> 
> Some initial design comments / questions:
> 
> On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > This patch introduces support for GPU Shared Virtual Memory (SVM) in
> > the
> > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > sharing of memory between the CPU and GPU, enhancing performance and
> > flexibility in GPU computing tasks.
> > 
> > The patch adds the necessary infrastructure for SVM, including data
> > structures and functions for managing SVM ranges and notifiers. It
> > also
> > provides mechanisms for allocating, deallocating, and migrating
> > memory
> > regions between system RAM and GPU VRAM.
> > 
> > This mid-layer is largely inspired by GPUVM.
> > 
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile     |    3 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> >  3 files changed, 2591 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index b9670ae09a9e..b8fc2ee58f1a 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> >  
> >  # core driver code
> >  
> > -xe-y += xe_bb.o \
> > +xe-y += drm_gpusvm.o \
> > +	xe_bb.o \
> >  	xe_bo.o \
> >  	xe_bo_evict.o \
> >  	xe_devcoredump.o \
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > new file mode 100644
> > index 000000000000..fc1e44e6ae72
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > @@ -0,0 +1,2174 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + *
> > + * Authors:
> > + *     Matthew Brost <matthew.brost@intel.com>
> > + */
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interval_tree_generic.h>
> > +#include <linux/hmm.h>
> > +#include <linux/memremap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_types.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/slab.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include "drm_gpusvm.h"
> > +
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework designed to
> > manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient data
> > exchange and
> > + * processing for GPU-accelerated applications by allowing memory
> > sharing and
> > + * synchronization between the CPU's and GPU's virtual address
> > spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU
> > SVM
> > + *		initialization parameter, with a recommendation of
> > 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of
> > ranges that
> > + *		fall within the notifier interval. Notifiers are
> > tracked within
> > + *		a GPU SVM Red-BlacK tree and list and are
> > dynamically inserted
> > + *		or removed as ranges within the interval are created
> > or
> > + *		destroyed.
> 
> What is the benefit of this extra layer compared to direct insertion of
> ranges using mmu_interval_notifier_insert?
> 
> IIRC the argument made previously about having wide notifiers was that
> the rb tree lookups inside the core were costly and if there were only
> a few, then the rb tree lookups within a notifier range could be
> replaced with the page-table radix-tree-like lookup, so each lookup
> complexity would be O(log(n_notifiers) + page_table_depth).
> 
> But now we have first an rb-tree lookup in the core and then an rb-tree
> lookup within each notifier yeilding O(log(n_ranges))
> 
> I can see a small benefit in that inserting directly into the core rb-
> tree will block pending ongoing invalidations, but at a cost of an
> extra multiplexing layer.
> 

So when the notifier is triggered the search is a smaller range. In a
perfect world eventually I'd like to drop the SVM range completely.
There is a lot of changes required in Xe to make that possible and not
entirely convinced it is possible and the ROI is worth it (additional
complexity vs. perf benefit). For now, this was a relatively simple way
to get SVM working (mirrors boths AMD's and Nvidia's implement wrt to
having a range concept) but also is flexible in the sense the notifier
size can be easily tweaked via a modparam [1] following Jason's
suggestion of larger notifiers.

[1] https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1

> > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > managed
> > + *	     by GPU SVM. They are sized based on an array of chunk
> > sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU
> > address space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits
> > within the
> > + *	     faulting CPU address space is chosen for the range
> > size. Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and
> > removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > are tracked in
> > + *	     a notifier's Red-Black tree.
> 
> How do ranges and chunks map to
>  
> a) Prefaulting granularity
> b) Migration granularity?
> 
> > + * - Operations: Define the interface for driver-specific SVM
> > operations such as
> > + *		 allocation, page collection, migration,
> > invalidations, and VRAM
> > + *		 release.
> > + *
> > + * This layer provides interfaces for allocating, mapping,
> > migrating, and
> > + * releasing memory ranges between the CPU and GPU. It handles all
> > core memory
> > + * management interactions (DMA mapping, HMM, and migration) and
> > provides
> > + * driver-specific virtual functions (vfuncs). This infrastructure
> > is sufficient
> > + * to build the expected driver components for an SVM implementation
> > as detailed
> > + * below.
> > + *
> > + * Expected Driver Components:
> > + * - GPU page fault handler: Used to create ranges and notifiers
> > based on the
> > + *			     fault address, optionally migrate the
> > range to
> > + *			     VRAM, and create GPU bindings.
> > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > Ranges are
> > + *			expected to be added to the garbage
> > collector upon
> > + *			MMU_NOTIFY_UNMAP event.
> > + */
> > +
> > +/**
> > + * DOC: Locking
> > + *
> > + * GPU SVM handles locking for core MM interactions, i.e., it
> > locks/unlocks the
> > + * mmap lock as needed. Alternatively, if the driver prefers to
> > handle the mmap
> > + * lock itself, a 'locked' argument is provided to the functions
> > that require
> > + * the mmap lock. This option may be useful for drivers that need to
> > call into
> > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > locking
> > + * inversions between the mmap and dma-resv locks.
> > + *
> > + * GPU SVM introduces a global notifier lock, which safeguards the
> > notifier's
> > + * range RB tree and list, as well as the range's DMA mappings and
> > sequence
> > + * number. GPU SVM manages all necessary locking and unlocking
> > operations,
> > + * except for the recheck of the range's sequence number
> > + * (mmu_interval_read_retry) when the driver is committing GPU
> > bindings. This
> > + * lock corresponds to the 'driver->update' lock mentioned in the
> > HMM
> > + * documentation (TODO: Link). Future revisions may transition from
> > a GPU SVM
> > + * global lock to a per-notifier lock if finer-grained locking is
> > deemed
> > + * necessary.
> > + *
> > + * In addition to the locking mentioned above, the driver should
> > implement a
> > + * lock to safeguard core GPU SVM function calls that modify state,
> > such as
> > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > Alternatively,
> > + * these core functions can be called within a single kernel thread,
> > for
> > + * instance, using an ordered work queue. This lock is denoted as
> > + * 'driver_svm_lock' in code examples.
> > + */
> > +
> > +/**
> > + * DOC: Migrataion
> > + *
> > + * The migration support is quite simple, allowing migration between
> > SRAM and
> > + * VRAM at the range granularity. For example, GPU SVM currently
> > does not
> > + * support mixing SRAM and VRAM pages within a range. This means
> > that upon GPU
> > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > fault, the
> > + * entire range is migrated to SRAM.
> > + *
> > + * The reasoning for only supporting range granularity is as
> > follows: it
> > + * simplifies the implementation, and range sizes are driver-defined
> > and should
> > + * be relatively small.
> > + */
> > +
> > +/**
> > + * DOC: Partial Unmapping of Ranges
> > + *
> > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > CPU resulting
> > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> > main one
> > + * being that a subset of the range still has CPU and GPU mappings.
> > If the
> > + * backing store for the range is in VRAM, a subset of the backing
> > store has
> > + * references. One option would be to split the range and VRAM
> > backing store,
> > + * but the implementation for this would be quite complicated. Given
> > that
> > + * partial unmappings are rare and driver-defined range sizes are
> > relatively
> > + * small, GPU SVM does not support splitting of ranges.
> > + *
> > + * With no support for range splitting, upon partial unmapping of a
> > range, the
> > + * driver is expected to invalidate and destroy the entire range. If
> > the range
> > + * has VRAM as its backing, the driver is also expected to migrate
> > any remaining
> > + * pages back to SRAM.
> 
> So what happens if we get a one-page invalidation, say protection
> change event, or NUMA accounting event, in the middle of a range? Can
> we unmap just that single gpu pte covering that range, that is, how do
> the ranges map to invalidation granularity? Does this differ between
> igfx an dgfx?

Well the idea of chunks is ranges should be 1 GPU page (the chunk array
in Xe is 4k, 64k, and 2M). The design is flexible enough that doesn't
have to true but optimized for the thinking each range is most likely 1
GPU page. If this isn't true, then all GPU pages in the range are
invalidated which isn't ideal but keeps it simple which IMO far out
weighs the potential benefits. In theory a driver could implement
spliting / partial invalidaions too with a couple of updates to GPUSVM
but would likely largely be a driver implementation rather than GPUSVM.

No difference between igfx an dgfx.

You bring up a good point about protection changes, I likely haven't
fully gotten that part of implementation correct either. I can add this
to my TODO list and also update my IGTs to do things like this.

Matt

> 
> Thanks,
> Thomas
> 
> 
> 
> 
> > + */
> > +
> > +/**
> > + * DOC: Examples
> > + *
> > + * This section provides two examples of how to build the expected
> > driver
> > + * components: the GPU page fault handler and the garbage collector.
> > A third
> > + * example demonstrates a sample invalidation driver vfunc.
> > + *
> > + * The generic code provided does not include logic for complex
> > migration
> > + * policies, optimized invalidations, or other potentially required
> > driver
> > + * locking (e.g., DMA-resv locks).
> > + *
> > + * 1) GPU page fault handler
> > + *
> > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > drm_gpusvm_range *range)
> > + *	{
> > + *		int err = 0;
> > + *
> > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > range);
> > + *
> > + *		drm_gpusvm_notifier_lock(gpusvm);
> > + *		if (drm_gpusvm_range_pages_valid(range))
> > + *			driver_commit_bind(gpusvm, range);
> > + *		else
> > + *			err = -EAGAIN;
> > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > + *
> > + *		return err;
> > + *	}
> > + *
> > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > + *			     u64 gpuva_start, u64 gpuva_end)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *		int err;
> > + *
> > + *		driver_svm_lock();
> > + *	retry:
> > + *		// Always process UNMAPs first so view of GPU SVM
> > ranges is current
> > + *		driver_garbage_collector(gpusvm);
> > + *
> > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > fault_addr,
> > + *							gpuva_start,
> > gpuva_end,
> > + *						        &ctx);
> > + *		if (IS_ERR(range)) {
> > + *			err = PTR_ERR(range);
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		if (driver_migration_policy(range)) {
> > + *			bo = driver_alloc_bo();
> > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > range, bo, &ctx);
> > + *			if (err)	// CPU mappings may have
> > changed
> > + *				goto retry;
> > + *		}
> > + *
> > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &ctx);
> > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > mappings changed
> > + *			goto retry;
> > + *		else if (err)
> > + *			goto unlock;
> > + *
> > + *		err = driver_bind_range(gpusvm, range);
> > + *		if (err == -EAGAIN)	// CPU mappings changed
> > + *			goto retry
> > + *
> > + *	unlock:
> > + *		driver_svm_unlock();
> > + *		return err;
> > + *	}
> > + *
> > + * 2) Garbage Collector.
> > + *
> > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > + *					struct drm_gpusvm_range
> > *range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		// Partial unmap, migrate any remaining VRAM pages
> > back to SRAM
> > + *		if (range->flags.partial_unmap)
> > + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> > &ctx);
> > + *
> > + *		driver_unbind_range(range);
> > + *		drm_gpusvm_range_remove(gpusvm, range);
> > + *	}
> > + *
> > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > + *			__driver_garbage_collector(gpusvm, range);
> > + *	}
> > + *
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier
> > *notifier,
> > + *				 const struct mmu_notifier_range
> > *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > >start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier,
> > mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> > &ctx);
> > + *
> > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range,
> > mmu_range);
> > + *			driver_garbage_collector_add(gpusvm, range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > >interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > >interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> > notifier);
> > +
> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given
> > range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory
> > range,
> > + * specified by the start and end addresses. It divides the
> > difference
> > + * between the end and start addresses by the page size (PAGE_SIZE)
> > to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @range: Pointer to the GPU SVM range
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up
> > a range
> > + * upon CPU page fault and asynchronously releasing VRAM once the
> > CPU has no
> > + * page references. Asynchronous release is useful because CPU page
> > references
> > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > requires sleeping
> > + * locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_range *range;
> > +	void *vram_allocation;
> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> > zdd
> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd,
> > destroy_work);
> > +	struct drm_gpusvm_range *range = zdd->range;
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > +	drm_gpusvm_range_put(range);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @range: Pointer to the GPU SVM range.
> > + *
> > + * This function allocates and initializes a new zdd structure. It
> > sets up the
> > + * reference count, initializes the destroy work, and links the
> > provided GPU SVM
> > + * range.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work,
> > drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->range = drm_gpusvm_range_get(range);
> > +	zdd->vram_allocation = NULL;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd
> > structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for asynchronous
> > destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > +
> > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd
> > structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> > start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > (start__), (end__)),	\
> > +	     (next__) =
> > __drm_gpusvm_range_next(range__);				\
> > +	     (range__) && (range__->va.start <
> > (end__));				\
> > +	     (range__) = (next__), (next__) =
> > __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> > the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> > or NULL if
> > + *         the current notifier is the last one or if the input
> > notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm-
> > >notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> > a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> > end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> > notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1),	\
> > +	     (next__) =
> > __drm_gpusvm_notifier_next(notifier__);				\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = (next__), (next__) =
> > __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It
> > sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc
> > under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range
> > *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops
> > drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order
> > with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner =
> > device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +	gpusvm->zdd_wq = system_wq;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault
> > address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier
> > structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)				\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> > and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining
> > ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> > LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any
> > invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next,
> > notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > >notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > >notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree
> > and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree
> > and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > +	range_remove((range__), &(notifier__)->root);		\
> > +	list_del(&(range__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > + *
> > + * This function allocates and initializes the GPU SVM range
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool
> > migrate_vram)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the
> > CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier
> > *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> > >notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> > range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range
> > based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> > the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64
> > gpuva_start,
> > +				       u64 gpuva_end, bool
> > check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > >chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with
> > existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted
> > in. Without
> > +		 * this check, or prefault, on BMG
> > 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case,
> > each process
> > +		 * mallocs 16k but the CPU VMA is ~128k which
> > results in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some
> > processes fail in
> > +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> > != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages device
> > pages from
> > +		 * other processes are collected + faulted in which
> > creates all
> > +		 * sorts of problems. Unsure exactly how this
> > happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --
> > r
> > +		 * process-many-malloc' mallocs at least 64k at a
> > time.
> > +		 */
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range
> > based on the
> > + * fault address. Caller must hold a lock to protect range lookup
> > and insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_vram;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_write_locked(mm);
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > >notifier,
> > +							  mm,
> > notifier->interval.start,
> > +							  notifier-
> > >interval.end -
> > +							  notifier-
> > >interval.start,
> > +							 
> > &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > current
> > +	 * limitations. If/when migrate_vma_* add more support, this
> > logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_vram = ctx->vram_possible &&
> > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> > vas,
> > +						 fault_addr,
> > gpuva_start,
> > +						 gpuva_end,
> > migrate_vram &&
> > +						 !ctx->prefault);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> > chunk_size,
> > +				       migrate_vram);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +	if (ctx->prefault) {
> > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > +
> > +		__ctx.mmap_locked = true;
> > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &__ctx);
> > +		if (err)
> > +			goto err_range_remove;
> > +	}
> > +
> > +out_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +
> > +	return range;
> > +
> > +err_range_remove:
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +err_notifier_remove:
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > + * @i__: the current page index in the iteration
> > + * @j__: the current page index, log order, in the iteration
> > + * @npages__: the total number of pages in the DMA region
> > + * @order__: the order of the pages in the DMA region
> > + *
> > + * This macro iterates over each page in a DMA region. The DMA
> > region
> > + * is assumed to be composed of 2^@order__ pages, and the macro will
> > + * step through the region one block of 2^@order__ pages at a time.
> > + */
> > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > +	     (j__)++, (i__) += 0x1 << (order__))
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function unmap pages associated with a GPU SVM range.
> > Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > *gpusvm,
> > +					   struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		unsigned long i, j, npages = npages_in_range(range-
> > >va.start,
> > +							     range-
> > >va.end);
> > +
> > +		if (range->flags.has_dma_mapping) {
> > +			for_each_dma_page(i, j, npages, range-
> > >order)
> > +				dma_unmap_page(gpusvm->drm->dev,
> > +					       range->dma_addr[j],
> > +					       PAGE_SIZE << range-
> > >order,
> > +					       DMA_BIDIRECTIONAL);
> > +		}
> > +
> > +		range->flags.has_vram_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > +					struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		if (range->flags.kfree_mapping) {
> > +			kfree(range->dma_addr);
> > +			range->flags.kfree_mapping = false;
> > +			range->pages = NULL;
> > +		} else {
> > +			kvfree(range->pages);
> > +			range->pages = NULL;
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also
> > removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > >va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier-
> > >notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified GPU
> > SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the GPU
> > SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its
> > reference count
> > + * reaches zero. If a custom range-free function is provided, it is
> > invoked to
> > + * free the range; otherwise, the range is deallocated using
> > kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range,
> > refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified GPU
> > SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before
> > commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_vram_pages || range-
> > >flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> > unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > +				      struct drm_gpusvm_range
> > *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->pages)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > +	if (!pages_valid && range->flags.kfree_mapping) {
> > +		kfree(range->dma_addr);
> > +		range->flags.kfree_mapping = false;
> > +		range->pages = NULL;
> > +	}
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they are
> > mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > >notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> > ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	bool vram_pages = !!range->flags.migrate_vram;
> > +	bool alloc_pfns = false, kfree_mapping;
> > +
> > +retry:
> > +	kfree_mapping = false;
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > +		return 0;
> > +
> > +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> > >pages) {
> > +		if (ctx->prefault)
> > +			return 0;
> > +
> > +		pfns = (unsigned long *)range->pages;
> > +		pages = range->pages;
> > +		goto map_pages;
> > +	}
> > +
> > +	if (!range->pages) {
> > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > GFP_KERNEL);
> > +		if (!pfns)
> > +			return -ENOMEM;
> > +		alloc_pfns = true;
> > +	} else {
> > +		pfns = (unsigned long *)range->pages;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		/* Must be checked after mmu_interval_read_begin */
> > +		if (range->flags.unmapped) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		if (!ctx->mmap_locked) {
> > +			/*
> > +			 * XXX: HMM locking document indicates only
> > a read-lock
> > +			 * is required but there apears to be a
> > window between
> > +			 * the MMU_NOTIFY_MIGRATE event triggered in
> > a CPU fault
> > +			 * via migrate_vma_setup and the pages
> > actually moving
> > +			 * in migrate_vma_finalize in which this
> > code can grab
> > +			 * garbage pages. Grabbing the write-lock if
> > the range
> > +			 * is attached to vram appears to protect
> > against this
> > +			 * race.
> > +			 */
> > +			if (vram_pages)
> > +				mmap_write_lock(mm);
> > +			else
> > +				mmap_read_lock(mm);
> > +		}
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (!ctx->mmap_locked) {
> > +			if (vram_pages)
> > +				mmap_write_unlock(mm);
> > +			else
> > +				mmap_read_unlock(mm);
> > +		}
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (!ctx->mmap_locked)
> > +		mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +
> > +	if (ctx->prefault) {
> > +		range->pages = pages;
> > +		goto set_seqno;
> > +	}
> > +
> > +map_pages:
> > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > +		WARN_ON_ONCE(!range->vram_allocation);
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +			if
> > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_free;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->flags.has_vram_pages = true;
> > +		range->pages = pages;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	} else {
> > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > +
> > +		for_each_dma_page(i, j, npages, order) {
> > +			if (WARN_ON_ONCE(i && order !=
> > +					
> > hmm_pfn_to_map_order(pfns[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +			order = hmm_pfn_to_map_order(pfns[i]);
> > +
> > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > +			if
> > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +
> > +			set_page_dirty_lock(pages[j]);
> > +			mark_page_accessed(pages[j]);
> > +
> > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > +						   pages[j], 0,
> > +						   PAGE_SIZE <<
> > order,
> > +						  
> > DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev,
> > dma_addr[j])) {
> > +				err = -EFAULT;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +		}
> > +
> > +		/* Huge pages, reduce memory footprint */
> > +		if (order) {
> > +			dma_addr = kmalloc_array(j,
> > sizeof(*dma_addr),
> > +						 GFP_KERNEL);
> > +			if (dma_addr) {
> > +				for (i = 0; i < j; ++i)
> > +					dma_addr[i] =
> > (dma_addr_t)pfns[i];
> > +				kvfree(pfns);
> > +				kfree_mapping = true;
> > +			} else {
> > +				dma_addr = (dma_addr_t *)pfns;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->order = order;
> > +		range->flags.kfree_mapping = kfree_mapping;
> > +		range->flags.has_dma_mapping = true;
> > +		range->dma_addr = dma_addr;
> > +		range->vram_allocation = NULL;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	}
> > +
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	for_each_dma_page(i, j, npages, order)
> > +		dma_unmap_page(gpusvm->drm->dev,
> > +			       (dma_addr_t)pfns[j],
> > +			       PAGE_SIZE << order,
> > DMA_BIDIRECTIONAL);
> > +err_free:
> > +	if (alloc_pfns)
> > +		kvfree(pfns);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If
> > @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > >invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long
> > *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > grate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU
> > SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_vram_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> > migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped
> > pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU
> > SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn,
> > maps the
> > + * corresponding page, and stores the DMA address in the provided
> > @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int
> > *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page =
> > migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> > dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> > for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU
> > Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in
> > @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the
> > corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction
> > dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *                   failure of this function.
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> > The caller
> > + *                   should hold a reference to the VRAM allocation,
> > which
> > + *                   should be dropped via ops->vram_allocation or
> > upon the
> > + *                   failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to VRAM. It
> > performs the
> > + * necessary setup and invokes the driver-specific operations for
> > migration to
> > + * VRAM. Upon successful return, @vram_allocation can safely
> > reference @range
> > + * until ops->vram_release is called which only upon successful
> > return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_vram)
> > +		return -EINVAL;
> > +
> > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > >copy_to_vram ||
> > +	    !gpusvm->ops->copy_to_sram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(range);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > npages, not
> > +	 * always an error. Need to revisit possible cases and how
> > to handle. We
> > +	 * could prefault on migrate.cpages != npages via
> > hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > vram_allocation, npages,
> > +					     migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.src, npages,
> > DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_vram_page(page, zdd);
> > +	}
> > +
> > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind vram allocation to range and zdd */
> > +	range->vram_allocation = vram_allocation;
> > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> > VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the SRAM migrate page frame numbers
> > (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the
> > VM area for
> > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > vm_area_struct *vas,
> > +						unsigned long
> > npages,
> > +						unsigned long
> > *src_mpfn,
> > +						unsigned long *mpfn,
> > u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> > lock and
> > + * migration done via migrate_device_* functions. Fallback path as
> > it is
> > + * preferred to issue migrations with mmap lock.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> > +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > npages;
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > >vram_allocation,
> > +					     npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_vma_range(gpusvm->mm,
> > +				       gpusvm-
> > >device_private_page_owner, src,
> > +				       npages, range->va.start);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > src, dst, 0);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   dst, npages,
> > DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @vas: Pointer to the VM area structure
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @start: Start address of the migration range
> > + * @end: End address of the migration range
> > + *
> > + * This internal function performs the migration of the specified
> > GPU SVM range
> > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > PFNs, and
> > + * invokes the driver-specific operations for migration to SRAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +					struct vm_area_struct *vas,
> > +					struct page *page,
> > +					u64 start, u64 end)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page	= page,
> > +	};
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	/* Corner where VMA area struct has been partially unmapped
> > */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > +						   migrate.src,
> > migrate.dst,
> > +						   start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.dst, npages,
> > +					   DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function initiates the migration of the specified GPU SVM
> > range to
> > + * SRAM. It performs necessary checks and invokes the internal
> > migration
> > + * function for actual migration.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err =
> > drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner
> > case when
> > +	 * VRAM backing has been partially unmapped from MM's
> > address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> > +	if (!vas) {
> > +		if (!retry)
> > +			err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > +		if (!retry)
> > +			err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> > end);
> > +	if (err)
> > +		goto err_mmunlock;
> > +
> > +	if (vas->vm_end < end) {
> > +		retry = true;
> > +		start = vas->vm_end;
> > +		goto again;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		mmap_read_unlock(mm);
> > +		/*
> > +		 * Using mmput_async as this function can be called
> > while
> > +		 * holding a dma-resv lock, and a final put can grab
> > the mmap
> > +		 * lock, causing a lock inversion.
> > +		 */
> > +		mmput_async(mm);
> > +	}
> > +
> > +	return 0;
> > +
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked)
> > +		mmap_read_unlock(mm);
> > +err_mmput:
> > +	if (!ctx->mmap_locked)
> > +		mmput_async(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> > with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device
> > data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> > fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM
> > range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page
> > and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > +					   vmf->vma, vmf->page,
> > +					   zdd->range->va.start,
> > +					   zdd->range->va.end);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start,
> > end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..0ea70f8534a8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,415 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual
> > Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM
> > ranges and
> > + * perform operations such as migration between VRAM and system RAM.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success,
> > NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> > *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @vram_release: Release VRAM allocation (optional)
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 *
> > +	 * This function shall release VRAM allocation and expects
> > to drop a
> > +	 * reference to VRAM allocation.
> > +	 */
> > +	void (*vram_release)(void *vram_allocation);
> > +
> > +	/**
> > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate VRAM page frame numbers
> > (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > +				 void *vram_allocation,
> > +				 unsigned long npages,
> > +				 unsigned long *pfn);
> > +
> > +	/**
> > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to VRAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_sram: Copy to system RAM (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses
> > (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It
> > can safely
> > +	 * walk the notifier range RB tree/list in this function.
> > Called while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range
> > *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure
> > notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order
> > they appear in
> > + *              interval tree. This is useful to keep iterating
> > ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval notifier
> > has been
> > + *                 removed
> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @pages: Pointer to the array of pages (if backing store is in
> > VRAM)
> > + * @dma_addr: DMA address array (if backing store is SRAM and DMA
> > mapped)
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping
> > size
> > + * @flags.migrate_vram: Flag indicating whether the range can be
> > migrated to VRAM
> > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been
> > partially unmapped
> > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> > mapping
> > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > allocation based
> > + *                       on @order which releases via kfree
> > + *
> > + * This structure represents a GPU SVM range used for tracking
> > memory ranges
> > + * mapped in a DRM device.
> > + */
> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	union {
> > +		struct page **pages;
> > +		dma_addr_t *dma_addr;
> > +	};
> > +	void *vram_allocation;
> > +	u16 order;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_vram : 1;
> > +		/* All flags below must be set / cleared under
> > notifier lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_vram_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +		u16 kfree_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier
> > operations
> > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > + * @root: Cached root node of the Red-Black tree containing GPU SVM
> > notifiers
> > + * @notifier_list: list head containing of notifiers in the same
> > order they
> > + *                 appear in interval tree. This is useful to keep
> > iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory) used
> > for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > + *
> > + * No reference counting is provided, as this is expected to be
> > embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which
> > handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct workqueue_struct *zdd_wq;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @mmap_locked: mmap lock is locked
> > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > inversions
> > + *                (e.g.dma-revs -> mmap lock)
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @vram_possible: possible to use VRAM
> > + * @prefault: prefault pages
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 mmap_locked :1;
> > +	u32 trylock_mmap :1;
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 vram_possible :1;
> > +	u32 prefault :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> > +
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or
> > NULL if the
> > + *         current range is the last one or if the input range is
> > NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates
> > the start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> > get the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier.
> > It is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > end__)	\
> > +	for ((range__) = (range__)
> > ?:					\
> > +	     drm_gpusvm_range_find((notifier__), (start__),
> > (end__));	\
> > +	     (range__) && (range__->va.start <
> > (end__));		\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the
> > partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier
> > range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > +			      const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 17:45     ` Matthew Brost
@ 2024-08-29 18:13       ` Matthew Brost
  2024-08-29 19:18       ` Thomas Hellström
  1 sibling, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 18:13 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Thu, Aug 29, 2024 at 05:45:07PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 11:16:49AM +0200, Thomas Hellström wrote:
> > Hi, Matt. 
> > 
> > Some initial design comments / questions:
> > 
> > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > This patch introduces support for GPU Shared Virtual Memory (SVM) in
> > > the
> > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > sharing of memory between the CPU and GPU, enhancing performance and
> > > flexibility in GPU computing tasks.
> > > 
> > > The patch adds the necessary infrastructure for SVM, including data
> > > structures and functions for managing SVM ranges and notifiers. It
> > > also
> > > provides mechanisms for allocating, deallocating, and migrating
> > > memory
> > > regions between system RAM and GPU VRAM.
> > > 
> > > This mid-layer is largely inspired by GPUVM.
> > > 
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > +++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > >  
> > >  # core driver code
> > >  
> > > -xe-y += xe_bb.o \
> > > +xe-y += drm_gpusvm.o \
> > > +	xe_bb.o \
> > >  	xe_bo.o \
> > >  	xe_bo_evict.o \
> > >  	xe_devcoredump.o \
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > new file mode 100644
> > > index 000000000000..fc1e44e6ae72
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > @@ -0,0 +1,2174 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + *
> > > + * Authors:
> > > + *     Matthew Brost <matthew.brost@intel.com>
> > > + */
> > > +
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/interval_tree_generic.h>
> > > +#include <linux/hmm.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/mm_types.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/slab.h>
> > > +
> > > +#include <drm/drm_device.h>
> > > +#include "drm_gpusvm.h"
> > > +
> > > +/**
> > > + * DOC: Overview
> > > + *
> > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > Rendering Manager (DRM)
> > > + *
> > > + * The GPU SVM layer is a component of the DRM framework designed to
> > > manage shared
> > > + * virtual memory between the CPU and GPU. It enables efficient data
> > > exchange and
> > > + * processing for GPU-accelerated applications by allowing memory
> > > sharing and
> > > + * synchronization between the CPU's and GPU's virtual address
> > > spaces.
> > > + *
> > > + * Key GPU SVM Components:
> > > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > > notifying the
> > > + *		GPU of changes, notifiers are sized based on a GPU
> > > SVM
> > > + *		initialization parameter, with a recommendation of
> > > 512M or
> > > + *		larger. They maintain a Red-BlacK tree and a list of
> > > ranges that
> > > + *		fall within the notifier interval. Notifiers are
> > > tracked within
> > > + *		a GPU SVM Red-BlacK tree and list and are
> > > dynamically inserted
> > > + *		or removed as ranges within the interval are created
> > > or
> > > + *		destroyed.
> > 
> > What is the benefit of this extra layer compared to direct insertion of
> > ranges using mmu_interval_notifier_insert?
> > 
> > IIRC the argument made previously about having wide notifiers was that
> > the rb tree lookups inside the core were costly and if there were only
> > a few, then the rb tree lookups within a notifier range could be
> > replaced with the page-table radix-tree-like lookup, so each lookup
> > complexity would be O(log(n_notifiers) + page_table_depth).
> > 
> > But now we have first an rb-tree lookup in the core and then an rb-tree
> > lookup within each notifier yeilding O(log(n_ranges))
> > 
> > I can see a small benefit in that inserting directly into the core rb-
> > tree will block pending ongoing invalidations, but at a cost of an
> > extra multiplexing layer.
> > 
> 
> So when the notifier is triggered the search is a smaller range. In a
> perfect world eventually I'd like to drop the SVM range completely.
> There is a lot of changes required in Xe to make that possible and not
> entirely convinced it is possible and the ROI is worth it (additional
> complexity vs. perf benefit). For now, this was a relatively simple way
> to get SVM working (mirrors boths AMD's and Nvidia's implement wrt to
> having a range concept) but also is flexible in the sense the notifier
> size can be easily tweaked via a modparam [1] following Jason's
> suggestion of larger notifiers.
> 
> [1] https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1
> 

Sorry double reply. Also worth noting that attaching ranges to the
notifiers also easily lets us know when a notifier can be removed (no
ranges attached).

Lastly, if it isn't clear via [1] is we can one notifier for the entire
VA space, this design works for that too.

Matt

> > > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > > managed
> > > + *	     by GPU SVM. They are sized based on an array of chunk
> > > sizes, which
> > > + *	     is a GPU SVM initialization parameter, and the CPU
> > > address space.
> > > + *	     Upon GPU fault, the largest aligned chunk that fits
> > > within the
> > > + *	     faulting CPU address space is chosen for the range
> > > size. Ranges are
> > > + *	     expected to be dynamically allocated on GPU fault and
> > > removed on an
> > > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > > are tracked in
> > > + *	     a notifier's Red-Black tree.
> > 
> > How do ranges and chunks map to
> >  
> > a) Prefaulting granularity
> > b) Migration granularity?
> > 
> > > + * - Operations: Define the interface for driver-specific SVM
> > > operations such as
> > > + *		 allocation, page collection, migration,
> > > invalidations, and VRAM
> > > + *		 release.
> > > + *
> > > + * This layer provides interfaces for allocating, mapping,
> > > migrating, and
> > > + * releasing memory ranges between the CPU and GPU. It handles all
> > > core memory
> > > + * management interactions (DMA mapping, HMM, and migration) and
> > > provides
> > > + * driver-specific virtual functions (vfuncs). This infrastructure
> > > is sufficient
> > > + * to build the expected driver components for an SVM implementation
> > > as detailed
> > > + * below.
> > > + *
> > > + * Expected Driver Components:
> > > + * - GPU page fault handler: Used to create ranges and notifiers
> > > based on the
> > > + *			     fault address, optionally migrate the
> > > range to
> > > + *			     VRAM, and create GPU bindings.
> > > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > > Ranges are
> > > + *			expected to be added to the garbage
> > > collector upon
> > > + *			MMU_NOTIFY_UNMAP event.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Locking
> > > + *
> > > + * GPU SVM handles locking for core MM interactions, i.e., it
> > > locks/unlocks the
> > > + * mmap lock as needed. Alternatively, if the driver prefers to
> > > handle the mmap
> > > + * lock itself, a 'locked' argument is provided to the functions
> > > that require
> > > + * the mmap lock. This option may be useful for drivers that need to
> > > call into
> > > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > > locking
> > > + * inversions between the mmap and dma-resv locks.
> > > + *
> > > + * GPU SVM introduces a global notifier lock, which safeguards the
> > > notifier's
> > > + * range RB tree and list, as well as the range's DMA mappings and
> > > sequence
> > > + * number. GPU SVM manages all necessary locking and unlocking
> > > operations,
> > > + * except for the recheck of the range's sequence number
> > > + * (mmu_interval_read_retry) when the driver is committing GPU
> > > bindings. This
> > > + * lock corresponds to the 'driver->update' lock mentioned in the
> > > HMM
> > > + * documentation (TODO: Link). Future revisions may transition from
> > > a GPU SVM
> > > + * global lock to a per-notifier lock if finer-grained locking is
> > > deemed
> > > + * necessary.
> > > + *
> > > + * In addition to the locking mentioned above, the driver should
> > > implement a
> > > + * lock to safeguard core GPU SVM function calls that modify state,
> > > such as
> > > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > > Alternatively,
> > > + * these core functions can be called within a single kernel thread,
> > > for
> > > + * instance, using an ordered work queue. This lock is denoted as
> > > + * 'driver_svm_lock' in code examples.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Migrataion
> > > + *
> > > + * The migration support is quite simple, allowing migration between
> > > SRAM and
> > > + * VRAM at the range granularity. For example, GPU SVM currently
> > > does not
> > > + * support mixing SRAM and VRAM pages within a range. This means
> > > that upon GPU
> > > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > > fault, the
> > > + * entire range is migrated to SRAM.
> > > + *
> > > + * The reasoning for only supporting range granularity is as
> > > follows: it
> > > + * simplifies the implementation, and range sizes are driver-defined
> > > and should
> > > + * be relatively small.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Partial Unmapping of Ranges
> > > + *
> > > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > > CPU resulting
> > > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> > > main one
> > > + * being that a subset of the range still has CPU and GPU mappings.
> > > If the
> > > + * backing store for the range is in VRAM, a subset of the backing
> > > store has
> > > + * references. One option would be to split the range and VRAM
> > > backing store,
> > > + * but the implementation for this would be quite complicated. Given
> > > that
> > > + * partial unmappings are rare and driver-defined range sizes are
> > > relatively
> > > + * small, GPU SVM does not support splitting of ranges.
> > > + *
> > > + * With no support for range splitting, upon partial unmapping of a
> > > range, the
> > > + * driver is expected to invalidate and destroy the entire range. If
> > > the range
> > > + * has VRAM as its backing, the driver is also expected to migrate
> > > any remaining
> > > + * pages back to SRAM.
> > 
> > So what happens if we get a one-page invalidation, say protection
> > change event, or NUMA accounting event, in the middle of a range? Can
> > we unmap just that single gpu pte covering that range, that is, how do
> > the ranges map to invalidation granularity? Does this differ between
> > igfx an dgfx?
> 
> Well the idea of chunks is ranges should be 1 GPU page (the chunk array
> in Xe is 4k, 64k, and 2M). The design is flexible enough that doesn't
> have to true but optimized for the thinking each range is most likely 1
> GPU page. If this isn't true, then all GPU pages in the range are
> invalidated which isn't ideal but keeps it simple which IMO far out
> weighs the potential benefits. In theory a driver could implement
> spliting / partial invalidaions too with a couple of updates to GPUSVM
> but would likely largely be a driver implementation rather than GPUSVM.
> 
> No difference between igfx an dgfx.
> 
> You bring up a good point about protection changes, I likely haven't
> fully gotten that part of implementation correct either. I can add this
> to my TODO list and also update my IGTs to do things like this.
> 
> Matt
> 
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > 
> > 
> > > + */
> > > +
> > > +/**
> > > + * DOC: Examples
> > > + *
> > > + * This section provides two examples of how to build the expected
> > > driver
> > > + * components: the GPU page fault handler and the garbage collector.
> > > A third
> > > + * example demonstrates a sample invalidation driver vfunc.
> > > + *
> > > + * The generic code provided does not include logic for complex
> > > migration
> > > + * policies, optimized invalidations, or other potentially required
> > > driver
> > > + * locking (e.g., DMA-resv locks).
> > > + *
> > > + * 1) GPU page fault handler
> > > + *
> > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > > drm_gpusvm_range *range)
> > > + *	{
> > > + *		int err = 0;
> > > + *
> > > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > range);
> > > + *
> > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > + *			driver_commit_bind(gpusvm, range);
> > > + *		else
> > > + *			err = -EAGAIN;
> > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > + *
> > > + *		return err;
> > > + *	}
> > > + *
> > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *		int err;
> > > + *
> > > + *		driver_svm_lock();
> > > + *	retry:
> > > + *		// Always process UNMAPs first so view of GPU SVM
> > > ranges is current
> > > + *		driver_garbage_collector(gpusvm);
> > > + *
> > > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > > fault_addr,
> > > + *							gpuva_start,
> > > gpuva_end,
> > > + *						        &ctx);
> > > + *		if (IS_ERR(range)) {
> > > + *			err = PTR_ERR(range);
> > > + *			goto unlock;
> > > + *		}
> > > + *
> > > + *		if (driver_migration_policy(range)) {
> > > + *			bo = driver_alloc_bo();
> > > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > > range, bo, &ctx);
> > > + *			if (err)	// CPU mappings may have
> > > changed
> > > + *				goto retry;
> > > + *		}
> > > + *
> > > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &ctx);
> > > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > > mappings changed
> > > + *			goto retry;
> > > + *		else if (err)
> > > + *			goto unlock;
> > > + *
> > > + *		err = driver_bind_range(gpusvm, range);
> > > + *		if (err == -EAGAIN)	// CPU mappings changed
> > > + *			goto retry
> > > + *
> > > + *	unlock:
> > > + *		driver_svm_unlock();
> > > + *		return err;
> > > + *	}
> > > + *
> > > + * 2) Garbage Collector.
> > > + *
> > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > + *					struct drm_gpusvm_range
> > > *range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		// Partial unmap, migrate any remaining VRAM pages
> > > back to SRAM
> > > + *		if (range->flags.partial_unmap)
> > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> > > &ctx);
> > > + *
> > > + *		driver_unbind_range(range);
> > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > + *	}
> > > + *
> > > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > > + *	{
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > > + *			__driver_garbage_collector(gpusvm, range);
> > > + *	}
> > > + *
> > > + * 3) Invalidation driver vfunc.
> > > + *
> > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > + *				 struct drm_gpusvm_notifier
> > > *notifier,
> > > + *				 const struct mmu_notifier_range
> > > *mmu_range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > > };
> > > + *		struct drm_gpusvm_range *range = NULL;
> > > + *
> > > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > > >start, mmu_range->end);
> > > + *
> > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > mmu_range->start,
> > > + *					  mmu_range->end) {
> > > + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> > > &ctx);
> > > + *
> > > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > > + *				continue;
> > > + *
> > > + *			drm_gpusvm_range_set_unmapped(range,
> > > mmu_range);
> > > + *			driver_garbage_collector_add(gpusvm, range);
> > > + *		}
> > > + *	}
> > > + */
> > > +
> > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > rb.__subtree_last,
> > > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > > +		     static __maybe_unused, range);
> > > +
> > > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > >interval.start)
> > > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > >interval.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> > > notifier);
> > > +
> > > +/**
> > > + * npages_in_range() - Calculate the number of pages in a given
> > > range
> > > + * @start__: The start address of the range
> > > + * @end__: The end address of the range
> > > + *
> > > + * This macro calculates the number of pages in a given memory
> > > range,
> > > + * specified by the start and end addresses. It divides the
> > > difference
> > > + * between the end and start addresses by the page size (PAGE_SIZE)
> > > to
> > > + * determine the number of pages in the range.
> > > + *
> > > + * Return: The number of pages in the specified range.
> > > + */
> > > +#define npages_in_range(start__, end__)	\
> > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > +
> > > +/**
> > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > + *
> > > + * @refcount: Reference count for the zdd
> > > + * @destroy_work: Work structure for asynchronous zdd destruction
> > > + * @range: Pointer to the GPU SVM range
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > + *
> > > + * This structure serves as a generic wrapper installed in
> > > + * page->zone_device_data. It provides infrastructure for looking up
> > > a range
> > > + * upon CPU page fault and asynchronously releasing VRAM once the
> > > CPU has no
> > > + * page references. Asynchronous release is useful because CPU page
> > > references
> > > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > > requires sleeping
> > > + * locks.
> > > + */
> > > +struct drm_gpusvm_zdd {
> > > +	struct kref refcount;
> > > +	struct work_struct destroy_work;
> > > +	struct drm_gpusvm_range *range;
> > > +	void *vram_allocation;
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> > > zdd
> > > + * @w: Pointer to the work_struct
> > > + *
> > > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(w, struct drm_gpusvm_zdd,
> > > destroy_work);
> > > +	struct drm_gpusvm_range *range = zdd->range;
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > > +	drm_gpusvm_range_put(range);
> > > +	kfree(zdd);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > + * @range: Pointer to the GPU SVM range.
> > > + *
> > > + * This function allocates and initializes a new zdd structure. It
> > > sets up the
> > > + * reference count, initializes the destroy work, and links the
> > > provided GPU SVM
> > > + * range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > > + */
> > > +static struct drm_gpusvm_zdd *
> > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd;
> > > +
> > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > +	if (!zdd)
> > > +		return NULL;
> > > +
> > > +	kref_init(&zdd->refcount);
> > > +	INIT_WORK(&zdd->destroy_work,
> > > drm_gpusvm_zdd_destroy_work_func);
> > > +	zdd->range = drm_gpusvm_range_get(range);
> > > +	zdd->vram_allocation = NULL;
> > > +
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function increments the reference count of the provided zdd
> > > structure.
> > > + *
> > > + * Returns: Pointer to the zdd structure.
> > > + */
> > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_get(&zdd->refcount);
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > + * @ref: Pointer to the reference count structure.
> > > + *
> > > + * This function queues the destroy_work of the zdd for asynchronous
> > > destruction.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > +
> > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function decrements the reference count of the provided zdd
> > > structure
> > > + * and schedules its destruction if the count drops to zero.
> > > + */
> > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > + * @start: Start address of the range
> > > + * @end: End address of the range
> > > + *
> > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end)
> > > +{
> > > +	return range_iter_first(&notifier->root, start, end - 1);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > ranges in a notifier
> > > + * @range__: Iterator variable for the ranges
> > > + * @next__: Iterator variable for the ranges temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > > while
> > > + * removing ranges from it.
> > > + */
> > > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> > > start__, end__)	\
> > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > (start__), (end__)),	\
> > > +	     (next__) =
> > > __drm_gpusvm_range_next(range__);				\
> > > +	     (range__) && (range__->va.start <
> > > (end__));				\
> > > +	     (range__) = (next__), (next__) =
> > > __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> > > the list
> > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> > > or NULL if
> > > + *         the current notifier is the last one or if the input
> > > notifier is
> > > + *         NULL.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > +{
> > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > +				      &notifier->gpusvm-
> > > >notifier_list))
> > > +		return list_next_entry(notifier, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> > > a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> > > end__)		\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > > (start__), (end__) - 1);	\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> > > notifiers in a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @next__: Iterator variable for the notifiers temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > > while
> > > + * removing notifiers from it.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > gpusvm__, start__, end__)	\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > > (start__), (end__) - 1),	\
> > > +	     (next__) =
> > > __drm_gpusvm_notifier_next(notifier__);				\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) = (next__), (next__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > + * @cur_seq: Current sequence number.
> > > + *
> > > + * This function serves as a generic MMU notifier for GPU SVM. It
> > > sets the MMU
> > > + * notifier sequence number and calls the driver invalidate vfunc
> > > under
> > > + * gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * true if the operation succeeds, false otherwise.
> > > + */
> > > +static bool
> > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > > +			       const struct mmu_notifier_range
> > > *mmu_range,
> > > +			       unsigned long cur_seq)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier =
> > > +		container_of(mni, typeof(*notifier), notifier);
> > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > +
> > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > +		return false;
> > > +
> > > +	down_write(&gpusvm->notifier_lock);
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > +	up_write(&gpusvm->notifier_lock);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > > GPU SVM
> > > + */
> > > +static const struct mmu_interval_notifier_ops
> > > drm_gpusvm_notifier_ops = {
> > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @name: Name of the GPU SVM.
> > > + * @drm: Pointer to the DRM device structure.
> > > + * @mm: Pointer to the mm_struct for the address space.
> > > + * @device_private_page_owner: Device private pages owner.
> > > + * @mm_start: Start address of GPU SVM.
> > > + * @mm_range: Range of the GPU SVM.
> > > + * @notifier_size: Size of individual notifiers.
> > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending order
> > > with last
> > > + *               entry being SZ_4K.
> > > + * @num_chunks: Number of chunks.
> > > + *
> > > + * This function initializes the GPU SVM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, a negative error code on failure.
> > > + */
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks)
> > > +{
> > > +	if (!ops->invalidate || !num_chunks)
> > > +		return -EINVAL;
> > > +
> > > +	gpusvm->name = name;
> > > +	gpusvm->drm = drm;
> > > +	gpusvm->mm = mm;
> > > +	gpusvm->device_private_page_owner =
> > > device_private_page_owner;
> > > +	gpusvm->mm_start = mm_start;
> > > +	gpusvm->mm_range = mm_range;
> > > +	gpusvm->notifier_size = notifier_size;
> > > +	gpusvm->ops = ops;
> > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > +	gpusvm->num_chunks = num_chunks;
> > > +	gpusvm->zdd_wq = system_wq;
> > > +
> > > +	mmgrab(mm);
> > > +	gpusvm->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > +
> > > +	init_rwsem(&gpusvm->notifier_lock);
> > > +
> > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > +	might_lock(&gpusvm->notifier_lock);
> > > +	fs_reclaim_release(GFP_KERNEL);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > + * @fault_addr__: Fault address
> > > + *
> > > + * This macro finds the GPU SVM notifier associated with the fault
> > > address.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > + */
> > > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > > +			    (fault_addr__ + 1))
> > > +
> > > +/**
> > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > given rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_notifier struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > structure.
> > > + */
> > > +#define to_drm_gpusvm_notifier(__node)				\
> > > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > > tree and list.
> > > + */
> > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > > +				       struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	notifier_insert(notifier, &gpusvm->root);
> > > +
> > > +	node = rb_prev(&notifier->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > > +	else
> > > +		head = &gpusvm->notifier_list;
> > > +
> > > +	list_add(&notifier->rb.entry, head);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> > > and list.
> > > + */
> > > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > > +	list_del(&(notifier__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + *
> > > + * This function finalizes the GPU SVM by cleaning up any remaining
> > > ranges and
> > > + * notifiers, and dropping a reference to struct MM.
> > > + */
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > +
> > > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> > > LONG_MAX) {
> > > +		struct drm_gpusvm_range *range, *__next;
> > > +
> > > +		/*
> > > +		 * Remove notifier first to avoid racing with any
> > > invalidation
> > > +		 */
> > > +		mmu_interval_notifier_remove(&notifier->notifier);
> > > +		notifier->flags.removed = true;
> > > +
> > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > notifier, 0,
> > > +					       LONG_MAX)
> > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > +	}
> > > +
> > > +	mmdrop(gpusvm->mm);
> > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + *
> > > + * This function allocates and initializes the GPU SVM notifier
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > > on failure.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	if (gpusvm->ops->notifier_alloc)
> > > +		notifier = gpusvm->ops->notifier_alloc();
> > > +	else
> > > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > > +
> > > +	if (!notifier)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	notifier->gpusvm = gpusvm;
> > > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > > >notifier_size);
> > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > >notifier_size);
> > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > +	notifier->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > +
> > > +	return notifier;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function frees the GPU SVM notifier structure.
> > > + */
> > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > +				     struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > +
> > > +	if (gpusvm->ops->notifier_free)
> > > +		gpusvm->ops->notifier_free(notifier);
> > > +	else
> > > +		kfree(notifier);
> > > +}
> > > +
> > > +/**
> > > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > > rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_range struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > > + */
> > > +#define to_drm_gpusvm_range(node__)	\
> > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function inserts the GPU SVM range into the notifier RB tree
> > > and list.
> > > + */
> > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > *notifier,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > +	range_insert(range, &notifier->root);
> > > +
> > > +	node = rb_prev(&range->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > +	else
> > > +		head = &notifier->range_list;
> > > +
> > > +	list_add(&range->rb.entry, head);
> > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + * @range__: Pointer to the GPU SVM range structure
> > > + *
> > > + * This macro removes the GPU SVM range from the notifier RB tree
> > > and list.
> > > + */
> > > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > > +	range_remove((range__), &(notifier__)->root);		\
> > > +	list_del(&(range__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @fault_addr: Fault address
> > > + * @chunk_size: Chunk size
> > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > + *
> > > + * This function allocates and initializes the GPU SVM range
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_range *
> > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > +		       struct drm_gpusvm_notifier *notifier,
> > > +		       u64 fault_addr, u64 chunk_size, bool
> > > migrate_vram)
> > > +{
> > > +	struct drm_gpusvm_range *range;
> > > +
> > > +	if (gpusvm->ops->range_alloc)
> > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > +	else
> > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	kref_init(&range->refcount);
> > > +	range->gpusvm = gpusvm;
> > > +	range->notifier = notifier;
> > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > +	range->notifier_seq = LONG_MAX;
> > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_check_pages - Check pages
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Check if pages between start and end have been faulted in on the
> > > CPU. Use to
> > > + * prevent migration of pages without CPU backing store.
> > > + *
> > > + * Returns:
> > > + * True if pages have been faulted into CPU, False otherwise
> > > + */
> > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > +				   struct drm_gpusvm_notifier
> > > *notifier,
> > > +				   u64 start, u64 end)
> > > +{
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = 0,
> > > +		.notifier = &notifier->notifier,
> > > +		.start = start,
> > > +		.end = end,
> > > +		.dev_private_owner = gpusvm-
> > > >device_private_page_owner,
> > > +	};
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns;
> > > +	unsigned long npages = npages_in_range(start, end);
> > > +	int err, i;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > > +	if (!pfns)
> > > +		return false;
> > > +
> > > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> > > >notifier);
> > > +	hmm_range.hmm_pfns = pfns;
> > > +
> > > +	while (true) {
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier->notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > +			err = -EFAULT;
> > > +			goto err_free;
> > > +		}
> > > +	}
> > > +
> > > +err_free:
> > > +	kvfree(pfns);
> > > +	return err ? false : true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @vas: Pointer to the virtual memory area structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @check_pages: Flag indicating whether to check pages
> > > + *
> > > + * This function determines the chunk size for the GPU SVM range
> > > based on the
> > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> > > the virtual
> > > + * memory area boundaries.
> > > + *
> > > + * Returns:
> > > + * Chunk size on success, LONG_MAX on failure.
> > > + */
> > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > > +				       struct drm_gpusvm_notifier
> > > *notifier,
> > > +				       struct vm_area_struct *vas,
> > > +				       u64 fault_addr, u64
> > > gpuva_start,
> > > +				       u64 gpuva_end, bool
> > > check_pages)
> > > +{
> > > +	u64 start, end;
> > > +	int i = 0;
> > > +
> > > +retry:
> > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > >chunk_sizes[i]);
> > > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > > +
> > > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > > +		    start >= notifier->interval.start &&
> > > +		    end <= notifier->interval.end &&
> > > +		    start >= gpuva_start && end <= gpuva_end)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i == gpusvm->num_chunks)
> > > +		return LONG_MAX;
> > > +
> > > +	/*
> > > +	 * If allocation more than page, ensure not to overlap with
> > > existing
> > > +	 * ranges.
> > > +	 */
> > > +	if (end - start != SZ_4K) {
> > > +		struct drm_gpusvm_range *range;
> > > +
> > > +		range = drm_gpusvm_range_find(notifier, start, end);
> > > +		if (range) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +
> > > +		/*
> > > +		 * XXX: Only create range on pages CPU has faulted
> > > in. Without
> > > +		 * this check, or prefault, on BMG
> > > 'xe_exec_system_allocator --r
> > > +		 * process-many-malloc' fails. In the failure case,
> > > each process
> > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > results in 64k SVM
> > > +		 * ranges. When migrating the SVM ranges, some
> > > processes fail in
> > > +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> > > != npages'
> > > +		 * and then upon drm_gpusvm_range_get_pages device
> > > pages from
> > > +		 * other processes are collected + faulted in which
> > > creates all
> > > +		 * sorts of problems. Unsure exactly how this
> > > happening, also
> > > +		 * problem goes away if 'xe_exec_system_allocator --
> > > r
> > > +		 * process-many-malloc' mallocs at least 64k at a
> > > time.
> > > +		 */
> > > +		if (check_pages &&
> > > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > > end)) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > > +	return end - start;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function finds or inserts a newly allocated a GPU SVM range
> > > based on the
> > > + * fault address. Caller must hold a lock to protect range lookup
> > > and insertion.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct drm_gpusvm_range *range;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	bool notifier_alloc = false;
> > > +	u64 chunk_size;
> > > +	int err;
> > > +	bool migrate_vram;
> > > +
> > > +	if (fault_addr < gpusvm->mm_start ||
> > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > +		err = -EINVAL;
> > > +		goto err_out;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_write_locked(mm);
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > +	if (!notifier) {
> > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > fault_addr);
> > > +		if (IS_ERR(notifier)) {
> > > +			err = PTR_ERR(notifier);
> > > +			goto err_mmunlock;
> > > +		}
> > > +		notifier_alloc = true;
> > > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > > >notifier,
> > > +							  mm,
> > > notifier->interval.start,
> > > +							  notifier-
> > > >interval.end -
> > > +							  notifier-
> > > >interval.start,
> > > +							 
> > > &drm_gpusvm_notifier_ops);
> > > +		if (err)
> > > +			goto err_notifier;
> > > +	}
> > > +
> > > +	vas = vma_lookup(mm, fault_addr);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > +		err = -EPERM;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > fault_addr + 1);
> > > +	if (range)
> > > +		goto out_mmunlock;
> > > +	/*
> > > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > > current
> > > +	 * limitations. If/when migrate_vma_* add more support, this
> > > logic will
> > > +	 * have to change.
> > > +	 */
> > > +	migrate_vram = ctx->vram_possible &&
> > > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > > +
> > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> > > vas,
> > > +						 fault_addr,
> > > gpuva_start,
> > > +						 gpuva_end,
> > > migrate_vram &&
> > > +						 !ctx->prefault);
> > > +	if (chunk_size == LONG_MAX) {
> > > +		err = -EINVAL;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> > > chunk_size,
> > > +				       migrate_vram);
> > > +	if (IS_ERR(range)) {
> > > +		err = PTR_ERR(range);
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	drm_gpusvm_range_insert(notifier, range);
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > +
> > > +	if (ctx->prefault) {
> > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > +
> > > +		__ctx.mmap_locked = true;
> > > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &__ctx);
> > > +		if (err)
> > > +			goto err_range_remove;
> > > +	}
> > > +
> > > +out_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +
> > > +	return range;
> > > +
> > > +err_range_remove:
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +err_notifier_remove:
> > > +	if (notifier_alloc)
> > > +		mmu_interval_notifier_remove(&notifier->notifier);
> > > +err_notifier:
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return ERR_PTR(err);
> > > +}
> > > +
> > > +/**
> > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > + * @i__: the current page index in the iteration
> > > + * @j__: the current page index, log order, in the iteration
> > > + * @npages__: the total number of pages in the DMA region
> > > + * @order__: the order of the pages in the DMA region
> > > + *
> > > + * This macro iterates over each page in a DMA region. The DMA
> > > region
> > > + * is assumed to be composed of 2^@order__ pages, and the macro will
> > > + * step through the region one block of 2^@order__ pages at a time.
> > > + */
> > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > GPU SVM range (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function unmap pages associated with a GPU SVM range.
> > > Assumes and
> > > + * asserts correct locking is in place when called.
> > > + */
> > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					   struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		unsigned long i, j, npages = npages_in_range(range-
> > > >va.start,
> > > +							     range-
> > > >va.end);
> > > +
> > > +		if (range->flags.has_dma_mapping) {
> > > +			for_each_dma_page(i, j, npages, range-
> > > >order)
> > > +				dma_unmap_page(gpusvm->drm->dev,
> > > +					       range->dma_addr[j],
> > > +					       PAGE_SIZE << range-
> > > >order,
> > > +					       DMA_BIDIRECTIONAL);
> > > +		}
> > > +
> > > +		range->flags.has_vram_pages = false;
> > > +		range->flags.has_dma_mapping = false;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function free pages associated with a GPU SVM range.
> > > + */
> > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > > +					struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		if (range->flags.kfree_mapping) {
> > > +			kfree(range->dma_addr);
> > > +			range->flags.kfree_mapping = false;
> > > +			range->pages = NULL;
> > > +		} else {
> > > +			kvfree(range->pages);
> > > +			range->pages = NULL;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range to be removed
> > > + *
> > > + * This function removes the specified GPU SVM range and also
> > > removes the parent
> > > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > > caller must
> > > + * hold a lock to protect range and notifier removal.
> > > + */
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > >va.start);
> > > +	if (WARN_ON_ONCE(!notifier))
> > > +		return;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	drm_gpusvm_range_put(range);
> > > +
> > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > +		if (!notifier->flags.removed)
> > > +			mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function increments the reference count of the specified GPU
> > > SVM range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_get(&range->refcount);
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > + * @refcount: Pointer to the reference counter embedded in the GPU
> > > SVM range
> > > + *
> > > + * This function destroys the specified GPU SVM range when its
> > > reference count
> > > + * reaches zero. If a custom range-free function is provided, it is
> > > invoked to
> > > + * free the range; otherwise, the range is deallocated using
> > > kfree().
> > > + */
> > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > +{
> > > +	struct drm_gpusvm_range *range =
> > > +		container_of(refcount, struct drm_gpusvm_range,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->range_free)
> > > +		gpusvm->ops->range_free(range);
> > > +	else
> > > +		kfree(range);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function decrements the reference count of the specified GPU
> > > SVM range
> > > + * and frees it when the count reaches zero.
> > > + */
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called holding gpusvm->notifier_lock and as the last step before
> > > commiting a
> > > + * GPU binding.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	return range->flags.has_vram_pages || range-
> > > >flags.has_dma_mapping;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> > > unlocked
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called without holding gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +static bool
> > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > +				      struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	bool pages_valid;
> > > +
> > > +	if (!range->pages)
> > > +		return false;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > +		kfree(range->dma_addr);
> > > +		range->flags.kfree_mapping = false;
> > > +		range->pages = NULL;
> > > +	}
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	return pages_valid;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function gets pages for a GPU SVM range and ensures they are
> > > mapped for
> > > + * DMA access.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > > >notifier;
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> > > ? 0 :
> > > +			HMM_PFN_REQ_WRITE),
> > > +		.notifier = notifier,
> > > +		.start = range->va.start,
> > > +		.end = range->va.end,
> > > +		.dev_private_owner = gpusvm-
> > > >device_private_page_owner,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long i, j;
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +	unsigned int order = 0;
> > > +	unsigned long *pfns;
> > > +	struct page **pages;
> > > +	int err = 0;
> > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > +	bool alloc_pfns = false, kfree_mapping;
> > > +
> > > +retry:
> > > +	kfree_mapping = false;
> > > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > > +		return 0;
> > > +
> > > +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> > > >pages) {
> > > +		if (ctx->prefault)
> > > +			return 0;
> > > +
> > > +		pfns = (unsigned long *)range->pages;
> > > +		pages = range->pages;
> > > +		goto map_pages;
> > > +	}
> > > +
> > > +	if (!range->pages) {
> > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +		if (!pfns)
> > > +			return -ENOMEM;
> > > +		alloc_pfns = true;
> > > +	} else {
> > > +		pfns = (unsigned long *)range->pages;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +	}
> > > +
> > > +	hmm_range.hmm_pfns = pfns;
> > > +	while (true) {
> > > +		/* Must be checked after mmu_interval_read_begin */
> > > +		if (range->flags.unmapped) {
> > > +			err = -EFAULT;
> > > +			break;
> > > +		}
> > > +
> > > +		if (!ctx->mmap_locked) {
> > > +			/*
> > > +			 * XXX: HMM locking document indicates only
> > > a read-lock
> > > +			 * is required but there apears to be a
> > > window between
> > > +			 * the MMU_NOTIFY_MIGRATE event triggered in
> > > a CPU fault
> > > +			 * via migrate_vma_setup and the pages
> > > actually moving
> > > +			 * in migrate_vma_finalize in which this
> > > code can grab
> > > +			 * garbage pages. Grabbing the write-lock if
> > > the range
> > > +			 * is attached to vram appears to protect
> > > against this
> > > +			 * race.
> > > +			 */
> > > +			if (vram_pages)
> > > +				mmap_write_lock(mm);
> > > +			else
> > > +				mmap_read_lock(mm);
> > > +		}
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (!ctx->mmap_locked) {
> > > +			if (vram_pages)
> > > +				mmap_write_unlock(mm);
> > > +			else
> > > +				mmap_read_unlock(mm);
> > > +		}
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (!ctx->mmap_locked)
> > > +		mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +
> > > +	if (ctx->prefault) {
> > > +		range->pages = pages;
> > > +		goto set_seqno;
> > > +	}
> > > +
> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if
> > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					
> > > hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if
> > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > > +
> > > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > +						   pages[j], 0,
> > > +						   PAGE_SIZE <<
> > > order,
> > > +						  
> > > DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > dma_addr[j])) {
> > > +				err = -EFAULT;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +		}
> > > +
> > > +		/* Huge pages, reduce memory footprint */
> > > +		if (order) {
> > > +			dma_addr = kmalloc_array(j,
> > > sizeof(*dma_addr),
> > > +						 GFP_KERNEL);
> > > +			if (dma_addr) {
> > > +				for (i = 0; i < j; ++i)
> > > +					dma_addr[i] =
> > > (dma_addr_t)pfns[i];
> > > +				kvfree(pfns);
> > > +				kfree_mapping = true;
> > > +			} else {
> > > +				dma_addr = (dma_addr_t *)pfns;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->order = order;
> > > +		range->flags.kfree_mapping = kfree_mapping;
> > > +		range->flags.has_dma_mapping = true;
> > > +		range->dma_addr = dma_addr;
> > > +		range->vram_allocation = NULL;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	}
> > > +
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	for_each_dma_page(i, j, npages, order)
> > > +		dma_unmap_page(gpusvm->drm->dev,
> > > +			       (dma_addr_t)pfns[j],
> > > +			       PAGE_SIZE << order,
> > > DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	if (alloc_pfns)
> > > +		kvfree(pfns);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range. If
> > > @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > > mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > > called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > >invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > +					   unsigned long
> > > *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > > grate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified GPU
> > > SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> > > migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to mapped
> > > pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in GPU
> > > SVM. It
> > > + * iterates over each page frame number provided in @migrate_pfn,
> > > maps the
> > > + * corresponding page, and stores the DMA address in the provided
> > > @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int
> > > *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page =
> > > migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> > > dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> > > for GPU SVM migration
> > > + * @dev: The device for which the pages were mapped
> > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > + * @npages: Number of pages to unmap
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function unmaps previously mapped pages of memory for GPU
> > > Shared Virtual
> > > + * Memory (SVM). It iterates over each DMA address provided in
> > > @dma_addr, checks
> > > + * if it's valid and not already unmapped, and unmaps the
> > > corresponding page.
> > > + */
> > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > +					   dma_addr_t *dma_addr,
> > > +					   unsigned long npages,
> > > +					   enum dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > dma_addr[i]))
> > > +			continue;
> > > +
> > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *                   failure of this function.
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> > > The caller
> > > + *                   should hold a reference to the VRAM allocation,
> > > which
> > > + *                   should be dropped via ops->vram_allocation or
> > > upon the
> > > + *                   failure of this function.
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function migrates the specified GPU SVM range to VRAM. It
> > > performs the
> > > + * necessary setup and invokes the driver-specific operations for
> > > migration to
> > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > reference @range
> > > + * until ops->vram_release is called which only upon successful
> > > return.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long i, npages = npages_in_range(start, end);
> > > +	struct vm_area_struct *vas;
> > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int err;
> > > +
> > > +	if (!range->flags.migrate_vram)
> > > +		return -EINVAL;
> > > +
> > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > >copy_to_vram ||
> > > +	    !gpusvm->ops->copy_to_sram)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > > npages, not
> > > +	 * always an error. Need to revisit possible cases and how
> > > to handle. We
> > > +	 * could prefault on migrate.cpages != npages via
> > > hmm_range_fault.
> > > +	 */
> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > vram_allocation, npages,
> > > +					     migrate.dst);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.src, npages,
> > > DMA_TO_DEVICE);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > +
> > > +		pages[i] = page;
> > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > +	}
> > > +
> > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	/* Upon success bind vram allocation to range and zdd */
> > > +	range->vram_allocation = vram_allocation;
> > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > > Owns ref */
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > > npages,
> > > +				       DMA_TO_DEVICE);
> > > +err_free:
> > > +	if (zdd)
> > > +		drm_gpusvm_zdd_put(zdd);
> > > +	kvfree(buf);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> > > VM area
> > > + * @vas: Pointer to the VM area structure, can be NULL
> > > + * @npages: Number of pages to populate
> > > + * @src_mpfn: Source array of migrate PFNs
> > > + * @mpfn: Array of migrate PFNs to populate
> > > + * @addr: Start address for PFN allocation
> > > + *
> > > + * This function populates the SRAM migrate page frame numbers
> > > (PFNs) for the
> > > + * specified VM area structure. It allocates and locks pages in the
> > > VM area for
> > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > > if NULL use
> > > + * alloc_page for allocation.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > vm_area_struct *vas,
> > > +						unsigned long
> > > npages,
> > > +						unsigned long
> > > *src_mpfn,
> > > +						unsigned long *mpfn,
> > > u64 addr)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > +		struct page *page;
> > > +
> > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > +			continue;
> > > +
> > > +		if (vas)
> > > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > > addr);
> > > +		else
> > > +			page = alloc_page(GFP_HIGHUSER);
> > > +
> > > +		if (!page)
> > > +			return -ENOMEM;
> > > +
> > > +		lock_page(page);
> > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> > > lock and
> > > + * migration done via migrate_device_* functions. Fallback path as
> > > it is
> > > + * preferred to issue migrations with mmap lock.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	unsigned long *src, *dst;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> > > +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	src = buf;
> > > +	dst = buf + (sizeof(*src) * npages);
> > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > > npages;
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > >vram_allocation,
> > > +					     npages, src);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > +				       gpusvm-
> > > >device_private_page_owner, src,
> > > +				       npages, range->va.start);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > > src, dst, 0);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   dst, npages,
> > > DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > +	migrate_device_pages(src, dst, npages);
> > > +	migrate_device_finalize(src, dst, npages);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > > (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the specified
> > > GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > > PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +					struct vm_area_struct *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	/* Corner where VMA area struct has been partially unmapped
> > > */
> > > +	if (start < vas->vm_start)
> > > +		start = vas->vm_start;
> > > +	if (end > vas->vm_end)
> > > +		end = vas->vm_end;
> > > +
> > > +	migrate.start = start;
> > > +	migrate.end = end;
> > > +	npages = npages_in_range(start, end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/* Raced with another CPU fault, nothing to do */
> > > +	if (!migrate.cpages)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > +						   migrate.src,
> > > migrate.dst,
> > > +						   start);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.dst, npages,
> > > +					   DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > > SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function initiates the migration of the specified GPU SVM
> > > range to
> > > + * SRAM. It performs necessary checks and invokes the internal
> > > migration
> > > + * function for actual migration.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err =
> > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the corner
> > > case when
> > > +	 * VRAM backing has been partially unmapped from MM's
> > > address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > > +	if (!vas) {
> > > +		if (!retry)
> > > +			err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > +		if (!retry)
> > > +			err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> > > end);
> > > +	if (err)
> > > +		goto err_mmunlock;
> > > +
> > > +	if (vas->vm_end < end) {
> > > +		retry = true;
> > > +		start = vas->vm_end;
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_read_unlock(mm);
> > > +		/*
> > > +		 * Using mmput_async as this function can be called
> > > while
> > > +		 * holding a dma-resv lock, and a final put can grab
> > > the mmap
> > > +		 * lock, causing a lock inversion.
> > > +		 */
> > > +		mmput_async(mm);
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked)
> > > +		mmap_read_unlock(mm);
> > > +err_mmput:
> > > +	if (!ctx->mmap_locked)
> > > +		mmput_async(mm);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> > > with a page
> > > + * @page: Pointer to the page
> > > + *
> > > + * This function is a callback used to put the GPU SVM zone device
> > > data
> > > + * associated with a page when it is being released.
> > > + */
> > > +static void drm_gpusvm_page_free(struct page *page)
> > > +{
> > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> > > fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU SVM
> > > range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting page
> > > and invokes
> > > + * the internal migration function to migrate the range back to RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > +					   vmf->vma, vmf->page,
> > > +					   zdd->range->va.start,
> > > +					   zdd->range->va.end);
> > > +
> > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > + */
> > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > +	.page_free = drm_gpusvm_page_free,
> > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > > operations
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM device page map operations structure.
> > > + */
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > +{
> > > +	return &drm_gpusvm_pagemap_ops;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > > given address range
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM has mapping, False otherwise
> > > + */
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > > u64 end)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > +		struct drm_gpusvm_range *range = NULL;
> > > +
> > > +		drm_gpusvm_for_each_range(range, notifier, start,
> > > end)
> > > +			return true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > new file mode 100644
> > > index 000000000000..0ea70f8534a8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > @@ -0,0 +1,415 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_GPUSVM_H__
> > > +#define __DRM_GPUSVM_H__
> > > +
> > > +#include <linux/kref.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct dev_pagemap_ops;
> > > +struct drm_device;
> > > +struct drm_gpusvm;
> > > +struct drm_gpusvm_notifier;
> > > +struct drm_gpusvm_ops;
> > > +struct drm_gpusvm_range;
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > + *
> > > + * This structure defines the operations for GPU Shared Virtual
> > > Memory (SVM).
> > > + * These operations are provided by the GPU driver to manage SVM
> > > ranges and
> > > + * perform operations such as migration between VRAM and system RAM.
> > > + */
> > > +struct drm_gpusvm_ops {
> > > +	/**
> > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM notifier.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM notifier on success,
> > > NULL on failure.
> > > +	 */
> > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > +
> > > +	/**
> > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM notifier.
> > > +	 */
> > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > +
> > > +	/**
> > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM range.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > > on failure.
> > > +	 */
> > > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> > > *gpusvm);
> > > +
> > > +	/**
> > > +	 * @range_free: Free a GPU SVM range (optional)
> > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM range.
> > > +	 */
> > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > +
> > > +	/**
> > > +	 * @vram_release: Release VRAM allocation (optional)
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 *
> > > +	 * This function shall release VRAM allocation and expects
> > > to drop a
> > > +	 * reference to VRAM allocation.
> > > +	 */
> > > +	void (*vram_release)(void *vram_allocation);
> > > +
> > > +	/**
> > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 * @npages: Number of pages to populate
> > > +	 * @pfn: Array of page frame numbers to populate
> > > +	 *
> > > +	 * This function shall populate VRAM page frame numbers
> > > (PFN).
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > +				 void *vram_allocation,
> > > +				 unsigned long npages,
> > > +				 unsigned long *pfn);
> > > +
> > > +	/**
> > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to VRAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @copy_to_sram: Copy to system RAM (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > (destination)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to system RAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > > +	 *
> > > +	 * This function shall invalidate the GPU page tables. It
> > > can safely
> > > +	 * walk the notifier range RB tree/list in this function.
> > > Called while
> > > +	 * holding the notifier lock.
> > > +	 */
> > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > +			   struct drm_gpusvm_notifier *notifier,
> > > +			   const struct mmu_notifier_range
> > > *mmu_range);
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > > notifier
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: MMU interval notifier
> > > + * @interval: Interval for the notifier
> > > + * @rb: Red-black tree node for the parent GPU SVM structure
> > > notifier tree
> > > + * @root: Cached root node of the RB tree containing ranges
> > > + * @range_list: List head containing of ranges in the same order
> > > they appear in
> > > + *              interval tree. This is useful to keep iterating
> > > ranges while
> > > + *              doing modifications to RB tree.
> > > + * @flags.removed: Flag indicating whether the MMU interval notifier
> > > has been
> > > + *                 removed
> > > + *
> > > + * This structure represents a GPU SVM notifier.
> > > + */
> > > +struct drm_gpusvm_notifier {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct mmu_interval_notifier notifier;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} interval;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct rb_root_cached root;
> > > +	struct list_head range_list;
> > > +	struct {
> > > +		u32 removed : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier
> > > + * @refcount: Reference count for the range
> > > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > > structure range tree
> > > + * @va: Virtual address range
> > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > + * @pages: Pointer to the array of pages (if backing store is in
> > > VRAM)
> > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA
> > > mapped)
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping
> > > size
> > > + * @flags.migrate_vram: Flag indicating whether the range can be
> > > migrated to VRAM
> > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > + * @flags.partial_unmap: Flag indicating if the range has been
> > > partially unmapped
> > > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > > pages
> > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> > > mapping
> > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > > allocation based
> > > + *                       on @order which releases via kfree
> > > + *
> > > + * This structure represents a GPU SVM range used for tracking
> > > memory ranges
> > > + * mapped in a DRM device.
> > > + */
> > > +struct drm_gpusvm_range {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct kref refcount;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} va;
> > > +	unsigned long notifier_seq;
> > > +	union {
> > > +		struct page **pages;
> > > +		dma_addr_t *dma_addr;
> > > +	};
> > > +	void *vram_allocation;
> > > +	u16 order;
> > > +	struct {
> > > +		/* All flags below must be set upon creation */
> > > +		u16 migrate_vram : 1;
> > > +		/* All flags below must be set / cleared under
> > > notifier lock */
> > > +		u16 unmapped : 1;
> > > +		u16 partial_unmap : 1;
> > > +		u16 has_vram_pages : 1;
> > > +		u16 has_dma_mapping : 1;
> > > +		u16 kfree_mapping : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm - GPU SVM structure
> > > + *
> > > + * @name: Name of the GPU SVM
> > > + * @drm: Pointer to the DRM device structure
> > > + * @mm: Pointer to the mm_struct for the address space
> > > + * @device_private_page_owner: Device private pages owner
> > > + * @mm_start: Start address of GPU SVM
> > > + * @mm_range: Range of the GPU SVM
> > > + * @notifier_size: Size of individual notifiers
> > > + * @ops: Pointer to the operations structure for GPU SVM
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending order.
> > > + * @num_chunks: Number of chunks
> > > + * @notifier_lock: Read-write semaphore for protecting notifier
> > > operations
> > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > + * @root: Cached root node of the Red-Black tree containing GPU SVM
> > > notifiers
> > > + * @notifier_list: list head containing of notifiers in the same
> > > order they
> > > + *                 appear in interval tree. This is useful to keep
> > > iterating
> > > + *                 notifiers while doing modifications to RB tree.
> > > + *
> > > + * This structure represents a GPU SVM (Shared Virtual Memory) used
> > > for tracking
> > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > + *
> > > + * No reference counting is provided, as this is expected to be
> > > embedded in the
> > > + * driver VM structure along with the struct drm_gpuvm, which
> > > handles reference
> > > + * counting.
> > > + */
> > > +struct drm_gpusvm {
> > > +	const char *name;
> > > +	struct drm_device *drm;
> > > +	struct mm_struct *mm;
> > > +	void *device_private_page_owner;
> > > +	u64 mm_start;
> > > +	u64 mm_range;
> > > +	u64 notifier_size;
> > > +	const struct drm_gpusvm_ops *ops;
> > > +	const u64 *chunk_sizes;
> > > +	int num_chunks;
> > > +	struct rw_semaphore notifier_lock;
> > > +	struct workqueue_struct *zdd_wq;
> > > +	struct rb_root_cached root;
> > > +	struct list_head notifier_list;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > + *
> > > + * @mmap_locked: mmap lock is locked
> > > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > > inversions
> > > + *                (e.g.dma-revs -> mmap lock)
> > > + * @in_notifier: entering from a MMU notifier
> > > + * @read_only: operating on read-only memory
> > > + * @vram_possible: possible to use VRAM
> > > + * @prefault: prefault pages
> > > + *
> > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > + */
> > > +struct drm_gpusvm_ctx {
> > > +	u32 mmap_locked :1;
> > > +	u32 trylock_mmap :1;
> > > +	u32 in_notifier :1;
> > > +	u32 read_only :1;
> > > +	u32 vram_possible :1;
> > > +	u32 prefault :1;
> > > +};
> > > +
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks);
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > +
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range);
> > > +
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > +
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > > u64 end);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end);
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > + */
> > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > +	down_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > + */
> > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > +	up_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > + * @range: a pointer to the current GPU SVM range
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_range if available, or
> > > NULL if the
> > > + *         current range is the last one or if the input range is
> > > NULL.
> > > + */
> > > +static inline struct drm_gpusvm_range *
> > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > +{
> > > +	if (range && !list_is_last(&range->rb.entry,
> > > +				   &range->notifier->range_list))
> > > +		return list_next_entry(range, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > > notifier
> > > + * @range__: Iterator variable for the ranges. If set, it indicates
> > > the start of
> > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> > > get the range.
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier.
> > > It is safe
> > > + * to use while holding the driver SVM lock or the notifier lock.
> > > + */
> > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > > end__)	\
> > > +	for ((range__) = (range__)
> > > ?:					\
> > > +	     drm_gpusvm_range_find((notifier__), (start__),
> > > (end__));	\
> > > +	     (range__) && (range__->va.start <
> > > (end__));		\
> > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > + * @range: Pointer to the GPU SVM range structure.
> > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > + *
> > > + * This function marks a GPU SVM range as unmapped and sets the
> > > partial_unmap flag
> > > + * if the range partially falls within the provided MMU notifier
> > > range.
> > > + */
> > > +static inline void
> > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > +			      const struct mmu_notifier_range
> > > *mmu_range)
> > > +{
> > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > +
> > > +	range->flags.unmapped = true;
> > > +	if (range->va.start < mmu_range->start ||
> > > +	    range->va.end > mmu_range->end)
> > > +		range->flags.partial_unmap = true;
> > > +}
> > > +
> > > +#endif /* __DRM_GPUSVM_H__ */
> > 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 17:45     ` Matthew Brost
  2024-08-29 18:13       ` Matthew Brost
@ 2024-08-29 19:18       ` Thomas Hellström
  2024-08-29 20:56         ` Matthew Brost
  1 sibling, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-08-29 19:18 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

Hi, Matthew,

On Thu, 2024-08-29 at 17:45 +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 11:16:49AM +0200, Thomas Hellström wrote:
> > Hi, Matt. 
> > 
> > Some initial design comments / questions:
> > 
> > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > This patch introduces support for GPU Shared Virtual Memory (SVM)
> > > in
> > > the
> > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > sharing of memory between the CPU and GPU, enhancing performance
> > > and
> > > flexibility in GPU computing tasks.
> > > 
> > > The patch adds the necessary infrastructure for SVM, including
> > > data
> > > structures and functions for managing SVM ranges and notifiers.
> > > It
> > > also
> > > provides mechanisms for allocating, deallocating, and migrating
> > > memory
> > > regions between system RAM and GPU VRAM.
> > > 
> > > This mid-layer is largely inspired by GPUVM.
> > > 
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > +++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > >  
> > >  # core driver code
> > >  
> > > -xe-y += xe_bb.o \
> > > +xe-y += drm_gpusvm.o \
> > > +	xe_bb.o \
> > >  	xe_bo.o \
> > >  	xe_bo_evict.o \
> > >  	xe_devcoredump.o \
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > new file mode 100644
> > > index 000000000000..fc1e44e6ae72
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > @@ -0,0 +1,2174 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + *
> > > + * Authors:
> > > + *     Matthew Brost <matthew.brost@intel.com>
> > > + */
> > > +
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/interval_tree_generic.h>
> > > +#include <linux/hmm.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/mm_types.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/slab.h>
> > > +
> > > +#include <drm/drm_device.h>
> > > +#include "drm_gpusvm.h"
> > > +
> > > +/**
> > > + * DOC: Overview
> > > + *
> > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > Rendering Manager (DRM)
> > > + *
> > > + * The GPU SVM layer is a component of the DRM framework
> > > designed to
> > > manage shared
> > > + * virtual memory between the CPU and GPU. It enables efficient
> > > data
> > > exchange and
> > > + * processing for GPU-accelerated applications by allowing
> > > memory
> > > sharing and
> > > + * synchronization between the CPU's and GPU's virtual address
> > > spaces.
> > > + *
> > > + * Key GPU SVM Components:
> > > + * - Notifiers: Notifiers: Used for tracking memory intervals
> > > and
> > > notifying the
> > > + *		GPU of changes, notifiers are sized based on a
> > > GPU
> > > SVM
> > > + *		initialization parameter, with a recommendation
> > > of
> > > 512M or
> > > + *		larger. They maintain a Red-BlacK tree and a
> > > list of
> > > ranges that
> > > + *		fall within the notifier interval. Notifiers are
> > > tracked within
> > > + *		a GPU SVM Red-BlacK tree and list and are
> > > dynamically inserted
> > > + *		or removed as ranges within the interval are
> > > created
> > > or
> > > + *		destroyed.
> > 
> > What is the benefit of this extra layer compared to direct
> > insertion of
> > ranges using mmu_interval_notifier_insert?
> > 
> > IIRC the argument made previously about having wide notifiers was
> > that
> > the rb tree lookups inside the core were costly and if there were
> > only
> > a few, then the rb tree lookups within a notifier range could be
> > replaced with the page-table radix-tree-like lookup, so each lookup
> > complexity would be O(log(n_notifiers) + page_table_depth).
> > 
> > But now we have first an rb-tree lookup in the core and then an rb-
> > tree
> > lookup within each notifier yeilding O(log(n_ranges))
> > 
> > I can see a small benefit in that inserting directly into the core
> > rb-
> > tree will block pending ongoing invalidations, but at a cost of an
> > extra multiplexing layer.
> > 
> 
> So when the notifier is triggered the search is a smaller range. In a
> perfect world eventually I'd like to drop the SVM range completely.
> There is a lot of changes required in Xe to make that possible and
> not
> entirely convinced it is possible and the ROI is worth it (additional
> complexity vs. perf benefit). For now, this was a relatively simple
> way
> to get SVM working (mirrors boths AMD's and Nvidia's implement wrt to
> having a range concept) but also is flexible in the sense the
> notifier
> size can be easily tweaked via a modparam [1] following Jason's
> suggestion of larger notifiers.
> 
> [1]
> https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1

What I meant was the core is already implementing the "one notifier for
the whole range", since your notifier duplicates the
mmu_interval_notifier functionality.

The mmu_interval_notifier first does an rbtree search to get to the
notifier, and then drm_gpusvm does an rbtree search to get to the
range.

If the svm notifier layer is skipped, mmu_interval_notifier has to
perform a wider rbtree search to get to the range. The point is, the
complexity is the same for both approaches so there is no point in
adding a svm notifier layer for that reason. The width of the notifier
just adjust the relative size of the two rbtree searches, so from that
point of view the drm_gpusvm does not offer any benefit from inserting
the ranges into the mmu_interval_notifier directly (except that the
mmu_interval_notifier is slightly more heavyweight).

As I understand it, Jasons comments were based on the assumption that
the drm_gpusvm search would be radix tree based, and hence with less
complexity than the rbtree search, and therefore providing a clear
benefit the larger they could be.

I.e. just calling something similar to xe_vm_invalidate_xxx over the
whole range, which will just skip subranges that are not populated.

/Thomas

> 
> > > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > > managed
> > > + *	     by GPU SVM. They are sized based on an array of
> > > chunk
> > > sizes, which
> > > + *	     is a GPU SVM initialization parameter, and the CPU
> > > address space.
> > > + *	     Upon GPU fault, the largest aligned chunk that fits
> > > within the
> > > + *	     faulting CPU address space is chosen for the range
> > > size. Ranges are
> > > + *	     expected to be dynamically allocated on GPU fault
> > > and
> > > removed on an
> > > + *	     MMU notifier UNMAP event. As mentioned above,
> > > ranges
> > > are tracked in
> > > + *	     a notifier's Red-Black tree.
> > 
> > How do ranges and chunks map to
> >  
> > a) Prefaulting granularity
> > b) Migration granularity?
> > 
> > > + * - Operations: Define the interface for driver-specific SVM
> > > operations such as
> > > + *		 allocation, page collection, migration,
> > > invalidations, and VRAM
> > > + *		 release.
> > > + *
> > > + * This layer provides interfaces for allocating, mapping,
> > > migrating, and
> > > + * releasing memory ranges between the CPU and GPU. It handles
> > > all
> > > core memory
> > > + * management interactions (DMA mapping, HMM, and migration) and
> > > provides
> > > + * driver-specific virtual functions (vfuncs). This
> > > infrastructure
> > > is sufficient
> > > + * to build the expected driver components for an SVM
> > > implementation
> > > as detailed
> > > + * below.
> > > + *
> > > + * Expected Driver Components:
> > > + * - GPU page fault handler: Used to create ranges and notifiers
> > > based on the
> > > + *			     fault address, optionally migrate
> > > the
> > > range to
> > > + *			     VRAM, and create GPU bindings.
> > > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > > Ranges are
> > > + *			expected to be added to the garbage
> > > collector upon
> > > + *			MMU_NOTIFY_UNMAP event.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Locking
> > > + *
> > > + * GPU SVM handles locking for core MM interactions, i.e., it
> > > locks/unlocks the
> > > + * mmap lock as needed. Alternatively, if the driver prefers to
> > > handle the mmap
> > > + * lock itself, a 'locked' argument is provided to the functions
> > > that require
> > > + * the mmap lock. This option may be useful for drivers that
> > > need to
> > > call into
> > > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > > locking
> > > + * inversions between the mmap and dma-resv locks.
> > > + *
> > > + * GPU SVM introduces a global notifier lock, which safeguards
> > > the
> > > notifier's
> > > + * range RB tree and list, as well as the range's DMA mappings
> > > and
> > > sequence
> > > + * number. GPU SVM manages all necessary locking and unlocking
> > > operations,
> > > + * except for the recheck of the range's sequence number
> > > + * (mmu_interval_read_retry) when the driver is committing GPU
> > > bindings. This
> > > + * lock corresponds to the 'driver->update' lock mentioned in
> > > the
> > > HMM
> > > + * documentation (TODO: Link). Future revisions may transition
> > > from
> > > a GPU SVM
> > > + * global lock to a per-notifier lock if finer-grained locking
> > > is
> > > deemed
> > > + * necessary.
> > > + *
> > > + * In addition to the locking mentioned above, the driver should
> > > implement a
> > > + * lock to safeguard core GPU SVM function calls that modify
> > > state,
> > > such as
> > > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > > Alternatively,
> > > + * these core functions can be called within a single kernel
> > > thread,
> > > for
> > > + * instance, using an ordered work queue. This lock is denoted
> > > as
> > > + * 'driver_svm_lock' in code examples.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Migrataion
> > > + *
> > > + * The migration support is quite simple, allowing migration
> > > between
> > > SRAM and
> > > + * VRAM at the range granularity. For example, GPU SVM currently
> > > does not
> > > + * support mixing SRAM and VRAM pages within a range. This means
> > > that upon GPU
> > > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > > fault, the
> > > + * entire range is migrated to SRAM.
> > > + *
> > > + * The reasoning for only supporting range granularity is as
> > > follows: it
> > > + * simplifies the implementation, and range sizes are driver-
> > > defined
> > > and should
> > > + * be relatively small.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Partial Unmapping of Ranges
> > > + *
> > > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped
> > > by
> > > CPU resulting
> > > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with
> > > the
> > > main one
> > > + * being that a subset of the range still has CPU and GPU
> > > mappings.
> > > If the
> > > + * backing store for the range is in VRAM, a subset of the
> > > backing
> > > store has
> > > + * references. One option would be to split the range and VRAM
> > > backing store,
> > > + * but the implementation for this would be quite complicated.
> > > Given
> > > that
> > > + * partial unmappings are rare and driver-defined range sizes
> > > are
> > > relatively
> > > + * small, GPU SVM does not support splitting of ranges.
> > > + *
> > > + * With no support for range splitting, upon partial unmapping
> > > of a
> > > range, the
> > > + * driver is expected to invalidate and destroy the entire
> > > range. If
> > > the range
> > > + * has VRAM as its backing, the driver is also expected to
> > > migrate
> > > any remaining
> > > + * pages back to SRAM.
> > 
> > So what happens if we get a one-page invalidation, say protection
> > change event, or NUMA accounting event, in the middle of a range?
> > Can
> > we unmap just that single gpu pte covering that range, that is, how
> > do
> > the ranges map to invalidation granularity? Does this differ
> > between
> > igfx an dgfx?
> 
> Well the idea of chunks is ranges should be 1 GPU page (the chunk
> array
> in Xe is 4k, 64k, and 2M). The design is flexible enough that doesn't
> have to true but optimized for the thinking each range is most likely
> 1
> GPU page. If this isn't true, then all GPU pages in the range are
> invalidated which isn't ideal but keeps it simple which IMO far out
> weighs the potential benefits. In theory a driver could implement
> spliting / partial invalidaions too with a couple of updates to
> GPUSVM
> but would likely largely be a driver implementation rather than
> GPUSVM.
> 
> No difference between igfx an dgfx.
> 
> You bring up a good point about protection changes, I likely haven't
> fully gotten that part of implementation correct either. I can add
> this
> to my TODO list and also update my IGTs to do things like this.
> 
> Matt
> 
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > 
> > 
> > > + */
> > > +
> > > +/**
> > > + * DOC: Examples
> > > + *
> > > + * This section provides two examples of how to build the
> > > expected
> > > driver
> > > + * components: the GPU page fault handler and the garbage
> > > collector.
> > > A third
> > > + * example demonstrates a sample invalidation driver vfunc.
> > > + *
> > > + * The generic code provided does not include logic for complex
> > > migration
> > > + * policies, optimized invalidations, or other potentially
> > > required
> > > driver
> > > + * locking (e.g., DMA-resv locks).
> > > + *
> > > + * 1) GPU page fault handler
> > > + *
> > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > > drm_gpusvm_range *range)
> > > + *	{
> > > + *		int err = 0;
> > > + *
> > > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > range);
> > > + *
> > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > + *			driver_commit_bind(gpusvm, range);
> > > + *		else
> > > + *			err = -EAGAIN;
> > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > + *
> > > + *		return err;
> > > + *	}
> > > + *
> > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *		int err;
> > > + *
> > > + *		driver_svm_lock();
> > > + *	retry:
> > > + *		// Always process UNMAPs first so view of GPU
> > > SVM
> > > ranges is current
> > > + *		driver_garbage_collector(gpusvm);
> > > + *
> > > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > > fault_addr,
> > > +
> > > *							gpuva_start,
> > > gpuva_end,
> > > + *						        &ctx);
> > > + *		if (IS_ERR(range)) {
> > > + *			err = PTR_ERR(range);
> > > + *			goto unlock;
> > > + *		}
> > > + *
> > > + *		if (driver_migration_policy(range)) {
> > > + *			bo = driver_alloc_bo();
> > > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > > range, bo, &ctx);
> > > + *			if (err)	// CPU mappings may have
> > > changed
> > > + *				goto retry;
> > > + *		}
> > > + *
> > > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &ctx);
> > > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > > mappings changed
> > > + *			goto retry;
> > > + *		else if (err)
> > > + *			goto unlock;
> > > + *
> > > + *		err = driver_bind_range(gpusvm, range);
> > > + *		if (err == -EAGAIN)	// CPU mappings changed
> > > + *			goto retry
> > > + *
> > > + *	unlock:
> > > + *		driver_svm_unlock();
> > > + *		return err;
> > > + *	}
> > > + *
> > > + * 2) Garbage Collector.
> > > + *
> > > + *	void __driver_garbage_collector(struct drm_gpusvm
> > > *gpusvm,
> > > + *					struct drm_gpusvm_range
> > > *range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		// Partial unmap, migrate any remaining VRAM
> > > pages
> > > back to SRAM
> > > + *		if (range->flags.partial_unmap)
> > > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > > range,
> > > &ctx);
> > > + *
> > > + *		driver_unbind_range(range);
> > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > + *	}
> > > + *
> > > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > > + *	{
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		for_each_range_in_garbage_collector(gpusvm,
> > > range)
> > > + *			__driver_garbage_collector(gpusvm,
> > > range);
> > > + *	}
> > > + *
> > > + * 3) Invalidation driver vfunc.
> > > + *
> > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > + *				 struct drm_gpusvm_notifier
> > > *notifier,
> > > + *				 const struct mmu_notifier_range
> > > *mmu_range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier =
> > > true,
> > > };
> > > + *		struct drm_gpusvm_range *range = NULL;
> > > + *
> > > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > > > start, mmu_range->end);
> > > + *
> > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > mmu_range->start,
> > > + *					  mmu_range->end) {
> > > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > > range,
> > > &ctx);
> > > + *
> > > + *			if (mmu_range->event !=
> > > MMU_NOTIFY_UNMAP)
> > > + *				continue;
> > > + *
> > > + *			drm_gpusvm_range_set_unmapped(range,
> > > mmu_range);
> > > + *			driver_garbage_collector_add(gpusvm,
> > > range);
> > > + *		}
> > > + *	}
> > > + */
> > > +
> > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > rb.__subtree_last,
> > > +		     DRM_GPUSVM_RANGE_START,
> > > DRM_GPUSVM_RANGE_END,
> > > +		     static __maybe_unused, range);
> > > +
> > > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > > interval.start)
> > > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > > interval.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > > +		     rb.__subtree_last,
> > > DRM_GPUSVM_NOTIFIER_START,
> > > +		     DRM_GPUSVM_NOTIFIER_END, static
> > > __maybe_unused,
> > > notifier);
> > > +
> > > +/**
> > > + * npages_in_range() - Calculate the number of pages in a given
> > > range
> > > + * @start__: The start address of the range
> > > + * @end__: The end address of the range
> > > + *
> > > + * This macro calculates the number of pages in a given memory
> > > range,
> > > + * specified by the start and end addresses. It divides the
> > > difference
> > > + * between the end and start addresses by the page size
> > > (PAGE_SIZE)
> > > to
> > > + * determine the number of pages in the range.
> > > + *
> > > + * Return: The number of pages in the specified range.
> > > + */
> > > +#define npages_in_range(start__, end__)	\
> > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > +
> > > +/**
> > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > + *
> > > + * @refcount: Reference count for the zdd
> > > + * @destroy_work: Work structure for asynchronous zdd
> > > destruction
> > > + * @range: Pointer to the GPU SVM range
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > + *
> > > + * This structure serves as a generic wrapper installed in
> > > + * page->zone_device_data. It provides infrastructure for
> > > looking up
> > > a range
> > > + * upon CPU page fault and asynchronously releasing VRAM once
> > > the
> > > CPU has no
> > > + * page references. Asynchronous release is useful because CPU
> > > page
> > > references
> > > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > > requires sleeping
> > > + * locks.
> > > + */
> > > +struct drm_gpusvm_zdd {
> > > +	struct kref refcount;
> > > +	struct work_struct destroy_work;
> > > +	struct drm_gpusvm_range *range;
> > > +	void *vram_allocation;
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > destroying a
> > > zdd
> > > + * @w: Pointer to the work_struct
> > > + *
> > > + * This function releases VRAM, puts GPU SVM range, and frees
> > > zdd.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct
> > > *w)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(w, struct drm_gpusvm_zdd,
> > > destroy_work);
> > > +	struct drm_gpusvm_range *range = zdd->range;
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > > +	drm_gpusvm_range_put(range);
> > > +	kfree(zdd);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > + * @range: Pointer to the GPU SVM range.
> > > + *
> > > + * This function allocates and initializes a new zdd structure.
> > > It
> > > sets up the
> > > + * reference count, initializes the destroy work, and links the
> > > provided GPU SVM
> > > + * range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_zdd *
> > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd;
> > > +
> > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > +	if (!zdd)
> > > +		return NULL;
> > > +
> > > +	kref_init(&zdd->refcount);
> > > +	INIT_WORK(&zdd->destroy_work,
> > > drm_gpusvm_zdd_destroy_work_func);
> > > +	zdd->range = drm_gpusvm_range_get(range);
> > > +	zdd->vram_allocation = NULL;
> > > +
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function increments the reference count of the provided
> > > zdd
> > > structure.
> > > + *
> > > + * Returns: Pointer to the zdd structure.
> > > + */
> > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_get(&zdd->refcount);
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > + * @ref: Pointer to the reference count structure.
> > > + *
> > > + * This function queues the destroy_work of the zdd for
> > > asynchronous
> > > destruction.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > +
> > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function decrements the reference count of the provided
> > > zdd
> > > structure
> > > + * and schedules its destruction if the count drops to zero.
> > > + */
> > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > notifier
> > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > + * @start: Start address of the range
> > > + * @end: End address of the range
> > > + *
> > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end)
> > > +{
> > > +	return range_iter_first(&notifier->root, start, end -
> > > 1);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > ranges in a notifier
> > > + * @range__: Iterator variable for the ranges
> > > + * @next__: Iterator variable for the ranges temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a
> > > notifier
> > > while
> > > + * removing ranges from it.
> > > + */
> > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > notifier__,
> > > start__, end__)	\
> > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > (start__), (end__)),	\
> > > +	     (next__) =
> > > __drm_gpusvm_range_next(range__);				\
> > > +	     (range__) && (range__->va.start <
> > > (end__));				\
> > > +	     (range__) = (next__), (next__) =
> > > __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier
> > > in
> > > the list
> > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > available,
> > > or NULL if
> > > + *         the current notifier is the last one or if the input
> > > notifier is
> > > + *         NULL.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > +{
> > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > +				      &notifier->gpusvm-
> > > > notifier_list))
> > > +		return list_next_entry(notifier, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> > > in
> > > a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > gpusvm.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > start__,
> > > end__)		\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > >root,
> > > (start__), (end__) - 1);	\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> > > SVM
> > > notifiers in a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @next__: Iterator variable for the notifiers temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > gpusvm
> > > while
> > > + * removing notifiers from it.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > gpusvm__, start__, end__)	\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > >root,
> > > (start__), (end__) - 1),	\
> > > +	     (next__) =
> > > __drm_gpusvm_notifier_next(notifier__);				\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) = (next__), (next__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > notifier.
> > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > + * @cur_seq: Current sequence number.
> > > + *
> > > + * This function serves as a generic MMU notifier for GPU SVM.
> > > It
> > > sets the MMU
> > > + * notifier sequence number and calls the driver invalidate
> > > vfunc
> > > under
> > > + * gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * true if the operation succeeds, false otherwise.
> > > + */
> > > +static bool
> > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > *mni,
> > > +			       const struct mmu_notifier_range
> > > *mmu_range,
> > > +			       unsigned long cur_seq)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier =
> > > +		container_of(mni, typeof(*notifier), notifier);
> > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > +
> > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > +		return false;
> > > +
> > > +	down_write(&gpusvm->notifier_lock);
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > +	up_write(&gpusvm->notifier_lock);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations
> > > for
> > > GPU SVM
> > > + */
> > > +static const struct mmu_interval_notifier_ops
> > > drm_gpusvm_notifier_ops = {
> > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @name: Name of the GPU SVM.
> > > + * @drm: Pointer to the DRM device structure.
> > > + * @mm: Pointer to the mm_struct for the address space.
> > > + * @device_private_page_owner: Device private pages owner.
> > > + * @mm_start: Start address of GPU SVM.
> > > + * @mm_range: Range of the GPU SVM.
> > > + * @notifier_size: Size of individual notifiers.
> > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending
> > > order
> > > with last
> > > + *               entry being SZ_4K.
> > > + * @num_chunks: Number of chunks.
> > > + *
> > > + * This function initializes the GPU SVM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, a negative error code on failure.
> > > + */
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64
> > > notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks)
> > > +{
> > > +	if (!ops->invalidate || !num_chunks)
> > > +		return -EINVAL;
> > > +
> > > +	gpusvm->name = name;
> > > +	gpusvm->drm = drm;
> > > +	gpusvm->mm = mm;
> > > +	gpusvm->device_private_page_owner =
> > > device_private_page_owner;
> > > +	gpusvm->mm_start = mm_start;
> > > +	gpusvm->mm_range = mm_range;
> > > +	gpusvm->notifier_size = notifier_size;
> > > +	gpusvm->ops = ops;
> > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > +	gpusvm->num_chunks = num_chunks;
> > > +	gpusvm->zdd_wq = system_wq;
> > > +
> > > +	mmgrab(mm);
> > > +	gpusvm->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > +
> > > +	init_rwsem(&gpusvm->notifier_lock);
> > > +
> > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > +	might_lock(&gpusvm->notifier_lock);
> > > +	fs_reclaim_release(GFP_KERNEL);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > + * @fault_addr__: Fault address
> > > + *
> > > + * This macro finds the GPU SVM notifier associated with the
> > > fault
> > > address.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > + */
> > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > fault_addr__)	\
> > > +	notifier_iter_first(&(gpusvm__)->root,
> > > (fault_addr__),	\
> > > +			    (fault_addr__ + 1))
> > > +
> > > +/**
> > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > given rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_notifier struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > structure.
> > > + */
> > > +#define
> > > to_drm_gpusvm_notifier(__node)				\
> > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function inserts the GPU SVM notifier into the GPU SVM
> > > RB
> > > tree and list.
> > > + */
> > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct
> > > drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	notifier_insert(notifier, &gpusvm->root);
> > > +
> > > +	node = rb_prev(&notifier->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > >rb.entry;
> > > +	else
> > > +		head = &gpusvm->notifier_list;
> > > +
> > > +	list_add(&notifier->rb.entry, head);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This macro removes the GPU SVM notifier from the GPU SVM RB
> > > tree
> > > and list.
> > > + */
> > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > notifier__)	\
> > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > >root);	\
> > > +	list_del(&(notifier__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + *
> > > + * This function finalizes the GPU SVM by cleaning up any
> > > remaining
> > > ranges and
> > > + * notifiers, and dropping a reference to struct MM.
> > > + */
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > +
> > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > gpusvm, 0,
> > > LONG_MAX) {
> > > +		struct drm_gpusvm_range *range, *__next;
> > > +
> > > +		/*
> > > +		 * Remove notifier first to avoid racing with
> > > any
> > > invalidation
> > > +		 */
> > > +		mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +		notifier->flags.removed = true;
> > > +
> > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > notifier, 0,
> > > +					       LONG_MAX)
> > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > +	}
> > > +
> > > +	mmdrop(gpusvm->mm);
> > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + *
> > > + * This function allocates and initializes the GPU SVM notifier
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM notifier on success,
> > > ERR_PTR()
> > > on failure.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > fault_addr)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	if (gpusvm->ops->notifier_alloc)
> > > +		notifier = gpusvm->ops->notifier_alloc();
> > > +	else
> > > +		notifier = kzalloc(sizeof(*notifier),
> > > GFP_KERNEL);
> > > +
> > > +	if (!notifier)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	notifier->gpusvm = gpusvm;
> > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > gpusvm-
> > > > notifier_size);
> > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > > notifier_size);
> > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > +	notifier->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > +
> > > +	return notifier;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function frees the GPU SVM notifier structure.
> > > + */
> > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > +				     struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > +
> > > +	if (gpusvm->ops->notifier_free)
> > > +		gpusvm->ops->notifier_free(notifier);
> > > +	else
> > > +		kfree(notifier);
> > > +}
> > > +
> > > +/**
> > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > given
> > > rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_range struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_range
> > > structure.
> > > + */
> > > +#define to_drm_gpusvm_range(node__)	\
> > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function inserts the GPU SVM range into the notifier RB
> > > tree
> > > and list.
> > > + */
> > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > *notifier,
> > > +				    struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > +	range_insert(range, &notifier->root);
> > > +
> > > +	node = rb_prev(&range->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > +	else
> > > +		head = &notifier->range_list;
> > > +
> > > +	list_add(&range->rb.entry, head);
> > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + * @range__: Pointer to the GPU SVM range structure
> > > + *
> > > + * This macro removes the GPU SVM range from the notifier RB
> > > tree
> > > and list.
> > > + */
> > > +#define __drm_gpusvm_range_remove(notifier__,
> > > range__)		\
> > > +	range_remove((range__), &(notifier__)-
> > > >root);		\
> > > +	list_del(&(range__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @fault_addr: Fault address
> > > + * @chunk_size: Chunk size
> > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > + *
> > > + * This function allocates and initializes the GPU SVM range
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR()
> > > on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_range *
> > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > +		       struct drm_gpusvm_notifier *notifier,
> > > +		       u64 fault_addr, u64 chunk_size, bool
> > > migrate_vram)
> > > +{
> > > +	struct drm_gpusvm_range *range;
> > > +
> > > +	if (gpusvm->ops->range_alloc)
> > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > +	else
> > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	kref_init(&range->refcount);
> > > +	range->gpusvm = gpusvm;
> > > +	range->notifier = notifier;
> > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > +	range->notifier_seq = LONG_MAX;
> > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_check_pages - Check pages
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Check if pages between start and end have been faulted in on
> > > the
> > > CPU. Use to
> > > + * prevent migration of pages without CPU backing store.
> > > + *
> > > + * Returns:
> > > + * True if pages have been faulted into CPU, False otherwise
> > > + */
> > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > +				   struct drm_gpusvm_notifier
> > > *notifier,
> > > +				   u64 start, u64 end)
> > > +{
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = 0,
> > > +		.notifier = &notifier->notifier,
> > > +		.start = start,
> > > +		.end = end,
> > > +		.dev_private_owner = gpusvm-
> > > > device_private_page_owner,
> > > +	};
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns;
> > > +	unsigned long npages = npages_in_range(start, end);
> > > +	int err, i;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +	if (!pfns)
> > > +		return false;
> > > +
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier-
> > > > notifier);
> > > +	hmm_range.hmm_pfns = pfns;
> > > +
> > > +	while (true) {
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier->notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > +			err = -EFAULT;
> > > +			goto err_free;
> > > +		}
> > > +	}
> > > +
> > > +err_free:
> > > +	kvfree(pfns);
> > > +	return err ? false : true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> > > SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @vas: Pointer to the virtual memory area structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @check_pages: Flag indicating whether to check pages
> > > + *
> > > + * This function determines the chunk size for the GPU SVM range
> > > based on the
> > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > > and
> > > the virtual
> > > + * memory area boundaries.
> > > + *
> > > + * Returns:
> > > + * Chunk size on success, LONG_MAX on failure.
> > > + */
> > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct
> > > drm_gpusvm_notifier
> > > *notifier,
> > > +				       struct vm_area_struct
> > > *vas,
> > > +				       u64 fault_addr, u64
> > > gpuva_start,
> > > +				       u64 gpuva_end, bool
> > > check_pages)
> > > +{
> > > +	u64 start, end;
> > > +	int i = 0;
> > > +
> > > +retry:
> > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > chunk_sizes[i]);
> > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > >chunk_sizes[i]);
> > > +
> > > +		if (start >= vas->vm_start && end <= vas->vm_end
> > > &&
> > > +		    start >= notifier->interval.start &&
> > > +		    end <= notifier->interval.end &&
> > > +		    start >= gpuva_start && end <= gpuva_end)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i == gpusvm->num_chunks)
> > > +		return LONG_MAX;
> > > +
> > > +	/*
> > > +	 * If allocation more than page, ensure not to overlap
> > > with
> > > existing
> > > +	 * ranges.
> > > +	 */
> > > +	if (end - start != SZ_4K) {
> > > +		struct drm_gpusvm_range *range;
> > > +
> > > +		range = drm_gpusvm_range_find(notifier, start,
> > > end);
> > > +		if (range) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +
> > > +		/*
> > > +		 * XXX: Only create range on pages CPU has
> > > faulted
> > > in. Without
> > > +		 * this check, or prefault, on BMG
> > > 'xe_exec_system_allocator --r
> > > +		 * process-many-malloc' fails. In the failure
> > > case,
> > > each process
> > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > results in 64k SVM
> > > +		 * ranges. When migrating the SVM ranges, some
> > > processes fail in
> > > +		 * drm_gpusvm_migrate_to_vram with
> > > 'migrate.cpages
> > > != npages'
> > > +		 * and then upon drm_gpusvm_range_get_pages
> > > device
> > > pages from
> > > +		 * other processes are collected + faulted in
> > > which
> > > creates all
> > > +		 * sorts of problems. Unsure exactly how this
> > > happening, also
> > > +		 * problem goes away if
> > > 'xe_exec_system_allocator --
> > > r
> > > +		 * process-many-malloc' mallocs at least 64k at
> > > a
> > > time.
> > > +		 */
> > > +		if (check_pages &&
> > > +		    !drm_gpusvm_check_pages(gpusvm, notifier,
> > > start,
> > > end)) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > > +	return end - start;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function finds or inserts a newly allocated a GPU SVM
> > > range
> > > based on the
> > > + * fault address. Caller must hold a lock to protect range
> > > lookup
> > > and insertion.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx
> > > *ctx)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct drm_gpusvm_range *range;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	bool notifier_alloc = false;
> > > +	u64 chunk_size;
> > > +	int err;
> > > +	bool migrate_vram;
> > > +
> > > +	if (fault_addr < gpusvm->mm_start ||
> > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > +		err = -EINVAL;
> > > +		goto err_out;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_write_locked(mm);
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > +	if (!notifier) {
> > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > fault_addr);
> > > +		if (IS_ERR(notifier)) {
> > > +			err = PTR_ERR(notifier);
> > > +			goto err_mmunlock;
> > > +		}
> > > +		notifier_alloc = true;
> > > +		err =
> > > mmu_interval_notifier_insert_locked(&notifier-
> > > > notifier,
> > > +							  mm,
> > > notifier->interval.start,
> > > +							 
> > > notifier-
> > > > interval.end -
> > > +							 
> > > notifier-
> > > > interval.start,
> > > +							 
> > > &drm_gpusvm_notifier_ops);
> > > +		if (err)
> > > +			goto err_notifier;
> > > +	}
> > > +
> > > +	vas = vma_lookup(mm, fault_addr);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > +		err = -EPERM;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > fault_addr + 1);
> > > +	if (range)
> > > +		goto out_mmunlock;
> > > +	/*
> > > +	 * XXX: Short-circuiting migration based on
> > > migrate_vma_*
> > > current
> > > +	 * limitations. If/when migrate_vma_* add more support,
> > > this
> > > logic will
> > > +	 * have to change.
> > > +	 */
> > > +	migrate_vram = ctx->vram_possible &&
> > > +		vma_is_anonymous(vas) &&
> > > !is_vm_hugetlb_page(vas);
> > > +
> > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > notifier,
> > > vas,
> > > +						 fault_addr,
> > > gpuva_start,
> > > +						 gpuva_end,
> > > migrate_vram &&
> > > +						 !ctx-
> > > >prefault);
> > > +	if (chunk_size == LONG_MAX) {
> > > +		err = -EINVAL;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > fault_addr,
> > > chunk_size,
> > > +				       migrate_vram);
> > > +	if (IS_ERR(range)) {
> > > +		err = PTR_ERR(range);
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	drm_gpusvm_range_insert(notifier, range);
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > +
> > > +	if (ctx->prefault) {
> > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > +
> > > +		__ctx.mmap_locked = true;
> > > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &__ctx);
> > > +		if (err)
> > > +			goto err_range_remove;
> > > +	}
> > > +
> > > +out_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +
> > > +	return range;
> > > +
> > > +err_range_remove:
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +err_notifier_remove:
> > > +	if (notifier_alloc)
> > > +		mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +err_notifier:
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return ERR_PTR(err);
> > > +}
> > > +
> > > +/**
> > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > + * @i__: the current page index in the iteration
> > > + * @j__: the current page index, log order, in the iteration
> > > + * @npages__: the total number of pages in the DMA region
> > > + * @order__: the order of the pages in the DMA region
> > > + *
> > > + * This macro iterates over each page in a DMA region. The DMA
> > > region
> > > + * is assumed to be composed of 2^@order__ pages, and the macro
> > > will
> > > + * step through the region one block of 2^@order__ pages at a
> > > time.
> > > + */
> > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with
> > > a
> > > GPU SVM range (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function unmap pages associated with a GPU SVM range.
> > > Assumes and
> > > + * asserts correct locking is in place when called.
> > > + */
> > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					   struct
> > > drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		unsigned long i, j, npages =
> > > npages_in_range(range-
> > > > va.start,
> > > +							    
> > > range-
> > > > va.end);
> > > +
> > > +		if (range->flags.has_dma_mapping) {
> > > +			for_each_dma_page(i, j, npages, range-
> > > > order)
> > > +				dma_unmap_page(gpusvm->drm->dev,
> > > +					       range-
> > > >dma_addr[j],
> > > +					       PAGE_SIZE <<
> > > range-
> > > > order,
> > > +					      
> > > DMA_BIDIRECTIONAL);
> > > +		}
> > > +
> > > +		range->flags.has_vram_pages = false;
> > > +		range->flags.has_dma_mapping = false;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_free_pages - Free pages associated with a
> > > GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function free pages associated with a GPU SVM range.
> > > + */
> > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		if (range->flags.kfree_mapping) {
> > > +			kfree(range->dma_addr);
> > > +			range->flags.kfree_mapping = false;
> > > +			range->pages = NULL;
> > > +		} else {
> > > +			kvfree(range->pages);
> > > +			range->pages = NULL;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range to be removed
> > > + *
> > > + * This function removes the specified GPU SVM range and also
> > > removes the parent
> > > + * GPU SVM notifier if no more ranges remain in the notifier.
> > > The
> > > caller must
> > > + * hold a lock to protect range and notifier removal.
> > > + */
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > va.start);
> > > +	if (WARN_ON_ONCE(!notifier))
> > > +		return;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	drm_gpusvm_range_put(range);
> > > +
> > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > +		if (!notifier->flags.removed)
> > > +			mmu_interval_notifier_remove(&notifier-
> > > > notifier);
> > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function increments the reference count of the specified
> > > GPU
> > > SVM range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_get(&range->refcount);
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > + * @refcount: Pointer to the reference counter embedded in the
> > > GPU
> > > SVM range
> > > + *
> > > + * This function destroys the specified GPU SVM range when its
> > > reference count
> > > + * reaches zero. If a custom range-free function is provided, it
> > > is
> > > invoked to
> > > + * free the range; otherwise, the range is deallocated using
> > > kfree().
> > > + */
> > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > +{
> > > +	struct drm_gpusvm_range *range =
> > > +		container_of(refcount, struct drm_gpusvm_range,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->range_free)
> > > +		gpusvm->ops->range_free(range);
> > > +	else
> > > +		kfree(range);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function decrements the reference count of the specified
> > > GPU
> > > SVM range
> > > + * and frees it when the count reaches zero.
> > > + */
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called holding gpusvm->notifier_lock and as the last step
> > > before
> > > commiting a
> > > + * GPU binding.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	return range->flags.has_vram_pages || range-
> > > > flags.has_dma_mapping;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages
> > > valid
> > > unlocked
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called without holding gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +static bool
> > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > +				      struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	bool pages_valid;
> > > +
> > > +	if (!range->pages)
> > > +		return false;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > range);
> > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > +		kfree(range->dma_addr);
> > > +		range->flags.kfree_mapping = false;
> > > +		range->pages = NULL;
> > > +	}
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	return pages_valid;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function gets pages for a GPU SVM range and ensures they
> > > are
> > > mapped for
> > > + * DMA access.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct mmu_interval_notifier *notifier = &range-
> > > >notifier-
> > > > notifier;
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > >read_only
> > > ? 0 :
> > > +			HMM_PFN_REQ_WRITE),
> > > +		.notifier = notifier,
> > > +		.start = range->va.start,
> > > +		.end = range->va.end,
> > > +		.dev_private_owner = gpusvm-
> > > > device_private_page_owner,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long i, j;
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +	unsigned int order = 0;
> > > +	unsigned long *pfns;
> > > +	struct page **pages;
> > > +	int err = 0;
> > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > +	bool alloc_pfns = false, kfree_mapping;
> > > +
> > > +retry:
> > > +	kfree_mapping = false;
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > range))
> > > +		return 0;
> > > +
> > > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > > range-
> > > > pages) {
> > > +		if (ctx->prefault)
> > > +			return 0;
> > > +
> > > +		pfns = (unsigned long *)range->pages;
> > > +		pages = range->pages;
> > > +		goto map_pages;
> > > +	}
> > > +
> > > +	if (!range->pages) {
> > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +		if (!pfns)
> > > +			return -ENOMEM;
> > > +		alloc_pfns = true;
> > > +	} else {
> > > +		pfns = (unsigned long *)range->pages;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +	}
> > > +
> > > +	hmm_range.hmm_pfns = pfns;
> > > +	while (true) {
> > > +		/* Must be checked after mmu_interval_read_begin
> > > */
> > > +		if (range->flags.unmapped) {
> > > +			err = -EFAULT;
> > > +			break;
> > > +		}
> > > +
> > > +		if (!ctx->mmap_locked) {
> > > +			/*
> > > +			 * XXX: HMM locking document indicates
> > > only
> > > a read-lock
> > > +			 * is required but there apears to be a
> > > window between
> > > +			 * the MMU_NOTIFY_MIGRATE event
> > > triggered in
> > > a CPU fault
> > > +			 * via migrate_vma_setup and the pages
> > > actually moving
> > > +			 * in migrate_vma_finalize in which this
> > > code can grab
> > > +			 * garbage pages. Grabbing the write-
> > > lock if
> > > the range
> > > +			 * is attached to vram appears to
> > > protect
> > > against this
> > > +			 * race.
> > > +			 */
> > > +			if (vram_pages)
> > > +				mmap_write_lock(mm);
> > > +			else
> > > +				mmap_read_lock(mm);
> > > +		}
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (!ctx->mmap_locked) {
> > > +			if (vram_pages)
> > > +				mmap_write_unlock(mm);
> > > +			else
> > > +				mmap_read_unlock(mm);
> > > +		}
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (!ctx->mmap_locked)
> > > +		mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +
> > > +	if (ctx->prefault) {
> > > +		range->pages = pages;
> > > +		goto set_seqno;
> > > +	}
> > > +
> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if
> > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					
> > > hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if
> > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > > +
> > > +			dma_addr[j] = dma_map_page(gpusvm->drm-
> > > >dev,
> > > +						   pages[j], 0,
> > > +						   PAGE_SIZE <<
> > > order,
> > > +						  
> > > DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > dma_addr[j])) {
> > > +				err = -EFAULT;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +		}
> > > +
> > > +		/* Huge pages, reduce memory footprint */
> > > +		if (order) {
> > > +			dma_addr = kmalloc_array(j,
> > > sizeof(*dma_addr),
> > > +						 GFP_KERNEL);
> > > +			if (dma_addr) {
> > > +				for (i = 0; i < j; ++i)
> > > +					dma_addr[i] =
> > > (dma_addr_t)pfns[i];
> > > +				kvfree(pfns);
> > > +				kfree_mapping = true;
> > > +			} else {
> > > +				dma_addr = (dma_addr_t *)pfns;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->order = order;
> > > +		range->flags.kfree_mapping = kfree_mapping;
> > > +		range->flags.has_dma_mapping = true;
> > > +		range->dma_addr = dma_addr;
> > > +		range->vram_allocation = NULL;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	}
> > > +
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	for_each_dma_page(i, j, npages, order)
> > > +		dma_unmap_page(gpusvm->drm->dev,
> > > +			       (dma_addr_t)pfns[j],
> > > +			       PAGE_SIZE << order,
> > > DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	if (alloc_pfns)
> > > +		kvfree(pfns);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range.
> > > If
> > > @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in
> > > write
> > > mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > Must be
> > > called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range,
> > > +				  const struct drm_gpusvm_ctx
> > > *ctx)
> > > +{
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm-
> > > >notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > +					   unsigned long
> > > *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_pag
> > > e(mi
> > > grate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified
> > > GPU
> > > SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU
> > > SVM
> > > migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > mapped
> > > pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in GPU
> > > SVM. It
> > > + * iterates over each page frame number provided in
> > > @migrate_pfn,
> > > maps the
> > > + * corresponding page, and stores the DMA address in the
> > > provided
> > > @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int
> > > *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page =
> > > migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > PAGE_SIZE,
> > > dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> > > mapped
> > > for GPU SVM migration
> > > + * @dev: The device for which the pages were mapped
> > > + * @dma_addr: Array of DMA addresses corresponding to mapped
> > > pages
> > > + * @npages: Number of pages to unmap
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function unmaps previously mapped pages of memory for
> > > GPU
> > > Shared Virtual
> > > + * Memory (SVM). It iterates over each DMA address provided in
> > > @dma_addr, checks
> > > + * if it's valid and not already unmapped, and unmaps the
> > > corresponding page.
> > > + */
> > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > +					   dma_addr_t *dma_addr,
> > > +					   unsigned long npages,
> > > +					   enum
> > > dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > dma_addr[i]))
> > > +			continue;
> > > +
> > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE,
> > > dir);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *                   failure of this function.
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation.
> > > The caller
> > > + *                   should hold a reference to the VRAM
> > > allocation,
> > > which
> > > + *                   should be dropped via ops->vram_allocation
> > > or
> > > upon the
> > > + *                   failure of this function.
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function migrates the specified GPU SVM range to VRAM.
> > > It
> > > performs the
> > > + * necessary setup and invokes the driver-specific operations
> > > for
> > > migration to
> > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > reference @range
> > > + * until ops->vram_release is called which only upon successful
> > > return.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= gpusvm-
> > > >device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long i, npages = npages_in_range(start, end);
> > > +	struct vm_area_struct *vas;
> > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int err;
> > > +
> > > +	if (!range->flags.migrate_vram)
> > > +		return -EINVAL;
> > > +
> > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > > copy_to_vram ||
> > > +	    !gpusvm->ops->copy_to_sram)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and
> > > migrate.cpages !=
> > > npages, not
> > > +	 * always an error. Need to revisit possible cases and
> > > how
> > > to handle. We
> > > +	 * could prefault on migrate.cpages != npages via
> > > hmm_range_fault.
> > > +	 */
> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > vram_allocation, npages,
> > > +					     migrate.dst);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.src, npages,
> > > DMA_TO_DEVICE);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > +
> > > +		pages[i] = page;
> > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > +	}
> > > +
> > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	/* Upon success bind vram allocation to range and zdd */
> > > +	range->vram_allocation = vram_allocation;
> > > +	WRITE_ONCE(zdd->vram_allocation,
> > > vram_allocation);	/*
> > > Owns ref */
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages,
> > > migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > npages,
> > > +				       DMA_TO_DEVICE);
> > > +err_free:
> > > +	if (zdd)
> > > +		drm_gpusvm_zdd_put(zdd);
> > > +	kvfree(buf);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for
> > > a
> > > VM area
> > > + * @vas: Pointer to the VM area structure, can be NULL
> > > + * @npages: Number of pages to populate
> > > + * @src_mpfn: Source array of migrate PFNs
> > > + * @mpfn: Array of migrate PFNs to populate
> > > + * @addr: Start address for PFN allocation
> > > + *
> > > + * This function populates the SRAM migrate page frame numbers
> > > (PFNs) for the
> > > + * specified VM area structure. It allocates and locks pages in
> > > the
> > > VM area for
> > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for
> > > allocation,
> > > if NULL use
> > > + * alloc_page for allocation.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > vm_area_struct *vas,
> > > +						unsigned long
> > > npages,
> > > +						unsigned long
> > > *src_mpfn,
> > > +						unsigned long
> > > *mpfn,
> > > u64 addr)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > +		struct page *page;
> > > +
> > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > +			continue;
> > > +
> > > +		if (vas)
> > > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > > addr);
> > > +		else
> > > +			page = alloc_page(GFP_HIGHUSER);
> > > +
> > > +		if (!page)
> > > +			return -ENOMEM;
> > > +
> > > +		lock_page(page);
> > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require
> > > mmap
> > > lock and
> > > + * migration done via migrate_device_* functions. Fallback path
> > > as
> > > it is
> > > + * preferred to issue migrations with mmap lock.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > +				    struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	unsigned long *src, *dst;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	npages = npages_in_range(range->va.start, range-
> > > >va.end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*src) +
> > > sizeof(*dma_addr)
> > > +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	src = buf;
> > > +	dst = buf + (sizeof(*src) * npages);
> > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > > npages;
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > > vram_allocation,
> > > +					     npages, src);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > +				       gpusvm-
> > > > device_private_page_owner, src,
> > > +				       npages, range->va.start);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > > src, dst, 0);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   dst, npages,
> > > DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > +	migrate_device_pages(src, dst, npages);
> > > +	migrate_device_finalize(src, dst, npages);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > > (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the
> > > specified
> > > GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > > PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > > *gpusvm,
> > > +					struct vm_area_struct
> > > *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm-
> > > >device_private_page_owner,
> > > +		.flags		=
> > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	/* Corner where VMA area struct has been partially
> > > unmapped
> > > */
> > > +	if (start < vas->vm_start)
> > > +		start = vas->vm_start;
> > > +	if (end > vas->vm_end)
> > > +		end = vas->vm_end;
> > > +
> > > +	migrate.start = start;
> > > +	migrate.end = end;
> > > +	npages = npages_in_range(start, end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/* Raced with another CPU fault, nothing to do */
> > > +	if (!migrate.cpages)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > +						   migrate.src,
> > > migrate.dst,
> > > +						   start);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.dst, npages,
> > > +					   DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages,
> > > migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > > SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function initiates the migration of the specified GPU
> > > SVM
> > > range to
> > > + * SRAM. It performs necessary checks and invokes the internal
> > > migration
> > > + * function for actual migration.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err =
> > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the
> > > corner
> > > case when
> > > +	 * VRAM backing has been partially unmapped from MM's
> > > address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > > +	if (!vas) {
> > > +		if (!retry)
> > > +			err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > +		if (!retry)
> > > +			err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL,
> > > start,
> > > end);
> > > +	if (err)
> > > +		goto err_mmunlock;
> > > +
> > > +	if (vas->vm_end < end) {
> > > +		retry = true;
> > > +		start = vas->vm_end;
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_read_unlock(mm);
> > > +		/*
> > > +		 * Using mmput_async as this function can be
> > > called
> > > while
> > > +		 * holding a dma-resv lock, and a final put can
> > > grab
> > > the mmap
> > > +		 * lock, causing a lock inversion.
> > > +		 */
> > > +		mmput_async(mm);
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked)
> > > +		mmap_read_unlock(mm);
> > > +err_mmput:
> > > +	if (!ctx->mmap_locked)
> > > +		mmput_async(mm);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > > associated
> > > with a page
> > > + * @page: Pointer to the page
> > > + *
> > > + * This function is a callback used to put the GPU SVM zone
> > > device
> > > data
> > > + * associated with a page when it is being released.
> > > + */
> > > +static void drm_gpusvm_page_free(struct page *page)
> > > +{
> > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > > (page
> > > fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU
> > > SVM
> > > range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting
> > > page
> > > and invokes
> > > + * the internal migration function to migrate the range back to
> > > RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > > *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > > >zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > +					   vmf->vma, vmf->page,
> > > +					   zdd->range->va.start,
> > > +					   zdd->range->va.end);
> > > +
> > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU
> > > SVM
> > > + */
> > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > +	.page_free = drm_gpusvm_page_free,
> > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > > operations
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM device page map operations structure.
> > > + */
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > +{
> > > +	return &drm_gpusvm_pagemap_ops;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > > given address range
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM has mapping, False otherwise
> > > + */
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > start,
> > > u64 end)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start,
> > > end) {
> > > +		struct drm_gpusvm_range *range = NULL;
> > > +
> > > +		drm_gpusvm_for_each_range(range, notifier,
> > > start,
> > > end)
> > > +			return true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > new file mode 100644
> > > index 000000000000..0ea70f8534a8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > @@ -0,0 +1,415 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_GPUSVM_H__
> > > +#define __DRM_GPUSVM_H__
> > > +
> > > +#include <linux/kref.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct dev_pagemap_ops;
> > > +struct drm_device;
> > > +struct drm_gpusvm;
> > > +struct drm_gpusvm_notifier;
> > > +struct drm_gpusvm_ops;
> > > +struct drm_gpusvm_range;
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > + *
> > > + * This structure defines the operations for GPU Shared Virtual
> > > Memory (SVM).
> > > + * These operations are provided by the GPU driver to manage SVM
> > > ranges and
> > > + * perform operations such as migration between VRAM and system
> > > RAM.
> > > + */
> > > +struct drm_gpusvm_ops {
> > > +	/**
> > > +	 * @notifier_alloc: Allocate a GPU SVM notifier
> > > (optional)
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM notifier.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM notifier on success,
> > > NULL on failure.
> > > +	 */
> > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > +
> > > +	/**
> > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > +	 * @notifier: Pointer to the GPU SVM notifier to be
> > > freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM notifier.
> > > +	 */
> > > +	void (*notifier_free)(struct drm_gpusvm_notifier
> > > *notifier);
> > > +
> > > +	/**
> > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM range.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM range on success,
> > > NULL
> > > on failure.
> > > +	 */
> > > +	struct drm_gpusvm_range *(*range_alloc)(struct
> > > drm_gpusvm
> > > *gpusvm);
> > > +
> > > +	/**
> > > +	 * @range_free: Free a GPU SVM range (optional)
> > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM range.
> > > +	 */
> > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > +
> > > +	/**
> > > +	 * @vram_release: Release VRAM allocation (optional)
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 *
> > > +	 * This function shall release VRAM allocation and
> > > expects
> > > to drop a
> > > +	 * reference to VRAM allocation.
> > > +	 */
> > > +	void (*vram_release)(void *vram_allocation);
> > > +
> > > +	/**
> > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 * @npages: Number of pages to populate
> > > +	 * @pfn: Array of page frame numbers to populate
> > > +	 *
> > > +	 * This function shall populate VRAM page frame numbers
> > > (PFN).
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > +				 void *vram_allocation,
> > > +				 unsigned long npages,
> > > +				 unsigned long *pfn);
> > > +
> > > +	/**
> > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to VRAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @copy_to_sram: Copy to system RAM (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > (destination)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to system RAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > +	 * @mmu_range: Pointer to the mmu_notifier_range
> > > structure
> > > +	 *
> > > +	 * This function shall invalidate the GPU page tables.
> > > It
> > > can safely
> > > +	 * walk the notifier range RB tree/list in this
> > > function.
> > > Called while
> > > +	 * holding the notifier lock.
> > > +	 */
> > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > +			   struct drm_gpusvm_notifier *notifier,
> > > +			   const struct mmu_notifier_range
> > > *mmu_range);
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > > notifier
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: MMU interval notifier
> > > + * @interval: Interval for the notifier
> > > + * @rb: Red-black tree node for the parent GPU SVM structure
> > > notifier tree
> > > + * @root: Cached root node of the RB tree containing ranges
> > > + * @range_list: List head containing of ranges in the same order
> > > they appear in
> > > + *              interval tree. This is useful to keep iterating
> > > ranges while
> > > + *              doing modifications to RB tree.
> > > + * @flags.removed: Flag indicating whether the MMU interval
> > > notifier
> > > has been
> > > + *                 removed
> > > + *
> > > + * This structure represents a GPU SVM notifier.
> > > + */
> > > +struct drm_gpusvm_notifier {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct mmu_interval_notifier notifier;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} interval;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct rb_root_cached root;
> > > +	struct list_head range_list;
> > > +	struct {
> > > +		u32 removed : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_range - Structure representing a GPU SVM
> > > range
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier
> > > + * @refcount: Reference count for the range
> > > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > > structure range tree
> > > + * @va: Virtual address range
> > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > + * @pages: Pointer to the array of pages (if backing store is in
> > > VRAM)
> > > + * @dma_addr: DMA address array (if backing store is SRAM and
> > > DMA
> > > mapped)
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is
> > > mapping
> > > size
> > > + * @flags.migrate_vram: Flag indicating whether the range can be
> > > migrated to VRAM
> > > + * @flags.unmapped: Flag indicating if the range has been
> > > unmapped
> > > + * @flags.partial_unmap: Flag indicating if the range has been
> > > partially unmapped
> > > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > > pages
> > > + * @flags.has_dma_mapping: Flag indicating if the range has a
> > > DMA
> > > mapping
> > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > > allocation based
> > > + *                       on @order which releases via kfree
> > > + *
> > > + * This structure represents a GPU SVM range used for tracking
> > > memory ranges
> > > + * mapped in a DRM device.
> > > + */
> > > +struct drm_gpusvm_range {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct kref refcount;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} va;
> > > +	unsigned long notifier_seq;
> > > +	union {
> > > +		struct page **pages;
> > > +		dma_addr_t *dma_addr;
> > > +	};
> > > +	void *vram_allocation;
> > > +	u16 order;
> > > +	struct {
> > > +		/* All flags below must be set upon creation */
> > > +		u16 migrate_vram : 1;
> > > +		/* All flags below must be set / cleared under
> > > notifier lock */
> > > +		u16 unmapped : 1;
> > > +		u16 partial_unmap : 1;
> > > +		u16 has_vram_pages : 1;
> > > +		u16 has_dma_mapping : 1;
> > > +		u16 kfree_mapping : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm - GPU SVM structure
> > > + *
> > > + * @name: Name of the GPU SVM
> > > + * @drm: Pointer to the DRM device structure
> > > + * @mm: Pointer to the mm_struct for the address space
> > > + * @device_private_page_owner: Device private pages owner
> > > + * @mm_start: Start address of GPU SVM
> > > + * @mm_range: Range of the GPU SVM
> > > + * @notifier_size: Size of individual notifiers
> > > + * @ops: Pointer to the operations structure for GPU SVM
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending
> > > order.
> > > + * @num_chunks: Number of chunks
> > > + * @notifier_lock: Read-write semaphore for protecting notifier
> > > operations
> > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > + * @root: Cached root node of the Red-Black tree containing GPU
> > > SVM
> > > notifiers
> > > + * @notifier_list: list head containing of notifiers in the same
> > > order they
> > > + *                 appear in interval tree. This is useful to
> > > keep
> > > iterating
> > > + *                 notifiers while doing modifications to RB
> > > tree.
> > > + *
> > > + * This structure represents a GPU SVM (Shared Virtual Memory)
> > > used
> > > for tracking
> > > + * memory ranges mapped in a DRM (Direct Rendering Manager)
> > > device.
> > > + *
> > > + * No reference counting is provided, as this is expected to be
> > > embedded in the
> > > + * driver VM structure along with the struct drm_gpuvm, which
> > > handles reference
> > > + * counting.
> > > + */
> > > +struct drm_gpusvm {
> > > +	const char *name;
> > > +	struct drm_device *drm;
> > > +	struct mm_struct *mm;
> > > +	void *device_private_page_owner;
> > > +	u64 mm_start;
> > > +	u64 mm_range;
> > > +	u64 notifier_size;
> > > +	const struct drm_gpusvm_ops *ops;
> > > +	const u64 *chunk_sizes;
> > > +	int num_chunks;
> > > +	struct rw_semaphore notifier_lock;
> > > +	struct workqueue_struct *zdd_wq;
> > > +	struct rb_root_cached root;
> > > +	struct list_head notifier_list;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > + *
> > > + * @mmap_locked: mmap lock is locked
> > > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > > inversions
> > > + *                (e.g.dma-revs -> mmap lock)
> > > + * @in_notifier: entering from a MMU notifier
> > > + * @read_only: operating on read-only memory
> > > + * @vram_possible: possible to use VRAM
> > > + * @prefault: prefault pages
> > > + *
> > > + * Context that is DRM GPUSVM is operating in (i.e. user
> > > arguments).
> > > + */
> > > +struct drm_gpusvm_ctx {
> > > +	u32 mmap_locked :1;
> > > +	u32 trylock_mmap :1;
> > > +	u32 in_notifier :1;
> > > +	u32 read_only :1;
> > > +	u32 vram_possible :1;
> > > +	u32 prefault :1;
> > > +};
> > > +
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64
> > > notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks);
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx
> > > *ctx);
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > +
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range);
> > > +
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx
> > > *ctx);
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range,
> > > +				  const struct drm_gpusvm_ctx
> > > *ctx);
> > > +
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx
> > > *ctx);
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx
> > > *ctx);
> > > +
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > +
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > start,
> > > u64 end);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end);
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > + */
> > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > +	down_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > + */
> > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > +	up_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the
> > > list
> > > + * @range: a pointer to the current GPU SVM range
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_range if available,
> > > or
> > > NULL if the
> > > + *         current range is the last one or if the input range
> > > is
> > > NULL.
> > > + */
> > > +static inline struct drm_gpusvm_range *
> > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > +{
> > > +	if (range && !list_is_last(&range->rb.entry,
> > > +				   &range->notifier-
> > > >range_list))
> > > +		return list_next_entry(range, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > > notifier
> > > + * @range__: Iterator variable for the ranges. If set, it
> > > indicates
> > > the start of
> > > + *	     the iterator. If NULL, call drm_gpusvm_range_find()
> > > to
> > > get the range.
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a
> > > notifier.
> > > It is safe
> > > + * to use while holding the driver SVM lock or the notifier
> > > lock.
> > > + */
> > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > > end__)	\
> > > +	for ((range__) = (range__)
> > > ?:					\
> > > +	     drm_gpusvm_range_find((notifier__), (start__),
> > > (end__));	\
> > > +	     (range__) && (range__->va.start <
> > > (end__));		\
> > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as
> > > unmapped
> > > + * @range: Pointer to the GPU SVM range structure.
> > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > + *
> > > + * This function marks a GPU SVM range as unmapped and sets the
> > > partial_unmap flag
> > > + * if the range partially falls within the provided MMU notifier
> > > range.
> > > + */
> > > +static inline void
> > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > +			      const struct mmu_notifier_range
> > > *mmu_range)
> > > +{
> > > +	lockdep_assert_held_write(&range->gpusvm-
> > > >notifier_lock);
> > > +
> > > +	range->flags.unmapped = true;
> > > +	if (range->va.start < mmu_range->start ||
> > > +	    range->va.end > mmu_range->end)
> > > +		range->flags.partial_unmap = true;
> > > +}
> > > +
> > > +#endif /* __DRM_GPUSVM_H__ */
> > 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 19:18       ` Thomas Hellström
@ 2024-08-29 20:56         ` Matthew Brost
  2024-08-30  8:18           ` Thomas Hellström
                             ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 20:56 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Thu, Aug 29, 2024 at 09:18:29PM +0200, Thomas Hellström wrote:
> Hi, Matthew,
> 
> On Thu, 2024-08-29 at 17:45 +0000, Matthew Brost wrote:
> > On Thu, Aug 29, 2024 at 11:16:49AM +0200, Thomas Hellström wrote:
> > > Hi, Matt. 
> > > 
> > > Some initial design comments / questions:
> > > 
> > > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > > This patch introduces support for GPU Shared Virtual Memory (SVM)
> > > > in
> > > > the
> > > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > > sharing of memory between the CPU and GPU, enhancing performance
> > > > and
> > > > flexibility in GPU computing tasks.
> > > > 
> > > > The patch adds the necessary infrastructure for SVM, including
> > > > data
> > > > structures and functions for managing SVM ranges and notifiers.
> > > > It
> > > > also
> > > > provides mechanisms for allocating, deallocating, and migrating
> > > > memory
> > > > regions between system RAM and GPU VRAM.
> > > > 
> > > > This mid-layer is largely inspired by GPUVM.
> > > > 
> > > > Cc: Dave Airlie <airlied@redhat.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > > +++++++++++++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > b/drivers/gpu/drm/xe/Makefile
> > > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > > >  
> > > >  # core driver code
> > > >  
> > > > -xe-y += xe_bb.o \
> > > > +xe-y += drm_gpusvm.o \
> > > > +	xe_bb.o \
> > > >  	xe_bo.o \
> > > >  	xe_bo_evict.o \
> > > >  	xe_devcoredump.o \
> > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > new file mode 100644
> > > > index 000000000000..fc1e44e6ae72
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > @@ -0,0 +1,2174 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + *
> > > > + * Authors:
> > > > + *     Matthew Brost <matthew.brost@intel.com>
> > > > + */
> > > > +
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/interval_tree_generic.h>
> > > > +#include <linux/hmm.h>
> > > > +#include <linux/memremap.h>
> > > > +#include <linux/migrate.h>
> > > > +#include <linux/mm_types.h>
> > > > +#include <linux/pagemap.h>
> > > > +#include <linux/slab.h>
> > > > +
> > > > +#include <drm/drm_device.h>
> > > > +#include "drm_gpusvm.h"
> > > > +
> > > > +/**
> > > > + * DOC: Overview
> > > > + *
> > > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > > Rendering Manager (DRM)
> > > > + *
> > > > + * The GPU SVM layer is a component of the DRM framework
> > > > designed to
> > > > manage shared
> > > > + * virtual memory between the CPU and GPU. It enables efficient
> > > > data
> > > > exchange and
> > > > + * processing for GPU-accelerated applications by allowing
> > > > memory
> > > > sharing and
> > > > + * synchronization between the CPU's and GPU's virtual address
> > > > spaces.
> > > > + *
> > > > + * Key GPU SVM Components:
> > > > + * - Notifiers: Notifiers: Used for tracking memory intervals
> > > > and
> > > > notifying the
> > > > + *		GPU of changes, notifiers are sized based on a
> > > > GPU
> > > > SVM
> > > > + *		initialization parameter, with a recommendation
> > > > of
> > > > 512M or
> > > > + *		larger. They maintain a Red-BlacK tree and a
> > > > list of
> > > > ranges that
> > > > + *		fall within the notifier interval. Notifiers are
> > > > tracked within
> > > > + *		a GPU SVM Red-BlacK tree and list and are
> > > > dynamically inserted
> > > > + *		or removed as ranges within the interval are
> > > > created
> > > > or
> > > > + *		destroyed.
> > > 
> > > What is the benefit of this extra layer compared to direct
> > > insertion of
> > > ranges using mmu_interval_notifier_insert?
> > > 
> > > IIRC the argument made previously about having wide notifiers was
> > > that
> > > the rb tree lookups inside the core were costly and if there were
> > > only
> > > a few, then the rb tree lookups within a notifier range could be
> > > replaced with the page-table radix-tree-like lookup, so each lookup
> > > complexity would be O(log(n_notifiers) + page_table_depth).
> > > 
> > > But now we have first an rb-tree lookup in the core and then an rb-
> > > tree
> > > lookup within each notifier yeilding O(log(n_ranges))
> > > 
> > > I can see a small benefit in that inserting directly into the core
> > > rb-
> > > tree will block pending ongoing invalidations, but at a cost of an
> > > extra multiplexing layer.
> > > 
> > 
> > So when the notifier is triggered the search is a smaller range. In a
> > perfect world eventually I'd like to drop the SVM range completely.
> > There is a lot of changes required in Xe to make that possible and
> > not
> > entirely convinced it is possible and the ROI is worth it (additional
> > complexity vs. perf benefit). For now, this was a relatively simple
> > way
> > to get SVM working (mirrors boths AMD's and Nvidia's implement wrt to
> > having a range concept) but also is flexible in the sense the
> > notifier
> > size can be easily tweaked via a modparam [1] following Jason's
> > suggestion of larger notifiers.
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1
> 
> What I meant was the core is already implementing the "one notifier for
> the whole range", since your notifier duplicates the
> mmu_interval_notifier functionality.
> 
> The mmu_interval_notifier first does an rbtree search to get to the
> notifier, and then drm_gpusvm does an rbtree search to get to the
> range.

Yes.

> 
> If the svm notifier layer is skipped, mmu_interval_notifier has to
> perform a wider rbtree search to get to the range. The point is, the
> complexity is the same for both approaches so there is no point in
> adding a svm notifier layer for that reason. The width of the notifier
> just adjust the relative size of the two rbtree searches, so from that
> point of view the drm_gpusvm does not offer any benefit from inserting
> the ranges into the mmu_interval_notifier directly (except that the
> mmu_interval_notifier is slightly more heavyweight).
> 

I think a large part of it was to avoid inserting / removing many
notifiers as that was expensive. Agree the search is not fundamentally
faster the way I have this coded. It just avoids heavy inserting /
removing of notifiers.

> As I understand it, Jasons comments were based on the assumption that
> the drm_gpusvm search would be radix tree based, and hence with less
> complexity than the rbtree search, and therefore providing a clear
> benefit the larger they could be.
> 
> I.e. just calling something similar to xe_vm_invalidate_xxx over the
> whole range, which will just skip subranges that are not populated.
> 

As stated, I think eventually removing the SVM range is a good longterm
goal.

I almost coded that in this initial series but ran into a number of
issues which make this complex and to get something working in simplest
way possible to enable further test development, start constructive
upstream discussions which appear to be happening, UMDs / application
development, and other up[er layer KMD development I stuck with this
approach.

I think for any solution which requires a SVM range (fwiw both AMD and
Nvidia have a similar concept), attaching the ranges to a larger
notifier makes sense and is better than 1 notifier per range.

Issues with removing a SVM range:

- Xe bind code stores invalidation / present state in VMA, this would
  need to be moved to the radix tree. I have Jira open for that work
  which I believe other developers are going to own.
- Where would the dma mapping / device pages be stored?
	- In the radix tree? What if ATS is enabled? We don't have a
	  driver owned radix tree. How do we reasonably connect a driver
	  owned radix to a common GPUSVM layer?
	- In the notifier? What is the notifier is sparsely populated?
	  We would be wasting huge amounts of memory. What is the
	  notifier is configured to span the entire virtual address
	  space?
- How does the garbage collector work? We can't allocate memory in the
  notifier so we don't anything to add to the garbage collector. We
  can't directly modify page tables given you need lock in the path of
  reclaim.
- How do we deal with fault storms (e.g. tons of faults hitting the same
  SVM range in a row)? Without a SVM range no every to know if mapping
  is valid and GPU page handler can be short circuited.
- Do we have notifier seqno for every PTE?

I feel like I'm missing a few and likely more issues would arrise when
implementing this too.

To be clear, I'm saying we shouldn't try to do this and all of the above
issues are likely workable but doing all this upfront is akin running
before we can walk. I rather solve of fundamental locking issues first,
have robust testing in place + passing and UMDs / apps running before
trying to rework this one. Performance numbers for this would also be
helpful too.

Matt

> /Thomas
> 
> > 
> > > > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > > > managed
> > > > + *	     by GPU SVM. They are sized based on an array of
> > > > chunk
> > > > sizes, which
> > > > + *	     is a GPU SVM initialization parameter, and the CPU
> > > > address space.
> > > > + *	     Upon GPU fault, the largest aligned chunk that fits
> > > > within the
> > > > + *	     faulting CPU address space is chosen for the range
> > > > size. Ranges are
> > > > + *	     expected to be dynamically allocated on GPU fault
> > > > and
> > > > removed on an
> > > > + *	     MMU notifier UNMAP event. As mentioned above,
> > > > ranges
> > > > are tracked in
> > > > + *	     a notifier's Red-Black tree.
> > > 
> > > How do ranges and chunks map to
> > >  
> > > a) Prefaulting granularity
> > > b) Migration granularity?
> > > 
> > > > + * - Operations: Define the interface for driver-specific SVM
> > > > operations such as
> > > > + *		 allocation, page collection, migration,
> > > > invalidations, and VRAM
> > > > + *		 release.
> > > > + *
> > > > + * This layer provides interfaces for allocating, mapping,
> > > > migrating, and
> > > > + * releasing memory ranges between the CPU and GPU. It handles
> > > > all
> > > > core memory
> > > > + * management interactions (DMA mapping, HMM, and migration) and
> > > > provides
> > > > + * driver-specific virtual functions (vfuncs). This
> > > > infrastructure
> > > > is sufficient
> > > > + * to build the expected driver components for an SVM
> > > > implementation
> > > > as detailed
> > > > + * below.
> > > > + *
> > > > + * Expected Driver Components:
> > > > + * - GPU page fault handler: Used to create ranges and notifiers
> > > > based on the
> > > > + *			     fault address, optionally migrate
> > > > the
> > > > range to
> > > > + *			     VRAM, and create GPU bindings.
> > > > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > > > Ranges are
> > > > + *			expected to be added to the garbage
> > > > collector upon
> > > > + *			MMU_NOTIFY_UNMAP event.
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Locking
> > > > + *
> > > > + * GPU SVM handles locking for core MM interactions, i.e., it
> > > > locks/unlocks the
> > > > + * mmap lock as needed. Alternatively, if the driver prefers to
> > > > handle the mmap
> > > > + * lock itself, a 'locked' argument is provided to the functions
> > > > that require
> > > > + * the mmap lock. This option may be useful for drivers that
> > > > need to
> > > > call into
> > > > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > > > locking
> > > > + * inversions between the mmap and dma-resv locks.
> > > > + *
> > > > + * GPU SVM introduces a global notifier lock, which safeguards
> > > > the
> > > > notifier's
> > > > + * range RB tree and list, as well as the range's DMA mappings
> > > > and
> > > > sequence
> > > > + * number. GPU SVM manages all necessary locking and unlocking
> > > > operations,
> > > > + * except for the recheck of the range's sequence number
> > > > + * (mmu_interval_read_retry) when the driver is committing GPU
> > > > bindings. This
> > > > + * lock corresponds to the 'driver->update' lock mentioned in
> > > > the
> > > > HMM
> > > > + * documentation (TODO: Link). Future revisions may transition
> > > > from
> > > > a GPU SVM
> > > > + * global lock to a per-notifier lock if finer-grained locking
> > > > is
> > > > deemed
> > > > + * necessary.
> > > > + *
> > > > + * In addition to the locking mentioned above, the driver should
> > > > implement a
> > > > + * lock to safeguard core GPU SVM function calls that modify
> > > > state,
> > > > such as
> > > > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > > > Alternatively,
> > > > + * these core functions can be called within a single kernel
> > > > thread,
> > > > for
> > > > + * instance, using an ordered work queue. This lock is denoted
> > > > as
> > > > + * 'driver_svm_lock' in code examples.
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Migrataion
> > > > + *
> > > > + * The migration support is quite simple, allowing migration
> > > > between
> > > > SRAM and
> > > > + * VRAM at the range granularity. For example, GPU SVM currently
> > > > does not
> > > > + * support mixing SRAM and VRAM pages within a range. This means
> > > > that upon GPU
> > > > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > > > fault, the
> > > > + * entire range is migrated to SRAM.
> > > > + *
> > > > + * The reasoning for only supporting range granularity is as
> > > > follows: it
> > > > + * simplifies the implementation, and range sizes are driver-
> > > > defined
> > > > and should
> > > > + * be relatively small.
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Partial Unmapping of Ranges
> > > > + *
> > > > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped
> > > > by
> > > > CPU resulting
> > > > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with
> > > > the
> > > > main one
> > > > + * being that a subset of the range still has CPU and GPU
> > > > mappings.
> > > > If the
> > > > + * backing store for the range is in VRAM, a subset of the
> > > > backing
> > > > store has
> > > > + * references. One option would be to split the range and VRAM
> > > > backing store,
> > > > + * but the implementation for this would be quite complicated.
> > > > Given
> > > > that
> > > > + * partial unmappings are rare and driver-defined range sizes
> > > > are
> > > > relatively
> > > > + * small, GPU SVM does not support splitting of ranges.
> > > > + *
> > > > + * With no support for range splitting, upon partial unmapping
> > > > of a
> > > > range, the
> > > > + * driver is expected to invalidate and destroy the entire
> > > > range. If
> > > > the range
> > > > + * has VRAM as its backing, the driver is also expected to
> > > > migrate
> > > > any remaining
> > > > + * pages back to SRAM.
> > > 
> > > So what happens if we get a one-page invalidation, say protection
> > > change event, or NUMA accounting event, in the middle of a range?
> > > Can
> > > we unmap just that single gpu pte covering that range, that is, how
> > > do
> > > the ranges map to invalidation granularity? Does this differ
> > > between
> > > igfx an dgfx?
> > 
> > Well the idea of chunks is ranges should be 1 GPU page (the chunk
> > array
> > in Xe is 4k, 64k, and 2M). The design is flexible enough that doesn't
> > have to true but optimized for the thinking each range is most likely
> > 1
> > GPU page. If this isn't true, then all GPU pages in the range are
> > invalidated which isn't ideal but keeps it simple which IMO far out
> > weighs the potential benefits. In theory a driver could implement
> > spliting / partial invalidaions too with a couple of updates to
> > GPUSVM
> > but would likely largely be a driver implementation rather than
> > GPUSVM.
> > 
> > No difference between igfx an dgfx.
> > 
> > You bring up a good point about protection changes, I likely haven't
> > fully gotten that part of implementation correct either. I can add
> > this
> > to my TODO list and also update my IGTs to do things like this.
> > 
> > Matt
> > 
> > > 
> > > Thanks,
> > > Thomas
> > > 
> > > 
> > > 
> > > 
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Examples
> > > > + *
> > > > + * This section provides two examples of how to build the
> > > > expected
> > > > driver
> > > > + * components: the GPU page fault handler and the garbage
> > > > collector.
> > > > A third
> > > > + * example demonstrates a sample invalidation driver vfunc.
> > > > + *
> > > > + * The generic code provided does not include logic for complex
> > > > migration
> > > > + * policies, optimized invalidations, or other potentially
> > > > required
> > > > driver
> > > > + * locking (e.g., DMA-resv locks).
> > > > + *
> > > > + * 1) GPU page fault handler
> > > > + *
> > > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > > > drm_gpusvm_range *range)
> > > > + *	{
> > > > + *		int err = 0;
> > > > + *
> > > > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > > range);
> > > > + *
> > > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > > + *			driver_commit_bind(gpusvm, range);
> > > > + *		else
> > > > + *			err = -EAGAIN;
> > > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > > + *
> > > > + *		return err;
> > > > + *	}
> > > > + *
> > > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr,
> > > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > + *		int err;
> > > > + *
> > > > + *		driver_svm_lock();
> > > > + *	retry:
> > > > + *		// Always process UNMAPs first so view of GPU
> > > > SVM
> > > > ranges is current
> > > > + *		driver_garbage_collector(gpusvm);
> > > > + *
> > > > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > > > fault_addr,
> > > > +
> > > > *							gpuva_start,
> > > > gpuva_end,
> > > > + *						        &ctx);
> > > > + *		if (IS_ERR(range)) {
> > > > + *			err = PTR_ERR(range);
> > > > + *			goto unlock;
> > > > + *		}
> > > > + *
> > > > + *		if (driver_migration_policy(range)) {
> > > > + *			bo = driver_alloc_bo();
> > > > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > > > range, bo, &ctx);
> > > > + *			if (err)	// CPU mappings may have
> > > > changed
> > > > + *				goto retry;
> > > > + *		}
> > > > + *
> > > > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > > &ctx);
> > > > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > > > mappings changed
> > > > + *			goto retry;
> > > > + *		else if (err)
> > > > + *			goto unlock;
> > > > + *
> > > > + *		err = driver_bind_range(gpusvm, range);
> > > > + *		if (err == -EAGAIN)	// CPU mappings changed
> > > > + *			goto retry
> > > > + *
> > > > + *	unlock:
> > > > + *		driver_svm_unlock();
> > > > + *		return err;
> > > > + *	}
> > > > + *
> > > > + * 2) Garbage Collector.
> > > > + *
> > > > + *	void __driver_garbage_collector(struct drm_gpusvm
> > > > *gpusvm,
> > > > + *					struct drm_gpusvm_range
> > > > *range)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > + *
> > > > + *		assert_driver_svm_locked(gpusvm);
> > > > + *
> > > > + *		// Partial unmap, migrate any remaining VRAM
> > > > pages
> > > > back to SRAM
> > > > + *		if (range->flags.partial_unmap)
> > > > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > > > range,
> > > > &ctx);
> > > > + *
> > > > + *		driver_unbind_range(range);
> > > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > > + *	}
> > > > + *
> > > > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > > > + *	{
> > > > + *		assert_driver_svm_locked(gpusvm);
> > > > + *
> > > > + *		for_each_range_in_garbage_collector(gpusvm,
> > > > range)
> > > > + *			__driver_garbage_collector(gpusvm,
> > > > range);
> > > > + *	}
> > > > + *
> > > > + * 3) Invalidation driver vfunc.
> > > > + *
> > > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > > + *				 struct drm_gpusvm_notifier
> > > > *notifier,
> > > > + *				 const struct mmu_notifier_range
> > > > *mmu_range)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier =
> > > > true,
> > > > };
> > > > + *		struct drm_gpusvm_range *range = NULL;
> > > > + *
> > > > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > > > > start, mmu_range->end);
> > > > + *
> > > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > > mmu_range->start,
> > > > + *					  mmu_range->end) {
> > > > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > > > range,
> > > > &ctx);
> > > > + *
> > > > + *			if (mmu_range->event !=
> > > > MMU_NOTIFY_UNMAP)
> > > > + *				continue;
> > > > + *
> > > > + *			drm_gpusvm_range_set_unmapped(range,
> > > > mmu_range);
> > > > + *			driver_garbage_collector_add(gpusvm,
> > > > range);
> > > > + *		}
> > > > + *	}
> > > > + */
> > > > +
> > > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > > rb.__subtree_last,
> > > > +		     DRM_GPUSVM_RANGE_START,
> > > > DRM_GPUSVM_RANGE_END,
> > > > +		     static __maybe_unused, range);
> > > > +
> > > > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > > > interval.start)
> > > > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > > > interval.end - 1)
> > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > > > +		     rb.__subtree_last,
> > > > DRM_GPUSVM_NOTIFIER_START,
> > > > +		     DRM_GPUSVM_NOTIFIER_END, static
> > > > __maybe_unused,
> > > > notifier);
> > > > +
> > > > +/**
> > > > + * npages_in_range() - Calculate the number of pages in a given
> > > > range
> > > > + * @start__: The start address of the range
> > > > + * @end__: The end address of the range
> > > > + *
> > > > + * This macro calculates the number of pages in a given memory
> > > > range,
> > > > + * specified by the start and end addresses. It divides the
> > > > difference
> > > > + * between the end and start addresses by the page size
> > > > (PAGE_SIZE)
> > > > to
> > > > + * determine the number of pages in the range.
> > > > + *
> > > > + * Return: The number of pages in the specified range.
> > > > + */
> > > > +#define npages_in_range(start__, end__)	\
> > > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > > + *
> > > > + * @refcount: Reference count for the zdd
> > > > + * @destroy_work: Work structure for asynchronous zdd
> > > > destruction
> > > > + * @range: Pointer to the GPU SVM range
> > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > allocation
> > > > + *
> > > > + * This structure serves as a generic wrapper installed in
> > > > + * page->zone_device_data. It provides infrastructure for
> > > > looking up
> > > > a range
> > > > + * upon CPU page fault and asynchronously releasing VRAM once
> > > > the
> > > > CPU has no
> > > > + * page references. Asynchronous release is useful because CPU
> > > > page
> > > > references
> > > > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > > > requires sleeping
> > > > + * locks.
> > > > + */
> > > > +struct drm_gpusvm_zdd {
> > > > +	struct kref refcount;
> > > > +	struct work_struct destroy_work;
> > > > +	struct drm_gpusvm_range *range;
> > > > +	void *vram_allocation;
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > > destroying a
> > > > zdd
> > > > + * @w: Pointer to the work_struct
> > > > + *
> > > > + * This function releases VRAM, puts GPU SVM range, and frees
> > > > zdd.
> > > > + */
> > > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct
> > > > *w)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd =
> > > > +		container_of(w, struct drm_gpusvm_zdd,
> > > > destroy_work);
> > > > +	struct drm_gpusvm_range *range = zdd->range;
> > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > +
> > > > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > > > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > > > +	drm_gpusvm_range_put(range);
> > > > +	kfree(zdd);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > > + * @range: Pointer to the GPU SVM range.
> > > > + *
> > > > + * This function allocates and initializes a new zdd structure.
> > > > It
> > > > sets up the
> > > > + * reference count, initializes the destroy work, and links the
> > > > provided GPU SVM
> > > > + * range.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > > failure.
> > > > + */
> > > > +static struct drm_gpusvm_zdd *
> > > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd;
> > > > +
> > > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > > +	if (!zdd)
> > > > +		return NULL;
> > > > +
> > > > +	kref_init(&zdd->refcount);
> > > > +	INIT_WORK(&zdd->destroy_work,
> > > > drm_gpusvm_zdd_destroy_work_func);
> > > > +	zdd->range = drm_gpusvm_range_get(range);
> > > > +	zdd->vram_allocation = NULL;
> > > > +
> > > > +	return zdd;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > > + * @zdd: Pointer to the zdd structure.
> > > > + *
> > > > + * This function increments the reference count of the provided
> > > > zdd
> > > > structure.
> > > > + *
> > > > + * Returns: Pointer to the zdd structure.
> > > > + */
> > > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > > drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	kref_get(&zdd->refcount);
> > > > +	return zdd;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > > + * @ref: Pointer to the reference count structure.
> > > > + *
> > > > + * This function queues the destroy_work of the zdd for
> > > > asynchronous
> > > > destruction.
> > > > + */
> > > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd =
> > > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > > refcount);
> > > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > > +
> > > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > > + * @zdd: Pointer to the zdd structure.
> > > > + *
> > > > + * This function decrements the reference count of the provided
> > > > zdd
> > > > structure
> > > > + * and schedules its destruction if the count drops to zero.
> > > > + */
> > > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > > notifier
> > > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > > + * @start: Start address of the range
> > > > + * @end: End address of the range
> > > > + *
> > > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > > start, u64 end)
> > > > +{
> > > > +	return range_iter_first(&notifier->root, start, end -
> > > > 1);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > > ranges in a notifier
> > > > + * @range__: Iterator variable for the ranges
> > > > + * @next__: Iterator variable for the ranges temporay storage
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the range
> > > > + * @end__: End address of the range
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > notifier
> > > > while
> > > > + * removing ranges from it.
> > > > + */
> > > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > > notifier__,
> > > > start__, end__)	\
> > > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > > (start__), (end__)),	\
> > > > +	     (next__) =
> > > > __drm_gpusvm_range_next(range__);				\
> > > > +	     (range__) && (range__->va.start <
> > > > (end__));				\
> > > > +	     (range__) = (next__), (next__) =
> > > > __drm_gpusvm_range_next(range__))
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier
> > > > in
> > > > the list
> > > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > > + *
> > > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > > available,
> > > > or NULL if
> > > > + *         the current notifier is the last one or if the input
> > > > notifier is
> > > > + *         NULL.
> > > > + */
> > > > +static struct drm_gpusvm_notifier *
> > > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > > +{
> > > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > > +				      &notifier->gpusvm-
> > > > > notifier_list))
> > > > +		return list_next_entry(notifier, rb.entry);
> > > > +
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> > > > in
> > > > a gpusvm
> > > > + * @notifier__: Iterator variable for the notifiers
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the notifier
> > > > + * @end__: End address of the notifier
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > gpusvm.
> > > > + */
> > > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > > start__,
> > > > end__)		\
> > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > >root,
> > > > (start__), (end__) - 1);	\
> > > > +	     (notifier__) && (notifier__->interval.start <
> > > > (end__));			\
> > > > +	     (notifier__) =
> > > > __drm_gpusvm_notifier_next(notifier__))
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> > > > SVM
> > > > notifiers in a gpusvm
> > > > + * @notifier__: Iterator variable for the notifiers
> > > > + * @next__: Iterator variable for the notifiers temporay storage
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the notifier
> > > > + * @end__: End address of the notifier
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > gpusvm
> > > > while
> > > > + * removing notifiers from it.
> > > > + */
> > > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > > gpusvm__, start__, end__)	\
> > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > >root,
> > > > (start__), (end__) - 1),	\
> > > > +	     (next__) =
> > > > __drm_gpusvm_notifier_next(notifier__);				\
> > > > +	     (notifier__) && (notifier__->interval.start <
> > > > (end__));			\
> > > > +	     (notifier__) = (next__), (next__) =
> > > > __drm_gpusvm_notifier_next(notifier__))
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > > notifier.
> > > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > > + * @cur_seq: Current sequence number.
> > > > + *
> > > > + * This function serves as a generic MMU notifier for GPU SVM.
> > > > It
> > > > sets the MMU
> > > > + * notifier sequence number and calls the driver invalidate
> > > > vfunc
> > > > under
> > > > + * gpusvm->notifier_lock.
> > > > + *
> > > > + * Returns:
> > > > + * true if the operation succeeds, false otherwise.
> > > > + */
> > > > +static bool
> > > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > > *mni,
> > > > +			       const struct mmu_notifier_range
> > > > *mmu_range,
> > > > +			       unsigned long cur_seq)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier =
> > > > +		container_of(mni, typeof(*notifier), notifier);
> > > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > > +
> > > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > > +		return false;
> > > > +
> > > > +	down_write(&gpusvm->notifier_lock);
> > > > +	mmu_interval_set_seq(mni, cur_seq);
> > > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > > +	up_write(&gpusvm->notifier_lock);
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations
> > > > for
> > > > GPU SVM
> > > > + */
> > > > +static const struct mmu_interval_notifier_ops
> > > > drm_gpusvm_notifier_ops = {
> > > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + * @name: Name of the GPU SVM.
> > > > + * @drm: Pointer to the DRM device structure.
> > > > + * @mm: Pointer to the mm_struct for the address space.
> > > > + * @device_private_page_owner: Device private pages owner.
> > > > + * @mm_start: Start address of GPU SVM.
> > > > + * @mm_range: Range of the GPU SVM.
> > > > + * @notifier_size: Size of individual notifiers.
> > > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > range
> > > > allocation.
> > > > + *               Entries should be powers of 2 in descending
> > > > order
> > > > with last
> > > > + *               entry being SZ_4K.
> > > > + * @num_chunks: Number of chunks.
> > > > + *
> > > > + * This function initializes the GPU SVM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, a negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > +		    const char *name, struct drm_device *drm,
> > > > +		    struct mm_struct *mm, void
> > > > *device_private_page_owner,
> > > > +		    u64 mm_start, u64 mm_range, u64
> > > > notifier_size,
> > > > +		    const struct drm_gpusvm_ops *ops,
> > > > +		    const u64 *chunk_sizes, int num_chunks)
> > > > +{
> > > > +	if (!ops->invalidate || !num_chunks)
> > > > +		return -EINVAL;
> > > > +
> > > > +	gpusvm->name = name;
> > > > +	gpusvm->drm = drm;
> > > > +	gpusvm->mm = mm;
> > > > +	gpusvm->device_private_page_owner =
> > > > device_private_page_owner;
> > > > +	gpusvm->mm_start = mm_start;
> > > > +	gpusvm->mm_range = mm_range;
> > > > +	gpusvm->notifier_size = notifier_size;
> > > > +	gpusvm->ops = ops;
> > > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > > +	gpusvm->num_chunks = num_chunks;
> > > > +	gpusvm->zdd_wq = system_wq;
> > > > +
> > > > +	mmgrab(mm);
> > > > +	gpusvm->root = RB_ROOT_CACHED;
> > > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > > +
> > > > +	init_rwsem(&gpusvm->notifier_lock);
> > > > +
> > > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > > +	might_lock(&gpusvm->notifier_lock);
> > > > +	fs_reclaim_release(GFP_KERNEL);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > > + * @fault_addr__: Fault address
> > > > + *
> > > > + * This macro finds the GPU SVM notifier associated with the
> > > > fault
> > > > address.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > > + */
> > > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > > fault_addr__)	\
> > > > +	notifier_iter_first(&(gpusvm__)->root,
> > > > (fault_addr__),	\
> > > > +			    (fault_addr__ + 1))
> > > > +
> > > > +/**
> > > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > > given rbtree node
> > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > drm_gpusvm_notifier struct
> > > > + *
> > > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > > structure.
> > > > + */
> > > > +#define
> > > > to_drm_gpusvm_notifier(__node)				\
> > > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > > rb.node)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This function inserts the GPU SVM notifier into the GPU SVM
> > > > RB
> > > > tree and list.
> > > > + */
> > > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > > *gpusvm,
> > > > +				       struct
> > > > drm_gpusvm_notifier
> > > > *notifier)
> > > > +{
> > > > +	struct rb_node *node;
> > > > +	struct list_head *head;
> > > > +
> > > > +	notifier_insert(notifier, &gpusvm->root);
> > > > +
> > > > +	node = rb_prev(&notifier->rb.node);
> > > > +	if (node)
> > > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > > >rb.entry;
> > > > +	else
> > > > +		head = &gpusvm->notifier_list;
> > > > +
> > > > +	list_add(&notifier->rb.entry, head);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This macro removes the GPU SVM notifier from the GPU SVM RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > > notifier__)	\
> > > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > > >root);	\
> > > > +	list_del(&(notifier__)->rb.entry)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * This function finalizes the GPU SVM by cleaning up any
> > > > remaining
> > > > ranges and
> > > > + * notifiers, and dropping a reference to struct MM.
> > > > + */
> > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > > +
> > > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > > gpusvm, 0,
> > > > LONG_MAX) {
> > > > +		struct drm_gpusvm_range *range, *__next;
> > > > +
> > > > +		/*
> > > > +		 * Remove notifier first to avoid racing with
> > > > any
> > > > invalidation
> > > > +		 */
> > > > +		mmu_interval_notifier_remove(&notifier-
> > > > >notifier);
> > > > +		notifier->flags.removed = true;
> > > > +
> > > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > > notifier, 0,
> > > > +					       LONG_MAX)
> > > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > > +	}
> > > > +
> > > > +	mmdrop(gpusvm->mm);
> > > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @fault_addr: Fault address
> > > > + *
> > > > + * This function allocates and initializes the GPU SVM notifier
> > > > structure.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated GPU SVM notifier on success,
> > > > ERR_PTR()
> > > > on failure.
> > > > + */
> > > > +static struct drm_gpusvm_notifier *
> > > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	if (gpusvm->ops->notifier_alloc)
> > > > +		notifier = gpusvm->ops->notifier_alloc();
> > > > +	else
> > > > +		notifier = kzalloc(sizeof(*notifier),
> > > > GFP_KERNEL);
> > > > +
> > > > +	if (!notifier)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	notifier->gpusvm = gpusvm;
> > > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > > gpusvm-
> > > > > notifier_size);
> > > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > > > notifier_size);
> > > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > > +	notifier->root = RB_ROOT_CACHED;
> > > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > > +
> > > > +	return notifier;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This function frees the GPU SVM notifier structure.
> > > > + */
> > > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > > +				     struct drm_gpusvm_notifier
> > > > *notifier)
> > > > +{
> > > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > > +
> > > > +	if (gpusvm->ops->notifier_free)
> > > > +		gpusvm->ops->notifier_free(notifier);
> > > > +	else
> > > > +		kfree(notifier);
> > > > +}
> > > > +
> > > > +/**
> > > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > > given
> > > > rbtree node
> > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > drm_gpusvm_range struct
> > > > + *
> > > > + * Return: A pointer to the containing drm_gpusvm_range
> > > > structure.
> > > > + */
> > > > +#define to_drm_gpusvm_range(node__)	\
> > > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function inserts the GPU SVM range into the notifier RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > > *notifier,
> > > > +				    struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	struct rb_node *node;
> > > > +	struct list_head *head;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > > +	range_insert(range, &notifier->root);
> > > > +
> > > > +	node = rb_prev(&range->rb.node);
> > > > +	if (node)
> > > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > > +	else
> > > > +		head = &notifier->range_list;
> > > > +
> > > > +	list_add(&range->rb.entry, head);
> > > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > + * @range__: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This macro removes the GPU SVM range from the notifier RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +#define __drm_gpusvm_range_remove(notifier__,
> > > > range__)		\
> > > > +	range_remove((range__), &(notifier__)-
> > > > >root);		\
> > > > +	list_del(&(range__)->rb.entry)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @fault_addr: Fault address
> > > > + * @chunk_size: Chunk size
> > > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > > + *
> > > > + * This function allocates and initializes the GPU SVM range
> > > > structure.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR()
> > > > on
> > > > failure.
> > > > + */
> > > > +static struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > > +		       struct drm_gpusvm_notifier *notifier,
> > > > +		       u64 fault_addr, u64 chunk_size, bool
> > > > migrate_vram)
> > > > +{
> > > > +	struct drm_gpusvm_range *range;
> > > > +
> > > > +	if (gpusvm->ops->range_alloc)
> > > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > > +	else
> > > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > +
> > > > +	if (!range)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	kref_init(&range->refcount);
> > > > +	range->gpusvm = gpusvm;
> > > > +	range->notifier = notifier;
> > > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > > +	range->notifier_seq = LONG_MAX;
> > > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > > +
> > > > +	return range;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_check_pages - Check pages
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @start: Start address
> > > > + * @end: End address
> > > > + *
> > > > + * Check if pages between start and end have been faulted in on
> > > > the
> > > > CPU. Use to
> > > > + * prevent migration of pages without CPU backing store.
> > > > + *
> > > > + * Returns:
> > > > + * True if pages have been faulted into CPU, False otherwise
> > > > + */
> > > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > > +				   struct drm_gpusvm_notifier
> > > > *notifier,
> > > > +				   u64 start, u64 end)
> > > > +{
> > > > +	struct hmm_range hmm_range = {
> > > > +		.default_flags = 0,
> > > > +		.notifier = &notifier->notifier,
> > > > +		.start = start,
> > > > +		.end = end,
> > > > +		.dev_private_owner = gpusvm-
> > > > > device_private_page_owner,
> > > > +	};
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long *pfns;
> > > > +	unsigned long npages = npages_in_range(start, end);
> > > > +	int err, i;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +	if (!pfns)
> > > > +		return false;
> > > > +
> > > > +	hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(&notifier-
> > > > > notifier);
> > > > +	hmm_range.hmm_pfns = pfns;
> > > > +
> > > > +	while (true) {
> > > > +		err = hmm_range_fault(&hmm_range);
> > > > +		if (err == -EBUSY) {
> > > > +			if (time_after(jiffies, timeout))
> > > > +				break;
> > > > +
> > > > +			hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(&notifier->notifier);
> > > > +			continue;
> > > > +		}
> > > > +		break;
> > > > +	}
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_free;
> > > > +		}
> > > > +	}
> > > > +
> > > > +err_free:
> > > > +	kvfree(pfns);
> > > > +	return err ? false : true;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> > > > SVM
> > > > range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @vas: Pointer to the virtual memory area structure
> > > > + * @fault_addr: Fault address
> > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > + * @check_pages: Flag indicating whether to check pages
> > > > + *
> > > > + * This function determines the chunk size for the GPU SVM range
> > > > based on the
> > > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > > > and
> > > > the virtual
> > > > + * memory area boundaries.
> > > > + *
> > > > + * Returns:
> > > > + * Chunk size on success, LONG_MAX on failure.
> > > > + */
> > > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > > *gpusvm,
> > > > +				       struct
> > > > drm_gpusvm_notifier
> > > > *notifier,
> > > > +				       struct vm_area_struct
> > > > *vas,
> > > > +				       u64 fault_addr, u64
> > > > gpuva_start,
> > > > +				       u64 gpuva_end, bool
> > > > check_pages)
> > > > +{
> > > > +	u64 start, end;
> > > > +	int i = 0;
> > > > +
> > > > +retry:
> > > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > > chunk_sizes[i]);
> > > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > > >chunk_sizes[i]);
> > > > +
> > > > +		if (start >= vas->vm_start && end <= vas->vm_end
> > > > &&
> > > > +		    start >= notifier->interval.start &&
> > > > +		    end <= notifier->interval.end &&
> > > > +		    start >= gpuva_start && end <= gpuva_end)
> > > > +			break;
> > > > +	}
> > > > +
> > > > +	if (i == gpusvm->num_chunks)
> > > > +		return LONG_MAX;
> > > > +
> > > > +	/*
> > > > +	 * If allocation more than page, ensure not to overlap
> > > > with
> > > > existing
> > > > +	 * ranges.
> > > > +	 */
> > > > +	if (end - start != SZ_4K) {
> > > > +		struct drm_gpusvm_range *range;
> > > > +
> > > > +		range = drm_gpusvm_range_find(notifier, start,
> > > > end);
> > > > +		if (range) {
> > > > +			++i;
> > > > +			goto retry;
> > > > +		}
> > > > +
> > > > +		/*
> > > > +		 * XXX: Only create range on pages CPU has
> > > > faulted
> > > > in. Without
> > > > +		 * this check, or prefault, on BMG
> > > > 'xe_exec_system_allocator --r
> > > > +		 * process-many-malloc' fails. In the failure
> > > > case,
> > > > each process
> > > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > > results in 64k SVM
> > > > +		 * ranges. When migrating the SVM ranges, some
> > > > processes fail in
> > > > +		 * drm_gpusvm_migrate_to_vram with
> > > > 'migrate.cpages
> > > > != npages'
> > > > +		 * and then upon drm_gpusvm_range_get_pages
> > > > device
> > > > pages from
> > > > +		 * other processes are collected + faulted in
> > > > which
> > > > creates all
> > > > +		 * sorts of problems. Unsure exactly how this
> > > > happening, also
> > > > +		 * problem goes away if
> > > > 'xe_exec_system_allocator --
> > > > r
> > > > +		 * process-many-malloc' mallocs at least 64k at
> > > > a
> > > > time.
> > > > +		 */
> > > > +		if (check_pages &&
> > > > +		    !drm_gpusvm_check_pages(gpusvm, notifier,
> > > > start,
> > > > end)) {
> > > > +			++i;
> > > > +			goto retry;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return end - start;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > > range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @fault_addr: Fault address
> > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function finds or inserts a newly allocated a GPU SVM
> > > > range
> > > > based on the
> > > > + * fault address. Caller must hold a lock to protect range
> > > > lookup
> > > > and insertion.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > > failure.
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr,
> > > > +				u64 gpuva_start, u64 gpuva_end,
> > > > +				const struct drm_gpusvm_ctx
> > > > *ctx)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +	struct drm_gpusvm_range *range;
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	struct vm_area_struct *vas;
> > > > +	bool notifier_alloc = false;
> > > > +	u64 chunk_size;
> > > > +	int err;
> > > > +	bool migrate_vram;
> > > > +
> > > > +	if (fault_addr < gpusvm->mm_start ||
> > > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > > +		err = -EINVAL;
> > > > +		goto err_out;
> > > > +	}
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		mmap_write_lock(mm);
> > > > +	}
> > > > +
> > > > +	mmap_assert_write_locked(mm);
> > > > +
> > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > > +	if (!notifier) {
> > > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > > fault_addr);
> > > > +		if (IS_ERR(notifier)) {
> > > > +			err = PTR_ERR(notifier);
> > > > +			goto err_mmunlock;
> > > > +		}
> > > > +		notifier_alloc = true;
> > > > +		err =
> > > > mmu_interval_notifier_insert_locked(&notifier-
> > > > > notifier,
> > > > +							  mm,
> > > > notifier->interval.start,
> > > > +							 
> > > > notifier-
> > > > > interval.end -
> > > > +							 
> > > > notifier-
> > > > > interval.start,
> > > > +							 
> > > > &drm_gpusvm_notifier_ops);
> > > > +		if (err)
> > > > +			goto err_notifier;
> > > > +	}
> > > > +
> > > > +	vas = vma_lookup(mm, fault_addr);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > > +		err = -EPERM;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > > fault_addr + 1);
> > > > +	if (range)
> > > > +		goto out_mmunlock;
> > > > +	/*
> > > > +	 * XXX: Short-circuiting migration based on
> > > > migrate_vma_*
> > > > current
> > > > +	 * limitations. If/when migrate_vma_* add more support,
> > > > this
> > > > logic will
> > > > +	 * have to change.
> > > > +	 */
> > > > +	migrate_vram = ctx->vram_possible &&
> > > > +		vma_is_anonymous(vas) &&
> > > > !is_vm_hugetlb_page(vas);
> > > > +
> > > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > > notifier,
> > > > vas,
> > > > +						 fault_addr,
> > > > gpuva_start,
> > > > +						 gpuva_end,
> > > > migrate_vram &&
> > > > +						 !ctx-
> > > > >prefault);
> > > > +	if (chunk_size == LONG_MAX) {
> > > > +		err = -EINVAL;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > > fault_addr,
> > > > chunk_size,
> > > > +				       migrate_vram);
> > > > +	if (IS_ERR(range)) {
> > > > +		err = PTR_ERR(range);
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	drm_gpusvm_range_insert(notifier, range);
> > > > +	if (notifier_alloc)
> > > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > > +
> > > > +	if (ctx->prefault) {
> > > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > > +
> > > > +		__ctx.mmap_locked = true;
> > > > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > > &__ctx);
> > > > +		if (err)
> > > > +			goto err_range_remove;
> > > > +	}
> > > > +
> > > > +out_mmunlock:
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_write_unlock(mm);
> > > > +		mmput(mm);
> > > > +	}
> > > > +
> > > > +	return range;
> > > > +
> > > > +err_range_remove:
> > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > +err_notifier_remove:
> > > > +	if (notifier_alloc)
> > > > +		mmu_interval_notifier_remove(&notifier-
> > > > >notifier);
> > > > +err_notifier:
> > > > +	if (notifier_alloc)
> > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_write_unlock(mm);
> > > > +		mmput(mm);
> > > > +	}
> > > > +err_out:
> > > > +	return ERR_PTR(err);
> > > > +}
> > > > +
> > > > +/**
> > > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > > + * @i__: the current page index in the iteration
> > > > + * @j__: the current page index, log order, in the iteration
> > > > + * @npages__: the total number of pages in the DMA region
> > > > + * @order__: the order of the pages in the DMA region
> > > > + *
> > > > + * This macro iterates over each page in a DMA region. The DMA
> > > > region
> > > > + * is assumed to be composed of 2^@order__ pages, and the macro
> > > > will
> > > > + * step through the region one block of 2^@order__ pages at a
> > > > time.
> > > > + */
> > > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with
> > > > a
> > > > GPU SVM range (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function unmap pages associated with a GPU SVM range.
> > > > Assumes and
> > > > + * asserts correct locking is in place when called.
> > > > + */
> > > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					   struct
> > > > drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	if (range->pages) {
> > > > +		unsigned long i, j, npages =
> > > > npages_in_range(range-
> > > > > va.start,
> > > > +							    
> > > > range-
> > > > > va.end);
> > > > +
> > > > +		if (range->flags.has_dma_mapping) {
> > > > +			for_each_dma_page(i, j, npages, range-
> > > > > order)
> > > > +				dma_unmap_page(gpusvm->drm->dev,
> > > > +					       range-
> > > > >dma_addr[j],
> > > > +					       PAGE_SIZE <<
> > > > range-
> > > > > order,
> > > > +					      
> > > > DMA_BIDIRECTIONAL);
> > > > +		}
> > > > +
> > > > +		range->flags.has_vram_pages = false;
> > > > +		range->flags.has_dma_mapping = false;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_free_pages - Free pages associated with a
> > > > GPU
> > > > SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function free pages associated with a GPU SVM range.
> > > > + */
> > > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	if (range->pages) {
> > > > +		if (range->flags.kfree_mapping) {
> > > > +			kfree(range->dma_addr);
> > > > +			range->flags.kfree_mapping = false;
> > > > +			range->pages = NULL;
> > > > +		} else {
> > > > +			kvfree(range->pages);
> > > > +			range->pages = NULL;
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range to be removed
> > > > + *
> > > > + * This function removes the specified GPU SVM range and also
> > > > removes the parent
> > > > + * GPU SVM notifier if no more ranges remain in the notifier.
> > > > The
> > > > caller must
> > > > + * hold a lock to protect range and notifier removal.
> > > > + */
> > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > +			     struct drm_gpusvm_range *range)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > > va.start);
> > > > +	if (WARN_ON_ONCE(!notifier))
> > > > +		return;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +
> > > > +	drm_gpusvm_range_put(range);
> > > > +
> > > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > > +		if (!notifier->flags.removed)
> > > > +			mmu_interval_notifier_remove(&notifier-
> > > > > notifier);
> > > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > > + * @range: Pointer to the GPU SVM range
> > > > + *
> > > > + * This function increments the reference count of the specified
> > > > GPU
> > > > SVM range.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM range.
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	kref_get(&range->refcount);
> > > > +
> > > > +	return range;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > > + * @refcount: Pointer to the reference counter embedded in the
> > > > GPU
> > > > SVM range
> > > > + *
> > > > + * This function destroys the specified GPU SVM range when its
> > > > reference count
> > > > + * reaches zero. If a custom range-free function is provided, it
> > > > is
> > > > invoked to
> > > > + * free the range; otherwise, the range is deallocated using
> > > > kfree().
> > > > + */
> > > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > > +{
> > > > +	struct drm_gpusvm_range *range =
> > > > +		container_of(refcount, struct drm_gpusvm_range,
> > > > refcount);
> > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > +
> > > > +	if (gpusvm->ops->range_free)
> > > > +		gpusvm->ops->range_free(range);
> > > > +	else
> > > > +		kfree(range);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > > + * @range: Pointer to the GPU SVM range
> > > > + *
> > > > + * This function decrements the reference count of the specified
> > > > GPU
> > > > SVM range
> > > > + * and frees it when the count reaches zero.
> > > > + */
> > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function determines if a GPU SVM range pages are valid.
> > > > Expected be
> > > > + * called holding gpusvm->notifier_lock and as the last step
> > > > before
> > > > commiting a
> > > > + * GPU binding.
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > + */
> > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	return range->flags.has_vram_pages || range-
> > > > > flags.has_dma_mapping;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages
> > > > valid
> > > > unlocked
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function determines if a GPU SVM range pages are valid.
> > > > Expected be
> > > > + * called without holding gpusvm->notifier_lock.
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > + */
> > > > +static bool
> > > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > > +				      struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	bool pages_valid;
> > > > +
> > > > +	if (!range->pages)
> > > > +		return false;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > > range);
> > > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > > +		kfree(range->dma_addr);
> > > > +		range->flags.kfree_mapping = false;
> > > > +		range->pages = NULL;
> > > > +	}
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +
> > > > +	return pages_valid;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function gets pages for a GPU SVM range and ensures they
> > > > are
> > > > mapped for
> > > > + * DMA access.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	struct mmu_interval_notifier *notifier = &range-
> > > > >notifier-
> > > > > notifier;
> > > > +	struct hmm_range hmm_range = {
> > > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > > >read_only
> > > > ? 0 :
> > > > +			HMM_PFN_REQ_WRITE),
> > > > +		.notifier = notifier,
> > > > +		.start = range->va.start,
> > > > +		.end = range->va.end,
> > > > +		.dev_private_owner = gpusvm-
> > > > > device_private_page_owner,
> > > > +	};
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long i, j;
> > > > +	unsigned long npages = npages_in_range(range->va.start,
> > > > range->va.end);
> > > > +	unsigned int order = 0;
> > > > +	unsigned long *pfns;
> > > > +	struct page **pages;
> > > > +	int err = 0;
> > > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > > +	bool alloc_pfns = false, kfree_mapping;
> > > > +
> > > > +retry:
> > > > +	kfree_mapping = false;
> > > > +	hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(notifier);
> > > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > > range))
> > > > +		return 0;
> > > > +
> > > > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > > > range-
> > > > > pages) {
> > > > +		if (ctx->prefault)
> > > > +			return 0;
> > > > +
> > > > +		pfns = (unsigned long *)range->pages;
> > > > +		pages = range->pages;
> > > > +		goto map_pages;
> > > > +	}
> > > > +
> > > > +	if (!range->pages) {
> > > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +		if (!pfns)
> > > > +			return -ENOMEM;
> > > > +		alloc_pfns = true;
> > > > +	} else {
> > > > +		pfns = (unsigned long *)range->pages;
> > > > +	}
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	hmm_range.hmm_pfns = pfns;
> > > > +	while (true) {
> > > > +		/* Must be checked after mmu_interval_read_begin
> > > > */
> > > > +		if (range->flags.unmapped) {
> > > > +			err = -EFAULT;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (!ctx->mmap_locked) {
> > > > +			/*
> > > > +			 * XXX: HMM locking document indicates
> > > > only
> > > > a read-lock
> > > > +			 * is required but there apears to be a
> > > > window between
> > > > +			 * the MMU_NOTIFY_MIGRATE event
> > > > triggered in
> > > > a CPU fault
> > > > +			 * via migrate_vma_setup and the pages
> > > > actually moving
> > > > +			 * in migrate_vma_finalize in which this
> > > > code can grab
> > > > +			 * garbage pages. Grabbing the write-
> > > > lock if
> > > > the range
> > > > +			 * is attached to vram appears to
> > > > protect
> > > > against this
> > > > +			 * race.
> > > > +			 */
> > > > +			if (vram_pages)
> > > > +				mmap_write_lock(mm);
> > > > +			else
> > > > +				mmap_read_lock(mm);
> > > > +		}
> > > > +		err = hmm_range_fault(&hmm_range);
> > > > +		if (!ctx->mmap_locked) {
> > > > +			if (vram_pages)
> > > > +				mmap_write_unlock(mm);
> > > > +			else
> > > > +				mmap_read_unlock(mm);
> > > > +		}
> > > > +
> > > > +		if (err == -EBUSY) {
> > > > +			if (time_after(jiffies, timeout))
> > > > +				break;
> > > > +
> > > > +			hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(notifier);
> > > > +			continue;
> > > > +		}
> > > > +		break;
> > > > +	}
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmput(mm);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	pages = (struct page **)pfns;
> > > > +
> > > > +	if (ctx->prefault) {
> > > > +		range->pages = pages;
> > > > +		goto set_seqno;
> > > > +	}
> > > > +
> > > > +map_pages:
> > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > +
> > > > +		for (i = 0; i < npages; ++i) {
> > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > +
> > > > +			if
> > > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				goto err_free;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->flags.has_vram_pages = true;
> > > > +		range->pages = pages;
> > > > +		if (mmu_interval_read_retry(notifier,
> > > > hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > > range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	} else {
> > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > +
> > > > +		for_each_dma_page(i, j, npages, order) {
> > > > +			if (WARN_ON_ONCE(i && order !=
> > > > +					
> > > > hmm_pfn_to_map_order(pfns[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > +
> > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > +			if
> > > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			set_page_dirty_lock(pages[j]);
> > > > +			mark_page_accessed(pages[j]);
> > > > +
> > > > +			dma_addr[j] = dma_map_page(gpusvm->drm-
> > > > >dev,
> > > > +						   pages[j], 0,
> > > > +						   PAGE_SIZE <<
> > > > order,
> > > > +						  
> > > > DMA_BIDIRECTIONAL);
> > > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > > dma_addr[j])) {
> > > > +				err = -EFAULT;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Huge pages, reduce memory footprint */
> > > > +		if (order) {
> > > > +			dma_addr = kmalloc_array(j,
> > > > sizeof(*dma_addr),
> > > > +						 GFP_KERNEL);
> > > > +			if (dma_addr) {
> > > > +				for (i = 0; i < j; ++i)
> > > > +					dma_addr[i] =
> > > > (dma_addr_t)pfns[i];
> > > > +				kvfree(pfns);
> > > > +				kfree_mapping = true;
> > > > +			} else {
> > > > +				dma_addr = (dma_addr_t *)pfns;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->order = order;
> > > > +		range->flags.kfree_mapping = kfree_mapping;
> > > > +		range->flags.has_dma_mapping = true;
> > > > +		range->dma_addr = dma_addr;
> > > > +		range->vram_allocation = NULL;
> > > > +		if (mmu_interval_read_retry(notifier,
> > > > hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > > range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	}
> > > > +
> > > > +	if (err == -EAGAIN)
> > > > +		goto retry;
> > > > +set_seqno:
> > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_unmap:
> > > > +	for_each_dma_page(i, j, npages, order)
> > > > +		dma_unmap_page(gpusvm->drm->dev,
> > > > +			       (dma_addr_t)pfns[j],
> > > > +			       PAGE_SIZE << order,
> > > > DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	if (alloc_pfns)
> > > > +		kvfree(pfns);
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > > GPU
> > > > SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function unmaps pages associated with a GPU SVM range.
> > > > If
> > > > @in_notifier
> > > > + * is set, it is assumed that gpusvm->notifier_lock is held in
> > > > write
> > > > mode; if it
> > > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > > Must be
> > > > called on
> > > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > > invalidate for IOMMU
> > > > + * security model.
> > > > + */
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range,
> > > > +				  const struct drm_gpusvm_ctx
> > > > *ctx)
> > > > +{
> > > > +	if (ctx->in_notifier)
> > > > +		lockdep_assert_held_write(&gpusvm-
> > > > >notifier_lock);
> > > > +	else
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +
> > > > +	if (!ctx->in_notifier)
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > + * @page: Pointer to the page to put
> > > > + *
> > > > + * This function unlocks and puts a page.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > + * @npages: Number of pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > + *
> > > > + * This function puts an array of pages.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > > +					   unsigned long
> > > > *migrate_pfn)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!migrate_pfn[i])
> > > > +			continue;
> > > > +
> > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_pag
> > > > e(mi
> > > > grate_pfn[i]));
> > > > +		migrate_pfn[i] = 0;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > + * @page: Pointer to the page
> > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > + *
> > > > + * This function associates the given page with the specified
> > > > GPU
> > > > SVM zone
> > > > + * device data and initializes it for zone device usage.
> > > > + */
> > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > +				     struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > +	zone_device_page_init(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU
> > > > SVM
> > > > migration
> > > > + * @dev: The device for which the pages are being mapped
> > > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > > mapped
> > > > pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > + * @npages: Number of pages to map
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function maps pages of memory for migration usage in GPU
> > > > SVM. It
> > > > + * iterates over each page frame number provided in
> > > > @migrate_pfn,
> > > > maps the
> > > > + * corresponding page, and stores the DMA address in the
> > > > provided
> > > > @dma_addr
> > > > + * array.
> > > > + *
> > > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > > mapping.
> > > > + */
> > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > +					dma_addr_t *dma_addr,
> > > > +					long unsigned int
> > > > *migrate_pfn,
> > > > +					unsigned long npages,
> > > > +					enum dma_data_direction
> > > > dir)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		struct page *page =
> > > > migrate_pfn_to_page(migrate_pfn[i]);
> > > > +
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > +			return -EFAULT;
> > > > +
> > > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > > PAGE_SIZE,
> > > > dir);
> > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > +			return -EFAULT;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> > > > mapped
> > > > for GPU SVM migration
> > > > + * @dev: The device for which the pages were mapped
> > > > + * @dma_addr: Array of DMA addresses corresponding to mapped
> > > > pages
> > > > + * @npages: Number of pages to unmap
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function unmaps previously mapped pages of memory for
> > > > GPU
> > > > Shared Virtual
> > > > + * Memory (SVM). It iterates over each DMA address provided in
> > > > @dma_addr, checks
> > > > + * if it's valid and not already unmapped, and unmaps the
> > > > corresponding page.
> > > > + */
> > > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > > +					   dma_addr_t *dma_addr,
> > > > +					   unsigned long npages,
> > > > +					   enum
> > > > dma_data_direction
> > > > dir)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > > dma_addr[i]))
> > > > +			continue;
> > > > +
> > > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE,
> > > > dir);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *                   failure of this function.
> > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > allocation.
> > > > The caller
> > > > + *                   should hold a reference to the VRAM
> > > > allocation,
> > > > which
> > > > + *                   should be dropped via ops->vram_allocation
> > > > or
> > > > upon the
> > > > + *                   failure of this function.
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function migrates the specified GPU SVM range to VRAM.
> > > > It
> > > > performs the
> > > > + * necessary setup and invokes the driver-specific operations
> > > > for
> > > > migration to
> > > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > > reference @range
> > > > + * until ops->vram_release is called which only upon successful
> > > > return.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       void *vram_allocation,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	u64 start = range->va.start, end = range->va.end;
> > > > +	struct migrate_vma migrate = {
> > > > +		.start		= start,
> > > > +		.end		= end,
> > > > +		.pgmap_owner	= gpusvm-
> > > > >device_private_page_owner,
> > > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > > +	};
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	unsigned long i, npages = npages_in_range(start, end);
> > > > +	struct vm_area_struct *vas;
> > > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int err;
> > > > +
> > > > +	if (!range->flags.migrate_vram)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > > > copy_to_vram ||
> > > > +	    !gpusvm->ops->copy_to_sram)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		mmap_write_lock(mm);
> > > > +	}
> > > > +
> > > > +	mmap_assert_locked(mm);
> > > > +
> > > > +	vas = vma_lookup(mm, start);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > +		err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (!vma_is_anonymous(vas)) {
> > > > +		err = -EBUSY;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr))
> > > > * npages;
> > > > +
> > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > +	if (!zdd) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/*
> > > > +	 * FIXME: Below cases, !migrate.cpages and
> > > > migrate.cpages !=
> > > > npages, not
> > > > +	 * always an error. Need to revisit possible cases and
> > > > how
> > > > to handle. We
> > > > +	 * could prefault on migrate.cpages != npages via
> > > > hmm_range_fault.
> > > > +	 */
> > > > +
> > > > +	if (!migrate.cpages) {
> > > > +		err = -EFAULT;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	if (migrate.cpages != npages) {
> > > > +		err = -EBUSY;
> > > > +		goto err_finalize;
> > > > +	}
> > > > +
> > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > > vram_allocation, npages,
> > > > +					     migrate.dst);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > +					   migrate.src, npages,
> > > > DMA_TO_DEVICE);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > > +
> > > > +		pages[i] = page;
> > > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > > +	}
> > > > +
> > > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > > > npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	/* Upon success bind vram allocation to range and zdd */
> > > > +	range->vram_allocation = vram_allocation;
> > > > +	WRITE_ONCE(zdd->vram_allocation,
> > > > vram_allocation);	/*
> > > > Owns ref */
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > migrate.dst);
> > > > +	migrate_vma_pages(&migrate);
> > > > +	migrate_vma_finalize(&migrate);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > npages,
> > > > +				       DMA_TO_DEVICE);
> > > > +err_free:
> > > > +	if (zdd)
> > > > +		drm_gpusvm_zdd_put(zdd);
> > > > +	kvfree(buf);
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_write_unlock(mm);
> > > > +		mmput(mm);
> > > > +	}
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for
> > > > a
> > > > VM area
> > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > + * @npages: Number of pages to populate
> > > > + * @src_mpfn: Source array of migrate PFNs
> > > > + * @mpfn: Array of migrate PFNs to populate
> > > > + * @addr: Start address for PFN allocation
> > > > + *
> > > > + * This function populates the SRAM migrate page frame numbers
> > > > (PFNs) for the
> > > > + * specified VM area structure. It allocates and locks pages in
> > > > the
> > > > VM area for
> > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for
> > > > allocation,
> > > > if NULL use
> > > > + * alloc_page for allocation.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > > vm_area_struct *vas,
> > > > +						unsigned long
> > > > npages,
> > > > +						unsigned long
> > > > *src_mpfn,
> > > > +						unsigned long
> > > > *mpfn,
> > > > u64 addr)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > +		struct page *page;
> > > > +
> > > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > +			continue;
> > > > +
> > > > +		if (vas)
> > > > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > > > addr);
> > > > +		else
> > > > +			page = alloc_page(GFP_HIGHUSER);
> > > > +
> > > > +		if (!page)
> > > > +			return -ENOMEM;
> > > > +
> > > > +		lock_page(page);
> > > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require
> > > > mmap
> > > > lock and
> > > > + * migration done via migrate_device_* functions. Fallback path
> > > > as
> > > > it is
> > > > + * preferred to issue migrations with mmap lock.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > > +				    struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	unsigned long *src, *dst;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	npages = npages_in_range(range->va.start, range-
> > > > >va.end);
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*src) +
> > > > sizeof(*dma_addr)
> > > > +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_out;
> > > > +	}
> > > > +	src = buf;
> > > > +	dst = buf + (sizeof(*src) * npages);
> > > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > > > npages;
> > > > +
> > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > > > vram_allocation,
> > > > +					     npages, src);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > > +				       gpusvm-
> > > > > device_private_page_owner, src,
> > > > +				       npages, range->va.start);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > > > src, dst, 0);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > +					   dst, npages,
> > > > DMA_BIDIRECTIONAL);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i)
> > > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > > +
> > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > > npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > > +	migrate_device_pages(src, dst, npages);
> > > > +	migrate_device_finalize(src, dst, npages);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > npages,
> > > > +				       DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	kvfree(buf);
> > > > +err_out:
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > > > (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @vas: Pointer to the VM area structure
> > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > + * @start: Start address of the migration range
> > > > + * @end: End address of the migration range
> > > > + *
> > > > + * This internal function performs the migration of the
> > > > specified
> > > > GPU SVM range
> > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > > > PFNs, and
> > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					struct vm_area_struct
> > > > *vas,
> > > > +					struct page *page,
> > > > +					u64 start, u64 end)
> > > > +{
> > > > +	struct migrate_vma migrate = {
> > > > +		.vma		= vas,
> > > > +		.pgmap_owner	= gpusvm-
> > > > >device_private_page_owner,
> > > > +		.flags		=
> > > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page	= page,
> > > > +	};
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	/* Corner where VMA area struct has been partially
> > > > unmapped
> > > > */
> > > > +	if (start < vas->vm_start)
> > > > +		start = vas->vm_start;
> > > > +	if (end > vas->vm_end)
> > > > +		end = vas->vm_end;
> > > > +
> > > > +	migrate.start = start;
> > > > +	migrate.end = end;
> > > > +	npages = npages_in_range(start, end);
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_out;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr))
> > > > * npages;
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/* Raced with another CPU fault, nothing to do */
> > > > +	if (!migrate.cpages)
> > > > +		goto err_free;
> > > > +
> > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > > +						   migrate.src,
> > > > migrate.dst,
> > > > +						   start);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > +					   migrate.dst, npages,
> > > > +					   DMA_BIDIRECTIONAL);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i)
> > > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > > +
> > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > > npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > migrate.dst);
> > > > +	migrate_vma_pages(&migrate);
> > > > +	migrate_vma_finalize(&migrate);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > npages,
> > > > +				       DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	kvfree(buf);
> > > > +err_out:
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > > > SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function initiates the migration of the specified GPU
> > > > SVM
> > > > range to
> > > > + * SRAM. It performs necessary checks and invokes the internal
> > > > migration
> > > > + * function for actual migration.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	u64 start = range->va.start, end = range->va.end;
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	struct vm_area_struct *vas;
> > > > +	int err;
> > > > +	bool retry = false;
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		if (ctx->trylock_mmap) {
> > > > +			if (!mmap_read_trylock(mm))  {
> > > > +				err =
> > > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > +				goto err_mmput;
> > > > +			}
> > > > +		} else {
> > > > +			mmap_read_lock(mm);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	mmap_assert_locked(mm);
> > > > +
> > > > +	/*
> > > > +	 * Loop required to find all VMA area structs for the
> > > > corner
> > > > case when
> > > > +	 * VRAM backing has been partially unmapped from MM's
> > > > address space.
> > > > +	 */
> > > > +again:
> > > > +	vas = find_vma(mm, start);
> > > > +	if (!vas) {
> > > > +		if (!retry)
> > > > +			err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > +		if (!retry)
> > > > +			err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL,
> > > > start,
> > > > end);
> > > > +	if (err)
> > > > +		goto err_mmunlock;
> > > > +
> > > > +	if (vas->vm_end < end) {
> > > > +		retry = true;
> > > > +		start = vas->vm_end;
> > > > +		goto again;
> > > > +	}
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_read_unlock(mm);
> > > > +		/*
> > > > +		 * Using mmput_async as this function can be
> > > > called
> > > > while
> > > > +		 * holding a dma-resv lock, and a final put can
> > > > grab
> > > > the mmap
> > > > +		 * lock, causing a lock inversion.
> > > > +		 */
> > > > +		mmput_async(mm);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmap_read_unlock(mm);
> > > > +err_mmput:
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmput_async(mm);
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > > > associated
> > > > with a page
> > > > + * @page: Pointer to the page
> > > > + *
> > > > + * This function is a callback used to put the GPU SVM zone
> > > > device
> > > > data
> > > > + * associated with a page when it is being released.
> > > > + */
> > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > +{
> > > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > > > (page
> > > > fault handler)
> > > > + * @vmf: Pointer to the fault information structure
> > > > + *
> > > > + * This function is a page fault handler used to migrate a GPU
> > > > SVM
> > > > range to RAM.
> > > > + * It retrieves the GPU SVM range information from the faulting
> > > > page
> > > > and invokes
> > > > + * the internal migration function to migrate the range back to
> > > > RAM.
> > > > + *
> > > > + * Returns:
> > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > + */
> > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > > > *vmf)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > > > >zone_device_data;
> > > > +	int err;
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > +					   vmf->vma, vmf->page,
> > > > +					   zdd->range->va.start,
> > > > +					   zdd->range->va.end);
> > > > +
> > > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU
> > > > SVM
> > > > + */
> > > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > > +	.page_free = drm_gpusvm_page_free,
> > > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > > > operations
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM device page map operations structure.
> > > > + */
> > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > > +{
> > > > +	return &drm_gpusvm_pagemap_ops;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > > > given address range
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + * @start: Start address
> > > > + * @end: End address
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM has mapping, False otherwise
> > > > + */
> > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > > start,
> > > > u64 end)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start,
> > > > end) {
> > > > +		struct drm_gpusvm_range *range = NULL;
> > > > +
> > > > +		drm_gpusvm_for_each_range(range, notifier,
> > > > start,
> > > > end)
> > > > +			return true;
> > > > +	}
> > > > +
> > > > +	return false;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > new file mode 100644
> > > > index 000000000000..0ea70f8534a8
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > @@ -0,0 +1,415 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef __DRM_GPUSVM_H__
> > > > +#define __DRM_GPUSVM_H__
> > > > +
> > > > +#include <linux/kref.h>
> > > > +#include <linux/mmu_notifier.h>
> > > > +#include <linux/workqueue.h>
> > > > +
> > > > +struct dev_pagemap_ops;
> > > > +struct drm_device;
> > > > +struct drm_gpusvm;
> > > > +struct drm_gpusvm_notifier;
> > > > +struct drm_gpusvm_ops;
> > > > +struct drm_gpusvm_range;
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > > + *
> > > > + * This structure defines the operations for GPU Shared Virtual
> > > > Memory (SVM).
> > > > + * These operations are provided by the GPU driver to manage SVM
> > > > ranges and
> > > > + * perform operations such as migration between VRAM and system
> > > > RAM.
> > > > + */
> > > > +struct drm_gpusvm_ops {
> > > > +	/**
> > > > +	 * @notifier_alloc: Allocate a GPU SVM notifier
> > > > (optional)
> > > > +	 *
> > > > +	 * This function shall allocate a GPU SVM notifier.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * Pointer to the allocated GPU SVM notifier on success,
> > > > NULL on failure.
> > > > +	 */
> > > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > > +
> > > > +	/**
> > > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > > +	 * @notifier: Pointer to the GPU SVM notifier to be
> > > > freed
> > > > +	 *
> > > > +	 * This function shall free a GPU SVM notifier.
> > > > +	 */
> > > > +	void (*notifier_free)(struct drm_gpusvm_notifier
> > > > *notifier);
> > > > +
> > > > +	/**
> > > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 *
> > > > +	 * This function shall allocate a GPU SVM range.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * Pointer to the allocated GPU SVM range on success,
> > > > NULL
> > > > on failure.
> > > > +	 */
> > > > +	struct drm_gpusvm_range *(*range_alloc)(struct
> > > > drm_gpusvm
> > > > *gpusvm);
> > > > +
> > > > +	/**
> > > > +	 * @range_free: Free a GPU SVM range (optional)
> > > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > > +	 *
> > > > +	 * This function shall free a GPU SVM range.
> > > > +	 */
> > > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > > +
> > > > +	/**
> > > > +	 * @vram_release: Release VRAM allocation (optional)
> > > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > > allocation
> > > > +	 *
> > > > +	 * This function shall release VRAM allocation and
> > > > expects
> > > > to drop a
> > > > +	 * reference to VRAM allocation.
> > > > +	 */
> > > > +	void (*vram_release)(void *vram_allocation);
> > > > +
> > > > +	/**
> > > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > > > migration)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > > allocation
> > > > +	 * @npages: Number of pages to populate
> > > > +	 * @pfn: Array of page frame numbers to populate
> > > > +	 *
> > > > +	 * This function shall populate VRAM page frame numbers
> > > > (PFN).
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * 0 on success, a negative error code on failure.
> > > > +	 */
> > > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > > +				 void *vram_allocation,
> > > > +				 unsigned long npages,
> > > > +				 unsigned long *pfn);
> > > > +
> > > > +	/**
> > > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > > +	 * @npages: Number of pages to copy
> > > > +	 *
> > > > +	 * This function shall copy pages to VRAM.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * 0 on success, a negative error code on failure.
> > > > +	 */
> > > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > > +			    struct page **pages,
> > > > +			    dma_addr_t *dma_addr,
> > > > +			    unsigned long npages);
> > > > +
> > > > +	/**
> > > > +	 * @copy_to_sram: Copy to system RAM (required for
> > > > migration)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > > (destination)
> > > > +	 * @npages: Number of pages to copy
> > > > +	 *
> > > > +	 * This function shall copy pages to system RAM.
> > > > +	 *
> > > > +	 * Returns:
> > > > +	 * 0 on success, a negative error code on failure.
> > > > +	 */
> > > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > > +			    struct page **pages,
> > > > +			    dma_addr_t *dma_addr,
> > > > +			    unsigned long npages);
> > > > +
> > > > +	/**
> > > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > > +	 * @mmu_range: Pointer to the mmu_notifier_range
> > > > structure
> > > > +	 *
> > > > +	 * This function shall invalidate the GPU page tables.
> > > > It
> > > > can safely
> > > > +	 * walk the notifier range RB tree/list in this
> > > > function.
> > > > Called while
> > > > +	 * holding the notifier lock.
> > > > +	 */
> > > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > > +			   struct drm_gpusvm_notifier *notifier,
> > > > +			   const struct mmu_notifier_range
> > > > *mmu_range);
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > > > notifier
> > > > + *
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: MMU interval notifier
> > > > + * @interval: Interval for the notifier
> > > > + * @rb: Red-black tree node for the parent GPU SVM structure
> > > > notifier tree
> > > > + * @root: Cached root node of the RB tree containing ranges
> > > > + * @range_list: List head containing of ranges in the same order
> > > > they appear in
> > > > + *              interval tree. This is useful to keep iterating
> > > > ranges while
> > > > + *              doing modifications to RB tree.
> > > > + * @flags.removed: Flag indicating whether the MMU interval
> > > > notifier
> > > > has been
> > > > + *                 removed
> > > > + *
> > > > + * This structure represents a GPU SVM notifier.
> > > > + */
> > > > +struct drm_gpusvm_notifier {
> > > > +	struct drm_gpusvm *gpusvm;
> > > > +	struct mmu_interval_notifier notifier;
> > > > +	struct {
> > > > +		u64 start;
> > > > +		u64 end;
> > > > +	} interval;
> > > > +	struct {
> > > > +		struct rb_node node;
> > > > +		struct list_head entry;
> > > > +		u64 __subtree_last;
> > > > +	} rb;
> > > > +	struct rb_root_cached root;
> > > > +	struct list_head range_list;
> > > > +	struct {
> > > > +		u32 removed : 1;
> > > > +	} flags;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_range - Structure representing a GPU SVM
> > > > range
> > > > + *
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier
> > > > + * @refcount: Reference count for the range
> > > > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > > > structure range tree
> > > > + * @va: Virtual address range
> > > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > > + * @pages: Pointer to the array of pages (if backing store is in
> > > > VRAM)
> > > > + * @dma_addr: DMA address array (if backing store is SRAM and
> > > > DMA
> > > > mapped)
> > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > allocation
> > > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is
> > > > mapping
> > > > size
> > > > + * @flags.migrate_vram: Flag indicating whether the range can be
> > > > migrated to VRAM
> > > > + * @flags.unmapped: Flag indicating if the range has been
> > > > unmapped
> > > > + * @flags.partial_unmap: Flag indicating if the range has been
> > > > partially unmapped
> > > > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > > > pages
> > > > + * @flags.has_dma_mapping: Flag indicating if the range has a
> > > > DMA
> > > > mapping
> > > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > > > allocation based
> > > > + *                       on @order which releases via kfree
> > > > + *
> > > > + * This structure represents a GPU SVM range used for tracking
> > > > memory ranges
> > > > + * mapped in a DRM device.
> > > > + */
> > > > +struct drm_gpusvm_range {
> > > > +	struct drm_gpusvm *gpusvm;
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +	struct kref refcount;
> > > > +	struct {
> > > > +		struct rb_node node;
> > > > +		struct list_head entry;
> > > > +		u64 __subtree_last;
> > > > +	} rb;
> > > > +	struct {
> > > > +		u64 start;
> > > > +		u64 end;
> > > > +	} va;
> > > > +	unsigned long notifier_seq;
> > > > +	union {
> > > > +		struct page **pages;
> > > > +		dma_addr_t *dma_addr;
> > > > +	};
> > > > +	void *vram_allocation;
> > > > +	u16 order;
> > > > +	struct {
> > > > +		/* All flags below must be set upon creation */
> > > > +		u16 migrate_vram : 1;
> > > > +		/* All flags below must be set / cleared under
> > > > notifier lock */
> > > > +		u16 unmapped : 1;
> > > > +		u16 partial_unmap : 1;
> > > > +		u16 has_vram_pages : 1;
> > > > +		u16 has_dma_mapping : 1;
> > > > +		u16 kfree_mapping : 1;
> > > > +	} flags;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm - GPU SVM structure
> > > > + *
> > > > + * @name: Name of the GPU SVM
> > > > + * @drm: Pointer to the DRM device structure
> > > > + * @mm: Pointer to the mm_struct for the address space
> > > > + * @device_private_page_owner: Device private pages owner
> > > > + * @mm_start: Start address of GPU SVM
> > > > + * @mm_range: Range of the GPU SVM
> > > > + * @notifier_size: Size of individual notifiers
> > > > + * @ops: Pointer to the operations structure for GPU SVM
> > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > range
> > > > allocation.
> > > > + *               Entries should be powers of 2 in descending
> > > > order.
> > > > + * @num_chunks: Number of chunks
> > > > + * @notifier_lock: Read-write semaphore for protecting notifier
> > > > operations
> > > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > > + * @root: Cached root node of the Red-Black tree containing GPU
> > > > SVM
> > > > notifiers
> > > > + * @notifier_list: list head containing of notifiers in the same
> > > > order they
> > > > + *                 appear in interval tree. This is useful to
> > > > keep
> > > > iterating
> > > > + *                 notifiers while doing modifications to RB
> > > > tree.
> > > > + *
> > > > + * This structure represents a GPU SVM (Shared Virtual Memory)
> > > > used
> > > > for tracking
> > > > + * memory ranges mapped in a DRM (Direct Rendering Manager)
> > > > device.
> > > > + *
> > > > + * No reference counting is provided, as this is expected to be
> > > > embedded in the
> > > > + * driver VM structure along with the struct drm_gpuvm, which
> > > > handles reference
> > > > + * counting.
> > > > + */
> > > > +struct drm_gpusvm {
> > > > +	const char *name;
> > > > +	struct drm_device *drm;
> > > > +	struct mm_struct *mm;
> > > > +	void *device_private_page_owner;
> > > > +	u64 mm_start;
> > > > +	u64 mm_range;
> > > > +	u64 notifier_size;
> > > > +	const struct drm_gpusvm_ops *ops;
> > > > +	const u64 *chunk_sizes;
> > > > +	int num_chunks;
> > > > +	struct rw_semaphore notifier_lock;
> > > > +	struct workqueue_struct *zdd_wq;
> > > > +	struct rb_root_cached root;
> > > > +	struct list_head notifier_list;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > > + *
> > > > + * @mmap_locked: mmap lock is locked
> > > > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > > > inversions
> > > > + *                (e.g.dma-revs -> mmap lock)
> > > > + * @in_notifier: entering from a MMU notifier
> > > > + * @read_only: operating on read-only memory
> > > > + * @vram_possible: possible to use VRAM
> > > > + * @prefault: prefault pages
> > > > + *
> > > > + * Context that is DRM GPUSVM is operating in (i.e. user
> > > > arguments).
> > > > + */
> > > > +struct drm_gpusvm_ctx {
> > > > +	u32 mmap_locked :1;
> > > > +	u32 trylock_mmap :1;
> > > > +	u32 in_notifier :1;
> > > > +	u32 read_only :1;
> > > > +	u32 vram_possible :1;
> > > > +	u32 prefault :1;
> > > > +};
> > > > +
> > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > +		    const char *name, struct drm_device *drm,
> > > > +		    struct mm_struct *mm, void
> > > > *device_private_page_owner,
> > > > +		    u64 mm_start, u64 mm_range, u64
> > > > notifier_size,
> > > > +		    const struct drm_gpusvm_ops *ops,
> > > > +		    const u64 *chunk_sizes, int num_chunks);
> > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr,
> > > > +				u64 gpuva_start, u64 gpuva_end,
> > > > +				const struct drm_gpusvm_ctx
> > > > *ctx);
> > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > +			     struct drm_gpusvm_range *range);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > > +
> > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range);
> > > > +
> > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx
> > > > *ctx);
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range,
> > > > +				  const struct drm_gpusvm_ctx
> > > > *ctx);
> > > > +
> > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       void *vram_allocation,
> > > > +			       const struct drm_gpusvm_ctx
> > > > *ctx);
> > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx
> > > > *ctx);
> > > > +
> > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > > +
> > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > > start,
> > > > u64 end);
> > > > +
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > > start, u64 end);
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > > + */
> > > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > > +	down_read(&(gpusvm__)->notifier_lock)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > > + */
> > > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > > +	up_read(&(gpusvm__)->notifier_lock)
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the
> > > > list
> > > > + * @range: a pointer to the current GPU SVM range
> > > > + *
> > > > + * Return: A pointer to the next drm_gpusvm_range if available,
> > > > or
> > > > NULL if the
> > > > + *         current range is the last one or if the input range
> > > > is
> > > > NULL.
> > > > + */
> > > > +static inline struct drm_gpusvm_range *
> > > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	if (range && !list_is_last(&range->rb.entry,
> > > > +				   &range->notifier-
> > > > >range_list))
> > > > +		return list_next_entry(range, rb.entry);
> > > > +
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > > > notifier
> > > > + * @range__: Iterator variable for the ranges. If set, it
> > > > indicates
> > > > the start of
> > > > + *	     the iterator. If NULL, call drm_gpusvm_range_find()
> > > > to
> > > > get the range.
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the range
> > > > + * @end__: End address of the range
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > notifier.
> > > > It is safe
> > > > + * to use while holding the driver SVM lock or the notifier
> > > > lock.
> > > > + */
> > > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > > > end__)	\
> > > > +	for ((range__) = (range__)
> > > > ?:					\
> > > > +	     drm_gpusvm_range_find((notifier__), (start__),
> > > > (end__));	\
> > > > +	     (range__) && (range__->va.start <
> > > > (end__));		\
> > > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as
> > > > unmapped
> > > > + * @range: Pointer to the GPU SVM range structure.
> > > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > > + *
> > > > + * This function marks a GPU SVM range as unmapped and sets the
> > > > partial_unmap flag
> > > > + * if the range partially falls within the provided MMU notifier
> > > > range.
> > > > + */
> > > > +static inline void
> > > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > > +			      const struct mmu_notifier_range
> > > > *mmu_range)
> > > > +{
> > > > +	lockdep_assert_held_write(&range->gpusvm-
> > > > >notifier_lock);
> > > > +
> > > > +	range->flags.unmapped = true;
> > > > +	if (range->va.start < mmu_range->start ||
> > > > +	    range->va.end > mmu_range->end)
> > > > +		range->flags.partial_unmap = true;
> > > > +}
> > > > +
> > > > +#endif /* __DRM_GPUSVM_H__ */
> > > 
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29  9:24     ` Christian König
  2024-08-29  9:53       ` Thomas Hellström
@ 2024-08-29 21:48       ` Matthew Brost
  2024-09-02 13:02         ` Daniel Vetter
  1 sibling, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 21:48 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, intel-xe, dri-devel, airlied, thomas.hellstrom,
	matthew.auld, daniel, Paneer Selvam, Arunpravin

On Thu, Aug 29, 2024 at 11:24:26AM +0200, Christian König wrote:
>    Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> 

A lot to unpack here. Will try to address as much as I can in this
single reply to both of you (Daniel, Christian).

> On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> 
> Migration is implemented with range granularity, with VRAM backing being
> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> SVM range is migrated to SRAM, the TTM BO is destroyed).
> 
> The design choice for using TTM BO for VRAM backing store, as opposed to
> direct buddy allocation, is as follows:
> 
> - DRM buddy allocations are not at page granularity, offering no
>   advantage over a BO.
> 
> This one I'm not understanding.
>
>    Adding Arun as well. I couldn't understand it fully either, but maybe
>    it's because the buddy allocator is more optimized for higher orders of
>    allocations?
> 

As currently written BO VRAM allocation resolves to a DRM buddy
allocation. Unless there is memory pressure this is likely 1 buddy block
if the allocation is aligned (SVM should basically also be doing aligned
allocations which the common case of 2M at a time).

It was suggested in earlier revs by a colleague of mine allocating
directly from DRM buddy pool provided a benefit wrt to freeing a page at
a time. It doesn't given even if you bypass a BO most likely you are
going to get 1 buddy block which is larger than a page. In either case
you need to ref count the allocation or do some wild spliting algorithm
(I don't want to that). Or alternatively write a new buddy allocator
which easily code with freeing a page at a time (I don't want to that).

Lastly, the common case for getting dev_pagemap_ops.page_free is going
to be consecutive calls spanning the entire allocation (e.g. eviction or
CPU fault which triggers migration).

> 
> 
> - DRM buddy allocations do not solve locking inversion problems between
>   mmap lock and dma-resv locks.
> 
> Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> matters hugely which migration path we're in, i.e. opportunistic
> migration, cpu fault where we have to migrate or die, or when we run out
> of vram and need to evict stuff to make space.
> 
>    Mhm I think the locking order between mmap lock and dma-resv lock is
>    well defined since dma_resv_lockdep() was added.
>

Yes. Also solved the inversion issue by using migrate_device_*. At one
point had trylocking of mmap lock (still kinda there) but have agreed
based on Daniel's feedback to rip the out.
 
> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>   evict each other).
> 
> So core mm handles this by just roughly equally shrinking everything.
> Seems to work, and it has a pile of object shrinkers, and the page lru is
> also split into page cache and anon memory.
> 
> I think you need to put in more justification that unified eviction is
> required than just stating it, because a look at mm/ gives a very well
> established counterexample.
> 
> 
> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>   require a dma-resv.
> 
> So from the TTM side we need exhaustive eviction, or at least something a
> bit more exhaustive than what ttm currently has. Note that i915-gem also
> never really got to perfect exhaustive eviction, it's just a pile better
> than ttm right now.
> 
>    Please define what exhaustive eviction should mean? I think I know what
>    it is and I have been pushing TTM into the direction of solving this
>    for years.
>    The last missing puzzle piece is to use drm_exec for TTM evictions, but
>    apart from that everything should work now.
>    Regards,
>    Christian.

I think Thomas has defined this in his replies. He also touches on our
SVM design allows mixing user BO mappings and SVM mappings within the
same VM. These need to be able to fairly evict each other. A dma-resv
lock provides a level of fairness and ensure forward progress once a
flavor of his series lands.

Also worth nothing in addition to user BOs, we have kernel BOs (for page
table, user exec queues, etc...) in Xe which absolutely need to be able
to evict something or the application dies.

> 
> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive

Page LRU isn't used for device pages from my understanding.

> eviction is going to win because the shrinkers can only trylock dma_resv.
> So this part works. It actually works so well on the system memory side
> that if we're not careful we can trigger oom, because we're too good at
> getting at all the memory.
> 
> SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> don't see why, because the idea is that thanks to gpu and cpu page faults,
> you can always get out of a pinch by just trashing everything for a while
> and migrating the handfull of available pages a lot.
> 
> 
> - Likely allocation size is 2M which makes of size of BO (872)
>   acceptable per allocation (872 / 2M == .0004158).
> 
> With this, using TTM BO for VRAM backing store seems to be an obvious
> choice as it allows leveraging of the TTM eviction code.
> 
> Except it requires that you hold dma_resv, which brings in all kinds of
> pain. And for eviction we really don't need a lot of synchronization, so a

Yes, but I think I have solved all those issues wrt to dma-resv.

What is really the alternative here? Teaching TTM to evict non-BO SVM
allocations? Writing an SVM VRAM allocator which ends up looking also
exactly like TTM and teaching it to evict TTM BOs? The later case we'd
still need to grab dma-lock...

Do we write a new page based buddy allocator and wire that to TTM if SVM
could possibly be used?

This would be tons of code and not really sure what ROI is here.

> lot of that locking is not needed, unlike the case where we have a cpu
> fault, where we absolutely need mmap_lock and all that to make sure we
> fault in the right page.
> 
> But for eviction we only need to throw out some pages, if we're not
> entirely precise with picking the right ones (or have no idea into which
> vma they're all currently mapped into) it doesn't matter. That's why
> migrate_device_pages doesn't care about any of that at all, it doesn't
> need to by design. But by bo backing memory you drag in all that stuff
> that's causing headacheds for eviction.
> 
> The only thing migration tries to do is remove all pte, and if that
> succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> code as cheat sheet, we need roughly:
> 
> - reverse mapping structure like anon_vma. Except gpusvm can assume that
>   there's currently only one gpu side mapping, so we can just stuff the
>   gpusvm an va_address into the page, and protect it with the page lock.
> 
> - we need pagetable locks, so that we can manipulate pagetables (well
>   specifically make ptes invalid) without taking any other locks.
> 
> - everyone else inserting or removing ptes for svm mappings also needs to
>   lock the page, or we have races. This might be the hmm_range_fault races
>   you're seeing when allowing vram pages, since I don't think there's
>   anything else stopping the page lookup otherwise from succeeding.

AMD looks take the range->migration_mutex to prevent races.

> 
> - we might also need to stuff migrate ptes into the gpu side, like the cpu
>   does, to hold up refaults before the migration has finished. But I think
>   those are only needed for anon memory in sram because there's no other
>   way to find the right page than swap pte entries, of which migration
>   entries are a special case.
> 
> - core code also expects us to handle the page refcount correctly for svm
>   device memory, so we can't free the pages like normal bo pages either
>   directly to drm_buddy.
> 
> Now typing this all up will look an awful lot like what you have, with the
> dma_resv lock serving as the page lock and the pagetable lock. The only

dma_resv is indeed one of locks we need for page table updates (binds)
as we allocate TTM BOs for page tables and we install fences for binds
in dma-resv slots (certainly for non-SVM, might be able to drop that for
SVM).

> reason is that these locks are much smaller and nest within all the other
> stuff going on and so avoid the inversion issues.
> 
> So one annoying part is that this is a lot of pointlessly looking typing.
> The other is that it's full of races, because core mm really is yolo all
> the way down. So lots of ways you lock the wrong page and fun stuff like
> that, but the few cases that matter work:
> 
> - svm fault handling with hmm_range fault retries with mmu notifiers. Note
>   that we need to have vram pages locked and the notifier retrie needs to
>   be under the pagetable lock, or there's room to escape. At least that's
>   what I came up with last time I thought it all through.
>

We grab the gpusvm->notifier lock just be committing the bind and check
for retry. If we need to retry we completely unwind all locks and
restart the GPU fault.
 
> - migrate_to_ram: it will hold a page reference which we know was the
>   valid vram page when the cpu pte was locked, but it might not be it
>   anymore. So we have to lock the page and check whether it's still gpu
>   mapped, and if not retry the entire fault since most likey another
>   migrate_to_ram has succeed meanwhile in parallel.
> 
> - for eviction we don't care, we might actually be migrating a page no one
>   even wants anymore.
> 
> Now I think you can get all this done with the dma_resv lock and maybe the
> bo refcount. But it does involve a tremendous amount of headaches and

I don't the headaches are too bad...

> impendence mismatch, because that's not how page faults and migrations
> work in core mm.

Agree there is a bit of impendence mismatch but see above I can't really
think of a better solution without thousands of lines of new code and
invavise changes across the subsystem.

What I have in place appears to work with very little code changes to Xe
and none to TTM. AMD also landed on a BO likely for similar reasons I
have laid out.

Matt

> 
> Cheers, Sima
> 
> 
> Current migration policy is migrate any SVM range greater than or equal
> to 64k once.
> 
> [1] [1]https://patchwork.freedesktop.org/series/133643/
> 
> Signed-off-by: Matthew Brost [2]matthew.brost@intel.com
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h |  1 +
>  2 files changed, 81 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 4372c02a341f..fd8987e0a506 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>  static int __xe_svm_garbage_collector(struct xe_vm *vm,
>                                       struct xe_svm_range *range)
>  {
> +       struct drm_gpusvm_ctx ctx = {};
>         struct dma_fence *fence;
> 
> +       /* Evict any pages holding references to vram allocation */
> +       if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> +               drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
> +
>         xe_vm_lock(vm, false);
>         fence = xe_vm_range_unbind(vm, range);
>         xe_vm_unlock(vm);
> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *ran
> ge,
>         return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>  }
> 
> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> +{
> +       return &tile->mem.vram;
> +}
> +
> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> +                                      struct xe_svm_range *range,
> +                                      const struct drm_gpusvm_ctx *ctx)
> +{
> +       struct xe_mem_region *mr = tile_to_mr(tile);
> +       struct drm_buddy_block *block;
> +       struct list_head *blocks;
> +       struct xe_bo *bo;
> +       ktime_t end = 0;
> +       int err;
> +
> +retry:
> +       xe_vm_lock(vm, false);
> +       bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
> +                         range->base.va.start, ttm_bo_type_device,
> +                         XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> +                         XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
> +       xe_vm_unlock(vm);
> +       if (IS_ERR(bo)) {
> +               err = PTR_ERR(bo);
> +               if (xe_vm_validate_should_retry(NULL, err, &end))
> +                       goto retry;
> +               return bo;
> +       }
> +
> +       blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> +       list_for_each_entry(block, blocks, link)
> +               block->private = mr;
> +
> +       /*
> +        * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
> +        * creation ref can be dropped upon CPU fault or unmap.
> +        */
> +       xe_bo_get(bo);
> +
> +       err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
> +                                        bo, ctx);
> +       if (err) {
> +               xe_bo_put(bo);  /* Local ref */
> +               xe_bo_put(bo);  /* Creation ref */
> +               return ERR_PTR(err);
> +       }
> +
> +       return bo;
> +}
> +
>  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>                             struct xe_tile *tile, u64 fault_addr,
>                             bool atomic)
>  {
> -       struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
> +       struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
> +               .vram_possible = IS_DGFX(vm->xe), };
>         struct xe_svm_range *range;
>         struct drm_gpusvm_range *r;
>         struct drm_exec exec;
>         struct dma_fence *fence;
> +       struct xe_bo *bo = NULL;
>         ktime_t end = 0;
>         int err;
> 
>         lockdep_assert_held_write(&vm->lock);
> 
>  retry:
> +       xe_bo_put(bo);
> +       bo = NULL;
> +
>         /* Always process UNMAPs first so view SVM ranges is current */
>         err = xe_svm_garbage_collector(vm);
>         if (err)
> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
>  *vma,
>         if (xe_svm_range_is_valid(range, tile))
>                 return 0;
> 
> +       /* XXX: Add migration policy, for now migrate range once */
> +       if (IS_DGFX(vm->xe) && !range->migrated &&
> +           range->base.flags.migrate_vram &&
> +           (range->base.va.end - range->base.va.start) >= SZ_64K) {
> +               range->migrated = true;
> +
> +               bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> +               if (IS_ERR(bo)) {
> +                       drm_info(&vm->xe->drm,
> +                                "VRAM allocation failed, falling back to retryi
> ng, asid=%u, errno %ld\n",
> +                                vm->usm.asid, PTR_ERR(bo));
> +                       bo = NULL;
> +                       goto retry;
> +               }
> +       }
> +
>         err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>         if (err == -EFAULT || err == -EPERM)    /* Corner where CPU mappings hav
> e change */
>                goto retry;
> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> *vma,
>         dma_fence_put(fence);
> 
>  err_out:
> +       xe_bo_put(bo);
> +
>         return err;
>  }
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 8b72e91cc37d..3f432483a230 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -18,6 +18,7 @@ struct xe_svm_range {
>         struct list_head garbage_collector_link;
>         u8 tile_present;
>         u8 tile_invalidated;
> +       u8 migrated     :1;
>  };
> 
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> --
> 2.34.1
> 
> References
> 
>    1. https://patchwork.freedesktop.org/series/133643/
>    2. mailto:matthew.brost@intel.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29 14:30         ` Christian König
@ 2024-08-29 21:53           ` Matthew Brost
  0 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 21:53 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström, Daniel Vetter, intel-xe, dri-devel,
	airlied, matthew.auld, daniel, Paneer Selvam, Arunpravin

On Thu, Aug 29, 2024 at 04:30:11PM +0200, Christian König wrote:
> 
> 
> Am 29.08.24 um 11:53 schrieb Thomas Hellström:
> > Hi, Christian,
> > 
> > On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> > ...
> > 
> > > > > - Unified eviction is required (SVM VRAM and TTM BOs need to be
> > > > > able to
> > > > >     evict each other).
> > > > So core mm handles this by just roughly equally shrinking
> > > > everything.
> > > > Seems to work, and it has a pile of object shrinkers, and the page
> > > > lru is
> > > > also split into page cache and anon memory.
> > > > 
> > > > I think you need to put in more justification that unified eviction
> > > > is
> > > > required than just stating it, because a look at mm/ gives a very
> > > > well
> > > > established counterexample.
> > > > 
> > > > > - For exhaustive eviction [1], SVM VRAM allocations will almost
> > > > > certainly
> > > > >     require a dma-resv.
> > > > So from the TTM side we need exhaustive eviction, or at least
> > > > something a
> > > > bit more exhaustive than what ttm currently has. Note that i915-gem
> > > > also
> > > > never really got to perfect exhaustive eviction, it's just a pile
> > > > better
> > > > than ttm right now.
> > > Please define what exhaustive eviction should mean? I think I know
> > > what
> > > it is and I have been pushing TTM into the direction of solving this
> > > for
> > > years.
> > We internally refer to exhaustive eviction being a client is always
> > guaranteed to eventually make progress in obtaining non-pinned vram,
> > typically by incrementally locking and keeping dma-resvs across a
> > single validation including validations during buffer object
> > allocations.
> > 
> > > The last missing puzzle piece is to use drm_exec for TTM evictions,
> > and IMO keeping the dma-resv locks grabbed during eviction until at
> > least one unit of progress (one validation) has succeeded.
> 
> Yes, exactly that. My guessed understanding was actually correct.
> 
> > 
> > > but
> > > apart from that everything should work now.
> > > 
> > > 
> > > Regards,
> > > Christian.
> > But as Sima pointed out in private communication, exhaustive eviction
> > is not really needed for faulting to make (crawling) progress.
> > Watermarks and VRAM trylock shrinking should suffice, since we're
> > strictly only required to service a single gpu page granule at a time.
> 
> Yeah fault based memory management should be able to keep working as long as
> the page isn't re-migrated before you make any progress.

I prevent against that via eviction_valuable vfunc. See here [1].

[1] https://patchwork.freedesktop.org/patch/610982/?series=137870&rev=1

> 
> Since the number of VRAM or system memory pages is very high that should
> basically never happen.
> 
> > However, ordinary bo-based jobs would still like to be able to
> > completely evict SVM vram. Whether that is important enough to strive
> > for is ofc up for discussion.
> 
> Yes, exactly that. Felix, Alex, a bunch of other AMD folks and I came up
> with the same conclusion at AMD internally as well.
>

Agree with both of you. Landed on BO rather than rewrite the world as
TTM appears to everything need for SVM aside from an impedance mismatch
Daniel has pointed out resolved by refcounting. 

Matt

> Regards,
> Christian.
> 
> > 
> > /Thomas
> > 
> > 
> > 
> > > > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> > > > eviction is going to win because the shrinkers can only trylock
> > > > dma_resv.
> > > > So this part works. It actually works so well on the system memory
> > > > side
> > > > that if we're not careful we can trigger oom, because we're too
> > > > good at
> > > > getting at all the memory.
> > > > 
> > > > SVM VRAM allocations otoh do not need exhaustive evictions. Or at
> > > > least I
> > > > don't see why, because the idea is that thanks to gpu and cpu page
> > > > faults,
> > > > you can always get out of a pinch by just trashing everything for a
> > > > while
> > > > and migrating the handfull of available pages a lot.
> > > > 
> > > > > - Likely allocation size is 2M which makes of size of BO (872)
> > > > >     acceptable per allocation (872 / 2M == .0004158).
> > > > > 
> > > > > With this, using TTM BO for VRAM backing store seems to be an
> > > > > obvious
> > > > > choice as it allows leveraging of the TTM eviction code.
> > > > Except it requires that you hold dma_resv, which brings in all
> > > > kinds of
> > > > pain. And for eviction we really don't need a lot of
> > > > synchronization, so a
> > > > lot of that locking is not needed, unlike the case where we have a
> > > > cpu
> > > > fault, where we absolutely need mmap_lock and all that to make sure
> > > > we
> > > > fault in the right page.
> > > > 
> > > > But for eviction we only need to throw out some pages, if we're not
> > > > entirely precise with picking the right ones (or have no idea into
> > > > which
> > > > vma they're all currently mapped into) it doesn't matter. That's
> > > > why
> > > > migrate_device_pages doesn't care about any of that at all, it
> > > > doesn't
> > > > need to by design. But by bo backing memory you drag in all that
> > > > stuff
> > > > that's causing headacheds for eviction.
> > > > 
> > > > The only thing migration tries to do is remove all pte, and if that
> > > > succeeds, move the page. Specialized for the gpusvm case, looking
> > > > at mm/
> > > > code as cheat sheet, we need roughly:
> > > > 
> > > > - reverse mapping structure like anon_vma. Except gpusvm can assume
> > > > that
> > > >     there's currently only one gpu side mapping, so we can just
> > > > stuff the
> > > >     gpusvm an va_address into the page, and protect it with the page
> > > > lock.
> > > > 
> > > > - we need pagetable locks, so that we can manipulate pagetables
> > > > (well
> > > >     specifically make ptes invalid) without taking any other locks.
> > > > 
> > > > - everyone else inserting or removing ptes for svm mappings also
> > > > needs to
> > > >     lock the page, or we have races. This might be the
> > > > hmm_range_fault races
> > > >     you're seeing when allowing vram pages, since I don't think
> > > > there's
> > > >     anything else stopping the page lookup otherwise from
> > > > succeeding.
> > > > 
> > > > - we might also need to stuff migrate ptes into the gpu side, like
> > > > the cpu
> > > >     does, to hold up refaults before the migration has finished. But
> > > > I think
> > > >     those are only needed for anon memory in sram because there's no
> > > > other
> > > >     way to find the right page than swap pte entries, of which
> > > > migration
> > > >     entries are a special case.
> > > > 
> > > > - core code also expects us to handle the page refcount correctly
> > > > for svm
> > > >     device memory, so we can't free the pages like normal bo pages
> > > > either
> > > >     directly to drm_buddy.
> > > > 
> > > > Now typing this all up will look an awful lot like what you have,
> > > > with the
> > > > dma_resv lock serving as the page lock and the pagetable lock. The
> > > > only
> > > > reason is that these locks are much smaller and nest within all the
> > > > other
> > > > stuff going on and so avoid the inversion issues.
> > > > 
> > > > So one annoying part is that this is a lot of pointlessly looking
> > > > typing.
> > > > The other is that it's full of races, because core mm really is
> > > > yolo all
> > > > the way down. So lots of ways you lock the wrong page and fun stuff
> > > > like
> > > > that, but the few cases that matter work:
> > > > 
> > > > - svm fault handling with hmm_range fault retries with mmu
> > > > notifiers. Note
> > > >     that we need to have vram pages locked and the notifier retrie
> > > > needs to
> > > >     be under the pagetable lock, or there's room to escape. At least
> > > > that's
> > > >     what I came up with last time I thought it all through.
> > > > 
> > > > - migrate_to_ram: it will hold a page reference which we know was
> > > > the
> > > >     valid vram page when the cpu pte was locked, but it might not be
> > > > it
> > > >     anymore. So we have to lock the page and check whether it's
> > > > still gpu
> > > >     mapped, and if not retry the entire fault since most likey
> > > > another
> > > >     migrate_to_ram has succeed meanwhile in parallel.
> > > > 
> > > > - for eviction we don't care, we might actually be migrating a page
> > > > no one
> > > >     even wants anymore.
> > > > 
> > > > Now I think you can get all this done with the dma_resv lock and
> > > > maybe the
> > > > bo refcount. But it does involve a tremendous amount of headaches
> > > > and
> > > > impendence mismatch, because that's not how page faults and
> > > > migrations
> > > > work in core mm.
> > > > 
> > > > Cheers, Sima
> > > > 
> > > > > Current migration policy is migrate any SVM range greater than or
> > > > > equal
> > > > > to 64k once.
> > > > > 
> > > > > [1]https://patchwork.freedesktop.org/series/133643/
> > > > > 
> > > > > Signed-off-by: Matthew Brostmatthew.brost@intel.com
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_svm.c | 81
> > > > > ++++++++++++++++++++++++++++++++++++-
> > > > >    drivers/gpu/drm/xe/xe_svm.h |  1 +
> > > > >    2 files changed, 81 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > > > b/drivers/gpu/drm/xe/xe_svm.c
> > > > > index 4372c02a341f..fd8987e0a506 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > > > @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
> > > > > drm_gpusvm *gpusvm,
> > > > >    static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > > > >    				      struct xe_svm_range
> > > > > *range)
> > > > >    {
> > > > > +	struct drm_gpusvm_ctx ctx = {};
> > > > >    	struct dma_fence *fence;
> > > > > +	/* Evict any pages holding references to vram allocation
> > > > > */
> > > > > +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> > > > > +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
> > > > > &range->base, &ctx);
> > > > > +
> > > > >    	xe_vm_lock(vm, false);
> > > > >    	fence = xe_vm_range_unbind(vm, range);
> > > > >    	xe_vm_unlock(vm);
> > > > > @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
> > > > > xe_svm_range *range,
> > > > >    	return (range->tile_present & ~range->tile_invalidated)
> > > > > & BIT(tile->id);
> > > > >    }
> > > > > +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> > > > > +{
> > > > > +	return &tile->mem.vram;
> > > > > +}
> > > > > +
> > > > > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> > > > > xe_tile *tile,
> > > > > +				       struct xe_svm_range
> > > > > *range,
> > > > > +				       const struct
> > > > > drm_gpusvm_ctx *ctx)
> > > > > +{
> > > > > +	struct xe_mem_region *mr = tile_to_mr(tile);
> > > > > +	struct drm_buddy_block *block;
> > > > > +	struct list_head *blocks;
> > > > > +	struct xe_bo *bo;
> > > > > +	ktime_t end = 0;
> > > > > +	int err;
> > > > > +
> > > > > +retry:
> > > > > +	xe_vm_lock(vm, false);
> > > > > +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> > > > > > base.va.end -
> > > > > +			  range->base.va.start,
> > > > > ttm_bo_type_device,
> > > > > +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> > > > > +			  XE_BO_FLAG_SYSTEM_ALLOC |
> > > > > XE_BO_FLAG_SKIP_CLEAR);
> > > > > +	xe_vm_unlock(vm);
> > > > > +	if (IS_ERR(bo)) {
> > > > > +		err = PTR_ERR(bo);
> > > > > +		if (xe_vm_validate_should_retry(NULL, err,
> > > > > &end))
> > > > > +			goto retry;
> > > > > +		return bo;
> > > > > +	}
> > > > > +
> > > > > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > > > > > blocks;
> > > > > +	list_for_each_entry(block, blocks, link)
> > > > > +		block->private = mr;
> > > > > +
> > > > > +	/*
> > > > > +	 * Take ref because as soon as
> > > > > drm_gpusvm_migrate_to_vram succeeds the
> > > > > +	 * creation ref can be dropped upon CPU fault or unmap.
> > > > > +	 */
> > > > > +	xe_bo_get(bo);
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
> > > > > &range->base,
> > > > > +					 bo, ctx);
> > > > > +	if (err) {
> > > > > +		xe_bo_put(bo);	/* Local ref */
> > > > > +		xe_bo_put(bo);	/* Creation ref */
> > > > > +		return ERR_PTR(err);
> > > > > +	}
> > > > > +
> > > > > +	return bo;
> > > > > +}
> > > > > +
> > > > >    int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > > > *vma,
> > > > >    			    struct xe_tile *tile, u64
> > > > > fault_addr,
> > > > >    			    bool atomic)
> > > > >    {
> > > > > -	struct drm_gpusvm_ctx ctx = { .read_only =
> > > > > xe_vma_read_only(vma), };
> > > > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > > > xe_vma_read_only(vma),
> > > > > +		.vram_possible = IS_DGFX(vm->xe), };
> > > > >    	struct xe_svm_range *range;
> > > > >    	struct drm_gpusvm_range *r;
> > > > >    	struct drm_exec exec;
> > > > >    	struct dma_fence *fence;
> > > > > +	struct xe_bo *bo = NULL;
> > > > >    	ktime_t end = 0;
> > > > >    	int err;
> > > > >    	lockdep_assert_held_write(&vm->lock);
> > > > >    retry:
> > > > > +	xe_bo_put(bo);
> > > > > +	bo = NULL;
> > > > > +
> > > > >    	/* Always process UNMAPs first so view SVM ranges is
> > > > > current */
> > > > >    	err = xe_svm_garbage_collector(vm);
> > > > >    	if (err)
> > > > > @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > > > *vm, struct xe_vma *vma,
> > > > >    	if (xe_svm_range_is_valid(range, tile))
> > > > >    		return 0;
> > > > > +	/* XXX: Add migration policy, for now migrate range once
> > > > > */
> > > > > +	if (IS_DGFX(vm->xe) && !range->migrated &&
> > > > > +	    range->base.flags.migrate_vram &&
> > > > > +	    (range->base.va.end - range->base.va.start) >=
> > > > > SZ_64K) {
> > > > > +		range->migrated = true;
> > > > > +
> > > > > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > > > > +		if (IS_ERR(bo)) {
> > > > > +			drm_info(&vm->xe->drm,
> > > > > +				 "VRAM allocation failed,
> > > > > falling back to retrying, asid=%u, errno %ld\n",
> > > > > +				 vm->usm.asid, PTR_ERR(bo));
> > > > > +			bo = NULL;
> > > > > +			goto retry;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > >    	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > > > &ctx);
> > > > >    	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > > > CPU mappings have change */
> > > > >    	       goto retry;
> > > > > @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > > > > struct xe_vma *vma,
> > > > >    	dma_fence_put(fence);
> > > > >    err_out:
> > > > > +	xe_bo_put(bo);
> > > > > +
> > > > >    	return err;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > index 8b72e91cc37d..3f432483a230 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > @@ -18,6 +18,7 @@ struct xe_svm_range {
> > > > >    	struct list_head garbage_collector_link;
> > > > >    	u8 tile_present;
> > > > >    	u8 tile_invalidated;
> > > > > +	u8 migrated	:1;
> > > > >    };
> > > > >    int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
> > > > > *mr);
> > > > > -- 
> > > > > 2.34.1
> > > > > 
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29 11:02         ` Daniel Vetter
@ 2024-08-29 22:12           ` Matthew Brost
  2024-08-29 22:23             ` Matthew Brost
                               ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 22:12 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Thomas Hellström, Christian König, intel-xe, dri-devel,
	airlied, matthew.auld, daniel, Paneer Selvam, Arunpravin

On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > But as Sima pointed out in private communication, exhaustive eviction
> > is not really needed for faulting to make (crawling) progress.
> > Watermarks and VRAM trylock shrinking should suffice, since we're
> > strictly only required to service a single gpu page granule at a time.
> > 
> > However, ordinary bo-based jobs would still like to be able to
> > completely evict SVM vram. Whether that is important enough to strive
> > for is ofc up for discussion.
> 
> My take is that you don't win anything for exhaustive eviction by having
> the dma_resv somewhere in there for svm allocations. Roughly for split lru
> world, where svm ignores bo/dma_resv:
> 
> When evicting vram from the ttm side we'll fairly switch between selecting
> bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> will eventually succeed in vacuuming up everything (with a few retries
> perhaps, if we're not yet at the head of the ww ticket queue).
> 
> svm pages we need to try to evict anyway - there's no guarantee, becaue
> the core mm might be holding temporary page references (which block

Yea, but think you can could kill the app then - not suggesting we
should but could. To me this is akin to a CPU fault and not being able
to migrate the device pages - the migration layer doc says when this
happens kick this to user space and segfault the app.

My last patch in the series adds some asserts to see if this ever
happens, thus far never. If it occurs we could gracefully handle it by
aborting the migration I guess... I think the user really needs to
something a bit crazy to trigger this condition - I don't think the core
randomly grabs refs to device pages but could be wrong.

> migration) or have the page locked (which also block the migration). But
> as long as those two steps succeed, we'll win and get the pages. There
> might be some thrashing against concurrent svm faults stealing them again,
> but they have a disadvantage since they can't steal dma_resv_locked bo.
> And if it's still too much we can stall them in the page allocator.
> 
> So it's not entirely reliable, but should be close enough.
> 
> Now for bo based svm the picture isn't any different, because holding
> dma_resv is not actually enough to migrate svm mappings. We still need to
> hope there's no temporary page references around, and we still need to
> succeed at locking the page. And the migration code only does trylocks,
> because that's it's deadlock prevent algorithm if different migrations
> needing the same set of pages, but acquiring them in a different order. So
> we win nothing.

Ok, maybe my statement above is false...

Wouldn't be the only time this falls is if another migration is in
flight (e.g. CPU fault) and they race? Then the eviction will naturally
happen via refcount being dropped from the other migration. I guess I
likely need to update my eviction code to not free the TTM resource if
all pages are not migrated.

> 
> Worse, if dma_resv does actually hold up svm migration and reclaim, then
> we potentially deadlock because that lock is for a bigger range than
> individual pages (or folios). And the core mm assumes that it can get out
> of a deadlock bind by (at least stochastically) eventually succeeding in
> acquiring/locking down a single page.
> 
> This means we cannot use dma_resv tricks to give the ttm world an
> advantage in exhaustive eviction against concurrent svm faults. Or at
> least not more than we can do without by just stalling svm faults that
> need to allocate gpu memory (but that must happen without holding locks or
> we're busted).
> 

I'm a little lost here on the deadlock case. Do you mean when we try to
evict SVM BO we trigger reclaim by allocating system pages and can
deadlock? Doesn't TTM already have this dependency when evicting non-SVM
BOs?

> So the only benefit I'm seeing is the unified lru, which I'm not sure is
> worth it. There's also a bit a lru design tension here, because for the bo

Well also not rewriting the world...

Matt

> world we want objects that are locked to stay on the lru, so that the
> competing processes can figure out who has the winning ww ticket. The core
> mm design otoh does isolate pages and remove them from the lru when
> they're acquired, so that they don't gunk up other processes from trying
> to make forward progress and are better hidden. Which reduces temporary
> page references (from lru walk) preventing migration and stuff like that.
> 
> Cheers, Sima
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29 22:12           ` Matthew Brost
@ 2024-08-29 22:23             ` Matthew Brost
  2024-09-02 11:01             ` Christian König
  2024-09-02 12:48             ` Daniel Vetter
  2 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-29 22:23 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Thomas Hellström, Christian König, intel-xe, dri-devel,
	airlied, matthew.auld, daniel, Paneer Selvam, Arunpravin

On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > But as Sima pointed out in private communication, exhaustive eviction
> > > is not really needed for faulting to make (crawling) progress.
> > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > strictly only required to service a single gpu page granule at a time.
> > > 
> > > However, ordinary bo-based jobs would still like to be able to
> > > completely evict SVM vram. Whether that is important enough to strive
> > > for is ofc up for discussion.
> > 
> > My take is that you don't win anything for exhaustive eviction by having
> > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > world, where svm ignores bo/dma_resv:
> > 
> > When evicting vram from the ttm side we'll fairly switch between selecting
> > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > will eventually succeed in vacuuming up everything (with a few retries
> > perhaps, if we're not yet at the head of the ww ticket queue).
> > 
> > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > the core mm might be holding temporary page references (which block
> 
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.
> 
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.
> 
> > migration) or have the page locked (which also block the migration). But
> > as long as those two steps succeed, we'll win and get the pages. There
> > might be some thrashing against concurrent svm faults stealing them again,
> > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > And if it's still too much we can stall them in the page allocator.
> > 
> > So it's not entirely reliable, but should be close enough.
> > 
> > Now for bo based svm the picture isn't any different, because holding
> > dma_resv is not actually enough to migrate svm mappings. We still need to
> > hope there's no temporary page references around, and we still need to
> > succeed at locking the page. And the migration code only does trylocks,
> > because that's it's deadlock prevent algorithm if different migrations
> > needing the same set of pages, but acquiring them in a different order. So
> > we win nothing.
> 
> Ok, maybe my statement above is false...
> 
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.
> 

Also if we add something like AMD's range->migration_mutex pretty sure
this race goes away and we left with 'the user has done something not smart,
and could convinciblely kill the app'.

Matt

> > 
> > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > we potentially deadlock because that lock is for a bigger range than
> > individual pages (or folios). And the core mm assumes that it can get out
> > of a deadlock bind by (at least stochastically) eventually succeeding in
> > acquiring/locking down a single page.
> > 
> > This means we cannot use dma_resv tricks to give the ttm world an
> > advantage in exhaustive eviction against concurrent svm faults. Or at
> > least not more than we can do without by just stalling svm faults that
> > need to allocate gpu memory (but that must happen without holding locks or
> > we're busted).
> > 
> 
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?
> 
> > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > worth it. There's also a bit a lru design tension here, because for the bo
> 
> Well also not rewriting the world...
> 
> Matt
> 
> > world we want objects that are locked to stay on the lru, so that the
> > competing processes can figure out who has the winning ww ticket. The core
> > mm design otoh does isolate pages and remove them from the lru when
> > they're acquired, so that they don't gunk up other processes from trying
> > to make forward progress and are better hidden. Which reduces temporary
> > page references (from lru walk) preventing migration and stuff like that.
> > 
> > Cheers, Sima
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29  9:16   ` Thomas Hellström
  2024-08-29 17:45     ` Matthew Brost
@ 2024-08-30  1:35     ` Matthew Brost
  1 sibling, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-08-30  1:35 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Thu, Aug 29, 2024 at 11:16:49AM +0200, Thomas Hellström wrote:
> Hi, Matt. 
> 
> Some initial design comments / questions:
> 

Hi, Thomas. Missed one question in initial reply.

> On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > This patch introduces support for GPU Shared Virtual Memory (SVM) in
> > the
> > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > sharing of memory between the CPU and GPU, enhancing performance and
> > flexibility in GPU computing tasks.
> > 
> > The patch adds the necessary infrastructure for SVM, including data
> > structures and functions for managing SVM ranges and notifiers. It
> > also
> > provides mechanisms for allocating, deallocating, and migrating
> > memory
> > regions between system RAM and GPU VRAM.
> > 
> > This mid-layer is largely inspired by GPUVM.
> > 
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile     |    3 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> >  3 files changed, 2591 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index b9670ae09a9e..b8fc2ee58f1a 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> >  
> >  # core driver code
> >  
> > -xe-y += xe_bb.o \
> > +xe-y += drm_gpusvm.o \
> > +	xe_bb.o \
> >  	xe_bo.o \
> >  	xe_bo_evict.o \
> >  	xe_devcoredump.o \
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > new file mode 100644
> > index 000000000000..fc1e44e6ae72
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > @@ -0,0 +1,2174 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + *
> > + * Authors:
> > + *     Matthew Brost <matthew.brost@intel.com>
> > + */
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interval_tree_generic.h>
> > +#include <linux/hmm.h>
> > +#include <linux/memremap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_types.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/slab.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include "drm_gpusvm.h"
> > +
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework designed to
> > manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient data
> > exchange and
> > + * processing for GPU-accelerated applications by allowing memory
> > sharing and
> > + * synchronization between the CPU's and GPU's virtual address
> > spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU
> > SVM
> > + *		initialization parameter, with a recommendation of
> > 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of
> > ranges that
> > + *		fall within the notifier interval. Notifiers are
> > tracked within
> > + *		a GPU SVM Red-BlacK tree and list and are
> > dynamically inserted
> > + *		or removed as ranges within the interval are created
> > or
> > + *		destroyed.
> 
> What is the benefit of this extra layer compared to direct insertion of
> ranges using mmu_interval_notifier_insert?
> 
> IIRC the argument made previously about having wide notifiers was that
> the rb tree lookups inside the core were costly and if there were only
> a few, then the rb tree lookups within a notifier range could be
> replaced with the page-table radix-tree-like lookup, so each lookup
> complexity would be O(log(n_notifiers) + page_table_depth).
> 
> But now we have first an rb-tree lookup in the core and then an rb-tree
> lookup within each notifier yeilding O(log(n_ranges))
> 
> I can see a small benefit in that inserting directly into the core rb-
> tree will block pending ongoing invalidations, but at a cost of an
> extra multiplexing layer.
> 
> > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > managed
> > + *	     by GPU SVM. They are sized based on an array of chunk
> > sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU
> > address space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits
> > within the
> > + *	     faulting CPU address space is chosen for the range
> > size. Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and
> > removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > are tracked in
> > + *	     a notifier's Red-Black tree.
> 
> How do ranges and chunks map to
>  
> a) Prefaulting granularity

Well we haven't implemented prefetch yet but my thinking was initially
prefetch is an IOCTL which would create N ranges based on chunk size.
As optimization we likely can make prefetch interruptable (e.g. stop
prefetch if a GPU fault occurs to service that first) and hopefully make
prefetch parallel rather than a completely serial operation like a GPU
fault. We can start with a simple serial implementation build on top of
the faulting code though.

As a further optimization it might be advantageous to trigger prefetch
upon a GPU fault too - e.g. service the fault for 1 range only then
trigger prefetch for N ranges async. This essentially could be thought
of of as fault triggering a prefetch IOCTL. This likely would be
controled by madvise setting or perhaps a global modparam?

All of above is likely to be done in the tuning phase when we UMDs /
apps running and based on performance data. Or at least this is my
thinking.

We have Jira open for all of this and I believe other engineers on the
team will be owning the implementation of this. 

> b) Migration granularity?
> 

A chunk is the size of the range and migration granularity. We chose the
largest chunk for a range which fits in GPU VMA and CPU VMA in an
aligned manner. The chunks I'm currently using in Xe map to single GPU
page (4k, 64k, or 2M). As I've mentioned this is flexible so it doesn't
have to be 1 GPU page but started there as it makes a bit of sense.

Matt

> > + * - Operations: Define the interface for driver-specific SVM
> > operations such as
> > + *		 allocation, page collection, migration,
> > invalidations, and VRAM
> > + *		 release.
> > + *
> > + * This layer provides interfaces for allocating, mapping,
> > migrating, and
> > + * releasing memory ranges between the CPU and GPU. It handles all
> > core memory
> > + * management interactions (DMA mapping, HMM, and migration) and
> > provides
> > + * driver-specific virtual functions (vfuncs). This infrastructure
> > is sufficient
> > + * to build the expected driver components for an SVM implementation
> > as detailed
> > + * below.
> > + *
> > + * Expected Driver Components:
> > + * - GPU page fault handler: Used to create ranges and notifiers
> > based on the
> > + *			     fault address, optionally migrate the
> > range to
> > + *			     VRAM, and create GPU bindings.
> > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > Ranges are
> > + *			expected to be added to the garbage
> > collector upon
> > + *			MMU_NOTIFY_UNMAP event.
> > + */
> > +
> > +/**
> > + * DOC: Locking
> > + *
> > + * GPU SVM handles locking for core MM interactions, i.e., it
> > locks/unlocks the
> > + * mmap lock as needed. Alternatively, if the driver prefers to
> > handle the mmap
> > + * lock itself, a 'locked' argument is provided to the functions
> > that require
> > + * the mmap lock. This option may be useful for drivers that need to
> > call into
> > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > locking
> > + * inversions between the mmap and dma-resv locks.
> > + *
> > + * GPU SVM introduces a global notifier lock, which safeguards the
> > notifier's
> > + * range RB tree and list, as well as the range's DMA mappings and
> > sequence
> > + * number. GPU SVM manages all necessary locking and unlocking
> > operations,
> > + * except for the recheck of the range's sequence number
> > + * (mmu_interval_read_retry) when the driver is committing GPU
> > bindings. This
> > + * lock corresponds to the 'driver->update' lock mentioned in the
> > HMM
> > + * documentation (TODO: Link). Future revisions may transition from
> > a GPU SVM
> > + * global lock to a per-notifier lock if finer-grained locking is
> > deemed
> > + * necessary.
> > + *
> > + * In addition to the locking mentioned above, the driver should
> > implement a
> > + * lock to safeguard core GPU SVM function calls that modify state,
> > such as
> > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > Alternatively,
> > + * these core functions can be called within a single kernel thread,
> > for
> > + * instance, using an ordered work queue. This lock is denoted as
> > + * 'driver_svm_lock' in code examples.
> > + */
> > +
> > +/**
> > + * DOC: Migrataion
> > + *
> > + * The migration support is quite simple, allowing migration between
> > SRAM and
> > + * VRAM at the range granularity. For example, GPU SVM currently
> > does not
> > + * support mixing SRAM and VRAM pages within a range. This means
> > that upon GPU
> > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > fault, the
> > + * entire range is migrated to SRAM.
> > + *
> > + * The reasoning for only supporting range granularity is as
> > follows: it
> > + * simplifies the implementation, and range sizes are driver-defined
> > and should
> > + * be relatively small.
> > + */
> > +
> > +/**
> > + * DOC: Partial Unmapping of Ranges
> > + *
> > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > CPU resulting
> > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> > main one
> > + * being that a subset of the range still has CPU and GPU mappings.
> > If the
> > + * backing store for the range is in VRAM, a subset of the backing
> > store has
> > + * references. One option would be to split the range and VRAM
> > backing store,
> > + * but the implementation for this would be quite complicated. Given
> > that
> > + * partial unmappings are rare and driver-defined range sizes are
> > relatively
> > + * small, GPU SVM does not support splitting of ranges.
> > + *
> > + * With no support for range splitting, upon partial unmapping of a
> > range, the
> > + * driver is expected to invalidate and destroy the entire range. If
> > the range
> > + * has VRAM as its backing, the driver is also expected to migrate
> > any remaining
> > + * pages back to SRAM.
> 
> So what happens if we get a one-page invalidation, say protection
> change event, or NUMA accounting event, in the middle of a range? Can
> we unmap just that single gpu pte covering that range, that is, how do
> the ranges map to invalidation granularity? Does this differ between
> igfx an dgfx?
> 
> Thanks,
> Thomas
> 
> 
> 
> 
> > + */
> > +
> > +/**
> > + * DOC: Examples
> > + *
> > + * This section provides two examples of how to build the expected
> > driver
> > + * components: the GPU page fault handler and the garbage collector.
> > A third
> > + * example demonstrates a sample invalidation driver vfunc.
> > + *
> > + * The generic code provided does not include logic for complex
> > migration
> > + * policies, optimized invalidations, or other potentially required
> > driver
> > + * locking (e.g., DMA-resv locks).
> > + *
> > + * 1) GPU page fault handler
> > + *
> > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > drm_gpusvm_range *range)
> > + *	{
> > + *		int err = 0;
> > + *
> > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > range);
> > + *
> > + *		drm_gpusvm_notifier_lock(gpusvm);
> > + *		if (drm_gpusvm_range_pages_valid(range))
> > + *			driver_commit_bind(gpusvm, range);
> > + *		else
> > + *			err = -EAGAIN;
> > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > + *
> > + *		return err;
> > + *	}
> > + *
> > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > + *			     u64 gpuva_start, u64 gpuva_end)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *		int err;
> > + *
> > + *		driver_svm_lock();
> > + *	retry:
> > + *		// Always process UNMAPs first so view of GPU SVM
> > ranges is current
> > + *		driver_garbage_collector(gpusvm);
> > + *
> > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > fault_addr,
> > + *							gpuva_start,
> > gpuva_end,
> > + *						        &ctx);
> > + *		if (IS_ERR(range)) {
> > + *			err = PTR_ERR(range);
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		if (driver_migration_policy(range)) {
> > + *			bo = driver_alloc_bo();
> > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > range, bo, &ctx);
> > + *			if (err)	// CPU mappings may have
> > changed
> > + *				goto retry;
> > + *		}
> > + *
> > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &ctx);
> > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > mappings changed
> > + *			goto retry;
> > + *		else if (err)
> > + *			goto unlock;
> > + *
> > + *		err = driver_bind_range(gpusvm, range);
> > + *		if (err == -EAGAIN)	// CPU mappings changed
> > + *			goto retry
> > + *
> > + *	unlock:
> > + *		driver_svm_unlock();
> > + *		return err;
> > + *	}
> > + *
> > + * 2) Garbage Collector.
> > + *
> > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > + *					struct drm_gpusvm_range
> > *range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		// Partial unmap, migrate any remaining VRAM pages
> > back to SRAM
> > + *		if (range->flags.partial_unmap)
> > + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> > &ctx);
> > + *
> > + *		driver_unbind_range(range);
> > + *		drm_gpusvm_range_remove(gpusvm, range);
> > + *	}
> > + *
> > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > + *			__driver_garbage_collector(gpusvm, range);
> > + *	}
> > + *
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier
> > *notifier,
> > + *				 const struct mmu_notifier_range
> > *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > >start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier,
> > mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> > &ctx);
> > + *
> > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range,
> > mmu_range);
> > + *			driver_garbage_collector_add(gpusvm, range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > >interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > >interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> > notifier);
> > +
> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given
> > range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory
> > range,
> > + * specified by the start and end addresses. It divides the
> > difference
> > + * between the end and start addresses by the page size (PAGE_SIZE)
> > to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @range: Pointer to the GPU SVM range
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up
> > a range
> > + * upon CPU page fault and asynchronously releasing VRAM once the
> > CPU has no
> > + * page references. Asynchronous release is useful because CPU page
> > references
> > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > requires sleeping
> > + * locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_range *range;
> > +	void *vram_allocation;
> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> > zdd
> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd,
> > destroy_work);
> > +	struct drm_gpusvm_range *range = zdd->range;
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > +	drm_gpusvm_range_put(range);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @range: Pointer to the GPU SVM range.
> > + *
> > + * This function allocates and initializes a new zdd structure. It
> > sets up the
> > + * reference count, initializes the destroy work, and links the
> > provided GPU SVM
> > + * range.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work,
> > drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->range = drm_gpusvm_range_get(range);
> > +	zdd->vram_allocation = NULL;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd
> > structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for asynchronous
> > destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > +
> > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd
> > structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> > start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > (start__), (end__)),	\
> > +	     (next__) =
> > __drm_gpusvm_range_next(range__);				\
> > +	     (range__) && (range__->va.start <
> > (end__));				\
> > +	     (range__) = (next__), (next__) =
> > __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> > the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> > or NULL if
> > + *         the current notifier is the last one or if the input
> > notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm-
> > >notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> > a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> > end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> > notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1),	\
> > +	     (next__) =
> > __drm_gpusvm_notifier_next(notifier__);				\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = (next__), (next__) =
> > __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It
> > sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc
> > under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range
> > *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops
> > drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order
> > with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner =
> > device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +	gpusvm->zdd_wq = system_wq;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault
> > address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier
> > structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)				\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> > and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining
> > ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> > LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any
> > invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next,
> > notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > >notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > >notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree
> > and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree
> > and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > +	range_remove((range__), &(notifier__)->root);		\
> > +	list_del(&(range__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > + *
> > + * This function allocates and initializes the GPU SVM range
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool
> > migrate_vram)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the
> > CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier
> > *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> > >notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> > range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range
> > based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> > the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64
> > gpuva_start,
> > +				       u64 gpuva_end, bool
> > check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > >chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with
> > existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted
> > in. Without
> > +		 * this check, or prefault, on BMG
> > 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case,
> > each process
> > +		 * mallocs 16k but the CPU VMA is ~128k which
> > results in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some
> > processes fail in
> > +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> > != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages device
> > pages from
> > +		 * other processes are collected + faulted in which
> > creates all
> > +		 * sorts of problems. Unsure exactly how this
> > happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --
> > r
> > +		 * process-many-malloc' mallocs at least 64k at a
> > time.
> > +		 */
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range
> > based on the
> > + * fault address. Caller must hold a lock to protect range lookup
> > and insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_vram;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_write_locked(mm);
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > >notifier,
> > +							  mm,
> > notifier->interval.start,
> > +							  notifier-
> > >interval.end -
> > +							  notifier-
> > >interval.start,
> > +							 
> > &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > current
> > +	 * limitations. If/when migrate_vma_* add more support, this
> > logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_vram = ctx->vram_possible &&
> > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> > vas,
> > +						 fault_addr,
> > gpuva_start,
> > +						 gpuva_end,
> > migrate_vram &&
> > +						 !ctx->prefault);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> > chunk_size,
> > +				       migrate_vram);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +	if (ctx->prefault) {
> > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > +
> > +		__ctx.mmap_locked = true;
> > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &__ctx);
> > +		if (err)
> > +			goto err_range_remove;
> > +	}
> > +
> > +out_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +
> > +	return range;
> > +
> > +err_range_remove:
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +err_notifier_remove:
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > + * @i__: the current page index in the iteration
> > + * @j__: the current page index, log order, in the iteration
> > + * @npages__: the total number of pages in the DMA region
> > + * @order__: the order of the pages in the DMA region
> > + *
> > + * This macro iterates over each page in a DMA region. The DMA
> > region
> > + * is assumed to be composed of 2^@order__ pages, and the macro will
> > + * step through the region one block of 2^@order__ pages at a time.
> > + */
> > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > +	     (j__)++, (i__) += 0x1 << (order__))
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function unmap pages associated with a GPU SVM range.
> > Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > *gpusvm,
> > +					   struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		unsigned long i, j, npages = npages_in_range(range-
> > >va.start,
> > +							     range-
> > >va.end);
> > +
> > +		if (range->flags.has_dma_mapping) {
> > +			for_each_dma_page(i, j, npages, range-
> > >order)
> > +				dma_unmap_page(gpusvm->drm->dev,
> > +					       range->dma_addr[j],
> > +					       PAGE_SIZE << range-
> > >order,
> > +					       DMA_BIDIRECTIONAL);
> > +		}
> > +
> > +		range->flags.has_vram_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > +					struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		if (range->flags.kfree_mapping) {
> > +			kfree(range->dma_addr);
> > +			range->flags.kfree_mapping = false;
> > +			range->pages = NULL;
> > +		} else {
> > +			kvfree(range->pages);
> > +			range->pages = NULL;
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also
> > removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > >va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier-
> > >notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified GPU
> > SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the GPU
> > SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its
> > reference count
> > + * reaches zero. If a custom range-free function is provided, it is
> > invoked to
> > + * free the range; otherwise, the range is deallocated using
> > kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range,
> > refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified GPU
> > SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before
> > commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_vram_pages || range-
> > >flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> > unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > +				      struct drm_gpusvm_range
> > *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->pages)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > +	if (!pages_valid && range->flags.kfree_mapping) {
> > +		kfree(range->dma_addr);
> > +		range->flags.kfree_mapping = false;
> > +		range->pages = NULL;
> > +	}
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they are
> > mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > >notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> > ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	bool vram_pages = !!range->flags.migrate_vram;
> > +	bool alloc_pfns = false, kfree_mapping;
> > +
> > +retry:
> > +	kfree_mapping = false;
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > +		return 0;
> > +
> > +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> > >pages) {
> > +		if (ctx->prefault)
> > +			return 0;
> > +
> > +		pfns = (unsigned long *)range->pages;
> > +		pages = range->pages;
> > +		goto map_pages;
> > +	}
> > +
> > +	if (!range->pages) {
> > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > GFP_KERNEL);
> > +		if (!pfns)
> > +			return -ENOMEM;
> > +		alloc_pfns = true;
> > +	} else {
> > +		pfns = (unsigned long *)range->pages;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		/* Must be checked after mmu_interval_read_begin */
> > +		if (range->flags.unmapped) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		if (!ctx->mmap_locked) {
> > +			/*
> > +			 * XXX: HMM locking document indicates only
> > a read-lock
> > +			 * is required but there apears to be a
> > window between
> > +			 * the MMU_NOTIFY_MIGRATE event triggered in
> > a CPU fault
> > +			 * via migrate_vma_setup and the pages
> > actually moving
> > +			 * in migrate_vma_finalize in which this
> > code can grab
> > +			 * garbage pages. Grabbing the write-lock if
> > the range
> > +			 * is attached to vram appears to protect
> > against this
> > +			 * race.
> > +			 */
> > +			if (vram_pages)
> > +				mmap_write_lock(mm);
> > +			else
> > +				mmap_read_lock(mm);
> > +		}
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (!ctx->mmap_locked) {
> > +			if (vram_pages)
> > +				mmap_write_unlock(mm);
> > +			else
> > +				mmap_read_unlock(mm);
> > +		}
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (!ctx->mmap_locked)
> > +		mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +
> > +	if (ctx->prefault) {
> > +		range->pages = pages;
> > +		goto set_seqno;
> > +	}
> > +
> > +map_pages:
> > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > +		WARN_ON_ONCE(!range->vram_allocation);
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +			if
> > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_free;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->flags.has_vram_pages = true;
> > +		range->pages = pages;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	} else {
> > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > +
> > +		for_each_dma_page(i, j, npages, order) {
> > +			if (WARN_ON_ONCE(i && order !=
> > +					
> > hmm_pfn_to_map_order(pfns[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +			order = hmm_pfn_to_map_order(pfns[i]);
> > +
> > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > +			if
> > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +
> > +			set_page_dirty_lock(pages[j]);
> > +			mark_page_accessed(pages[j]);
> > +
> > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > +						   pages[j], 0,
> > +						   PAGE_SIZE <<
> > order,
> > +						  
> > DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev,
> > dma_addr[j])) {
> > +				err = -EFAULT;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +		}
> > +
> > +		/* Huge pages, reduce memory footprint */
> > +		if (order) {
> > +			dma_addr = kmalloc_array(j,
> > sizeof(*dma_addr),
> > +						 GFP_KERNEL);
> > +			if (dma_addr) {
> > +				for (i = 0; i < j; ++i)
> > +					dma_addr[i] =
> > (dma_addr_t)pfns[i];
> > +				kvfree(pfns);
> > +				kfree_mapping = true;
> > +			} else {
> > +				dma_addr = (dma_addr_t *)pfns;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->order = order;
> > +		range->flags.kfree_mapping = kfree_mapping;
> > +		range->flags.has_dma_mapping = true;
> > +		range->dma_addr = dma_addr;
> > +		range->vram_allocation = NULL;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	}
> > +
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	for_each_dma_page(i, j, npages, order)
> > +		dma_unmap_page(gpusvm->drm->dev,
> > +			       (dma_addr_t)pfns[j],
> > +			       PAGE_SIZE << order,
> > DMA_BIDIRECTIONAL);
> > +err_free:
> > +	if (alloc_pfns)
> > +		kvfree(pfns);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If
> > @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > >invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long
> > *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > grate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU
> > SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_vram_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> > migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped
> > pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU
> > SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn,
> > maps the
> > + * corresponding page, and stores the DMA address in the provided
> > @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int
> > *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page =
> > migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> > dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> > for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU
> > Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in
> > @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the
> > corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction
> > dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *                   failure of this function.
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> > The caller
> > + *                   should hold a reference to the VRAM allocation,
> > which
> > + *                   should be dropped via ops->vram_allocation or
> > upon the
> > + *                   failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to VRAM. It
> > performs the
> > + * necessary setup and invokes the driver-specific operations for
> > migration to
> > + * VRAM. Upon successful return, @vram_allocation can safely
> > reference @range
> > + * until ops->vram_release is called which only upon successful
> > return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_vram)
> > +		return -EINVAL;
> > +
> > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > >copy_to_vram ||
> > +	    !gpusvm->ops->copy_to_sram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(range);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > npages, not
> > +	 * always an error. Need to revisit possible cases and how
> > to handle. We
> > +	 * could prefault on migrate.cpages != npages via
> > hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > vram_allocation, npages,
> > +					     migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.src, npages,
> > DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_vram_page(page, zdd);
> > +	}
> > +
> > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind vram allocation to range and zdd */
> > +	range->vram_allocation = vram_allocation;
> > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> > VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the SRAM migrate page frame numbers
> > (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the
> > VM area for
> > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > vm_area_struct *vas,
> > +						unsigned long
> > npages,
> > +						unsigned long
> > *src_mpfn,
> > +						unsigned long *mpfn,
> > u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> > lock and
> > + * migration done via migrate_device_* functions. Fallback path as
> > it is
> > + * preferred to issue migrations with mmap lock.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> > +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > npages;
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > >vram_allocation,
> > +					     npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_vma_range(gpusvm->mm,
> > +				       gpusvm-
> > >device_private_page_owner, src,
> > +				       npages, range->va.start);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > src, dst, 0);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   dst, npages,
> > DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @vas: Pointer to the VM area structure
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @start: Start address of the migration range
> > + * @end: End address of the migration range
> > + *
> > + * This internal function performs the migration of the specified
> > GPU SVM range
> > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > PFNs, and
> > + * invokes the driver-specific operations for migration to SRAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +					struct vm_area_struct *vas,
> > +					struct page *page,
> > +					u64 start, u64 end)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page	= page,
> > +	};
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	/* Corner where VMA area struct has been partially unmapped
> > */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > +						   migrate.src,
> > migrate.dst,
> > +						   start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.dst, npages,
> > +					   DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function initiates the migration of the specified GPU SVM
> > range to
> > + * SRAM. It performs necessary checks and invokes the internal
> > migration
> > + * function for actual migration.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err =
> > drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner
> > case when
> > +	 * VRAM backing has been partially unmapped from MM's
> > address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> > +	if (!vas) {
> > +		if (!retry)
> > +			err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > +		if (!retry)
> > +			err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> > end);
> > +	if (err)
> > +		goto err_mmunlock;
> > +
> > +	if (vas->vm_end < end) {
> > +		retry = true;
> > +		start = vas->vm_end;
> > +		goto again;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		mmap_read_unlock(mm);
> > +		/*
> > +		 * Using mmput_async as this function can be called
> > while
> > +		 * holding a dma-resv lock, and a final put can grab
> > the mmap
> > +		 * lock, causing a lock inversion.
> > +		 */
> > +		mmput_async(mm);
> > +	}
> > +
> > +	return 0;
> > +
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked)
> > +		mmap_read_unlock(mm);
> > +err_mmput:
> > +	if (!ctx->mmap_locked)
> > +		mmput_async(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> > with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device
> > data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> > fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM
> > range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page
> > and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > +					   vmf->vma, vmf->page,
> > +					   zdd->range->va.start,
> > +					   zdd->range->va.end);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start,
> > end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..0ea70f8534a8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,415 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual
> > Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM
> > ranges and
> > + * perform operations such as migration between VRAM and system RAM.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success,
> > NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> > *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @vram_release: Release VRAM allocation (optional)
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 *
> > +	 * This function shall release VRAM allocation and expects
> > to drop a
> > +	 * reference to VRAM allocation.
> > +	 */
> > +	void (*vram_release)(void *vram_allocation);
> > +
> > +	/**
> > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate VRAM page frame numbers
> > (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > +				 void *vram_allocation,
> > +				 unsigned long npages,
> > +				 unsigned long *pfn);
> > +
> > +	/**
> > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to VRAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_sram: Copy to system RAM (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses
> > (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It
> > can safely
> > +	 * walk the notifier range RB tree/list in this function.
> > Called while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range
> > *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure
> > notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order
> > they appear in
> > + *              interval tree. This is useful to keep iterating
> > ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval notifier
> > has been
> > + *                 removed
> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @pages: Pointer to the array of pages (if backing store is in
> > VRAM)
> > + * @dma_addr: DMA address array (if backing store is SRAM and DMA
> > mapped)
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping
> > size
> > + * @flags.migrate_vram: Flag indicating whether the range can be
> > migrated to VRAM
> > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been
> > partially unmapped
> > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> > mapping
> > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > allocation based
> > + *                       on @order which releases via kfree
> > + *
> > + * This structure represents a GPU SVM range used for tracking
> > memory ranges
> > + * mapped in a DRM device.
> > + */
> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	union {
> > +		struct page **pages;
> > +		dma_addr_t *dma_addr;
> > +	};
> > +	void *vram_allocation;
> > +	u16 order;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_vram : 1;
> > +		/* All flags below must be set / cleared under
> > notifier lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_vram_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +		u16 kfree_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier
> > operations
> > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > + * @root: Cached root node of the Red-Black tree containing GPU SVM
> > notifiers
> > + * @notifier_list: list head containing of notifiers in the same
> > order they
> > + *                 appear in interval tree. This is useful to keep
> > iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory) used
> > for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > + *
> > + * No reference counting is provided, as this is expected to be
> > embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which
> > handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct workqueue_struct *zdd_wq;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @mmap_locked: mmap lock is locked
> > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > inversions
> > + *                (e.g.dma-revs -> mmap lock)
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @vram_possible: possible to use VRAM
> > + * @prefault: prefault pages
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 mmap_locked :1;
> > +	u32 trylock_mmap :1;
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 vram_possible :1;
> > +	u32 prefault :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> > +
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or
> > NULL if the
> > + *         current range is the last one or if the input range is
> > NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates
> > the start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> > get the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier.
> > It is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > end__)	\
> > +	for ((range__) = (range__)
> > ?:					\
> > +	     drm_gpusvm_range_find((notifier__), (start__),
> > (end__));	\
> > +	     (range__) && (range__->va.start <
> > (end__));		\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the
> > partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier
> > range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > +			      const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28 14:31   ` Daniel Vetter
  2024-08-28 14:46     ` Christian König
@ 2024-08-30  5:00     ` Matthew Brost
  2024-09-02 11:36       ` Daniel Vetter
  1 sibling, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-30  5:00 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Wed, Aug 28, 2024 at 04:31:19PM +0200, Daniel Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > +		if (!ctx->mmap_locked) {
> > +			/*
> > +			 * XXX: HMM locking document indicates only a read-lock
> > +			 * is required but there apears to be a window between
> > +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > +			 * via migrate_vma_setup and the pages actually moving
> > +			 * in migrate_vma_finalize in which this code can grab
> > +			 * garbage pages. Grabbing the write-lock if the range
> > +			 * is attached to vram appears to protect against this
> > +			 * race.
> > +			 */
> 
> This one is really scary, since it means the entire migrate pte trickery
> is essentially completely busted. Grabbing the mmap write lock just means
> you block out pretty much everything interesting from concurrently
> happening.
> 
> My gut feeling says we need to figure out what's happening here, because
> this looks a bit too fundamental to me.
> -Sima
> 

Sima,

I’ve already replied to this.

We’ve discussed the mmap write hack extensively, so I’m not quite sure
where to put this. The reply chain is quickly becoming a mess. However,
I’ve looked into this and collected some data points based on your
feedback.

I’ve pushed a branch [1] with the updated code.

The first new commit [2] removes the mmap write lock hack and addresses
an issue related to VRAM migrations, which couldn’t collect all VRAM
pages without this hack.

With this commit [2], xe_exec_system_allocator --r twice*race* fails
quite regularly, perhaps 25% of the time. This test is a
single-thread/process test that races CPU and GPU faults with migration.

It fails with the following dmesg:

[   68.473007] WARNING: CPU: 12 PID: 1643 at drivers/gpu/drm/xe/drm_gpusvm.c:1407 drm_gpusvm_range_get_pages+0xbda/0x1480 [xe]
...
[   68.473836] xe 0000:03:00.0: [drm:pf_queue_work_func [xe]] Fault response: Unsuccessful -95
[   68.474024] xe 0000:03:00.0: [drm:xe_guc_exec_queue_memory_cat_error_handler [xe]] GT1: Engine memory cat error: engine_class=vecs, logical_mask: 0x2, guc_id=0
[   68.474163] xe 0000:03:00.0: [drm] exec queue reset detected
[   68.474696] xe 0000:03:00.0: [drm] GT1: Engine reset: engine_class=vecs, logical_mask: 0x2, guc_id=0

This means hmm_range_fault collects a mix of SRAM and VRAM pages, which
my design aims to avoid. Perhaps allowing a mix of SRAM and VRAM pages
in my design might work, but I highly doubt it based on AMD's
range->migration_mutex and my inspection of the migration layer.
Allowing mixed mappings would introduce significant complexity, so I’d
prefer to avoid this if possible. Additionally, allowing mixed mappings
would eliminate the use of huge GPU pages when race like this occurs.

I also implemented a retry loop to see if the system stabilizes with
either only SRAM or VRAM pages. Unfortunately, it results in a
continuous loop of drm_gpusvm_range_get_pages / hmm_range_fault until
the test case kills the MM due to a timeout.

Next, I added a lock similar to AMD's range->migration_lock, but using
an rwsem [3]. The semantics are to allow read access for CPU access and
write access for GPU access, thus enabling parallel CPU page faults for
the same range which matching existing core semantics. This provides
finer granularity compared to using the mmap write lock; it only
disallows CPU and GPU servicing in parallel for a given range, rather
than the entire MM. It also aligns with AMD’s approach. I haven’t
checked Nvidia’s approach wrt this locking but can do so if you think it
would be helpful.

Matt

[1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/commits/mmap_write_lock
[2] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/commit/6cf67d98c719ffbb4ac6124a7cb81d797a5bad9f
[3] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/commit/2b62075d193265b2c1634ecfd0497dffd2e18c13

> 
> > +			if (vram_pages)
> > +				mmap_write_lock(mm);
> > +			else
> > +				mmap_read_lock(mm);
> > +		}
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (!ctx->mmap_locked) {
> > +			if (vram_pages)
> > +				mmap_write_unlock(mm);
> > +			else
> > +				mmap_read_unlock(mm);
> > +		}
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (!ctx->mmap_locked)
> > +		mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +
> > +	if (ctx->prefault) {
> > +		range->pages = pages;
> > +		goto set_seqno;
> > +	}
> > +
> > +map_pages:
> > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > +		WARN_ON_ONCE(!range->vram_allocation);
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_free;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->flags.has_vram_pages = true;
> > +		range->pages = pages;
> > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	} else {
> > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > +
> > +		for_each_dma_page(i, j, npages, order) {
> > +			if (WARN_ON_ONCE(i && order !=
> > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +			order = hmm_pfn_to_map_order(pfns[i]);
> > +
> > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +
> > +			set_page_dirty_lock(pages[j]);
> > +			mark_page_accessed(pages[j]);
> > +
> > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > +						   pages[j], 0,
> > +						   PAGE_SIZE << order,
> > +						   DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > +				err = -EFAULT;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +		}
> > +
> > +		/* Huge pages, reduce memory footprint */
> > +		if (order) {
> > +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > +						 GFP_KERNEL);
> > +			if (dma_addr) {
> > +				for (i = 0; i < j; ++i)
> > +					dma_addr[i] = (dma_addr_t)pfns[i];
> > +				kvfree(pfns);
> > +				kfree_mapping = true;
> > +			} else {
> > +				dma_addr = (dma_addr_t *)pfns;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->order = order;
> > +		range->flags.kfree_mapping = kfree_mapping;
> > +		range->flags.has_dma_mapping = true;
> > +		range->dma_addr = dma_addr;
> > +		range->vram_allocation = NULL;
> > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	}
> > +
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	for_each_dma_page(i, j, npages, order)
> > +		dma_unmap_page(gpusvm->drm->dev,
> > +			       (dma_addr_t)pfns[j],
> > +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > +err_free:
> > +	if (alloc_pfns)
> > +		kvfree(pfns);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_vram_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *                   failure of this function.
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > + *                   should hold a reference to the VRAM allocation, which
> > + *                   should be dropped via ops->vram_allocation or upon the
> > + *                   failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > + * necessary setup and invokes the driver-specific operations for migration to
> > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > + * until ops->vram_release is called which only upon successful return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_vram)
> > +		return -EINVAL;
> > +
> > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > +	    !gpusvm->ops->copy_to_sram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(range);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > +	 * always an error. Need to revisit possible cases and how to handle. We
> > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > +					     migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > +					   migrate.src, npages, DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_vram_page(page, zdd);
> > +	}
> > +
> > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind vram allocation to range and zdd */
> > +	range->vram_allocation = vram_allocation;
> > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the VM area for
> > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > +						unsigned long npages,
> > +						unsigned long *src_mpfn,
> > +						unsigned long *mpfn, u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > + * migration done via migrate_device_* functions. Fallback path as it is
> > + * preferred to issue migrations with mmap lock.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > +					     npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_vma_range(gpusvm->mm,
> > +				       gpusvm->device_private_page_owner, src,
> > +				       npages, range->va.start);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > +					   dst, npages, DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @vas: Pointer to the VM area structure
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @start: Start address of the migration range
> > + * @end: End address of the migration range
> > + *
> > + * This internal function performs the migration of the specified GPU SVM range
> > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > + * invokes the driver-specific operations for migration to SRAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +					struct vm_area_struct *vas,
> > +					struct page *page,
> > +					u64 start, u64 end)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page	= page,
> > +	};
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	/* Corner where VMA area struct has been partially unmapped */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > +						   migrate.src, migrate.dst,
> > +						   start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > +					   migrate.dst, npages,
> > +					   DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function initiates the migration of the specified GPU SVM range to
> > + * SRAM. It performs necessary checks and invokes the internal migration
> > + * function for actual migration.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner case when
> > +	 * VRAM backing has been partially unmapped from MM's address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> > +	if (!vas) {
> > +		if (!retry)
> > +			err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > +		if (!retry)
> > +			err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > +	if (err)
> > +		goto err_mmunlock;
> > +
> > +	if (vas->vm_end < end) {
> > +		retry = true;
> > +		start = vas->vm_end;
> > +		goto again;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		mmap_read_unlock(mm);
> > +		/*
> > +		 * Using mmput_async as this function can be called while
> > +		 * holding a dma-resv lock, and a final put can grab the mmap
> > +		 * lock, causing a lock inversion.
> > +		 */
> > +		mmput_async(mm);
> > +	}
> > +
> > +	return 0;
> > +
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked)
> > +		mmap_read_unlock(mm);
> > +err_mmput:
> > +	if (!ctx->mmap_locked)
> > +		mmput_async(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > +					   vmf->vma, vmf->page,
> > +					   zdd->range->va.start,
> > +					   zdd->range->va.end);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..0ea70f8534a8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,415 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM ranges and
> > + * perform operations such as migration between VRAM and system RAM.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @vram_release: Release VRAM allocation (optional)
> > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > +	 *
> > +	 * This function shall release VRAM allocation and expects to drop a
> > +	 * reference to VRAM allocation.
> > +	 */
> > +	void (*vram_release)(void *vram_allocation);
> > +
> > +	/**
> > +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate VRAM page frame numbers (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > +				 void *vram_allocation,
> > +				 unsigned long npages,
> > +				 unsigned long *pfn);
> > +
> > +	/**
> > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to VRAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_sram: Copy to system RAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It can safely
> > +	 * walk the notifier range RB tree/list in this function. Called while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order they appear in
> > + *              interval tree. This is useful to keep iterating ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > + *                 removed
> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > + *                       on @order which releases via kfree
> > + *
> > + * This structure represents a GPU SVM range used for tracking memory ranges
> > + * mapped in a DRM device.
> > + */
> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	union {
> > +		struct page **pages;
> > +		dma_addr_t *dma_addr;
> > +	};
> > +	void *vram_allocation;
> > +	u16 order;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_vram : 1;
> > +		/* All flags below must be set / cleared under notifier lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_vram_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +		u16 kfree_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > + * @notifier_list: list head containing of notifiers in the same order they
> > + *                 appear in interval tree. This is useful to keep iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > + *
> > + * No reference counting is provided, as this is expected to be embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct workqueue_struct *zdd_wq;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @mmap_locked: mmap lock is locked
> > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > + *                (e.g.dma-revs -> mmap lock)
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @vram_possible: possible to use VRAM
> > + * @prefault: prefault pages
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 mmap_locked :1;
> > +	u32 trylock_mmap :1;
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 vram_possible :1;
> > +	u32 prefault :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> > +
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > + *         current range is the last one or if the input range is NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > +	for ((range__) = (range__) ?:					\
> > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > +	     (range__) && (range__->va.start < (end__));		\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > +			      const struct mmu_notifier_range *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> > -- 
> > 2.34.1
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 20:56         ` Matthew Brost
@ 2024-08-30  8:18           ` Thomas Hellström
  2024-08-30 13:58             ` Matthew Brost
  2024-08-30  9:57           ` Thomas Hellström
  2024-09-02 12:33           ` Daniel Vetter
  2 siblings, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-08-30  8:18 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

Hi, Matthew,

On Thu, 2024-08-29 at 20:56 +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 09:18:29PM +0200, Thomas Hellström wrote:
> > Hi, Matthew,
> > 
> > On Thu, 2024-08-29 at 17:45 +0000, Matthew Brost wrote:
> > > On Thu, Aug 29, 2024 at 11:16:49AM +0200, Thomas Hellström wrote:
> > > > Hi, Matt. 
> > > > 
> > > > Some initial design comments / questions:
> > > > 
> > > > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > > > This patch introduces support for GPU Shared Virtual Memory
> > > > > (SVM)
> > > > > in
> > > > > the
> > > > > Direct Rendering Manager (DRM) subsystem. SVM allows for
> > > > > seamless
> > > > > sharing of memory between the CPU and GPU, enhancing
> > > > > performance
> > > > > and
> > > > > flexibility in GPU computing tasks.
> > > > > 
> > > > > The patch adds the necessary infrastructure for SVM,
> > > > > including
> > > > > data
> > > > > structures and functions for managing SVM ranges and
> > > > > notifiers.
> > > > > It
> > > > > also
> > > > > provides mechanisms for allocating, deallocating, and
> > > > > migrating
> > > > > memory
> > > > > regions between system RAM and GPU VRAM.
> > > > > 
> > > > > This mid-layer is largely inspired by GPUVM.
> > > > > 
> > > > > Cc: Dave Airlie <airlied@redhat.com>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > > > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > > > +++++++++++++++++++++++++++++++
> > > > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > > > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > > b/drivers/gpu/drm/xe/Makefile
> > > > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > > > >  
> > > > >  # core driver code
> > > > >  
> > > > > -xe-y += xe_bb.o \
> > > > > +xe-y += drm_gpusvm.o \
> > > > > +	xe_bb.o \
> > > > >  	xe_bo.o \
> > > > >  	xe_bo_evict.o \
> > > > >  	xe_devcoredump.o \
> > > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > > new file mode 100644
> > > > > index 000000000000..fc1e44e6ae72
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > > @@ -0,0 +1,2174 @@
> > > > > +// SPDX-License-Identifier: MIT
> > > > > +/*
> > > > > + * Copyright © 2024 Intel Corporation
> > > > > + *
> > > > > + * Authors:
> > > > > + *     Matthew Brost <matthew.brost@intel.com>
> > > > > + */
> > > > > +
> > > > > +#include <linux/dma-mapping.h>
> > > > > +#include <linux/interval_tree_generic.h>
> > > > > +#include <linux/hmm.h>
> > > > > +#include <linux/memremap.h>
> > > > > +#include <linux/migrate.h>
> > > > > +#include <linux/mm_types.h>
> > > > > +#include <linux/pagemap.h>
> > > > > +#include <linux/slab.h>
> > > > > +
> > > > > +#include <drm/drm_device.h>
> > > > > +#include "drm_gpusvm.h"
> > > > > +
> > > > > +/**
> > > > > + * DOC: Overview
> > > > > + *
> > > > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > > > Rendering Manager (DRM)
> > > > > + *
> > > > > + * The GPU SVM layer is a component of the DRM framework
> > > > > designed to
> > > > > manage shared
> > > > > + * virtual memory between the CPU and GPU. It enables
> > > > > efficient
> > > > > data
> > > > > exchange and
> > > > > + * processing for GPU-accelerated applications by allowing
> > > > > memory
> > > > > sharing and
> > > > > + * synchronization between the CPU's and GPU's virtual
> > > > > address
> > > > > spaces.
> > > > > + *
> > > > > + * Key GPU SVM Components:
> > > > > + * - Notifiers: Notifiers: Used for tracking memory
> > > > > intervals
> > > > > and
> > > > > notifying the
> > > > > + *		GPU of changes, notifiers are sized based on
> > > > > a
> > > > > GPU
> > > > > SVM
> > > > > + *		initialization parameter, with a
> > > > > recommendation
> > > > > of
> > > > > 512M or
> > > > > + *		larger. They maintain a Red-BlacK tree and a
> > > > > list of
> > > > > ranges that
> > > > > + *		fall within the notifier interval. Notifiers
> > > > > are
> > > > > tracked within
> > > > > + *		a GPU SVM Red-BlacK tree and list and are
> > > > > dynamically inserted
> > > > > + *		or removed as ranges within the interval are
> > > > > created
> > > > > or
> > > > > + *		destroyed.
> > > > 
> > > > What is the benefit of this extra layer compared to direct
> > > > insertion of
> > > > ranges using mmu_interval_notifier_insert?
> > > > 
> > > > IIRC the argument made previously about having wide notifiers
> > > > was
> > > > that
> > > > the rb tree lookups inside the core were costly and if there
> > > > were
> > > > only
> > > > a few, then the rb tree lookups within a notifier range could
> > > > be
> > > > replaced with the page-table radix-tree-like lookup, so each
> > > > lookup
> > > > complexity would be O(log(n_notifiers) + page_table_depth).
> > > > 
> > > > But now we have first an rb-tree lookup in the core and then an
> > > > rb-
> > > > tree
> > > > lookup within each notifier yeilding O(log(n_ranges))
> > > > 
> > > > I can see a small benefit in that inserting directly into the
> > > > core
> > > > rb-
> > > > tree will block pending ongoing invalidations, but at a cost of
> > > > an
> > > > extra multiplexing layer.
> > > > 
> > > 
> > > So when the notifier is triggered the search is a smaller range.
> > > In a
> > > perfect world eventually I'd like to drop the SVM range
> > > completely.
> > > There is a lot of changes required in Xe to make that possible
> > > and
> > > not
> > > entirely convinced it is possible and the ROI is worth it
> > > (additional
> > > complexity vs. perf benefit). For now, this was a relatively
> > > simple
> > > way
> > > to get SVM working (mirrors boths AMD's and Nvidia's implement
> > > wrt to
> > > having a range concept) but also is flexible in the sense the
> > > notifier
> > > size can be easily tweaked via a modparam [1] following Jason's
> > > suggestion of larger notifiers.
> > > 
> > > [1]
> > > https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1
> > 
> > What I meant was the core is already implementing the "one notifier
> > for
> > the whole range", since your notifier duplicates the
> > mmu_interval_notifier functionality.
> > 
> > The mmu_interval_notifier first does an rbtree search to get to the
> > notifier, and then drm_gpusvm does an rbtree search to get to the
> > range.
> 
> Yes.
> 
> > 
> > If the svm notifier layer is skipped, mmu_interval_notifier has to
> > perform a wider rbtree search to get to the range. The point is,
> > the
> > complexity is the same for both approaches so there is no point in
> > adding a svm notifier layer for that reason. The width of the
> > notifier
> > just adjust the relative size of the two rbtree searches, so from
> > that
> > point of view the drm_gpusvm does not offer any benefit from
> > inserting
> > the ranges into the mmu_interval_notifier directly (except that the
> > mmu_interval_notifier is slightly more heavyweight).
> > 
> 
> I think a large part of it was to avoid inserting / removing many
> notifiers as that was expensive. Agree the search is not
> fundamentally
> faster the way I have this coded. It just avoids heavy inserting /
> removing of notifiers.

So I specifically asked Jason about the performance problem about using
many notifiers vs using a single one, and he responded that the problem
is slowing down the core mm on invalidations, if the RB tree gets too
large to walk. He also mentioned that we should consider core
invalidation performance before faulting performance because the latter
is so slow anyway we must have the driver stack avoid gpu faults using
user-space prefetching and similar techniques.

In particular inserting and removing into the mmu_interval tree is not
costly in terms of locking but because of correctness requirements
insertion might block on ongoing validations.

So basically what I'm trying to say is that as long as we're using SVM
ranges in the way we do (I'm not saying that is wrong at this point,
and I agree that could be fine-tuned later), The benefit of an extra
notifier layer is questionable compared to directly inserting the
ranges into the mmu_interval_tree. So hence my questions, given those
considerations why this additional layer?

Anyway, a more detailed review of the code perhaps helps clear this
out.

> 
> > As I understand it, Jasons comments were based on the assumption
> > that
> > the drm_gpusvm search would be radix tree based, and hence with
> > less
> > complexity than the rbtree search, and therefore providing a clear
> > benefit the larger they could be.
> > 
> > I.e. just calling something similar to xe_vm_invalidate_xxx over
> > the
> > whole range, which will just skip subranges that are not populated.
> > 
> 
> As stated, I think eventually removing the SVM range is a good
> longterm
> goal.
> 
> I almost coded that in this initial series but ran into a number of
> issues which make this complex and to get something working in
> simplest
> way possible to enable further test development, start constructive
> upstream discussions which appear to be happening, UMDs / application
> development, and other up[er layer KMD development I stuck with this
> approach.
> 
> I think for any solution which requires a SVM range (fwiw both AMD
> and
> Nvidia have a similar concept), attaching the ranges to a larger
> notifier makes sense and is better than 1 notifier per range.
> 
> Issues with removing a SVM range:
> 
> - Xe bind code stores invalidation / present state in VMA, this would
>   need to be moved to the radix tree. I have Jira open for that work
>   which I believe other developers are going to own.
> - Where would the dma mapping / device pages be stored?
> 	- In the radix tree? What if ATS is enabled? We don't have a
> 	  driver owned radix tree. How do we reasonably connect a
> driver
> 	  owned radix to a common GPUSVM layer?
> 	- In the notifier? What is the notifier is sparsely
> populated?
> 	  We would be wasting huge amounts of memory. What is the
> 	  notifier is configured to span the entire virtual address
> 	  space?
> - How does the garbage collector work? We can't allocate memory in
> the
>   notifier so we don't anything to add to the garbage collector. We
>   can't directly modify page tables given you need lock in the path
> of
>   reclaim.
> - How do we deal with fault storms (e.g. tons of faults hitting the
> same
>   SVM range in a row)? Without a SVM range no every to know if
> mapping
>   is valid and GPU page handler can be short circuited.
> - Do we have notifier seqno for every PTE?
> 
> I feel like I'm missing a few and likely more issues would arrise
> when
> implementing this too.
> 
> To be clear, I'm saying we shouldn't try to do this and all of the
> above
> issues are likely workable but doing all this upfront is akin running
> before we can walk. I rather solve of fundamental locking issues
> first,
> have robust testing in place + passing and UMDs / apps running before
> trying to rework this one. Performance numbers for this would also be
> helpful too.






> 
> Matt
> 
> > /Thomas
> > 
> > > 
> > > > > + * - Ranges: Represent memory ranges mapped in a DRM device
> > > > > and
> > > > > managed
> > > > > + *	     by GPU SVM. They are sized based on an array of
> > > > > chunk
> > > > > sizes, which
> > > > > + *	     is a GPU SVM initialization parameter, and the
> > > > > CPU
> > > > > address space.
> > > > > + *	     Upon GPU fault, the largest aligned chunk that
> > > > > fits
> > > > > within the
> > > > > + *	     faulting CPU address space is chosen for the
> > > > > range
> > > > > size. Ranges are
> > > > > + *	     expected to be dynamically allocated on GPU
> > > > > fault
> > > > > and
> > > > > removed on an
> > > > > + *	     MMU notifier UNMAP event. As mentioned above,
> > > > > ranges
> > > > > are tracked in
> > > > > + *	     a notifier's Red-Black tree.
> > > > 
> > > > How do ranges and chunks map to
> > > >  
> > > > a) Prefaulting granularity
> > > > b) Migration granularity?
> > > > 
> > > > > + * - Operations: Define the interface for driver-specific
> > > > > SVM
> > > > > operations such as
> > > > > + *		 allocation, page collection, migration,
> > > > > invalidations, and VRAM
> > > > > + *		 release.
> > > > > + *
> > > > > + * This layer provides interfaces for allocating, mapping,
> > > > > migrating, and
> > > > > + * releasing memory ranges between the CPU and GPU. It
> > > > > handles
> > > > > all
> > > > > core memory
> > > > > + * management interactions (DMA mapping, HMM, and migration)
> > > > > and
> > > > > provides
> > > > > + * driver-specific virtual functions (vfuncs). This
> > > > > infrastructure
> > > > > is sufficient
> > > > > + * to build the expected driver components for an SVM
> > > > > implementation
> > > > > as detailed
> > > > > + * below.
> > > > > + *
> > > > > + * Expected Driver Components:
> > > > > + * - GPU page fault handler: Used to create ranges and
> > > > > notifiers
> > > > > based on the
> > > > > + *			     fault address, optionally
> > > > > migrate
> > > > > the
> > > > > range to
> > > > > + *			     VRAM, and create GPU bindings.
> > > > > + * - Garbage collector: Used to destroy GPU bindings for
> > > > > ranges.
> > > > > Ranges are
> > > > > + *			expected to be added to the garbage
> > > > > collector upon
> > > > > + *			MMU_NOTIFY_UNMAP event.
> > > > > + */
> > > > > +
> > > > > +/**
> > > > > + * DOC: Locking
> > > > > + *
> > > > > + * GPU SVM handles locking for core MM interactions, i.e.,
> > > > > it
> > > > > locks/unlocks the
> > > > > + * mmap lock as needed. Alternatively, if the driver prefers
> > > > > to
> > > > > handle the mmap
> > > > > + * lock itself, a 'locked' argument is provided to the
> > > > > functions
> > > > > that require
> > > > > + * the mmap lock. This option may be useful for drivers that
> > > > > need to
> > > > > call into
> > > > > + * GPU SVM while also holding a dma-resv lock, thus
> > > > > preventing
> > > > > locking
> > > > > + * inversions between the mmap and dma-resv locks.
> > > > > + *
> > > > > + * GPU SVM introduces a global notifier lock, which
> > > > > safeguards
> > > > > the
> > > > > notifier's
> > > > > + * range RB tree and list, as well as the range's DMA
> > > > > mappings
> > > > > and
> > > > > sequence
> > > > > + * number. GPU SVM manages all necessary locking and
> > > > > unlocking
> > > > > operations,
> > > > > + * except for the recheck of the range's sequence number
> > > > > + * (mmu_interval_read_retry) when the driver is committing
> > > > > GPU
> > > > > bindings. This
> > > > > + * lock corresponds to the 'driver->update' lock mentioned
> > > > > in
> > > > > the
> > > > > HMM
> > > > > + * documentation (TODO: Link). Future revisions may
> > > > > transition
> > > > > from
> > > > > a GPU SVM
> > > > > + * global lock to a per-notifier lock if finer-grained
> > > > > locking
> > > > > is
> > > > > deemed
> > > > > + * necessary.
> > > > > + *
> > > > > + * In addition to the locking mentioned above, the driver
> > > > > should
> > > > > implement a
> > > > > + * lock to safeguard core GPU SVM function calls that modify
> > > > > state,
> > > > > such as
> > > > > + * drm_gpusvm_range_find_or_insert and
> > > > > drm_gpusvm_range_remove.
> > > > > Alternatively,
> > > > > + * these core functions can be called within a single kernel
> > > > > thread,
> > > > > for
> > > > > + * instance, using an ordered work queue. This lock is
> > > > > denoted
> > > > > as
> > > > > + * 'driver_svm_lock' in code examples.
> > > > > + */
> > > > > +
> > > > > +/**
> > > > > + * DOC: Migrataion
> > > > > + *
> > > > > + * The migration support is quite simple, allowing migration
> > > > > between
> > > > > SRAM and
> > > > > + * VRAM at the range granularity. For example, GPU SVM
> > > > > currently
> > > > > does not
> > > > > + * support mixing SRAM and VRAM pages within a range. This
> > > > > means
> > > > > that upon GPU
> > > > > + * fault, the entire range can be migrated to VRAM, and upon
> > > > > CPU
> > > > > fault, the
> > > > > + * entire range is migrated to SRAM.
> > > > > + *
> > > > > + * The reasoning for only supporting range granularity is as
> > > > > follows: it
> > > > > + * simplifies the implementation, and range sizes are
> > > > > driver-
> > > > > defined
> > > > > and should
> > > > > + * be relatively small.
> > > > > + */
> > > > > +
> > > > > +/**
> > > > > + * DOC: Partial Unmapping of Ranges
> > > > > + *
> > > > > + * Partial unmapping of ranges (e.g., 1M out of 2M is
> > > > > unmapped
> > > > > by
> > > > > CPU resulting
> > > > > + * in MMU_NOTIFY_UNMAP event) presents several challenges,
> > > > > with
> > > > > the
> > > > > main one
> > > > > + * being that a subset of the range still has CPU and GPU
> > > > > mappings.
> > > > > If the
> > > > > + * backing store for the range is in VRAM, a subset of the
> > > > > backing
> > > > > store has
> > > > > + * references. One option would be to split the range and
> > > > > VRAM
> > > > > backing store,
> > > > > + * but the implementation for this would be quite
> > > > > complicated.
> > > > > Given
> > > > > that
> > > > > + * partial unmappings are rare and driver-defined range
> > > > > sizes
> > > > > are
> > > > > relatively
> > > > > + * small, GPU SVM does not support splitting of ranges.
> > > > > + *
> > > > > + * With no support for range splitting, upon partial
> > > > > unmapping
> > > > > of a
> > > > > range, the
> > > > > + * driver is expected to invalidate and destroy the entire
> > > > > range. If
> > > > > the range
> > > > > + * has VRAM as its backing, the driver is also expected to
> > > > > migrate
> > > > > any remaining
> > > > > + * pages back to SRAM.
> > > > 
> > > > So what happens if we get a one-page invalidation, say
> > > > protection
> > > > change event, or NUMA accounting event, in the middle of a
> > > > range?
> > > > Can
> > > > we unmap just that single gpu pte covering that range, that is,
> > > > how
> > > > do
> > > > the ranges map to invalidation granularity? Does this differ
> > > > between
> > > > igfx an dgfx?
> > > 
> > > Well the idea of chunks is ranges should be 1 GPU page (the chunk
> > > array
> > > in Xe is 4k, 64k, and 2M). The design is flexible enough that
> > > doesn't
> > > have to true but optimized for the thinking each range is most
> > > likely
> > > 1
> > > GPU page. If this isn't true, then all GPU pages in the range are
> > > invalidated which isn't ideal but keeps it simple which IMO far
> > > out
> > > weighs the potential benefits. In theory a driver could implement
> > > spliting / partial invalidaions too with a couple of updates to
> > > GPUSVM
> > > but would likely largely be a driver implementation rather than
> > > GPUSVM.
> > > 
> > > No difference between igfx an dgfx.
> > > 
> > > You bring up a good point about protection changes, I likely
> > > haven't
> > > fully gotten that part of implementation correct either. I can
> > > add
> > > this
> > > to my TODO list and also update my IGTs to do things like this.
> > > 
> > > Matt
> > > 
> > > > 
> > > > Thanks,
> > > > Thomas
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > + */
> > > > > +
> > > > > +/**
> > > > > + * DOC: Examples
> > > > > + *
> > > > > + * This section provides two examples of how to build the
> > > > > expected
> > > > > driver
> > > > > + * components: the GPU page fault handler and the garbage
> > > > > collector.
> > > > > A third
> > > > > + * example demonstrates a sample invalidation driver vfunc.
> > > > > + *
> > > > > + * The generic code provided does not include logic for
> > > > > complex
> > > > > migration
> > > > > + * policies, optimized invalidations, or other potentially
> > > > > required
> > > > > driver
> > > > > + * locking (e.g., DMA-resv locks).
> > > > > + *
> > > > > + * 1) GPU page fault handler
> > > > > + *
> > > > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm,
> > > > > struct
> > > > > drm_gpusvm_range *range)
> > > > > + *	{
> > > > > + *		int err = 0;
> > > > > + *
> > > > > +
> > > > > *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > > > range);
> > > > > + *
> > > > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > > > + *			driver_commit_bind(gpusvm, range);
> > > > > + *		else
> > > > > + *			err = -EAGAIN;
> > > > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > + *
> > > > > + *		return err;
> > > > > + *	}
> > > > > + *
> > > > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > > > fault_addr,
> > > > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > > > + *	{
> > > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > > + *		int err;
> > > > > + *
> > > > > + *		driver_svm_lock();
> > > > > + *	retry:
> > > > > + *		// Always process UNMAPs first so view of
> > > > > GPU
> > > > > SVM
> > > > > ranges is current
> > > > > + *		driver_garbage_collector(gpusvm);
> > > > > + *
> > > > > + *		range =
> > > > > drm_gpusvm_range_find_or_insert(gpusvm,
> > > > > fault_addr,
> > > > > +
> > > > > *							gpuv
> > > > > a_start,
> > > > > gpuva_end,
> > > > > + *						       
> > > > > &ctx);
> > > > > + *		if (IS_ERR(range)) {
> > > > > + *			err = PTR_ERR(range);
> > > > > + *			goto unlock;
> > > > > + *		}
> > > > > + *
> > > > > + *		if (driver_migration_policy(range)) {
> > > > > + *			bo = driver_alloc_bo();
> > > > > + *			err =
> > > > > drm_gpusvm_migrate_to_vram(gpusvm,
> > > > > range, bo, &ctx);
> > > > > + *			if (err)	// CPU mappings may
> > > > > have
> > > > > changed
> > > > > + *				goto retry;
> > > > > + *		}
> > > > > + *
> > > > > + *		err = drm_gpusvm_range_get_pages(gpusvm,
> > > > > range,
> > > > > &ctx);
> > > > > + *		if (err == -EFAULT || err == -EPERM)	//
> > > > > CPU
> > > > > mappings changed
> > > > > + *			goto retry;
> > > > > + *		else if (err)
> > > > > + *			goto unlock;
> > > > > + *
> > > > > + *		err = driver_bind_range(gpusvm, range);
> > > > > + *		if (err == -EAGAIN)	// CPU mappings
> > > > > changed
> > > > > + *			goto retry
> > > > > + *
> > > > > + *	unlock:
> > > > > + *		driver_svm_unlock();
> > > > > + *		return err;
> > > > > + *	}
> > > > > + *
> > > > > + * 2) Garbage Collector.
> > > > > + *
> > > > > + *	void __driver_garbage_collector(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > + *					struct
> > > > > drm_gpusvm_range
> > > > > *range)
> > > > > + *	{
> > > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > > + *
> > > > > + *		assert_driver_svm_locked(gpusvm);
> > > > > + *
> > > > > + *		// Partial unmap, migrate any remaining VRAM
> > > > > pages
> > > > > back to SRAM
> > > > > + *		if (range->flags.partial_unmap)
> > > > > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > > > > range,
> > > > > &ctx);
> > > > > + *
> > > > > + *		driver_unbind_range(range);
> > > > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > > > + *	}
> > > > > + *
> > > > > + *	void driver_garbage_collector(struct drm_gpusvm
> > > > > *gpusvm)
> > > > > + *	{
> > > > > + *		assert_driver_svm_locked(gpusvm);
> > > > > + *
> > > > > + *		for_each_range_in_garbage_collector(gpusvm,
> > > > > range)
> > > > > + *			__driver_garbage_collector(gpusvm,
> > > > > range);
> > > > > + *	}
> > > > > + *
> > > > > + * 3) Invalidation driver vfunc.
> > > > > + *
> > > > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > > > + *				 struct drm_gpusvm_notifier
> > > > > *notifier,
> > > > > + *				 const struct
> > > > > mmu_notifier_range
> > > > > *mmu_range)
> > > > > + *	{
> > > > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier =
> > > > > true,
> > > > > };
> > > > > + *		struct drm_gpusvm_range *range = NULL;
> > > > > + *
> > > > > + *		driver_invalidate_device_tlb(gpusvm,
> > > > > mmu_range-
> > > > > > start, mmu_range->end);
> > > > > + *
> > > > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > > > mmu_range->start,
> > > > > + *					  mmu_range->end) {
> > > > > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > > > > range,
> > > > > &ctx);
> > > > > + *
> > > > > + *			if (mmu_range->event !=
> > > > > MMU_NOTIFY_UNMAP)
> > > > > + *				continue;
> > > > > + *
> > > > > + *			drm_gpusvm_range_set_unmapped(range,
> > > > > mmu_range);
> > > > > + *			driver_garbage_collector_add(gpusvm,
> > > > > range);
> > > > > + *		}
> > > > > + *	}
> > > > > + */
> > > > > +
> > > > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end -
> > > > > 1)
> > > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > > > rb.__subtree_last,
> > > > > +		     DRM_GPUSVM_RANGE_START,
> > > > > DRM_GPUSVM_RANGE_END,
> > > > > +		     static __maybe_unused, range);
> > > > > +
> > > > > +#define
> > > > > DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > > > > interval.start)
> > > > > +#define
> > > > > DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > > > > interval.end - 1)
> > > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node,
> > > > > u64,
> > > > > +		     rb.__subtree_last,
> > > > > DRM_GPUSVM_NOTIFIER_START,
> > > > > +		     DRM_GPUSVM_NOTIFIER_END, static
> > > > > __maybe_unused,
> > > > > notifier);
> > > > > +
> > > > > +/**
> > > > > + * npages_in_range() - Calculate the number of pages in a
> > > > > given
> > > > > range
> > > > > + * @start__: The start address of the range
> > > > > + * @end__: The end address of the range
> > > > > + *
> > > > > + * This macro calculates the number of pages in a given
> > > > > memory
> > > > > range,
> > > > > + * specified by the start and end addresses. It divides the
> > > > > difference
> > > > > + * between the end and start addresses by the page size
> > > > > (PAGE_SIZE)
> > > > > to
> > > > > + * determine the number of pages in the range.
> > > > > + *
> > > > > + * Return: The number of pages in the specified range.
> > > > > + */
> > > > > +#define npages_in_range(start__, end__)	\
> > > > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > > > + *
> > > > > + * @refcount: Reference count for the zdd
> > > > > + * @destroy_work: Work structure for asynchronous zdd
> > > > > destruction
> > > > > + * @range: Pointer to the GPU SVM range
> > > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > > allocation
> > > > > + *
> > > > > + * This structure serves as a generic wrapper installed in
> > > > > + * page->zone_device_data. It provides infrastructure for
> > > > > looking up
> > > > > a range
> > > > > + * upon CPU page fault and asynchronously releasing VRAM
> > > > > once
> > > > > the
> > > > > CPU has no
> > > > > + * page references. Asynchronous release is useful because
> > > > > CPU
> > > > > page
> > > > > references
> > > > > + * can be dropped in IRQ contexts, while releasing VRAM
> > > > > likely
> > > > > requires sleeping
> > > > > + * locks.
> > > > > + */
> > > > > +struct drm_gpusvm_zdd {
> > > > > +	struct kref refcount;
> > > > > +	struct work_struct destroy_work;
> > > > > +	struct drm_gpusvm_range *range;
> > > > > +	void *vram_allocation;
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > > > destroying a
> > > > > zdd
> > > > > + * @w: Pointer to the work_struct
> > > > > + *
> > > > > + * This function releases VRAM, puts GPU SVM range, and
> > > > > frees
> > > > > zdd.
> > > > > + */
> > > > > +static void drm_gpusvm_zdd_destroy_work_func(struct
> > > > > work_struct
> > > > > *w)
> > > > > +{
> > > > > +	struct drm_gpusvm_zdd *zdd =
> > > > > +		container_of(w, struct drm_gpusvm_zdd,
> > > > > destroy_work);
> > > > > +	struct drm_gpusvm_range *range = zdd->range;
> > > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > > +
> > > > > +	if (gpusvm->ops->vram_release && zdd-
> > > > > >vram_allocation)
> > > > > +		gpusvm->ops->vram_release(zdd-
> > > > > >vram_allocation);
> > > > > +	drm_gpusvm_range_put(range);
> > > > > +	kfree(zdd);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > > > + * @range: Pointer to the GPU SVM range.
> > > > > + *
> > > > > + * This function allocates and initializes a new zdd
> > > > > structure.
> > > > > It
> > > > > sets up the
> > > > > + * reference count, initializes the destroy work, and links
> > > > > the
> > > > > provided GPU SVM
> > > > > + * range.
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > > > failure.
> > > > > + */
> > > > > +static struct drm_gpusvm_zdd *
> > > > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > > > +{
> > > > > +	struct drm_gpusvm_zdd *zdd;
> > > > > +
> > > > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > > > +	if (!zdd)
> > > > > +		return NULL;
> > > > > +
> > > > > +	kref_init(&zdd->refcount);
> > > > > +	INIT_WORK(&zdd->destroy_work,
> > > > > drm_gpusvm_zdd_destroy_work_func);
> > > > > +	zdd->range = drm_gpusvm_range_get(range);
> > > > > +	zdd->vram_allocation = NULL;
> > > > > +
> > > > > +	return zdd;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > > > + * @zdd: Pointer to the zdd structure.
> > > > > + *
> > > > > + * This function increments the reference count of the
> > > > > provided
> > > > > zdd
> > > > > structure.
> > > > > + *
> > > > > + * Returns: Pointer to the zdd structure.
> > > > > + */
> > > > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > > > drm_gpusvm_zdd *zdd)
> > > > > +{
> > > > > +	kref_get(&zdd->refcount);
> > > > > +	return zdd;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > > > + * @ref: Pointer to the reference count structure.
> > > > > + *
> > > > > + * This function queues the destroy_work of the zdd for
> > > > > asynchronous
> > > > > destruction.
> > > > > + */
> > > > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > > > +{
> > > > > +	struct drm_gpusvm_zdd *zdd =
> > > > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > > > refcount);
> > > > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > > > +
> > > > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > > > + * @zdd: Pointer to the zdd structure.
> > > > > + *
> > > > > + * This function decrements the reference count of the
> > > > > provided
> > > > > zdd
> > > > > structure
> > > > > + * and schedules its destruction if the count drops to zero.
> > > > > + */
> > > > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > > > +{
> > > > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > > > notifier
> > > > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > > > + * @start: Start address of the range
> > > > > + * @end: End address of the range
> > > > > + *
> > > > > + * Return: A pointer to the drm_gpusvm_range if found or
> > > > > NULL
> > > > > + */
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier,
> > > > > u64
> > > > > start, u64 end)
> > > > > +{
> > > > > +	return range_iter_first(&notifier->root, start, end
> > > > > -
> > > > > 1);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU
> > > > > SVM
> > > > > ranges in a notifier
> > > > > + * @range__: Iterator variable for the ranges
> > > > > + * @next__: Iterator variable for the ranges temporay
> > > > > storage
> > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > + * @start__: Start address of the range
> > > > > + * @end__: End address of the range
> > > > > + *
> > > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > > notifier
> > > > > while
> > > > > + * removing ranges from it.
> > > > > + */
> > > > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > > > notifier__,
> > > > > start__, end__)	\
> > > > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > > > (start__), (end__)),	\
> > > > > +	     (next__) =
> > > > > __drm_gpusvm_range_next(range__);			
> > > > > 	\
> > > > > +	     (range__) && (range__->va.start <
> > > > > (end__));				\
> > > > > +	     (range__) = (next__), (next__) =
> > > > > __drm_gpusvm_range_next(range__))
> > > > > +
> > > > > +/**
> > > > > + * __drm_gpusvm_notifier_next - get the next
> > > > > drm_gpusvm_notifier
> > > > > in
> > > > > the list
> > > > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > > > + *
> > > > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > > > available,
> > > > > or NULL if
> > > > > + *         the current notifier is the last one or if the
> > > > > input
> > > > > notifier is
> > > > > + *         NULL.
> > > > > + */
> > > > > +static struct drm_gpusvm_notifier *
> > > > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier
> > > > > *notifier)
> > > > > +{
> > > > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > > > +				      &notifier->gpusvm-
> > > > > > notifier_list))
> > > > > +		return list_next_entry(notifier, rb.entry);
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM
> > > > > notifiers
> > > > > in
> > > > > a gpusvm
> > > > > + * @notifier__: Iterator variable for the notifiers
> > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > + * @start__: Start address of the notifier
> > > > > + * @end__: End address of the notifier
> > > > > + *
> > > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > > gpusvm.
> > > > > + */
> > > > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > > > start__,
> > > > > end__)		\
> > > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > > > root,
> > > > > (start__), (end__) - 1);	\
> > > > > +	     (notifier__) && (notifier__->interval.start <
> > > > > (end__));			\
> > > > > +	     (notifier__) =
> > > > > __drm_gpusvm_notifier_next(notifier__))
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over
> > > > > GPU
> > > > > SVM
> > > > > notifiers in a gpusvm
> > > > > + * @notifier__: Iterator variable for the notifiers
> > > > > + * @next__: Iterator variable for the notifiers temporay
> > > > > storage
> > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > + * @start__: Start address of the notifier
> > > > > + * @end__: End address of the notifier
> > > > > + *
> > > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > > gpusvm
> > > > > while
> > > > > + * removing notifiers from it.
> > > > > + */
> > > > > +#define drm_gpusvm_for_each_notifier_safe(notifier__,
> > > > > next__,
> > > > > gpusvm__, start__, end__)	\
> > > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > > > root,
> > > > > (start__), (end__) - 1),	\
> > > > > +	     (next__) =
> > > > > __drm_gpusvm_notifier_next(notifier__);			
> > > > > 	\
> > > > > +	     (notifier__) && (notifier__->interval.start <
> > > > > (end__));			\
> > > > > +	     (notifier__) = (next__), (next__) =
> > > > > __drm_gpusvm_notifier_next(notifier__))
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > > > notifier.
> > > > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > > > + * @cur_seq: Current sequence number.
> > > > > + *
> > > > > + * This function serves as a generic MMU notifier for GPU
> > > > > SVM.
> > > > > It
> > > > > sets the MMU
> > > > > + * notifier sequence number and calls the driver invalidate
> > > > > vfunc
> > > > > under
> > > > > + * gpusvm->notifier_lock.
> > > > > + *
> > > > > + * Returns:
> > > > > + * true if the operation succeeds, false otherwise.
> > > > > + */
> > > > > +static bool
> > > > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > > > *mni,
> > > > > +			       const struct
> > > > > mmu_notifier_range
> > > > > *mmu_range,
> > > > > +			       unsigned long cur_seq)
> > > > > +{
> > > > > +	struct drm_gpusvm_notifier *notifier =
> > > > > +		container_of(mni, typeof(*notifier),
> > > > > notifier);
> > > > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > > > +
> > > > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > > > +		return false;
> > > > > +
> > > > > +	down_write(&gpusvm->notifier_lock);
> > > > > +	mmu_interval_set_seq(mni, cur_seq);
> > > > > +	gpusvm->ops->invalidate(gpusvm, notifier,
> > > > > mmu_range);
> > > > > +	up_write(&gpusvm->notifier_lock);
> > > > > +
> > > > > +	return true;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_ops - MMU interval notifier
> > > > > operations
> > > > > for
> > > > > GPU SVM
> > > > > + */
> > > > > +static const struct mmu_interval_notifier_ops
> > > > > drm_gpusvm_notifier_ops = {
> > > > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > + * @name: Name of the GPU SVM.
> > > > > + * @drm: Pointer to the DRM device structure.
> > > > > + * @mm: Pointer to the mm_struct for the address space.
> > > > > + * @device_private_page_owner: Device private pages owner.
> > > > > + * @mm_start: Start address of GPU SVM.
> > > > > + * @mm_range: Range of the GPU SVM.
> > > > > + * @notifier_size: Size of individual notifiers.
> > > > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > > range
> > > > > allocation.
> > > > > + *               Entries should be powers of 2 in descending
> > > > > order
> > > > > with last
> > > > > + *               entry being SZ_4K.
> > > > > + * @num_chunks: Number of chunks.
> > > > > + *
> > > > > + * This function initializes the GPU SVM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, a negative error code on failure.
> > > > > + */
> > > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > > +		    const char *name, struct drm_device
> > > > > *drm,
> > > > > +		    struct mm_struct *mm, void
> > > > > *device_private_page_owner,
> > > > > +		    u64 mm_start, u64 mm_range, u64
> > > > > notifier_size,
> > > > > +		    const struct drm_gpusvm_ops *ops,
> > > > > +		    const u64 *chunk_sizes, int num_chunks)
> > > > > +{
> > > > > +	if (!ops->invalidate || !num_chunks)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	gpusvm->name = name;
> > > > > +	gpusvm->drm = drm;
> > > > > +	gpusvm->mm = mm;
> > > > > +	gpusvm->device_private_page_owner =
> > > > > device_private_page_owner;
> > > > > +	gpusvm->mm_start = mm_start;
> > > > > +	gpusvm->mm_range = mm_range;
> > > > > +	gpusvm->notifier_size = notifier_size;
> > > > > +	gpusvm->ops = ops;
> > > > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > > > +	gpusvm->num_chunks = num_chunks;
> > > > > +	gpusvm->zdd_wq = system_wq;
> > > > > +
> > > > > +	mmgrab(mm);
> > > > > +	gpusvm->root = RB_ROOT_CACHED;
> > > > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > > > +
> > > > > +	init_rwsem(&gpusvm->notifier_lock);
> > > > > +
> > > > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > > > +	might_lock(&gpusvm->notifier_lock);
> > > > > +	fs_reclaim_release(GFP_KERNEL);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > > > + * @fault_addr__: Fault address
> > > > > + *
> > > > > + * This macro finds the GPU SVM notifier associated with the
> > > > > fault
> > > > > address.
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the GPU SVM notifier on success, NULL
> > > > > otherwise.
> > > > > + */
> > > > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > > > fault_addr__)	\
> > > > > +	notifier_iter_first(&(gpusvm__)->root,
> > > > > (fault_addr__),	\
> > > > > +			    (fault_addr__ + 1))
> > > > > +
> > > > > +/**
> > > > > + * to_drm_gpusvm_notifier - retrieve the container struct
> > > > > for a
> > > > > given rbtree node
> > > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > > drm_gpusvm_notifier struct
> > > > > + *
> > > > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > > > structure.
> > > > > + */
> > > > > +#define
> > > > > to_drm_gpusvm_notifier(__node)				\
> > > > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > > > rb.node)
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > + *
> > > > > + * This function inserts the GPU SVM notifier into the GPU
> > > > > SVM
> > > > > RB
> > > > > tree and list.
> > > > > + */
> > > > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +				       struct
> > > > > drm_gpusvm_notifier
> > > > > *notifier)
> > > > > +{
> > > > > +	struct rb_node *node;
> > > > > +	struct list_head *head;
> > > > > +
> > > > > +	notifier_insert(notifier, &gpusvm->root);
> > > > > +
> > > > > +	node = rb_prev(&notifier->rb.node);
> > > > > +	if (node)
> > > > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > > > > rb.entry;
> > > > > +	else
> > > > > +		head = &gpusvm->notifier_list;
> > > > > +
> > > > > +	list_add(&notifier->rb.entry, head);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > > + *
> > > > > + * This macro removes the GPU SVM notifier from the GPU SVM
> > > > > RB
> > > > > tree
> > > > > and list.
> > > > > + */
> > > > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > > > notifier__)	\
> > > > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > > > > root);	\
> > > > > +	list_del(&(notifier__)->rb.entry)
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > + *
> > > > > + * This function finalizes the GPU SVM by cleaning up any
> > > > > remaining
> > > > > ranges and
> > > > > + * notifiers, and dropping a reference to struct MM.
> > > > > + */
> > > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > > > +{
> > > > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > > > +
> > > > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > > > gpusvm, 0,
> > > > > LONG_MAX) {
> > > > > +		struct drm_gpusvm_range *range, *__next;
> > > > > +
> > > > > +		/*
> > > > > +		 * Remove notifier first to avoid racing
> > > > > with
> > > > > any
> > > > > invalidation
> > > > > +		 */
> > > > > +		mmu_interval_notifier_remove(&notifier-
> > > > > > notifier);
> > > > > +		notifier->flags.removed = true;
> > > > > +
> > > > > +		drm_gpusvm_for_each_range_safe(range,
> > > > > __next,
> > > > > notifier, 0,
> > > > > +					       LONG_MAX)
> > > > > +			drm_gpusvm_range_remove(gpusvm,
> > > > > range);
> > > > > +	}
> > > > > +
> > > > > +	mmdrop(gpusvm->mm);
> > > > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @fault_addr: Fault address
> > > > > + *
> > > > > + * This function allocates and initializes the GPU SVM
> > > > > notifier
> > > > > structure.
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the allocated GPU SVM notifier on success,
> > > > > ERR_PTR()
> > > > > on failure.
> > > > > + */
> > > > > +static struct drm_gpusvm_notifier *
> > > > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > > > fault_addr)
> > > > > +{
> > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > +
> > > > > +	if (gpusvm->ops->notifier_alloc)
> > > > > +		notifier = gpusvm->ops->notifier_alloc();
> > > > > +	else
> > > > > +		notifier = kzalloc(sizeof(*notifier),
> > > > > GFP_KERNEL);
> > > > > +
> > > > > +	if (!notifier)
> > > > > +		return ERR_PTR(-ENOMEM);
> > > > > +
> > > > > +	notifier->gpusvm = gpusvm;
> > > > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > > > gpusvm-
> > > > > > notifier_size);
> > > > > +	notifier->interval.end = ALIGN(fault_addr + 1,
> > > > > gpusvm-
> > > > > > notifier_size);
> > > > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > > > +	notifier->root = RB_ROOT_CACHED;
> > > > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > > > +
> > > > > +	return notifier;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > + *
> > > > > + * This function frees the GPU SVM notifier structure.
> > > > > + */
> > > > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +				     struct
> > > > > drm_gpusvm_notifier
> > > > > *notifier)
> > > > > +{
> > > > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > > > +
> > > > > +	if (gpusvm->ops->notifier_free)
> > > > > +		gpusvm->ops->notifier_free(notifier);
> > > > > +	else
> > > > > +		kfree(notifier);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > > > given
> > > > > rbtree node
> > > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > > drm_gpusvm_range struct
> > > > > + *
> > > > > + * Return: A pointer to the containing drm_gpusvm_range
> > > > > structure.
> > > > > + */
> > > > > +#define to_drm_gpusvm_range(node__)	\
> > > > > +	container_of((node__), struct drm_gpusvm_range,
> > > > > rb.node)
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * This function inserts the GPU SVM range into the notifier
> > > > > RB
> > > > > tree
> > > > > and list.
> > > > > + */
> > > > > +static void drm_gpusvm_range_insert(struct
> > > > > drm_gpusvm_notifier
> > > > > *notifier,
> > > > > +				    struct drm_gpusvm_range
> > > > > *range)
> > > > > +{
> > > > > +	struct rb_node *node;
> > > > > +	struct list_head *head;
> > > > > +
> > > > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > > > +	range_insert(range, &notifier->root);
> > > > > +
> > > > > +	node = rb_prev(&range->rb.node);
> > > > > +	if (node)
> > > > > +		head = &(to_drm_gpusvm_range(node))-
> > > > > >rb.entry;
> > > > > +	else
> > > > > +		head = &notifier->range_list;
> > > > > +
> > > > > +	list_add(&range->rb.entry, head);
> > > > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > > + * @range__: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * This macro removes the GPU SVM range from the notifier RB
> > > > > tree
> > > > > and list.
> > > > > + */
> > > > > +#define __drm_gpusvm_range_remove(notifier__,
> > > > > range__)		\
> > > > > +	range_remove((range__), &(notifier__)-
> > > > > > root);		\
> > > > > +	list_del(&(range__)->rb.entry)
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > + * @fault_addr: Fault address
> > > > > + * @chunk_size: Chunk size
> > > > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > > > + *
> > > > > + * This function allocates and initializes the GPU SVM range
> > > > > structure.
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the allocated GPU SVM range on success,
> > > > > ERR_PTR()
> > > > > on
> > > > > failure.
> > > > > + */
> > > > > +static struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > > > +		       struct drm_gpusvm_notifier *notifier,
> > > > > +		       u64 fault_addr, u64 chunk_size, bool
> > > > > migrate_vram)
> > > > > +{
> > > > > +	struct drm_gpusvm_range *range;
> > > > > +
> > > > > +	if (gpusvm->ops->range_alloc)
> > > > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > > > +	else
> > > > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > > +
> > > > > +	if (!range)
> > > > > +		return ERR_PTR(-ENOMEM);
> > > > > +
> > > > > +	kref_init(&range->refcount);
> > > > > +	range->gpusvm = gpusvm;
> > > > > +	range->notifier = notifier;
> > > > > +	range->va.start = ALIGN_DOWN(fault_addr,
> > > > > chunk_size);
> > > > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > > > +	range->notifier_seq = LONG_MAX;
> > > > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > > > +
> > > > > +	return range;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_check_pages - Check pages
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > + * @start: Start address
> > > > > + * @end: End address
> > > > > + *
> > > > > + * Check if pages between start and end have been faulted in
> > > > > on
> > > > > the
> > > > > CPU. Use to
> > > > > + * prevent migration of pages without CPU backing store.
> > > > > + *
> > > > > + * Returns:
> > > > > + * True if pages have been faulted into CPU, False otherwise
> > > > > + */
> > > > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +				   struct
> > > > > drm_gpusvm_notifier
> > > > > *notifier,
> > > > > +				   u64 start, u64 end)
> > > > > +{
> > > > > +	struct hmm_range hmm_range = {
> > > > > +		.default_flags = 0,
> > > > > +		.notifier = &notifier->notifier,
> > > > > +		.start = start,
> > > > > +		.end = end,
> > > > > +		.dev_private_owner = gpusvm-
> > > > > > device_private_page_owner,
> > > > > +	};
> > > > > +	unsigned long timeout =
> > > > > +		jiffies +
> > > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > > +	unsigned long *pfns;
> > > > > +	unsigned long npages = npages_in_range(start, end);
> > > > > +	int err, i;
> > > > > +
> > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > +
> > > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > > GFP_KERNEL);
> > > > > +	if (!pfns)
> > > > > +		return false;
> > > > > +
> > > > > +	hmm_range.notifier_seq =
> > > > > mmu_interval_read_begin(&notifier-
> > > > > > notifier);
> > > > > +	hmm_range.hmm_pfns = pfns;
> > > > > +
> > > > > +	while (true) {
> > > > > +		err = hmm_range_fault(&hmm_range);
> > > > > +		if (err == -EBUSY) {
> > > > > +			if (time_after(jiffies, timeout))
> > > > > +				break;
> > > > > +
> > > > > +			hmm_range.notifier_seq =
> > > > > mmu_interval_read_begin(&notifier->notifier);
> > > > > +			continue;
> > > > > +		}
> > > > > +		break;
> > > > > +	}
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > > > +			err = -EFAULT;
> > > > > +			goto err_free;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +err_free:
> > > > > +	kvfree(pfns);
> > > > > +	return err ? false : true;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_chunk_size - Determine chunk size for
> > > > > GPU
> > > > > SVM
> > > > > range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > + * @vas: Pointer to the virtual memory area structure
> > > > > + * @fault_addr: Fault address
> > > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > > + * @check_pages: Flag indicating whether to check pages
> > > > > + *
> > > > > + * This function determines the chunk size for the GPU SVM
> > > > > range
> > > > > based on the
> > > > > + * fault address, GPU SVM chunk sizes, existing GPU SVM
> > > > > ranges,
> > > > > and
> > > > > the virtual
> > > > > + * memory area boundaries.
> > > > > + *
> > > > > + * Returns:
> > > > > + * Chunk size on success, LONG_MAX on failure.
> > > > > + */
> > > > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +				       struct
> > > > > drm_gpusvm_notifier
> > > > > *notifier,
> > > > > +				       struct vm_area_struct
> > > > > *vas,
> > > > > +				       u64 fault_addr, u64
> > > > > gpuva_start,
> > > > > +				       u64 gpuva_end, bool
> > > > > check_pages)
> > > > > +{
> > > > > +	u64 start, end;
> > > > > +	int i = 0;
> > > > > +
> > > > > +retry:
> > > > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > > > chunk_sizes[i]);
> > > > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > > > > chunk_sizes[i]);
> > > > > +
> > > > > +		if (start >= vas->vm_start && end <= vas-
> > > > > >vm_end
> > > > > &&
> > > > > +		    start >= notifier->interval.start &&
> > > > > +		    end <= notifier->interval.end &&
> > > > > +		    start >= gpuva_start && end <=
> > > > > gpuva_end)
> > > > > +			break;
> > > > > +	}
> > > > > +
> > > > > +	if (i == gpusvm->num_chunks)
> > > > > +		return LONG_MAX;
> > > > > +
> > > > > +	/*
> > > > > +	 * If allocation more than page, ensure not to
> > > > > overlap
> > > > > with
> > > > > existing
> > > > > +	 * ranges.
> > > > > +	 */
> > > > > +	if (end - start != SZ_4K) {
> > > > > +		struct drm_gpusvm_range *range;
> > > > > +
> > > > > +		range = drm_gpusvm_range_find(notifier,
> > > > > start,
> > > > > end);
> > > > > +		if (range) {
> > > > > +			++i;
> > > > > +			goto retry;
> > > > > +		}
> > > > > +
> > > > > +		/*
> > > > > +		 * XXX: Only create range on pages CPU has
> > > > > faulted
> > > > > in. Without
> > > > > +		 * this check, or prefault, on BMG
> > > > > 'xe_exec_system_allocator --r
> > > > > +		 * process-many-malloc' fails. In the
> > > > > failure
> > > > > case,
> > > > > each process
> > > > > +		 * mallocs 16k but the CPU VMA is ~128k
> > > > > which
> > > > > results in 64k SVM
> > > > > +		 * ranges. When migrating the SVM ranges,
> > > > > some
> > > > > processes fail in
> > > > > +		 * drm_gpusvm_migrate_to_vram with
> > > > > 'migrate.cpages
> > > > > != npages'
> > > > > +		 * and then upon drm_gpusvm_range_get_pages
> > > > > device
> > > > > pages from
> > > > > +		 * other processes are collected + faulted
> > > > > in
> > > > > which
> > > > > creates all
> > > > > +		 * sorts of problems. Unsure exactly how
> > > > > this
> > > > > happening, also
> > > > > +		 * problem goes away if
> > > > > 'xe_exec_system_allocator --
> > > > > r
> > > > > +		 * process-many-malloc' mallocs at least 64k
> > > > > at
> > > > > a
> > > > > time.
> > > > > +		 */
> > > > > +		if (check_pages &&
> > > > > +		    !drm_gpusvm_check_pages(gpusvm,
> > > > > notifier,
> > > > > start,
> > > > > end)) {
> > > > > +			++i;
> > > > > +			goto retry;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	return end - start;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > > > range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @fault_addr: Fault address
> > > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function finds or inserts a newly allocated a GPU
> > > > > SVM
> > > > > range
> > > > > based on the
> > > > > + * fault address. Caller must hold a lock to protect range
> > > > > lookup
> > > > > and insertion.
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > > > failure.
> > > > > + */
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> > > > > u64
> > > > > fault_addr,
> > > > > +				u64 gpuva_start, u64
> > > > > gpuva_end,
> > > > > +				const struct drm_gpusvm_ctx
> > > > > *ctx)
> > > > > +{
> > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > +	struct drm_gpusvm_range *range;
> > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > +	struct vm_area_struct *vas;
> > > > > +	bool notifier_alloc = false;
> > > > > +	u64 chunk_size;
> > > > > +	int err;
> > > > > +	bool migrate_vram;
> > > > > +
> > > > > +	if (fault_addr < gpusvm->mm_start ||
> > > > > +	    fault_addr > gpusvm->mm_start + gpusvm-
> > > > > >mm_range) {
> > > > > +		err = -EINVAL;
> > > > > +		goto err_out;
> > > > > +	}
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		if (!mmget_not_zero(mm)) {
> > > > > +			err = -EFAULT;
> > > > > +			goto err_out;
> > > > > +		}
> > > > > +		mmap_write_lock(mm);
> > > > > +	}
> > > > > +
> > > > > +	mmap_assert_write_locked(mm);
> > > > > +
> > > > > +	notifier = drm_gpusvm_notifier_find(gpusvm,
> > > > > fault_addr);
> > > > > +	if (!notifier) {
> > > > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > > > fault_addr);
> > > > > +		if (IS_ERR(notifier)) {
> > > > > +			err = PTR_ERR(notifier);
> > > > > +			goto err_mmunlock;
> > > > > +		}
> > > > > +		notifier_alloc = true;
> > > > > +		err =
> > > > > mmu_interval_notifier_insert_locked(&notifier-
> > > > > > notifier,
> > > > > +							 
> > > > > mm,
> > > > > notifier->interval.start,
> > > > > +							 
> > > > > notifier-
> > > > > > interval.end -
> > > > > +							 
> > > > > notifier-
> > > > > > interval.start,
> > > > > +							 
> > > > > &drm_gpusvm_notifier_ops);
> > > > > +		if (err)
> > > > > +			goto err_notifier;
> > > > > +	}
> > > > > +
> > > > > +	vas = vma_lookup(mm, fault_addr);
> > > > > +	if (!vas) {
> > > > > +		err = -ENOENT;
> > > > > +		goto err_notifier_remove;
> > > > > +	}
> > > > > +
> > > > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE))
> > > > > {
> > > > > +		err = -EPERM;
> > > > > +		goto err_notifier_remove;
> > > > > +	}
> > > > > +
> > > > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > > > fault_addr + 1);
> > > > > +	if (range)
> > > > > +		goto out_mmunlock;
> > > > > +	/*
> > > > > +	 * XXX: Short-circuiting migration based on
> > > > > migrate_vma_*
> > > > > current
> > > > > +	 * limitations. If/when migrate_vma_* add more
> > > > > support,
> > > > > this
> > > > > logic will
> > > > > +	 * have to change.
> > > > > +	 */
> > > > > +	migrate_vram = ctx->vram_possible &&
> > > > > +		vma_is_anonymous(vas) &&
> > > > > !is_vm_hugetlb_page(vas);
> > > > > +
> > > > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > > > notifier,
> > > > > vas,
> > > > > +						 fault_addr,
> > > > > gpuva_start,
> > > > > +						 gpuva_end,
> > > > > migrate_vram &&
> > > > > +						 !ctx-
> > > > > > prefault);
> > > > > +	if (chunk_size == LONG_MAX) {
> > > > > +		err = -EINVAL;
> > > > > +		goto err_notifier_remove;
> > > > > +	}
> > > > > +
> > > > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > > > fault_addr,
> > > > > chunk_size,
> > > > > +				       migrate_vram);
> > > > > +	if (IS_ERR(range)) {
> > > > > +		err = PTR_ERR(range);
> > > > > +		goto err_notifier_remove;
> > > > > +	}
> > > > > +
> > > > > +	drm_gpusvm_range_insert(notifier, range);
> > > > > +	if (notifier_alloc)
> > > > > +		drm_gpusvm_notifier_insert(gpusvm,
> > > > > notifier);
> > > > > +
> > > > > +	if (ctx->prefault) {
> > > > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > > > +
> > > > > +		__ctx.mmap_locked = true;
> > > > > +		err = drm_gpusvm_range_get_pages(gpusvm,
> > > > > range,
> > > > > &__ctx);
> > > > > +		if (err)
> > > > > +			goto err_range_remove;
> > > > > +	}
> > > > > +
> > > > > +out_mmunlock:
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		mmap_write_unlock(mm);
> > > > > +		mmput(mm);
> > > > > +	}
> > > > > +
> > > > > +	return range;
> > > > > +
> > > > > +err_range_remove:
> > > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > > +err_notifier_remove:
> > > > > +	if (notifier_alloc)
> > > > > +		mmu_interval_notifier_remove(&notifier-
> > > > > > notifier);
> > > > > +err_notifier:
> > > > > +	if (notifier_alloc)
> > > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > > +err_mmunlock:
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		mmap_write_unlock(mm);
> > > > > +		mmput(mm);
> > > > > +	}
> > > > > +err_out:
> > > > > +	return ERR_PTR(err);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > > > + * @i__: the current page index in the iteration
> > > > > + * @j__: the current page index, log order, in the iteration
> > > > > + * @npages__: the total number of pages in the DMA region
> > > > > + * @order__: the order of the pages in the DMA region
> > > > > + *
> > > > > + * This macro iterates over each page in a DMA region. The
> > > > > DMA
> > > > > region
> > > > > + * is assumed to be composed of 2^@order__ pages, and the
> > > > > macro
> > > > > will
> > > > > + * step through the region one block of 2^@order__ pages at
> > > > > a
> > > > > time.
> > > > > + */
> > > > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > > > +
> > > > > +/**
> > > > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated
> > > > > with
> > > > > a
> > > > > GPU SVM range (internal)
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * This function unmap pages associated with a GPU SVM
> > > > > range.
> > > > > Assumes and
> > > > > + * asserts correct locking is in place when called.
> > > > > + */
> > > > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +					   struct
> > > > > drm_gpusvm_range
> > > > > *range)
> > > > > +{
> > > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > > +
> > > > > +	if (range->pages) {
> > > > > +		unsigned long i, j, npages =
> > > > > npages_in_range(range-
> > > > > > va.start,
> > > > > +							    
> > > > > range-
> > > > > > va.end);
> > > > > +
> > > > > +		if (range->flags.has_dma_mapping) {
> > > > > +			for_each_dma_page(i, j, npages,
> > > > > range-
> > > > > > order)
> > > > > +				dma_unmap_page(gpusvm->drm-
> > > > > >dev,
> > > > > +					       range-
> > > > > > dma_addr[j],
> > > > > +					       PAGE_SIZE <<
> > > > > range-
> > > > > > order,
> > > > > +					      
> > > > > DMA_BIDIRECTIONAL);
> > > > > +		}
> > > > > +
> > > > > +		range->flags.has_vram_pages = false;
> > > > > +		range->flags.has_dma_mapping = false;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_free_pages - Free pages associated with
> > > > > a
> > > > > GPU
> > > > > SVM range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * This function free pages associated with a GPU SVM range.
> > > > > + */
> > > > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +					struct
> > > > > drm_gpusvm_range
> > > > > *range)
> > > > > +{
> > > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > > +
> > > > > +	if (range->pages) {
> > > > > +		if (range->flags.kfree_mapping) {
> > > > > +			kfree(range->dma_addr);
> > > > > +			range->flags.kfree_mapping = false;
> > > > > +			range->pages = NULL;
> > > > > +		} else {
> > > > > +			kvfree(range->pages);
> > > > > +			range->pages = NULL;
> > > > > +		}
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range to be removed
> > > > > + *
> > > > > + * This function removes the specified GPU SVM range and
> > > > > also
> > > > > removes the parent
> > > > > + * GPU SVM notifier if no more ranges remain in the
> > > > > notifier.
> > > > > The
> > > > > caller must
> > > > > + * hold a lock to protect range and notifier removal.
> > > > > + */
> > > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > > +			     struct drm_gpusvm_range *range)
> > > > > +{
> > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > +
> > > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > > > va.start);
> > > > > +	if (WARN_ON_ONCE(!notifier))
> > > > > +		return;
> > > > > +
> > > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +
> > > > > +	drm_gpusvm_range_put(range);
> > > > > +
> > > > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > > > +		if (!notifier->flags.removed)
> > > > > +			mmu_interval_notifier_remove(&notifi
> > > > > er-
> > > > > > notifier);
> > > > > +		drm_gpusvm_notifier_remove(gpusvm,
> > > > > notifier);
> > > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > > > + * @range: Pointer to the GPU SVM range
> > > > > + *
> > > > > + * This function increments the reference count of the
> > > > > specified
> > > > > GPU
> > > > > SVM range.
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the GPU SVM range.
> > > > > + */
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > > > +{
> > > > > +	kref_get(&range->refcount);
> > > > > +
> > > > > +	return range;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > > > + * @refcount: Pointer to the reference counter embedded in
> > > > > the
> > > > > GPU
> > > > > SVM range
> > > > > + *
> > > > > + * This function destroys the specified GPU SVM range when
> > > > > its
> > > > > reference count
> > > > > + * reaches zero. If a custom range-free function is
> > > > > provided, it
> > > > > is
> > > > > invoked to
> > > > > + * free the range; otherwise, the range is deallocated using
> > > > > kfree().
> > > > > + */
> > > > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > > > +{
> > > > > +	struct drm_gpusvm_range *range =
> > > > > +		container_of(refcount, struct
> > > > > drm_gpusvm_range,
> > > > > refcount);
> > > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > > +
> > > > > +	if (gpusvm->ops->range_free)
> > > > > +		gpusvm->ops->range_free(range);
> > > > > +	else
> > > > > +		kfree(range);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > > > + * @range: Pointer to the GPU SVM range
> > > > > + *
> > > > > + * This function decrements the reference count of the
> > > > > specified
> > > > > GPU
> > > > > SVM range
> > > > > + * and frees it when the count reaches zero.
> > > > > + */
> > > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > > > +{
> > > > > +	kref_put(&range->refcount,
> > > > > drm_gpusvm_range_destroy);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * This function determines if a GPU SVM range pages are
> > > > > valid.
> > > > > Expected be
> > > > > + * called holding gpusvm->notifier_lock and as the last step
> > > > > before
> > > > > commiting a
> > > > > + * GPU binding.
> > > > > + *
> > > > > + * Returns:
> > > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > > + */
> > > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > > +				  struct drm_gpusvm_range
> > > > > *range)
> > > > > +{
> > > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > > +
> > > > > +	return range->flags.has_vram_pages || range-
> > > > > > flags.has_dma_mapping;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range
> > > > > pages
> > > > > valid
> > > > > unlocked
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * This function determines if a GPU SVM range pages are
> > > > > valid.
> > > > > Expected be
> > > > > + * called without holding gpusvm->notifier_lock.
> > > > > + *
> > > > > + * Returns:
> > > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > > + */
> > > > > +static bool
> > > > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +				      struct
> > > > > drm_gpusvm_range
> > > > > *range)
> > > > > +{
> > > > > +	bool pages_valid;
> > > > > +
> > > > > +	if (!range->pages)
> > > > > +		return false;
> > > > > +
> > > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > > > range);
> > > > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > > > +		kfree(range->dma_addr);
> > > > > +		range->flags.kfree_mapping = false;
> > > > > +		range->pages = NULL;
> > > > > +	}
> > > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +
> > > > > +	return pages_valid;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM
> > > > > range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function gets pages for a GPU SVM range and ensures
> > > > > they
> > > > > are
> > > > > mapped for
> > > > > + * DMA access.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range
> > > > > *range,
> > > > > +			       const struct drm_gpusvm_ctx
> > > > > *ctx)
> > > > > +{
> > > > > +	struct mmu_interval_notifier *notifier = &range-
> > > > > > notifier-
> > > > > > notifier;
> > > > > +	struct hmm_range hmm_range = {
> > > > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > > > > read_only
> > > > > ? 0 :
> > > > > +			HMM_PFN_REQ_WRITE),
> > > > > +		.notifier = notifier,
> > > > > +		.start = range->va.start,
> > > > > +		.end = range->va.end,
> > > > > +		.dev_private_owner = gpusvm-
> > > > > > device_private_page_owner,
> > > > > +	};
> > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > +	unsigned long timeout =
> > > > > +		jiffies +
> > > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > > +	unsigned long i, j;
> > > > > +	unsigned long npages = npages_in_range(range-
> > > > > >va.start,
> > > > > range->va.end);
> > > > > +	unsigned int order = 0;
> > > > > +	unsigned long *pfns;
> > > > > +	struct page **pages;
> > > > > +	int err = 0;
> > > > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > > > +	bool alloc_pfns = false, kfree_mapping;
> > > > > +
> > > > > +retry:
> > > > > +	kfree_mapping = false;
> > > > > +	hmm_range.notifier_seq =
> > > > > mmu_interval_read_begin(notifier);
> > > > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > > > range))
> > > > > +		return 0;
> > > > > +
> > > > > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > > > > range-
> > > > > > pages) {
> > > > > +		if (ctx->prefault)
> > > > > +			return 0;
> > > > > +
> > > > > +		pfns = (unsigned long *)range->pages;
> > > > > +		pages = range->pages;
> > > > > +		goto map_pages;
> > > > > +	}
> > > > > +
> > > > > +	if (!range->pages) {
> > > > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > > GFP_KERNEL);
> > > > > +		if (!pfns)
> > > > > +			return -ENOMEM;
> > > > > +		alloc_pfns = true;
> > > > > +	} else {
> > > > > +		pfns = (unsigned long *)range->pages;
> > > > > +	}
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		if (!mmget_not_zero(mm)) {
> > > > > +			err = -EFAULT;
> > > > > +			goto err_out;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	hmm_range.hmm_pfns = pfns;
> > > > > +	while (true) {
> > > > > +		/* Must be checked after
> > > > > mmu_interval_read_begin
> > > > > */
> > > > > +		if (range->flags.unmapped) {
> > > > > +			err = -EFAULT;
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +		if (!ctx->mmap_locked) {
> > > > > +			/*
> > > > > +			 * XXX: HMM locking document
> > > > > indicates
> > > > > only
> > > > > a read-lock
> > > > > +			 * is required but there apears to
> > > > > be a
> > > > > window between
> > > > > +			 * the MMU_NOTIFY_MIGRATE event
> > > > > triggered in
> > > > > a CPU fault
> > > > > +			 * via migrate_vma_setup and the
> > > > > pages
> > > > > actually moving
> > > > > +			 * in migrate_vma_finalize in which
> > > > > this
> > > > > code can grab
> > > > > +			 * garbage pages. Grabbing the
> > > > > write-
> > > > > lock if
> > > > > the range
> > > > > +			 * is attached to vram appears to
> > > > > protect
> > > > > against this
> > > > > +			 * race.
> > > > > +			 */
> > > > > +			if (vram_pages)
> > > > > +				mmap_write_lock(mm);
> > > > > +			else
> > > > > +				mmap_read_lock(mm);
> > > > > +		}
> > > > > +		err = hmm_range_fault(&hmm_range);
> > > > > +		if (!ctx->mmap_locked) {
> > > > > +			if (vram_pages)
> > > > > +				mmap_write_unlock(mm);
> > > > > +			else
> > > > > +				mmap_read_unlock(mm);
> > > > > +		}
> > > > > +
> > > > > +		if (err == -EBUSY) {
> > > > > +			if (time_after(jiffies, timeout))
> > > > > +				break;
> > > > > +
> > > > > +			hmm_range.notifier_seq =
> > > > > mmu_interval_read_begin(notifier);
> > > > > +			continue;
> > > > > +		}
> > > > > +		break;
> > > > > +	}
> > > > > +	if (!ctx->mmap_locked)
> > > > > +		mmput(mm);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	pages = (struct page **)pfns;
> > > > > +
> > > > > +	if (ctx->prefault) {
> > > > > +		range->pages = pages;
> > > > > +		goto set_seqno;
> > > > > +	}
> > > > > +
> > > > > +map_pages:
> > > > > +	if
> > > > > (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > > +
> > > > > +		for (i = 0; i < npages; ++i) {
> > > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > > +
> > > > > +			if
> > > > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				goto err_free;
> > > > > +			}
> > > > > +		}
> > > > > +
> > > > > +		/* Do not race with notifier unmapping pages
> > > > > */
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +		range->flags.has_vram_pages = true;
> > > > > +		range->pages = pages;
> > > > > +		if (mmu_interval_read_retry(notifier,
> > > > > hmm_range.notifier_seq)) {
> > > > > +			err = -EAGAIN;
> > > > > +			__drm_gpusvm_range_unmap_pages(gpusv
> > > > > m,
> > > > > range);
> > > > > +		}
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +	} else {
> > > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > > +
> > > > > +		for_each_dma_page(i, j, npages, order) {
> > > > > +			if (WARN_ON_ONCE(i && order !=
> > > > > +					
> > > > > hmm_pfn_to_map_order(pfns[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +			order =
> > > > > hmm_pfn_to_map_order(pfns[i]);
> > > > > +
> > > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > > +			if
> > > > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +
> > > > > +			set_page_dirty_lock(pages[j]);
> > > > > +			mark_page_accessed(pages[j]);
> > > > > +
> > > > > +			dma_addr[j] = dma_map_page(gpusvm-
> > > > > >drm-
> > > > > > dev,
> > > > > +						   pages[j],
> > > > > 0,
> > > > > +						   PAGE_SIZE
> > > > > <<
> > > > > order,
> > > > > +						  
> > > > > DMA_BIDIRECTIONAL);
> > > > > +			if (dma_mapping_error(gpusvm->drm-
> > > > > >dev,
> > > > > dma_addr[j])) {
> > > > > +				err = -EFAULT;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +		}
> > > > > +
> > > > > +		/* Huge pages, reduce memory footprint */
> > > > > +		if (order) {
> > > > > +			dma_addr = kmalloc_array(j,
> > > > > sizeof(*dma_addr),
> > > > > +						
> > > > > GFP_KERNEL);
> > > > > +			if (dma_addr) {
> > > > > +				for (i = 0; i < j; ++i)
> > > > > +					dma_addr[i] =
> > > > > (dma_addr_t)pfns[i];
> > > > > +				kvfree(pfns);
> > > > > +				kfree_mapping = true;
> > > > > +			} else {
> > > > > +				dma_addr = (dma_addr_t
> > > > > *)pfns;
> > > > > +			}
> > > > > +		}
> > > > > +
> > > > > +		/* Do not race with notifier unmapping pages
> > > > > */
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +		range->order = order;
> > > > > +		range->flags.kfree_mapping = kfree_mapping;
> > > > > +		range->flags.has_dma_mapping = true;
> > > > > +		range->dma_addr = dma_addr;
> > > > > +		range->vram_allocation = NULL;
> > > > > +		if (mmu_interval_read_retry(notifier,
> > > > > hmm_range.notifier_seq)) {
> > > > > +			err = -EAGAIN;
> > > > > +			__drm_gpusvm_range_unmap_pages(gpusv
> > > > > m,
> > > > > range);
> > > > > +		}
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +	}
> > > > > +
> > > > > +	if (err == -EAGAIN)
> > > > > +		goto retry;
> > > > > +set_seqno:
> > > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +err_unmap:
> > > > > +	for_each_dma_page(i, j, npages, order)
> > > > > +		dma_unmap_page(gpusvm->drm->dev,
> > > > > +			       (dma_addr_t)pfns[j],
> > > > > +			       PAGE_SIZE << order,
> > > > > DMA_BIDIRECTIONAL);
> > > > > +err_free:
> > > > > +	if (alloc_pfns)
> > > > > +		kvfree(pfns);
> > > > > +err_out:
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated
> > > > > with a
> > > > > GPU
> > > > > SVM range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function unmaps pages associated with a GPU SVM
> > > > > range.
> > > > > If
> > > > > @in_notifier
> > > > > + * is set, it is assumed that gpusvm->notifier_lock is held
> > > > > in
> > > > > write
> > > > > mode; if it
> > > > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > > > Must be
> > > > > called on
> > > > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > > > invalidate for IOMMU
> > > > > + * security model.
> > > > > + */
> > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > +				  struct drm_gpusvm_range
> > > > > *range,
> > > > > +				  const struct
> > > > > drm_gpusvm_ctx
> > > > > *ctx)
> > > > > +{
> > > > > +	if (ctx->in_notifier)
> > > > > +		lockdep_assert_held_write(&gpusvm-
> > > > > > notifier_lock);
> > > > > +	else
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +
> > > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > +
> > > > > +	if (!ctx->in_notifier)
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > > + * @page: Pointer to the page to put
> > > > > + *
> > > > > + * This function unlocks and puts a page.
> > > > > + */
> > > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > > +{
> > > > > +	unlock_page(page);
> > > > > +	put_page(page);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > > + * @npages: Number of pages
> > > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > > + *
> > > > > + * This function puts an array of pages.
> > > > > + */
> > > > > +static void drm_gpusvm_migration_put_pages(unsigned long
> > > > > npages,
> > > > > +					   unsigned long
> > > > > *migrate_pfn)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		if (!migrate_pfn[i])
> > > > > +			continue;
> > > > > +
> > > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to
> > > > > _pag
> > > > > e(mi
> > > > > grate_pfn[i]));
> > > > > +		migrate_pfn[i] = 0;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > > + * @page: Pointer to the page
> > > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > > + *
> > > > > + * This function associates the given page with the
> > > > > specified
> > > > > GPU
> > > > > SVM zone
> > > > > + * device data and initializes it for zone device usage.
> > > > > + */
> > > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > > +				     struct drm_gpusvm_zdd
> > > > > *zdd)
> > > > > +{
> > > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > > +	zone_device_page_init(page);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for
> > > > > GPU
> > > > > SVM
> > > > > migration
> > > > > + * @dev: The device for which the pages are being mapped
> > > > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > > > mapped
> > > > > pages
> > > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > > + * @npages: Number of pages to map
> > > > > + * @dir: Direction of data transfer (e.g.,
> > > > > DMA_BIDIRECTIONAL)
> > > > > + *
> > > > > + * This function maps pages of memory for migration usage in
> > > > > GPU
> > > > > SVM. It
> > > > > + * iterates over each page frame number provided in
> > > > > @migrate_pfn,
> > > > > maps the
> > > > > + * corresponding page, and stores the DMA address in the
> > > > > provided
> > > > > @dma_addr
> > > > > + * array.
> > > > > + *
> > > > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > > > mapping.
> > > > > + */
> > > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > > +					dma_addr_t
> > > > > *dma_addr,
> > > > > +					long unsigned int
> > > > > *migrate_pfn,
> > > > > +					unsigned long
> > > > > npages,
> > > > > +					enum
> > > > > dma_data_direction
> > > > > dir)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		struct page *page =
> > > > > migrate_pfn_to_page(migrate_pfn[i]);
> > > > > +
> > > > > +		if (!page)
> > > > > +			continue;
> > > > > +
> > > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > > +			return -EFAULT;
> > > > > +
> > > > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > > > PAGE_SIZE,
> > > > > dir);
> > > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > > +			return -EFAULT;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> > > > > mapped
> > > > > for GPU SVM migration
> > > > > + * @dev: The device for which the pages were mapped
> > > > > + * @dma_addr: Array of DMA addresses corresponding to mapped
> > > > > pages
> > > > > + * @npages: Number of pages to unmap
> > > > > + * @dir: Direction of data transfer (e.g.,
> > > > > DMA_BIDIRECTIONAL)
> > > > > + *
> > > > > + * This function unmaps previously mapped pages of memory
> > > > > for
> > > > > GPU
> > > > > Shared Virtual
> > > > > + * Memory (SVM). It iterates over each DMA address provided
> > > > > in
> > > > > @dma_addr, checks
> > > > > + * if it's valid and not already unmapped, and unmaps the
> > > > > corresponding page.
> > > > > + */
> > > > > +static void drm_gpusvm_migrate_unmap_pages(struct device
> > > > > *dev,
> > > > > +					   dma_addr_t
> > > > > *dma_addr,
> > > > > +					   unsigned long
> > > > > npages,
> > > > > +					   enum
> > > > > dma_data_direction
> > > > > dir)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > > > dma_addr[i]))
> > > > > +			continue;
> > > > > +
> > > > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE,
> > > > > dir);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to
> > > > > VRAM
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *                   failure of this function.
> > > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > > allocation.
> > > > > The caller
> > > > > + *                   should hold a reference to the VRAM
> > > > > allocation,
> > > > > which
> > > > > + *                   should be dropped via ops-
> > > > > >vram_allocation
> > > > > or
> > > > > upon the
> > > > > + *                   failure of this function.
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function migrates the specified GPU SVM range to
> > > > > VRAM.
> > > > > It
> > > > > performs the
> > > > > + * necessary setup and invokes the driver-specific
> > > > > operations
> > > > > for
> > > > > migration to
> > > > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > > > reference @range
> > > > > + * until ops->vram_release is called which only upon
> > > > > successful
> > > > > return.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range
> > > > > *range,
> > > > > +			       void *vram_allocation,
> > > > > +			       const struct drm_gpusvm_ctx
> > > > > *ctx)
> > > > > +{
> > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > +	struct migrate_vma migrate = {
> > > > > +		.start		= start,
> > > > > +		.end		= end,
> > > > > +		.pgmap_owner	= gpusvm-
> > > > > > device_private_page_owner,
> > > > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > > > +	};
> > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > +	unsigned long i, npages = npages_in_range(start,
> > > > > end);
> > > > > +	struct vm_area_struct *vas;
> > > > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > > > +	struct page **pages;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int err;
> > > > > +
> > > > > +	if (!range->flags.migrate_vram)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > > > > copy_to_vram ||
> > > > > +	    !gpusvm->ops->copy_to_sram)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		if (!mmget_not_zero(mm)) {
> > > > > +			err = -EFAULT;
> > > > > +			goto err_out;
> > > > > +		}
> > > > > +		mmap_write_lock(mm);
> > > > > +	}
> > > > > +
> > > > > +	mmap_assert_locked(mm);
> > > > > +
> > > > > +	vas = vma_lookup(mm, start);
> > > > > +	if (!vas) {
> > > > > +		err = -ENOENT;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > > +		err = -EINVAL;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (!vma_is_anonymous(vas)) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > > sizeof(*dma_addr) +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) *
> > > > > npages);
> > > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > > sizeof(*dma_addr))
> > > > > * npages;
> > > > > +
> > > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > > +	if (!zdd) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	migrate.vma = vas;
> > > > > +	migrate.src = buf;
> > > > > +	migrate.dst = migrate.src + npages;
> > > > > +
> > > > > +	err = migrate_vma_setup(&migrate);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	/*
> > > > > +	 * FIXME: Below cases, !migrate.cpages and
> > > > > migrate.cpages !=
> > > > > npages, not
> > > > > +	 * always an error. Need to revisit possible cases
> > > > > and
> > > > > how
> > > > > to handle. We
> > > > > +	 * could prefault on migrate.cpages != npages via
> > > > > hmm_range_fault.
> > > > > +	 */
> > > > > +
> > > > > +	if (!migrate.cpages) {
> > > > > +		err = -EFAULT;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	if (migrate.cpages != npages) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_finalize;
> > > > > +	}
> > > > > +
> > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > > > vram_allocation, npages,
> > > > > +					     migrate.dst);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > > dma_addr,
> > > > > +					   migrate.src,
> > > > > npages,
> > > > > DMA_TO_DEVICE);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i) {
> > > > > +		struct page *page =
> > > > > pfn_to_page(migrate.dst[i]);
> > > > > +
> > > > > +		pages[i] = page;
> > > > > +		migrate.dst[i] =
> > > > > migrate_pfn(migrate.dst[i]);
> > > > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > > > +	}
> > > > > +
> > > > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages,
> > > > > dma_addr,
> > > > > npages);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	/* Upon success bind vram allocation to range and
> > > > > zdd */
> > > > > +	range->vram_allocation = vram_allocation;
> > > > > +	WRITE_ONCE(zdd->vram_allocation,
> > > > > vram_allocation);	/*
> > > > > Owns ref */
> > > > > +
> > > > > +err_finalize:
> > > > > +	if (err)
> > > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > > migrate.dst);
> > > > > +	migrate_vma_pages(&migrate);
> > > > > +	migrate_vma_finalize(&migrate);
> > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > > dma_addr,
> > > > > npages,
> > > > > +				       DMA_TO_DEVICE);
> > > > > +err_free:
> > > > > +	if (zdd)
> > > > > +		drm_gpusvm_zdd_put(zdd);
> > > > > +	kvfree(buf);
> > > > > +err_mmunlock:
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		mmap_write_unlock(mm);
> > > > > +		mmput(mm);
> > > > > +	}
> > > > > +err_out:
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs
> > > > > for
> > > > > a
> > > > > VM area
> > > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > > + * @npages: Number of pages to populate
> > > > > + * @src_mpfn: Source array of migrate PFNs
> > > > > + * @mpfn: Array of migrate PFNs to populate
> > > > > + * @addr: Start address for PFN allocation
> > > > > + *
> > > > > + * This function populates the SRAM migrate page frame
> > > > > numbers
> > > > > (PFNs) for the
> > > > > + * specified VM area structure. It allocates and locks pages
> > > > > in
> > > > > the
> > > > > VM area for
> > > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for
> > > > > allocation,
> > > > > if NULL use
> > > > > + * alloc_page for allocation.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > > > vm_area_struct *vas,
> > > > > +						unsigned
> > > > > long
> > > > > npages,
> > > > > +						unsigned
> > > > > long
> > > > > *src_mpfn,
> > > > > +						unsigned
> > > > > long
> > > > > *mpfn,
> > > > > u64 addr)
> > > > > +{
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > > +		struct page *page;
> > > > > +
> > > > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > > +			continue;
> > > > > +
> > > > > +		if (vas)
> > > > > +			page = alloc_page_vma(GFP_HIGHUSER,
> > > > > vas,
> > > > > addr);
> > > > > +		else
> > > > > +			page = alloc_page(GFP_HIGHUSER);
> > > > > +
> > > > > +		if (!page)
> > > > > +			return -ENOMEM;
> > > > > +
> > > > > +		lock_page(page);
> > > > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + *
> > > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not
> > > > > require
> > > > > mmap
> > > > > lock and
> > > > > + * migration done via migrate_device_* functions. Fallback
> > > > > path
> > > > > as
> > > > > it is
> > > > > + * preferred to issue migrations with mmap lock.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +				    struct drm_gpusvm_range
> > > > > *range)
> > > > > +{
> > > > > +	unsigned long npages;
> > > > > +	struct page **pages;
> > > > > +	unsigned long *src, *dst;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int i, err = 0;
> > > > > +
> > > > > +	npages = npages_in_range(range->va.start, range-
> > > > > > va.end);
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*src) +
> > > > > sizeof(*dma_addr)
> > > > > +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_out;
> > > > > +	}
> > > > > +	src = buf;
> > > > > +	dst = buf + (sizeof(*src) * npages);
> > > > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr))
> > > > > *
> > > > > npages;
> > > > > +
> > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > > > > vram_allocation,
> > > > > +					     npages, src);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > > > +				       gpusvm-
> > > > > > device_private_page_owner, src,
> > > > > +				       npages, range-
> > > > > >va.start);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL,
> > > > > npages,
> > > > > src, dst, 0);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > > dma_addr,
> > > > > +					   dst, npages,
> > > > > DMA_BIDIRECTIONAL);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i)
> > > > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > > > +
> > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > > > > dma_addr,
> > > > > npages);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +err_finalize:
> > > > > +	if (err)
> > > > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > > > +	migrate_device_pages(src, dst, npages);
> > > > > +	migrate_device_finalize(src, dst, npages);
> > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > > dma_addr,
> > > > > npages,
> > > > > +				       DMA_BIDIRECTIONAL);
> > > > > +err_free:
> > > > > +	kvfree(buf);
> > > > > +err_out:
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to
> > > > > SRAM
> > > > > (internal)
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @vas: Pointer to the VM area structure
> > > > > + * @page: Pointer to the page for fault handling (can be
> > > > > NULL)
> > > > > + * @start: Start address of the migration range
> > > > > + * @end: End address of the migration range
> > > > > + *
> > > > > + * This internal function performs the migration of the
> > > > > specified
> > > > > GPU SVM range
> > > > > + * to SRAM. It sets up the migration, populates + dma maps
> > > > > SRAM
> > > > > PFNs, and
> > > > > + * invokes the driver-specific operations for migration to
> > > > > SRAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > > > > *gpusvm,
> > > > > +					struct
> > > > > vm_area_struct
> > > > > *vas,
> > > > > +					struct page *page,
> > > > > +					u64 start, u64 end)
> > > > > +{
> > > > > +	struct migrate_vma migrate = {
> > > > > +		.vma		= vas,
> > > > > +		.pgmap_owner	= gpusvm-
> > > > > > device_private_page_owner,
> > > > > +		.flags		=
> > > > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > > +		.fault_page	= page,
> > > > > +	};
> > > > > +	unsigned long npages;
> > > > > +	struct page **pages;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int i, err = 0;
> > > > > +
> > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > +
> > > > > +	/* Corner where VMA area struct has been partially
> > > > > unmapped
> > > > > */
> > > > > +	if (start < vas->vm_start)
> > > > > +		start = vas->vm_start;
> > > > > +	if (end > vas->vm_end)
> > > > > +		end = vas->vm_end;
> > > > > +
> > > > > +	migrate.start = start;
> > > > > +	migrate.end = end;
> > > > > +	npages = npages_in_range(start, end);
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > > sizeof(*dma_addr) +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_out;
> > > > > +	}
> > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) *
> > > > > npages);
> > > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > > sizeof(*dma_addr))
> > > > > * npages;
> > > > > +
> > > > > +	migrate.vma = vas;
> > > > > +	migrate.src = buf;
> > > > > +	migrate.dst = migrate.src + npages;
> > > > > +
> > > > > +	err = migrate_vma_setup(&migrate);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	/* Raced with another CPU fault, nothing to do */
> > > > > +	if (!migrate.cpages)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas,
> > > > > npages,
> > > > > +						  
> > > > > migrate.src,
> > > > > migrate.dst,
> > > > > +						   start);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > > dma_addr,
> > > > > +					   migrate.dst,
> > > > > npages,
> > > > > +					  
> > > > > DMA_BIDIRECTIONAL);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +	for (i = 0; i < npages; ++i)
> > > > > +		pages[i] =
> > > > > migrate_pfn_to_page(migrate.src[i]);
> > > > > +
> > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > > > > dma_addr,
> > > > > npages);
> > > > > +	if (err)
> > > > > +		goto err_finalize;
> > > > > +
> > > > > +err_finalize:
> > > > > +	if (err)
> > > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > > migrate.dst);
> > > > > +	migrate_vma_pages(&migrate);
> > > > > +	migrate_vma_finalize(&migrate);
> > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > > dma_addr,
> > > > > npages,
> > > > > +				       DMA_BIDIRECTIONAL);
> > > > > +err_free:
> > > > > +	kvfree(buf);
> > > > > +err_out:
> > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM
> > > > > range to
> > > > > SRAM
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > + * @ctx: GPU SVM context
> > > > > + *
> > > > > + * This function initiates the migration of the specified
> > > > > GPU
> > > > > SVM
> > > > > range to
> > > > > + * SRAM. It performs necessary checks and invokes the
> > > > > internal
> > > > > migration
> > > > > + * function for actual migration.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range
> > > > > *range,
> > > > > +			       const struct drm_gpusvm_ctx
> > > > > *ctx)
> > > > > +{
> > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > +	struct vm_area_struct *vas;
> > > > > +	int err;
> > > > > +	bool retry = false;
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		if (!mmget_not_zero(mm)) {
> > > > > +			err = -EFAULT;
> > > > > +			goto err_out;
> > > > > +		}
> > > > > +		if (ctx->trylock_mmap) {
> > > > > +			if (!mmap_read_trylock(mm))  {
> > > > > +				err =
> > > > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > > +				goto err_mmput;
> > > > > +			}
> > > > > +		} else {
> > > > > +			mmap_read_lock(mm);
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	mmap_assert_locked(mm);
> > > > > +
> > > > > +	/*
> > > > > +	 * Loop required to find all VMA area structs for
> > > > > the
> > > > > corner
> > > > > case when
> > > > > +	 * VRAM backing has been partially unmapped from
> > > > > MM's
> > > > > address space.
> > > > > +	 */
> > > > > +again:
> > > > > +	vas = find_vma(mm, start);
> > > > > +	if (!vas) {
> > > > > +		if (!retry)
> > > > > +			err = -ENOENT;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > > +		if (!retry)
> > > > > +			err = -EINVAL;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas,
> > > > > NULL,
> > > > > start,
> > > > > end);
> > > > > +	if (err)
> > > > > +		goto err_mmunlock;
> > > > > +
> > > > > +	if (vas->vm_end < end) {
> > > > > +		retry = true;
> > > > > +		start = vas->vm_end;
> > > > > +		goto again;
> > > > > +	}
> > > > > +
> > > > > +	if (!ctx->mmap_locked) {
> > > > > +		mmap_read_unlock(mm);
> > > > > +		/*
> > > > > +		 * Using mmput_async as this function can be
> > > > > called
> > > > > while
> > > > > +		 * holding a dma-resv lock, and a final put
> > > > > can
> > > > > grab
> > > > > the mmap
> > > > > +		 * lock, causing a lock inversion.
> > > > > +		 */
> > > > > +		mmput_async(mm);
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +err_mmunlock:
> > > > > +	if (!ctx->mmap_locked)
> > > > > +		mmap_read_unlock(mm);
> > > > > +err_mmput:
> > > > > +	if (!ctx->mmap_locked)
> > > > > +		mmput_async(mm);
> > > > > +err_out:
> > > > > +	return err;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > > > > associated
> > > > > with a page
> > > > > + * @page: Pointer to the page
> > > > > + *
> > > > > + * This function is a callback used to put the GPU SVM zone
> > > > > device
> > > > > data
> > > > > + * associated with a page when it is being released.
> > > > > + */
> > > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > > +{
> > > > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > > > > (page
> > > > > fault handler)
> > > > > + * @vmf: Pointer to the fault information structure
> > > > > + *
> > > > > + * This function is a page fault handler used to migrate a
> > > > > GPU
> > > > > SVM
> > > > > range to RAM.
> > > > > + * It retrieves the GPU SVM range information from the
> > > > > faulting
> > > > > page
> > > > > and invokes
> > > > > + * the internal migration function to migrate the range back
> > > > > to
> > > > > RAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > > + */
> > > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > > > > *vmf)
> > > > > +{
> > > > > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > > > > > zone_device_data;
> > > > > +	int err;
> > > > > +
> > > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range-
> > > > > >gpusvm,
> > > > > +					   vmf->vma, vmf-
> > > > > >page,
> > > > > +					   zdd->range-
> > > > > >va.start,
> > > > > +					   zdd->range-
> > > > > >va.end);
> > > > > +
> > > > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_pagemap_ops - Device page map operations for
> > > > > GPU
> > > > > SVM
> > > > > + */
> > > > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops =
> > > > > {
> > > > > +	.page_free = drm_gpusvm_page_free,
> > > > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page
> > > > > map
> > > > > operations
> > > > > + *
> > > > > + * Returns:
> > > > > + * Pointer to the GPU SVM device page map operations
> > > > > structure.
> > > > > + */
> > > > > +const struct dev_pagemap_ops
> > > > > *drm_gpusvm_pagemap_ops_get(void)
> > > > > +{
> > > > > +	return &drm_gpusvm_pagemap_ops;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for
> > > > > the
> > > > > given address range
> > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > + * @start: Start address
> > > > > + * @end: End address
> > > > > + *
> > > > > + * Returns:
> > > > > + * True if GPU SVM has mapping, False otherwise
> > > > > + */
> > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > > > start,
> > > > > u64 end)
> > > > > +{
> > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > +
> > > > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm,
> > > > > start,
> > > > > end) {
> > > > > +		struct drm_gpusvm_range *range = NULL;
> > > > > +
> > > > > +		drm_gpusvm_for_each_range(range, notifier,
> > > > > start,
> > > > > end)
> > > > > +			return true;
> > > > > +	}
> > > > > +
> > > > > +	return false;
> > > > > +}
> > > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > new file mode 100644
> > > > > index 000000000000..0ea70f8534a8
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > @@ -0,0 +1,415 @@
> > > > > +/* SPDX-License-Identifier: MIT */
> > > > > +/*
> > > > > + * Copyright © 2024 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +#ifndef __DRM_GPUSVM_H__
> > > > > +#define __DRM_GPUSVM_H__
> > > > > +
> > > > > +#include <linux/kref.h>
> > > > > +#include <linux/mmu_notifier.h>
> > > > > +#include <linux/workqueue.h>
> > > > > +
> > > > > +struct dev_pagemap_ops;
> > > > > +struct drm_device;
> > > > > +struct drm_gpusvm;
> > > > > +struct drm_gpusvm_notifier;
> > > > > +struct drm_gpusvm_ops;
> > > > > +struct drm_gpusvm_range;
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > > > + *
> > > > > + * This structure defines the operations for GPU Shared
> > > > > Virtual
> > > > > Memory (SVM).
> > > > > + * These operations are provided by the GPU driver to manage
> > > > > SVM
> > > > > ranges and
> > > > > + * perform operations such as migration between VRAM and
> > > > > system
> > > > > RAM.
> > > > > + */
> > > > > +struct drm_gpusvm_ops {
> > > > > +	/**
> > > > > +	 * @notifier_alloc: Allocate a GPU SVM notifier
> > > > > (optional)
> > > > > +	 *
> > > > > +	 * This function shall allocate a GPU SVM notifier.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * Pointer to the allocated GPU SVM notifier on
> > > > > success,
> > > > > NULL on failure.
> > > > > +	 */
> > > > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > > > +
> > > > > +	/**
> > > > > +	 * @notifier_free: Free a GPU SVM notifier
> > > > > (optional)
> > > > > +	 * @notifier: Pointer to the GPU SVM notifier to be
> > > > > freed
> > > > > +	 *
> > > > > +	 * This function shall free a GPU SVM notifier.
> > > > > +	 */
> > > > > +	void (*notifier_free)(struct drm_gpusvm_notifier
> > > > > *notifier);
> > > > > +
> > > > > +	/**
> > > > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 *
> > > > > +	 * This function shall allocate a GPU SVM range.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * Pointer to the allocated GPU SVM range on
> > > > > success,
> > > > > NULL
> > > > > on failure.
> > > > > +	 */
> > > > > +	struct drm_gpusvm_range *(*range_alloc)(struct
> > > > > drm_gpusvm
> > > > > *gpusvm);
> > > > > +
> > > > > +	/**
> > > > > +	 * @range_free: Free a GPU SVM range (optional)
> > > > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > > > +	 *
> > > > > +	 * This function shall free a GPU SVM range.
> > > > > +	 */
> > > > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > > > +
> > > > > +	/**
> > > > > +	 * @vram_release: Release VRAM allocation (optional)
> > > > > +	 * @vram_allocation: Driver-private pointer to the
> > > > > VRAM
> > > > > allocation
> > > > > +	 *
> > > > > +	 * This function shall release VRAM allocation and
> > > > > expects
> > > > > to drop a
> > > > > +	 * reference to VRAM allocation.
> > > > > +	 */
> > > > > +	void (*vram_release)(void *vram_allocation);
> > > > > +
> > > > > +	/**
> > > > > +	 * @populate_vram_pfn: Populate VRAM PFN (required
> > > > > for
> > > > > migration)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @vram_allocation: Driver-private pointer to the
> > > > > VRAM
> > > > > allocation
> > > > > +	 * @npages: Number of pages to populate
> > > > > +	 * @pfn: Array of page frame numbers to populate
> > > > > +	 *
> > > > > +	 * This function shall populate VRAM page frame
> > > > > numbers
> > > > > (PFN).
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * 0 on success, a negative error code on failure.
> > > > > +	 */
> > > > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > > > +				 void *vram_allocation,
> > > > > +				 unsigned long npages,
> > > > > +				 unsigned long *pfn);
> > > > > +
> > > > > +	/**
> > > > > +	 * @copy_to_vram: Copy to VRAM (required for
> > > > > migration)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @pages: Pointer to array of VRAM pages
> > > > > (destination)
> > > > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > > > (source)
> > > > > +	 * @npages: Number of pages to copy
> > > > > +	 *
> > > > > +	 * This function shall copy pages to VRAM.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * 0 on success, a negative error code on failure.
> > > > > +	 */
> > > > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > > > +			    struct page **pages,
> > > > > +			    dma_addr_t *dma_addr,
> > > > > +			    unsigned long npages);
> > > > > +
> > > > > +	/**
> > > > > +	 * @copy_to_sram: Copy to system RAM (required for
> > > > > migration)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > > > (destination)
> > > > > +	 * @npages: Number of pages to copy
> > > > > +	 *
> > > > > +	 * This function shall copy pages to system RAM.
> > > > > +	 *
> > > > > +	 * Returns:
> > > > > +	 * 0 on success, a negative error code on failure.
> > > > > +	 */
> > > > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > > > +			    struct page **pages,
> > > > > +			    dma_addr_t *dma_addr,
> > > > > +			    unsigned long npages);
> > > > > +
> > > > > +	/**
> > > > > +	 * @invalidate: Invalidate GPU SVM notifier
> > > > > (required)
> > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > > > +	 * @mmu_range: Pointer to the mmu_notifier_range
> > > > > structure
> > > > > +	 *
> > > > > +	 * This function shall invalidate the GPU page
> > > > > tables.
> > > > > It
> > > > > can safely
> > > > > +	 * walk the notifier range RB tree/list in this
> > > > > function.
> > > > > Called while
> > > > > +	 * holding the notifier lock.
> > > > > +	 */
> > > > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > > > +			   struct drm_gpusvm_notifier
> > > > > *notifier,
> > > > > +			   const struct mmu_notifier_range
> > > > > *mmu_range);
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_notifier - Structure representing a GPU
> > > > > SVM
> > > > > notifier
> > > > > + *
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: MMU interval notifier
> > > > > + * @interval: Interval for the notifier
> > > > > + * @rb: Red-black tree node for the parent GPU SVM structure
> > > > > notifier tree
> > > > > + * @root: Cached root node of the RB tree containing ranges
> > > > > + * @range_list: List head containing of ranges in the same
> > > > > order
> > > > > they appear in
> > > > > + *              interval tree. This is useful to keep
> > > > > iterating
> > > > > ranges while
> > > > > + *              doing modifications to RB tree.
> > > > > + * @flags.removed: Flag indicating whether the MMU interval
> > > > > notifier
> > > > > has been
> > > > > + *                 removed
> > > > > + *
> > > > > + * This structure represents a GPU SVM notifier.
> > > > > + */
> > > > > +struct drm_gpusvm_notifier {
> > > > > +	struct drm_gpusvm *gpusvm;
> > > > > +	struct mmu_interval_notifier notifier;
> > > > > +	struct {
> > > > > +		u64 start;
> > > > > +		u64 end;
> > > > > +	} interval;
> > > > > +	struct {
> > > > > +		struct rb_node node;
> > > > > +		struct list_head entry;
> > > > > +		u64 __subtree_last;
> > > > > +	} rb;
> > > > > +	struct rb_root_cached root;
> > > > > +	struct list_head range_list;
> > > > > +	struct {
> > > > > +		u32 removed : 1;
> > > > > +	} flags;
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_range - Structure representing a GPU
> > > > > SVM
> > > > > range
> > > > > + *
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @notifier: Pointer to the GPU SVM notifier
> > > > > + * @refcount: Reference count for the range
> > > > > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > > > > structure range tree
> > > > > + * @va: Virtual address range
> > > > > + * @notifier_seq: Notifier sequence number of the range's
> > > > > pages
> > > > > + * @pages: Pointer to the array of pages (if backing store
> > > > > is in
> > > > > VRAM)
> > > > > + * @dma_addr: DMA address array (if backing store is SRAM
> > > > > and
> > > > > DMA
> > > > > mapped)
> > > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > > allocation
> > > > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is
> > > > > mapping
> > > > > size
> > > > > + * @flags.migrate_vram: Flag indicating whether the range
> > > > > can be
> > > > > migrated to VRAM
> > > > > + * @flags.unmapped: Flag indicating if the range has been
> > > > > unmapped
> > > > > + * @flags.partial_unmap: Flag indicating if the range has
> > > > > been
> > > > > partially unmapped
> > > > > + * @flags.has_vram_pages: Flag indicating if the range has
> > > > > vram
> > > > > pages
> > > > > + * @flags.has_dma_mapping: Flag indicating if the range has
> > > > > a
> > > > > DMA
> > > > > mapping
> > > > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a
> > > > > compact
> > > > > allocation based
> > > > > + *                       on @order which releases via kfree
> > > > > + *
> > > > > + * This structure represents a GPU SVM range used for
> > > > > tracking
> > > > > memory ranges
> > > > > + * mapped in a DRM device.
> > > > > + */
> > > > > +struct drm_gpusvm_range {
> > > > > +	struct drm_gpusvm *gpusvm;
> > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > +	struct kref refcount;
> > > > > +	struct {
> > > > > +		struct rb_node node;
> > > > > +		struct list_head entry;
> > > > > +		u64 __subtree_last;
> > > > > +	} rb;
> > > > > +	struct {
> > > > > +		u64 start;
> > > > > +		u64 end;
> > > > > +	} va;
> > > > > +	unsigned long notifier_seq;
> > > > > +	union {
> > > > > +		struct page **pages;
> > > > > +		dma_addr_t *dma_addr;
> > > > > +	};
> > > > > +	void *vram_allocation;
> > > > > +	u16 order;
> > > > > +	struct {
> > > > > +		/* All flags below must be set upon creation
> > > > > */
> > > > > +		u16 migrate_vram : 1;
> > > > > +		/* All flags below must be set / cleared
> > > > > under
> > > > > notifier lock */
> > > > > +		u16 unmapped : 1;
> > > > > +		u16 partial_unmap : 1;
> > > > > +		u16 has_vram_pages : 1;
> > > > > +		u16 has_dma_mapping : 1;
> > > > > +		u16 kfree_mapping : 1;
> > > > > +	} flags;
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm - GPU SVM structure
> > > > > + *
> > > > > + * @name: Name of the GPU SVM
> > > > > + * @drm: Pointer to the DRM device structure
> > > > > + * @mm: Pointer to the mm_struct for the address space
> > > > > + * @device_private_page_owner: Device private pages owner
> > > > > + * @mm_start: Start address of GPU SVM
> > > > > + * @mm_range: Range of the GPU SVM
> > > > > + * @notifier_size: Size of individual notifiers
> > > > > + * @ops: Pointer to the operations structure for GPU SVM
> > > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > > range
> > > > > allocation.
> > > > > + *               Entries should be powers of 2 in descending
> > > > > order.
> > > > > + * @num_chunks: Number of chunks
> > > > > + * @notifier_lock: Read-write semaphore for protecting
> > > > > notifier
> > > > > operations
> > > > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > > > + * @root: Cached root node of the Red-Black tree containing
> > > > > GPU
> > > > > SVM
> > > > > notifiers
> > > > > + * @notifier_list: list head containing of notifiers in the
> > > > > same
> > > > > order they
> > > > > + *                 appear in interval tree. This is useful
> > > > > to
> > > > > keep
> > > > > iterating
> > > > > + *                 notifiers while doing modifications to RB
> > > > > tree.
> > > > > + *
> > > > > + * This structure represents a GPU SVM (Shared Virtual
> > > > > Memory)
> > > > > used
> > > > > for tracking
> > > > > + * memory ranges mapped in a DRM (Direct Rendering Manager)
> > > > > device.
> > > > > + *
> > > > > + * No reference counting is provided, as this is expected to
> > > > > be
> > > > > embedded in the
> > > > > + * driver VM structure along with the struct drm_gpuvm,
> > > > > which
> > > > > handles reference
> > > > > + * counting.
> > > > > + */
> > > > > +struct drm_gpusvm {
> > > > > +	const char *name;
> > > > > +	struct drm_device *drm;
> > > > > +	struct mm_struct *mm;
> > > > > +	void *device_private_page_owner;
> > > > > +	u64 mm_start;
> > > > > +	u64 mm_range;
> > > > > +	u64 notifier_size;
> > > > > +	const struct drm_gpusvm_ops *ops;
> > > > > +	const u64 *chunk_sizes;
> > > > > +	int num_chunks;
> > > > > +	struct rw_semaphore notifier_lock;
> > > > > +	struct workqueue_struct *zdd_wq;
> > > > > +	struct rb_root_cached root;
> > > > > +	struct list_head notifier_list;
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > > > + *
> > > > > + * @mmap_locked: mmap lock is locked
> > > > > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > > > > inversions
> > > > > + *                (e.g.dma-revs -> mmap lock)
> > > > > + * @in_notifier: entering from a MMU notifier
> > > > > + * @read_only: operating on read-only memory
> > > > > + * @vram_possible: possible to use VRAM
> > > > > + * @prefault: prefault pages
> > > > > + *
> > > > > + * Context that is DRM GPUSVM is operating in (i.e. user
> > > > > arguments).
> > > > > + */
> > > > > +struct drm_gpusvm_ctx {
> > > > > +	u32 mmap_locked :1;
> > > > > +	u32 trylock_mmap :1;
> > > > > +	u32 in_notifier :1;
> > > > > +	u32 read_only :1;
> > > > > +	u32 vram_possible :1;
> > > > > +	u32 prefault :1;
> > > > > +};
> > > > > +
> > > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > > +		    const char *name, struct drm_device
> > > > > *drm,
> > > > > +		    struct mm_struct *mm, void
> > > > > *device_private_page_owner,
> > > > > +		    u64 mm_start, u64 mm_range, u64
> > > > > notifier_size,
> > > > > +		    const struct drm_gpusvm_ops *ops,
> > > > > +		    const u64 *chunk_sizes, int num_chunks);
> > > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > > > +
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> > > > > u64
> > > > > fault_addr,
> > > > > +				u64 gpuva_start, u64
> > > > > gpuva_end,
> > > > > +				const struct drm_gpusvm_ctx
> > > > > *ctx);
> > > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > > +			     struct drm_gpusvm_range
> > > > > *range);
> > > > > +
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > > > +
> > > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > > +				  struct drm_gpusvm_range
> > > > > *range);
> > > > > +
> > > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range
> > > > > *range,
> > > > > +			       const struct drm_gpusvm_ctx
> > > > > *ctx);
> > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > +				  struct drm_gpusvm_range
> > > > > *range,
> > > > > +				  const struct
> > > > > drm_gpusvm_ctx
> > > > > *ctx);
> > > > > +
> > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range
> > > > > *range,
> > > > > +			       void *vram_allocation,
> > > > > +			       const struct drm_gpusvm_ctx
> > > > > *ctx);
> > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +			       struct drm_gpusvm_range
> > > > > *range,
> > > > > +			       const struct drm_gpusvm_ctx
> > > > > *ctx);
> > > > > +
> > > > > +const struct dev_pagemap_ops
> > > > > *drm_gpusvm_pagemap_ops_get(void);
> > > > > +
> > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > > > start,
> > > > > u64 end);
> > > > > +
> > > > > +struct drm_gpusvm_range *
> > > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier,
> > > > > u64
> > > > > start, u64 end);
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > + *
> > > > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > > > + */
> > > > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > > > +	down_read(&(gpusvm__)->notifier_lock)
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > + *
> > > > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > > > + */
> > > > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > > > +	up_read(&(gpusvm__)->notifier_lock)
> > > > > +
> > > > > +/**
> > > > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in
> > > > > the
> > > > > list
> > > > > + * @range: a pointer to the current GPU SVM range
> > > > > + *
> > > > > + * Return: A pointer to the next drm_gpusvm_range if
> > > > > available,
> > > > > or
> > > > > NULL if the
> > > > > + *         current range is the last one or if the input
> > > > > range
> > > > > is
> > > > > NULL.
> > > > > + */
> > > > > +static inline struct drm_gpusvm_range *
> > > > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > > > +{
> > > > > +	if (range && !list_is_last(&range->rb.entry,
> > > > > +				   &range->notifier-
> > > > > > range_list))
> > > > > +		return list_next_entry(range, rb.entry);
> > > > > +
> > > > > +	return NULL;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges
> > > > > in a
> > > > > notifier
> > > > > + * @range__: Iterator variable for the ranges. If set, it
> > > > > indicates
> > > > > the start of
> > > > > + *	     the iterator. If NULL, call
> > > > > drm_gpusvm_range_find()
> > > > > to
> > > > > get the range.
> > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > + * @start__: Start address of the range
> > > > > + * @end__: End address of the range
> > > > > + *
> > > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > > notifier.
> > > > > It is safe
> > > > > + * to use while holding the driver SVM lock or the notifier
> > > > > lock.
> > > > > + */
> > > > > +#define drm_gpusvm_for_each_range(range__, notifier__,
> > > > > start__,
> > > > > end__)	\
> > > > > +	for ((range__) = (range__)
> > > > > ?:					\
> > > > > +	     drm_gpusvm_range_find((notifier__), (start__),
> > > > > (end__));	\
> > > > > +	     (range__) && (range__->va.start <
> > > > > (end__));		\
> > > > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > > > +
> > > > > +/**
> > > > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as
> > > > > unmapped
> > > > > + * @range: Pointer to the GPU SVM range structure.
> > > > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > > > + *
> > > > > + * This function marks a GPU SVM range as unmapped and sets
> > > > > the
> > > > > partial_unmap flag
> > > > > + * if the range partially falls within the provided MMU
> > > > > notifier
> > > > > range.
> > > > > + */
> > > > > +static inline void
> > > > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range
> > > > > *range,
> > > > > +			      const struct
> > > > > mmu_notifier_range
> > > > > *mmu_range)
> > > > > +{
> > > > > +	lockdep_assert_held_write(&range->gpusvm-
> > > > > > notifier_lock);
> > > > > +
> > > > > +	range->flags.unmapped = true;
> > > > > +	if (range->va.start < mmu_range->start ||
> > > > > +	    range->va.end > mmu_range->end)
> > > > > +		range->flags.partial_unmap = true;
> > > > > +}
> > > > > +
> > > > > +#endif /* __DRM_GPUSVM_H__ */
> > > > 
> > 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                     ` (3 preceding siblings ...)
  2024-08-29  9:45   ` Daniel Vetter
@ 2024-08-30  9:16   ` Thomas Hellström
  2024-09-02 12:20     ` Daniel Vetter
  2024-09-06 18:41   ` Zeng, Oak
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-08-30  9:16 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: airlied, christian.koenig, matthew.auld, daniel

Hi, Matthew

On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> +/**
> + * DOC: Overview
> + *
> + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> Rendering Manager (DRM)
> + *
> + * The GPU SVM layer is a component of the DRM framework designed to
> manage shared
> + * virtual memory between the CPU and GPU. It enables efficient data
> exchange and
> + * processing for GPU-accelerated applications by allowing memory
> sharing and
> + * synchronization between the CPU's and GPU's virtual address
> spaces.
> + *
> + * Key GPU SVM Components:
> + * - Notifiers: Notifiers: Used for tracking memory intervals and
> notifying the
> + *		GPU of changes, notifiers are sized based on a GPU
> SVM
> + *		initialization parameter, with a recommendation of
> 512M or
> + *		larger. They maintain a Red-BlacK tree and a list of
> ranges that
> + *		fall within the notifier interval. Notifiers are
> tracked within
> + *		a GPU SVM Red-BlacK tree and list and are
> dynamically inserted
> + *		or removed as ranges within the interval are created
> or
> + *		destroyed.
> + * - Ranges: Represent memory ranges mapped in a DRM device and
> managed
> + *	     by GPU SVM. They are sized based on an array of chunk
> sizes, which
> + *	     is a GPU SVM initialization parameter, and the CPU
> address space.
> + *	     Upon GPU fault, the largest aligned chunk that fits
> within the
> + *	     faulting CPU address space is chosen for the range
> size. Ranges are
> + *	     expected to be dynamically allocated on GPU fault and
> removed on an
> + *	     MMU notifier UNMAP event. As mentioned above, ranges
> are tracked in
> + *	     a notifier's Red-Black tree.
> + * - Operations: Define the interface for driver-specific SVM
> operations such as
> + *		 allocation, page collection, migration,
> invalidations, and VRAM
> + *		 release.
> + *

Another question, since ranges, as I understand it, are per gpuvm and
per cpu mm, whereas migration is per device and per cpu_mm, (whe might
have multiple gpuvms mapping the same cpu_mm), I figure the gpu_svm is
per gpuvm, but that makes migration currently inconsistent, right?

/Thomas


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 20:56         ` Matthew Brost
  2024-08-30  8:18           ` Thomas Hellström
@ 2024-08-30  9:57           ` Thomas Hellström
  2024-08-30 13:47             ` Matthew Brost
  2024-09-02 12:33           ` Daniel Vetter
  2 siblings, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-08-30  9:57 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

Hi, Matthew,

Agreed the below might not be important just now, but some ideas:

On Thu, 2024-08-29 at 20:56 +0000, Matthew Brost wrote:
> Issues with removing a SVM range:
> 
> - Xe bind code stores invalidation / present state in VMA, this would
>   need to be moved to the radix tree. I have Jira open for that work
>   which I believe other developers are going to own.

Yeah, although we shouldn't *design* around xe bind-code and page-table
code shortcomings.

> - Where would the dma mapping / device pages be stored?
> 	- In the radix tree? What if ATS is enabled? We don't have a
> 	  driver owned radix tree. How do we reasonably connect a
> driver
> 	  owned radix to a common GPUSVM layer?

With ATS you mean IOMMU SVA, right? I think we could assume that any
user of this code also has a gpu page-table since otherwise they
couldn't be using VRAM and a simpler solution would be in place. 

But to that specific question, drm_gpusvm state would live in a
drm_gpusvm radix tree and driver-specific stuff in the driver tree. A
helper based approach would then call drm_gpusvm_unmap_dma(range),
whereas a middle layer would just traverse the tree and unmap.

> 	- In the notifier? What is the notifier is sparsely
> populated?
> 	  We would be wasting huge amounts of memory. What is the
> 	  notifier is configured to span the entire virtual address
> 	  space?

Let's assume you use a fake page-table like in xe_pt_walk.c as your
"radix tree", adapted to relevant page-sizes, sparsity is not a
problem.

> - How does the garbage collector work? We can't allocate memory in
> the
>   notifier so we don't anything to add to the garbage collector. We
>   can't directly modify page tables given you need lock in the path
> of
>   reclaim.

The garbage collector would operate on the whole invalidated range. In
the case of xe, upon zapping under reclaim you mark individual page-
table bos that are to be removed as "invalid", the garbage collector
walks the range removing the "invalid" entries. Subsequent (re-binding)
avoids the "invalid" entries, (perhaps even helps removing them) and
can thus race with the garbage collector. Hence, any ranges implied by
the page-table code are elimitated.

> - How do we deal with fault storms (e.g. tons of faults hitting the
> same
>   SVM range in a row)? Without a SVM range no every to know if
> mapping
>   is valid and GPU page handler can be short circuited.

Perhaps look at page-table tree and check whether the gpu_pte causing
the fault is valid.

> - Do we have notifier seqno for every PTE?

I'd say no. With this approach it makes sense to have a wide notifier.
The seqno now only affects binding of new gpu_ptes, so the problem with
a wide notifier becomes that if invalidation occurs to *any* part of
the notifier while we're in the read section during binding, we need to
rerun the binding. Adding more notifiers to mitigate that would be to
optimize faulting performance over core invalidation performance which
Jason asked us to avoid.

/Thomas

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-30  9:57           ` Thomas Hellström
@ 2024-08-30 13:47             ` Matthew Brost
  2024-09-02  9:45               ` Thomas Hellström
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-30 13:47 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Fri, Aug 30, 2024 at 11:57:33AM +0200, Thomas Hellström wrote:
> Hi, Matthew,
> 
> Agreed the below might not be important just now, but some ideas:
> 
> On Thu, 2024-08-29 at 20:56 +0000, Matthew Brost wrote:
> > Issues with removing a SVM range:
> > 
> > - Xe bind code stores invalidation / present state in VMA, this would
> >   need to be moved to the radix tree. I have Jira open for that work
> >   which I believe other developers are going to own.
> 
> Yeah, although we shouldn't *design* around xe bind-code and page-table
> code shortcomings.
> 

I'm thinking this one certainly should be fixed sooner rather than
later which would be helpful.

But let's also consider the case where we get a bunch of individual page
invalidates serially for an entire range (I can't remember when this
happens but I have seen it in my testing, will look into this more to
figure exactly when). If we invalidate 1 page at a time in radix tree,
each invalidation could potentially results in TLB invalidation
interaction with the hardware in cases where a larger GPU pages are not
being used. The TLB invalidation is going to vastly slower than any CPU
operation (e.g. RB search, radix tree walk). If we key on a range
invalidate the entire once on the first invalidation this may end up
being significantly faster.

Above is pure speculation though, a lot of what both of us is saying
is... So another reason I'd like to get apps running to do profiling. It
would be nice to make design decisions based on data not speculation.

> 
> > - Where would the dma mapping / device pages be stored?
> > 	- In the radix tree? What if ATS is enabled? We don't have a
> > 	  driver owned radix tree. How do we reasonably connect a
> > driver
> > 	  owned radix to a common GPUSVM layer?
> 
> With ATS you mean IOMMU SVA, right? I think we could assume that any
> user of this code also has a gpu page-table since otherwise they
> couldn't be using VRAM and a simpler solution would be in place. 
>

Fair point.

> But to that specific question, drm_gpusvm state would live in a
> drm_gpusvm radix tree and driver-specific stuff in the driver tree. A
> helper based approach would then call drm_gpusvm_unmap_dma(range),
> whereas a middle layer would just traverse the tree and unmap.
> 

Let me consider this. Open to all options.

> > 	- In the notifier? What is the notifier is sparsely
> > populated?
> > 	  We would be wasting huge amounts of memory. What is the
> > 	  notifier is configured to span the entire virtual address
> > 	  space?
> 
> Let's assume you use a fake page-table like in xe_pt_walk.c as your
> "radix tree", adapted to relevant page-sizes, sparsity is not a
> problem.
>

Ok, makes sense I think.

> > - How does the garbage collector work? We can't allocate memory in
> > the
> >   notifier so we don't anything to add to the garbage collector. We
> >   can't directly modify page tables given you need lock in the path
> > of
> >   reclaim.
> 
> The garbage collector would operate on the whole invalidated range. In
> the case of xe, upon zapping under reclaim you mark individual page-
> table bos that are to be removed as "invalid", the garbage collector
> walks the range removing the "invalid" entries. Subsequent (re-binding)
> avoids the "invalid" entries, (perhaps even helps removing them) and
> can thus race with the garbage collector. Hence, any ranges implied by
> the page-table code are elimitated.
> 

This is pretty much with what I came up with too if we didn't have a SVM
range.

> > - How do we deal with fault storms (e.g. tons of faults hitting the
> > same
> >   SVM range in a row)? Without a SVM range no every to know if
> > mapping
> >   is valid and GPU page handler can be short circuited.
> 
> Perhaps look at page-table tree and check whether the gpu_pte causing
> the fault is valid.
> 

Came up with the same thing.

> > - Do we have notifier seqno for every PTE?
> 
> I'd say no. With this approach it makes sense to have a wide notifier.
> The seqno now only affects binding of new gpu_ptes, so the problem with
> a wide notifier becomes that if invalidation occurs to *any* part of
> the notifier while we're in the read section during binding, we need to

I have avoided this by the drm_gpusvm_range_pages_valid. This isn't just
an optimization is actually required for the 2 tile case to be able to
safely know when dma pages can be unmapped (i.e. you can't dma unmap
pages if either tile has a valid mapping).

Matt

> rerun the binding. Adding more notifiers to mitigate that would be to
> optimize faulting performance over core invalidation performance which
> Jason asked us to avoid.
> 
> /Thomas
> 
> 
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-30  8:18           ` Thomas Hellström
@ 2024-08-30 13:58             ` Matthew Brost
  2024-09-02  9:57               ` Thomas Hellström
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-08-30 13:58 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Fri, Aug 30, 2024 at 10:18:58AM +0200, Thomas Hellström wrote:
> Hi, Matthew,
> 
> On Thu, 2024-08-29 at 20:56 +0000, Matthew Brost wrote:
> > On Thu, Aug 29, 2024 at 09:18:29PM +0200, Thomas Hellström wrote:
> > > Hi, Matthew,
> > > 
> > > On Thu, 2024-08-29 at 17:45 +0000, Matthew Brost wrote:
> > > > On Thu, Aug 29, 2024 at 11:16:49AM +0200, Thomas Hellström wrote:
> > > > > Hi, Matt. 
> > > > > 
> > > > > Some initial design comments / questions:
> > > > > 
> > > > > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > > > > This patch introduces support for GPU Shared Virtual Memory
> > > > > > (SVM)
> > > > > > in
> > > > > > the
> > > > > > Direct Rendering Manager (DRM) subsystem. SVM allows for
> > > > > > seamless
> > > > > > sharing of memory between the CPU and GPU, enhancing
> > > > > > performance
> > > > > > and
> > > > > > flexibility in GPU computing tasks.
> > > > > > 
> > > > > > The patch adds the necessary infrastructure for SVM,
> > > > > > including
> > > > > > data
> > > > > > structures and functions for managing SVM ranges and
> > > > > > notifiers.
> > > > > > It
> > > > > > also
> > > > > > provides mechanisms for allocating, deallocating, and
> > > > > > migrating
> > > > > > memory
> > > > > > regions between system RAM and GPU VRAM.
> > > > > > 
> > > > > > This mid-layer is largely inspired by GPUVM.
> > > > > > 
> > > > > > Cc: Dave Airlie <airlied@redhat.com>
> > > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > > > > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > > > > +++++++++++++++++++++++++++++++
> > > > > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > > > > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > > > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > > > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > > > b/drivers/gpu/drm/xe/Makefile
> > > > > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > > > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > > > > >  
> > > > > >  # core driver code
> > > > > >  
> > > > > > -xe-y += xe_bb.o \
> > > > > > +xe-y += drm_gpusvm.o \
> > > > > > +	xe_bb.o \
> > > > > >  	xe_bo.o \
> > > > > >  	xe_bo_evict.o \
> > > > > >  	xe_devcoredump.o \
> > > > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..fc1e44e6ae72
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > > > @@ -0,0 +1,2174 @@
> > > > > > +// SPDX-License-Identifier: MIT
> > > > > > +/*
> > > > > > + * Copyright © 2024 Intel Corporation
> > > > > > + *
> > > > > > + * Authors:
> > > > > > + *     Matthew Brost <matthew.brost@intel.com>
> > > > > > + */
> > > > > > +
> > > > > > +#include <linux/dma-mapping.h>
> > > > > > +#include <linux/interval_tree_generic.h>
> > > > > > +#include <linux/hmm.h>
> > > > > > +#include <linux/memremap.h>
> > > > > > +#include <linux/migrate.h>
> > > > > > +#include <linux/mm_types.h>
> > > > > > +#include <linux/pagemap.h>
> > > > > > +#include <linux/slab.h>
> > > > > > +
> > > > > > +#include <drm/drm_device.h>
> > > > > > +#include "drm_gpusvm.h"
> > > > > > +
> > > > > > +/**
> > > > > > + * DOC: Overview
> > > > > > + *
> > > > > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > > > > Rendering Manager (DRM)
> > > > > > + *
> > > > > > + * The GPU SVM layer is a component of the DRM framework
> > > > > > designed to
> > > > > > manage shared
> > > > > > + * virtual memory between the CPU and GPU. It enables
> > > > > > efficient
> > > > > > data
> > > > > > exchange and
> > > > > > + * processing for GPU-accelerated applications by allowing
> > > > > > memory
> > > > > > sharing and
> > > > > > + * synchronization between the CPU's and GPU's virtual
> > > > > > address
> > > > > > spaces.
> > > > > > + *
> > > > > > + * Key GPU SVM Components:
> > > > > > + * - Notifiers: Notifiers: Used for tracking memory
> > > > > > intervals
> > > > > > and
> > > > > > notifying the
> > > > > > + *		GPU of changes, notifiers are sized based on
> > > > > > a
> > > > > > GPU
> > > > > > SVM
> > > > > > + *		initialization parameter, with a
> > > > > > recommendation
> > > > > > of
> > > > > > 512M or
> > > > > > + *		larger. They maintain a Red-BlacK tree and a
> > > > > > list of
> > > > > > ranges that
> > > > > > + *		fall within the notifier interval. Notifiers
> > > > > > are
> > > > > > tracked within
> > > > > > + *		a GPU SVM Red-BlacK tree and list and are
> > > > > > dynamically inserted
> > > > > > + *		or removed as ranges within the interval are
> > > > > > created
> > > > > > or
> > > > > > + *		destroyed.
> > > > > 
> > > > > What is the benefit of this extra layer compared to direct
> > > > > insertion of
> > > > > ranges using mmu_interval_notifier_insert?
> > > > > 
> > > > > IIRC the argument made previously about having wide notifiers
> > > > > was
> > > > > that
> > > > > the rb tree lookups inside the core were costly and if there
> > > > > were
> > > > > only
> > > > > a few, then the rb tree lookups within a notifier range could
> > > > > be
> > > > > replaced with the page-table radix-tree-like lookup, so each
> > > > > lookup
> > > > > complexity would be O(log(n_notifiers) + page_table_depth).
> > > > > 
> > > > > But now we have first an rb-tree lookup in the core and then an
> > > > > rb-
> > > > > tree
> > > > > lookup within each notifier yeilding O(log(n_ranges))
> > > > > 
> > > > > I can see a small benefit in that inserting directly into the
> > > > > core
> > > > > rb-
> > > > > tree will block pending ongoing invalidations, but at a cost of
> > > > > an
> > > > > extra multiplexing layer.
> > > > > 
> > > > 
> > > > So when the notifier is triggered the search is a smaller range.
> > > > In a
> > > > perfect world eventually I'd like to drop the SVM range
> > > > completely.
> > > > There is a lot of changes required in Xe to make that possible
> > > > and
> > > > not
> > > > entirely convinced it is possible and the ROI is worth it
> > > > (additional
> > > > complexity vs. perf benefit). For now, this was a relatively
> > > > simple
> > > > way
> > > > to get SVM working (mirrors boths AMD's and Nvidia's implement
> > > > wrt to
> > > > having a range concept) but also is flexible in the sense the
> > > > notifier
> > > > size can be easily tweaked via a modparam [1] following Jason's
> > > > suggestion of larger notifiers.
> > > > 
> > > > [1]
> > > > https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1
> > > 
> > > What I meant was the core is already implementing the "one notifier
> > > for
> > > the whole range", since your notifier duplicates the
> > > mmu_interval_notifier functionality.
> > > 
> > > The mmu_interval_notifier first does an rbtree search to get to the
> > > notifier, and then drm_gpusvm does an rbtree search to get to the
> > > range.
> > 
> > Yes.
> > 
> > > 
> > > If the svm notifier layer is skipped, mmu_interval_notifier has to
> > > perform a wider rbtree search to get to the range. The point is,
> > > the
> > > complexity is the same for both approaches so there is no point in
> > > adding a svm notifier layer for that reason. The width of the
> > > notifier
> > > just adjust the relative size of the two rbtree searches, so from
> > > that
> > > point of view the drm_gpusvm does not offer any benefit from
> > > inserting
> > > the ranges into the mmu_interval_notifier directly (except that the
> > > mmu_interval_notifier is slightly more heavyweight).
> > > 
> > 
> > I think a large part of it was to avoid inserting / removing many
> > notifiers as that was expensive. Agree the search is not
> > fundamentally
> > faster the way I have this coded. It just avoids heavy inserting /
> > removing of notifiers.
> 
> So I specifically asked Jason about the performance problem about using
> many notifiers vs using a single one, and he responded that the problem
> is slowing down the core mm on invalidations, if the RB tree gets too
> large to walk. He also mentioned that we should consider core
> invalidation performance before faulting performance because the latter
> is so slow anyway we must have the driver stack avoid gpu faults using
> user-space prefetching and similar techniques.
> 
> In particular inserting and removing into the mmu_interval tree is not
> costly in terms of locking but because of correctness requirements
> insertion might block on ongoing validations.
> 
> So basically what I'm trying to say is that as long as we're using SVM
> ranges in the way we do (I'm not saying that is wrong at this point,

If you have been following the mmap write discussions at all, one
potential fix for removing that hack is a per range migrate mutex [1].
This also need to be considered when / if we try to drop a raneg
concept.

[1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111296

> and I agree that could be fine-tuned later), The benefit of an extra
> notifier layer is questionable compared to directly inserting the
> ranges into the mmu_interval_tree. So hence my questions, given those
> considerations why this additional layer?
> 

One we do fairly easily if you think this questionable is have an option
to size the notifier to range size and wire this the notifier size
modparam [2]. Again once we have apps running it would be fairly to
profile this and see if there is benefit to this large notifier scheme.
If there really is none, perhaps then we consider ripping this out.

[2] https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1

Matt

> Anyway, a more detailed review of the code perhaps helps clear this
> out.
> 
> > 
> > > As I understand it, Jasons comments were based on the assumption
> > > that
> > > the drm_gpusvm search would be radix tree based, and hence with
> > > less
> > > complexity than the rbtree search, and therefore providing a clear
> > > benefit the larger they could be.
> > > 
> > > I.e. just calling something similar to xe_vm_invalidate_xxx over
> > > the
> > > whole range, which will just skip subranges that are not populated.
> > > 
> > 
> > As stated, I think eventually removing the SVM range is a good
> > longterm
> > goal.
> > 
> > I almost coded that in this initial series but ran into a number of
> > issues which make this complex and to get something working in
> > simplest
> > way possible to enable further test development, start constructive
> > upstream discussions which appear to be happening, UMDs / application
> > development, and other up[er layer KMD development I stuck with this
> > approach.
> > 
> > I think for any solution which requires a SVM range (fwiw both AMD
> > and
> > Nvidia have a similar concept), attaching the ranges to a larger
> > notifier makes sense and is better than 1 notifier per range.
> > 
> > Issues with removing a SVM range:
> > 
> > - Xe bind code stores invalidation / present state in VMA, this would
> >   need to be moved to the radix tree. I have Jira open for that work
> >   which I believe other developers are going to own.
> > - Where would the dma mapping / device pages be stored?
> > 	- In the radix tree? What if ATS is enabled? We don't have a
> > 	  driver owned radix tree. How do we reasonably connect a
> > driver
> > 	  owned radix to a common GPUSVM layer?
> > 	- In the notifier? What is the notifier is sparsely
> > populated?
> > 	  We would be wasting huge amounts of memory. What is the
> > 	  notifier is configured to span the entire virtual address
> > 	  space?
> > - How does the garbage collector work? We can't allocate memory in
> > the
> >   notifier so we don't anything to add to the garbage collector. We
> >   can't directly modify page tables given you need lock in the path
> > of
> >   reclaim.
> > - How do we deal with fault storms (e.g. tons of faults hitting the
> > same
> >   SVM range in a row)? Without a SVM range no every to know if
> > mapping
> >   is valid and GPU page handler can be short circuited.
> > - Do we have notifier seqno for every PTE?
> > 
> > I feel like I'm missing a few and likely more issues would arrise
> > when
> > implementing this too.
> > 
> > To be clear, I'm saying we shouldn't try to do this and all of the
> > above
> > issues are likely workable but doing all this upfront is akin running
> > before we can walk. I rather solve of fundamental locking issues
> > first,
> > have robust testing in place + passing and UMDs / apps running before
> > trying to rework this one. Performance numbers for this would also be
> > helpful too.
> 
> 
> 
> 
> 
> 
> > 
> > Matt
> > 
> > > /Thomas
> > > 
> > > > 
> > > > > > + * - Ranges: Represent memory ranges mapped in a DRM device
> > > > > > and
> > > > > > managed
> > > > > > + *	     by GPU SVM. They are sized based on an array of
> > > > > > chunk
> > > > > > sizes, which
> > > > > > + *	     is a GPU SVM initialization parameter, and the
> > > > > > CPU
> > > > > > address space.
> > > > > > + *	     Upon GPU fault, the largest aligned chunk that
> > > > > > fits
> > > > > > within the
> > > > > > + *	     faulting CPU address space is chosen for the
> > > > > > range
> > > > > > size. Ranges are
> > > > > > + *	     expected to be dynamically allocated on GPU
> > > > > > fault
> > > > > > and
> > > > > > removed on an
> > > > > > + *	     MMU notifier UNMAP event. As mentioned above,
> > > > > > ranges
> > > > > > are tracked in
> > > > > > + *	     a notifier's Red-Black tree.
> > > > > 
> > > > > How do ranges and chunks map to
> > > > >  
> > > > > a) Prefaulting granularity
> > > > > b) Migration granularity?
> > > > > 
> > > > > > + * - Operations: Define the interface for driver-specific
> > > > > > SVM
> > > > > > operations such as
> > > > > > + *		 allocation, page collection, migration,
> > > > > > invalidations, and VRAM
> > > > > > + *		 release.
> > > > > > + *
> > > > > > + * This layer provides interfaces for allocating, mapping,
> > > > > > migrating, and
> > > > > > + * releasing memory ranges between the CPU and GPU. It
> > > > > > handles
> > > > > > all
> > > > > > core memory
> > > > > > + * management interactions (DMA mapping, HMM, and migration)
> > > > > > and
> > > > > > provides
> > > > > > + * driver-specific virtual functions (vfuncs). This
> > > > > > infrastructure
> > > > > > is sufficient
> > > > > > + * to build the expected driver components for an SVM
> > > > > > implementation
> > > > > > as detailed
> > > > > > + * below.
> > > > > > + *
> > > > > > + * Expected Driver Components:
> > > > > > + * - GPU page fault handler: Used to create ranges and
> > > > > > notifiers
> > > > > > based on the
> > > > > > + *			     fault address, optionally
> > > > > > migrate
> > > > > > the
> > > > > > range to
> > > > > > + *			     VRAM, and create GPU bindings.
> > > > > > + * - Garbage collector: Used to destroy GPU bindings for
> > > > > > ranges.
> > > > > > Ranges are
> > > > > > + *			expected to be added to the garbage
> > > > > > collector upon
> > > > > > + *			MMU_NOTIFY_UNMAP event.
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * DOC: Locking
> > > > > > + *
> > > > > > + * GPU SVM handles locking for core MM interactions, i.e.,
> > > > > > it
> > > > > > locks/unlocks the
> > > > > > + * mmap lock as needed. Alternatively, if the driver prefers
> > > > > > to
> > > > > > handle the mmap
> > > > > > + * lock itself, a 'locked' argument is provided to the
> > > > > > functions
> > > > > > that require
> > > > > > + * the mmap lock. This option may be useful for drivers that
> > > > > > need to
> > > > > > call into
> > > > > > + * GPU SVM while also holding a dma-resv lock, thus
> > > > > > preventing
> > > > > > locking
> > > > > > + * inversions between the mmap and dma-resv locks.
> > > > > > + *
> > > > > > + * GPU SVM introduces a global notifier lock, which
> > > > > > safeguards
> > > > > > the
> > > > > > notifier's
> > > > > > + * range RB tree and list, as well as the range's DMA
> > > > > > mappings
> > > > > > and
> > > > > > sequence
> > > > > > + * number. GPU SVM manages all necessary locking and
> > > > > > unlocking
> > > > > > operations,
> > > > > > + * except for the recheck of the range's sequence number
> > > > > > + * (mmu_interval_read_retry) when the driver is committing
> > > > > > GPU
> > > > > > bindings. This
> > > > > > + * lock corresponds to the 'driver->update' lock mentioned
> > > > > > in
> > > > > > the
> > > > > > HMM
> > > > > > + * documentation (TODO: Link). Future revisions may
> > > > > > transition
> > > > > > from
> > > > > > a GPU SVM
> > > > > > + * global lock to a per-notifier lock if finer-grained
> > > > > > locking
> > > > > > is
> > > > > > deemed
> > > > > > + * necessary.
> > > > > > + *
> > > > > > + * In addition to the locking mentioned above, the driver
> > > > > > should
> > > > > > implement a
> > > > > > + * lock to safeguard core GPU SVM function calls that modify
> > > > > > state,
> > > > > > such as
> > > > > > + * drm_gpusvm_range_find_or_insert and
> > > > > > drm_gpusvm_range_remove.
> > > > > > Alternatively,
> > > > > > + * these core functions can be called within a single kernel
> > > > > > thread,
> > > > > > for
> > > > > > + * instance, using an ordered work queue. This lock is
> > > > > > denoted
> > > > > > as
> > > > > > + * 'driver_svm_lock' in code examples.
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * DOC: Migrataion
> > > > > > + *
> > > > > > + * The migration support is quite simple, allowing migration
> > > > > > between
> > > > > > SRAM and
> > > > > > + * VRAM at the range granularity. For example, GPU SVM
> > > > > > currently
> > > > > > does not
> > > > > > + * support mixing SRAM and VRAM pages within a range. This
> > > > > > means
> > > > > > that upon GPU
> > > > > > + * fault, the entire range can be migrated to VRAM, and upon
> > > > > > CPU
> > > > > > fault, the
> > > > > > + * entire range is migrated to SRAM.
> > > > > > + *
> > > > > > + * The reasoning for only supporting range granularity is as
> > > > > > follows: it
> > > > > > + * simplifies the implementation, and range sizes are
> > > > > > driver-
> > > > > > defined
> > > > > > and should
> > > > > > + * be relatively small.
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * DOC: Partial Unmapping of Ranges
> > > > > > + *
> > > > > > + * Partial unmapping of ranges (e.g., 1M out of 2M is
> > > > > > unmapped
> > > > > > by
> > > > > > CPU resulting
> > > > > > + * in MMU_NOTIFY_UNMAP event) presents several challenges,
> > > > > > with
> > > > > > the
> > > > > > main one
> > > > > > + * being that a subset of the range still has CPU and GPU
> > > > > > mappings.
> > > > > > If the
> > > > > > + * backing store for the range is in VRAM, a subset of the
> > > > > > backing
> > > > > > store has
> > > > > > + * references. One option would be to split the range and
> > > > > > VRAM
> > > > > > backing store,
> > > > > > + * but the implementation for this would be quite
> > > > > > complicated.
> > > > > > Given
> > > > > > that
> > > > > > + * partial unmappings are rare and driver-defined range
> > > > > > sizes
> > > > > > are
> > > > > > relatively
> > > > > > + * small, GPU SVM does not support splitting of ranges.
> > > > > > + *
> > > > > > + * With no support for range splitting, upon partial
> > > > > > unmapping
> > > > > > of a
> > > > > > range, the
> > > > > > + * driver is expected to invalidate and destroy the entire
> > > > > > range. If
> > > > > > the range
> > > > > > + * has VRAM as its backing, the driver is also expected to
> > > > > > migrate
> > > > > > any remaining
> > > > > > + * pages back to SRAM.
> > > > > 
> > > > > So what happens if we get a one-page invalidation, say
> > > > > protection
> > > > > change event, or NUMA accounting event, in the middle of a
> > > > > range?
> > > > > Can
> > > > > we unmap just that single gpu pte covering that range, that is,
> > > > > how
> > > > > do
> > > > > the ranges map to invalidation granularity? Does this differ
> > > > > between
> > > > > igfx an dgfx?
> > > > 
> > > > Well the idea of chunks is ranges should be 1 GPU page (the chunk
> > > > array
> > > > in Xe is 4k, 64k, and 2M). The design is flexible enough that
> > > > doesn't
> > > > have to true but optimized for the thinking each range is most
> > > > likely
> > > > 1
> > > > GPU page. If this isn't true, then all GPU pages in the range are
> > > > invalidated which isn't ideal but keeps it simple which IMO far
> > > > out
> > > > weighs the potential benefits. In theory a driver could implement
> > > > spliting / partial invalidaions too with a couple of updates to
> > > > GPUSVM
> > > > but would likely largely be a driver implementation rather than
> > > > GPUSVM.
> > > > 
> > > > No difference between igfx an dgfx.
> > > > 
> > > > You bring up a good point about protection changes, I likely
> > > > haven't
> > > > fully gotten that part of implementation correct either. I can
> > > > add
> > > > this
> > > > to my TODO list and also update my IGTs to do things like this.
> > > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Thomas
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * DOC: Examples
> > > > > > + *
> > > > > > + * This section provides two examples of how to build the
> > > > > > expected
> > > > > > driver
> > > > > > + * components: the GPU page fault handler and the garbage
> > > > > > collector.
> > > > > > A third
> > > > > > + * example demonstrates a sample invalidation driver vfunc.
> > > > > > + *
> > > > > > + * The generic code provided does not include logic for
> > > > > > complex
> > > > > > migration
> > > > > > + * policies, optimized invalidations, or other potentially
> > > > > > required
> > > > > > driver
> > > > > > + * locking (e.g., DMA-resv locks).
> > > > > > + *
> > > > > > + * 1) GPU page fault handler
> > > > > > + *
> > > > > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm,
> > > > > > struct
> > > > > > drm_gpusvm_range *range)
> > > > > > + *	{
> > > > > > + *		int err = 0;
> > > > > > + *
> > > > > > +
> > > > > > *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > > > > range);
> > > > > > + *
> > > > > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > > > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > > > > + *			driver_commit_bind(gpusvm, range);
> > > > > > + *		else
> > > > > > + *			err = -EAGAIN;
> > > > > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > + *
> > > > > > + *		return err;
> > > > > > + *	}
> > > > > > + *
> > > > > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > > > > fault_addr,
> > > > > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > > > > + *	{
> > > > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > > > + *		int err;
> > > > > > + *
> > > > > > + *		driver_svm_lock();
> > > > > > + *	retry:
> > > > > > + *		// Always process UNMAPs first so view of
> > > > > > GPU
> > > > > > SVM
> > > > > > ranges is current
> > > > > > + *		driver_garbage_collector(gpusvm);
> > > > > > + *
> > > > > > + *		range =
> > > > > > drm_gpusvm_range_find_or_insert(gpusvm,
> > > > > > fault_addr,
> > > > > > +
> > > > > > *							gpuv
> > > > > > a_start,
> > > > > > gpuva_end,
> > > > > > + *						       
> > > > > > &ctx);
> > > > > > + *		if (IS_ERR(range)) {
> > > > > > + *			err = PTR_ERR(range);
> > > > > > + *			goto unlock;
> > > > > > + *		}
> > > > > > + *
> > > > > > + *		if (driver_migration_policy(range)) {
> > > > > > + *			bo = driver_alloc_bo();
> > > > > > + *			err =
> > > > > > drm_gpusvm_migrate_to_vram(gpusvm,
> > > > > > range, bo, &ctx);
> > > > > > + *			if (err)	// CPU mappings may
> > > > > > have
> > > > > > changed
> > > > > > + *				goto retry;
> > > > > > + *		}
> > > > > > + *
> > > > > > + *		err = drm_gpusvm_range_get_pages(gpusvm,
> > > > > > range,
> > > > > > &ctx);
> > > > > > + *		if (err == -EFAULT || err == -EPERM)	//
> > > > > > CPU
> > > > > > mappings changed
> > > > > > + *			goto retry;
> > > > > > + *		else if (err)
> > > > > > + *			goto unlock;
> > > > > > + *
> > > > > > + *		err = driver_bind_range(gpusvm, range);
> > > > > > + *		if (err == -EAGAIN)	// CPU mappings
> > > > > > changed
> > > > > > + *			goto retry
> > > > > > + *
> > > > > > + *	unlock:
> > > > > > + *		driver_svm_unlock();
> > > > > > + *		return err;
> > > > > > + *	}
> > > > > > + *
> > > > > > + * 2) Garbage Collector.
> > > > > > + *
> > > > > > + *	void __driver_garbage_collector(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > + *					struct
> > > > > > drm_gpusvm_range
> > > > > > *range)
> > > > > > + *	{
> > > > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > > > + *
> > > > > > + *		assert_driver_svm_locked(gpusvm);
> > > > > > + *
> > > > > > + *		// Partial unmap, migrate any remaining VRAM
> > > > > > pages
> > > > > > back to SRAM
> > > > > > + *		if (range->flags.partial_unmap)
> > > > > > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > > > > > range,
> > > > > > &ctx);
> > > > > > + *
> > > > > > + *		driver_unbind_range(range);
> > > > > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > > > > + *	}
> > > > > > + *
> > > > > > + *	void driver_garbage_collector(struct drm_gpusvm
> > > > > > *gpusvm)
> > > > > > + *	{
> > > > > > + *		assert_driver_svm_locked(gpusvm);
> > > > > > + *
> > > > > > + *		for_each_range_in_garbage_collector(gpusvm,
> > > > > > range)
> > > > > > + *			__driver_garbage_collector(gpusvm,
> > > > > > range);
> > > > > > + *	}
> > > > > > + *
> > > > > > + * 3) Invalidation driver vfunc.
> > > > > > + *
> > > > > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > > > > + *				 struct drm_gpusvm_notifier
> > > > > > *notifier,
> > > > > > + *				 const struct
> > > > > > mmu_notifier_range
> > > > > > *mmu_range)
> > > > > > + *	{
> > > > > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier =
> > > > > > true,
> > > > > > };
> > > > > > + *		struct drm_gpusvm_range *range = NULL;
> > > > > > + *
> > > > > > + *		driver_invalidate_device_tlb(gpusvm,
> > > > > > mmu_range-
> > > > > > > start, mmu_range->end);
> > > > > > + *
> > > > > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > > > > mmu_range->start,
> > > > > > + *					  mmu_range->end) {
> > > > > > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > > > > > range,
> > > > > > &ctx);
> > > > > > + *
> > > > > > + *			if (mmu_range->event !=
> > > > > > MMU_NOTIFY_UNMAP)
> > > > > > + *				continue;
> > > > > > + *
> > > > > > + *			drm_gpusvm_range_set_unmapped(range,
> > > > > > mmu_range);
> > > > > > + *			driver_garbage_collector_add(gpusvm,
> > > > > > range);
> > > > > > + *		}
> > > > > > + *	}
> > > > > > + */
> > > > > > +
> > > > > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > > > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end -
> > > > > > 1)
> > > > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > > > > rb.__subtree_last,
> > > > > > +		     DRM_GPUSVM_RANGE_START,
> > > > > > DRM_GPUSVM_RANGE_END,
> > > > > > +		     static __maybe_unused, range);
> > > > > > +
> > > > > > +#define
> > > > > > DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > > > > > interval.start)
> > > > > > +#define
> > > > > > DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > > > > > interval.end - 1)
> > > > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node,
> > > > > > u64,
> > > > > > +		     rb.__subtree_last,
> > > > > > DRM_GPUSVM_NOTIFIER_START,
> > > > > > +		     DRM_GPUSVM_NOTIFIER_END, static
> > > > > > __maybe_unused,
> > > > > > notifier);
> > > > > > +
> > > > > > +/**
> > > > > > + * npages_in_range() - Calculate the number of pages in a
> > > > > > given
> > > > > > range
> > > > > > + * @start__: The start address of the range
> > > > > > + * @end__: The end address of the range
> > > > > > + *
> > > > > > + * This macro calculates the number of pages in a given
> > > > > > memory
> > > > > > range,
> > > > > > + * specified by the start and end addresses. It divides the
> > > > > > difference
> > > > > > + * between the end and start addresses by the page size
> > > > > > (PAGE_SIZE)
> > > > > > to
> > > > > > + * determine the number of pages in the range.
> > > > > > + *
> > > > > > + * Return: The number of pages in the specified range.
> > > > > > + */
> > > > > > +#define npages_in_range(start__, end__)	\
> > > > > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > > > > + *
> > > > > > + * @refcount: Reference count for the zdd
> > > > > > + * @destroy_work: Work structure for asynchronous zdd
> > > > > > destruction
> > > > > > + * @range: Pointer to the GPU SVM range
> > > > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > > > allocation
> > > > > > + *
> > > > > > + * This structure serves as a generic wrapper installed in
> > > > > > + * page->zone_device_data. It provides infrastructure for
> > > > > > looking up
> > > > > > a range
> > > > > > + * upon CPU page fault and asynchronously releasing VRAM
> > > > > > once
> > > > > > the
> > > > > > CPU has no
> > > > > > + * page references. Asynchronous release is useful because
> > > > > > CPU
> > > > > > page
> > > > > > references
> > > > > > + * can be dropped in IRQ contexts, while releasing VRAM
> > > > > > likely
> > > > > > requires sleeping
> > > > > > + * locks.
> > > > > > + */
> > > > > > +struct drm_gpusvm_zdd {
> > > > > > +	struct kref refcount;
> > > > > > +	struct work_struct destroy_work;
> > > > > > +	struct drm_gpusvm_range *range;
> > > > > > +	void *vram_allocation;
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > > > > destroying a
> > > > > > zdd
> > > > > > + * @w: Pointer to the work_struct
> > > > > > + *
> > > > > > + * This function releases VRAM, puts GPU SVM range, and
> > > > > > frees
> > > > > > zdd.
> > > > > > + */
> > > > > > +static void drm_gpusvm_zdd_destroy_work_func(struct
> > > > > > work_struct
> > > > > > *w)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_zdd *zdd =
> > > > > > +		container_of(w, struct drm_gpusvm_zdd,
> > > > > > destroy_work);
> > > > > > +	struct drm_gpusvm_range *range = zdd->range;
> > > > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > > > +
> > > > > > +	if (gpusvm->ops->vram_release && zdd-
> > > > > > >vram_allocation)
> > > > > > +		gpusvm->ops->vram_release(zdd-
> > > > > > >vram_allocation);
> > > > > > +	drm_gpusvm_range_put(range);
> > > > > > +	kfree(zdd);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > > > > + * @range: Pointer to the GPU SVM range.
> > > > > > + *
> > > > > > + * This function allocates and initializes a new zdd
> > > > > > structure.
> > > > > > It
> > > > > > sets up the
> > > > > > + * reference count, initializes the destroy work, and links
> > > > > > the
> > > > > > provided GPU SVM
> > > > > > + * range.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > > > > failure.
> > > > > > + */
> > > > > > +static struct drm_gpusvm_zdd *
> > > > > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_zdd *zdd;
> > > > > > +
> > > > > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > > > > +	if (!zdd)
> > > > > > +		return NULL;
> > > > > > +
> > > > > > +	kref_init(&zdd->refcount);
> > > > > > +	INIT_WORK(&zdd->destroy_work,
> > > > > > drm_gpusvm_zdd_destroy_work_func);
> > > > > > +	zdd->range = drm_gpusvm_range_get(range);
> > > > > > +	zdd->vram_allocation = NULL;
> > > > > > +
> > > > > > +	return zdd;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > > > > + * @zdd: Pointer to the zdd structure.
> > > > > > + *
> > > > > > + * This function increments the reference count of the
> > > > > > provided
> > > > > > zdd
> > > > > > structure.
> > > > > > + *
> > > > > > + * Returns: Pointer to the zdd structure.
> > > > > > + */
> > > > > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > > > > drm_gpusvm_zdd *zdd)
> > > > > > +{
> > > > > > +	kref_get(&zdd->refcount);
> > > > > > +	return zdd;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > > > > + * @ref: Pointer to the reference count structure.
> > > > > > + *
> > > > > > + * This function queues the destroy_work of the zdd for
> > > > > > asynchronous
> > > > > > destruction.
> > > > > > + */
> > > > > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_zdd *zdd =
> > > > > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > > > > refcount);
> > > > > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > > > > +
> > > > > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > > > > + * @zdd: Pointer to the zdd structure.
> > > > > > + *
> > > > > > + * This function decrements the reference count of the
> > > > > > provided
> > > > > > zdd
> > > > > > structure
> > > > > > + * and schedules its destruction if the count drops to zero.
> > > > > > + */
> > > > > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > > > > +{
> > > > > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > > > > notifier
> > > > > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > > > > + * @start: Start address of the range
> > > > > > + * @end: End address of the range
> > > > > > + *
> > > > > > + * Return: A pointer to the drm_gpusvm_range if found or
> > > > > > NULL
> > > > > > + */
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier,
> > > > > > u64
> > > > > > start, u64 end)
> > > > > > +{
> > > > > > +	return range_iter_first(&notifier->root, start, end
> > > > > > -
> > > > > > 1);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU
> > > > > > SVM
> > > > > > ranges in a notifier
> > > > > > + * @range__: Iterator variable for the ranges
> > > > > > + * @next__: Iterator variable for the ranges temporay
> > > > > > storage
> > > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > > + * @start__: Start address of the range
> > > > > > + * @end__: End address of the range
> > > > > > + *
> > > > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > > > notifier
> > > > > > while
> > > > > > + * removing ranges from it.
> > > > > > + */
> > > > > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > > > > notifier__,
> > > > > > start__, end__)	\
> > > > > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > > > > (start__), (end__)),	\
> > > > > > +	     (next__) =
> > > > > > __drm_gpusvm_range_next(range__);			
> > > > > > 	\
> > > > > > +	     (range__) && (range__->va.start <
> > > > > > (end__));				\
> > > > > > +	     (range__) = (next__), (next__) =
> > > > > > __drm_gpusvm_range_next(range__))
> > > > > > +
> > > > > > +/**
> > > > > > + * __drm_gpusvm_notifier_next - get the next
> > > > > > drm_gpusvm_notifier
> > > > > > in
> > > > > > the list
> > > > > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > > > > + *
> > > > > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > > > > available,
> > > > > > or NULL if
> > > > > > + *         the current notifier is the last one or if the
> > > > > > input
> > > > > > notifier is
> > > > > > + *         NULL.
> > > > > > + */
> > > > > > +static struct drm_gpusvm_notifier *
> > > > > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier
> > > > > > *notifier)
> > > > > > +{
> > > > > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > > > > +				      &notifier->gpusvm-
> > > > > > > notifier_list))
> > > > > > +		return list_next_entry(notifier, rb.entry);
> > > > > > +
> > > > > > +	return NULL;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM
> > > > > > notifiers
> > > > > > in
> > > > > > a gpusvm
> > > > > > + * @notifier__: Iterator variable for the notifiers
> > > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > > + * @start__: Start address of the notifier
> > > > > > + * @end__: End address of the notifier
> > > > > > + *
> > > > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > > > gpusvm.
> > > > > > + */
> > > > > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > > > > start__,
> > > > > > end__)		\
> > > > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > > > > root,
> > > > > > (start__), (end__) - 1);	\
> > > > > > +	     (notifier__) && (notifier__->interval.start <
> > > > > > (end__));			\
> > > > > > +	     (notifier__) =
> > > > > > __drm_gpusvm_notifier_next(notifier__))
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over
> > > > > > GPU
> > > > > > SVM
> > > > > > notifiers in a gpusvm
> > > > > > + * @notifier__: Iterator variable for the notifiers
> > > > > > + * @next__: Iterator variable for the notifiers temporay
> > > > > > storage
> > > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > > + * @start__: Start address of the notifier
> > > > > > + * @end__: End address of the notifier
> > > > > > + *
> > > > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > > > gpusvm
> > > > > > while
> > > > > > + * removing notifiers from it.
> > > > > > + */
> > > > > > +#define drm_gpusvm_for_each_notifier_safe(notifier__,
> > > > > > next__,
> > > > > > gpusvm__, start__, end__)	\
> > > > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > > > > root,
> > > > > > (start__), (end__) - 1),	\
> > > > > > +	     (next__) =
> > > > > > __drm_gpusvm_notifier_next(notifier__);			
> > > > > > 	\
> > > > > > +	     (notifier__) && (notifier__->interval.start <
> > > > > > (end__));			\
> > > > > > +	     (notifier__) = (next__), (next__) =
> > > > > > __drm_gpusvm_notifier_next(notifier__))
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > > > > notifier.
> > > > > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > > > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > > > > + * @cur_seq: Current sequence number.
> > > > > > + *
> > > > > > + * This function serves as a generic MMU notifier for GPU
> > > > > > SVM.
> > > > > > It
> > > > > > sets the MMU
> > > > > > + * notifier sequence number and calls the driver invalidate
> > > > > > vfunc
> > > > > > under
> > > > > > + * gpusvm->notifier_lock.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * true if the operation succeeds, false otherwise.
> > > > > > + */
> > > > > > +static bool
> > > > > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > > > > *mni,
> > > > > > +			       const struct
> > > > > > mmu_notifier_range
> > > > > > *mmu_range,
> > > > > > +			       unsigned long cur_seq)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_notifier *notifier =
> > > > > > +		container_of(mni, typeof(*notifier),
> > > > > > notifier);
> > > > > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > > > > +
> > > > > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > > > > +		return false;
> > > > > > +
> > > > > > +	down_write(&gpusvm->notifier_lock);
> > > > > > +	mmu_interval_set_seq(mni, cur_seq);
> > > > > > +	gpusvm->ops->invalidate(gpusvm, notifier,
> > > > > > mmu_range);
> > > > > > +	up_write(&gpusvm->notifier_lock);
> > > > > > +
> > > > > > +	return true;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_ops - MMU interval notifier
> > > > > > operations
> > > > > > for
> > > > > > GPU SVM
> > > > > > + */
> > > > > > +static const struct mmu_interval_notifier_ops
> > > > > > drm_gpusvm_notifier_ops = {
> > > > > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > > + * @name: Name of the GPU SVM.
> > > > > > + * @drm: Pointer to the DRM device structure.
> > > > > > + * @mm: Pointer to the mm_struct for the address space.
> > > > > > + * @device_private_page_owner: Device private pages owner.
> > > > > > + * @mm_start: Start address of GPU SVM.
> > > > > > + * @mm_range: Range of the GPU SVM.
> > > > > > + * @notifier_size: Size of individual notifiers.
> > > > > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > > > range
> > > > > > allocation.
> > > > > > + *               Entries should be powers of 2 in descending
> > > > > > order
> > > > > > with last
> > > > > > + *               entry being SZ_4K.
> > > > > > + * @num_chunks: Number of chunks.
> > > > > > + *
> > > > > > + * This function initializes the GPU SVM.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, a negative error code on failure.
> > > > > > + */
> > > > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > > > +		    const char *name, struct drm_device
> > > > > > *drm,
> > > > > > +		    struct mm_struct *mm, void
> > > > > > *device_private_page_owner,
> > > > > > +		    u64 mm_start, u64 mm_range, u64
> > > > > > notifier_size,
> > > > > > +		    const struct drm_gpusvm_ops *ops,
> > > > > > +		    const u64 *chunk_sizes, int num_chunks)
> > > > > > +{
> > > > > > +	if (!ops->invalidate || !num_chunks)
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	gpusvm->name = name;
> > > > > > +	gpusvm->drm = drm;
> > > > > > +	gpusvm->mm = mm;
> > > > > > +	gpusvm->device_private_page_owner =
> > > > > > device_private_page_owner;
> > > > > > +	gpusvm->mm_start = mm_start;
> > > > > > +	gpusvm->mm_range = mm_range;
> > > > > > +	gpusvm->notifier_size = notifier_size;
> > > > > > +	gpusvm->ops = ops;
> > > > > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > > > > +	gpusvm->num_chunks = num_chunks;
> > > > > > +	gpusvm->zdd_wq = system_wq;
> > > > > > +
> > > > > > +	mmgrab(mm);
> > > > > > +	gpusvm->root = RB_ROOT_CACHED;
> > > > > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > > > > +
> > > > > > +	init_rwsem(&gpusvm->notifier_lock);
> > > > > > +
> > > > > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > > > > +	might_lock(&gpusvm->notifier_lock);
> > > > > > +	fs_reclaim_release(GFP_KERNEL);
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > > > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > > > > + * @fault_addr__: Fault address
> > > > > > + *
> > > > > > + * This macro finds the GPU SVM notifier associated with the
> > > > > > fault
> > > > > > address.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the GPU SVM notifier on success, NULL
> > > > > > otherwise.
> > > > > > + */
> > > > > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > > > > fault_addr__)	\
> > > > > > +	notifier_iter_first(&(gpusvm__)->root,
> > > > > > (fault_addr__),	\
> > > > > > +			    (fault_addr__ + 1))
> > > > > > +
> > > > > > +/**
> > > > > > + * to_drm_gpusvm_notifier - retrieve the container struct
> > > > > > for a
> > > > > > given rbtree node
> > > > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > > > drm_gpusvm_notifier struct
> > > > > > + *
> > > > > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > > > > structure.
> > > > > > + */
> > > > > > +#define
> > > > > > to_drm_gpusvm_notifier(__node)				\
> > > > > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > > > > rb.node)
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > > + *
> > > > > > + * This function inserts the GPU SVM notifier into the GPU
> > > > > > SVM
> > > > > > RB
> > > > > > tree and list.
> > > > > > + */
> > > > > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +				       struct
> > > > > > drm_gpusvm_notifier
> > > > > > *notifier)
> > > > > > +{
> > > > > > +	struct rb_node *node;
> > > > > > +	struct list_head *head;
> > > > > > +
> > > > > > +	notifier_insert(notifier, &gpusvm->root);
> > > > > > +
> > > > > > +	node = rb_prev(&notifier->rb.node);
> > > > > > +	if (node)
> > > > > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > > > > > rb.entry;
> > > > > > +	else
> > > > > > +		head = &gpusvm->notifier_list;
> > > > > > +
> > > > > > +	list_add(&notifier->rb.entry, head);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > > > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > > > + *
> > > > > > + * This macro removes the GPU SVM notifier from the GPU SVM
> > > > > > RB
> > > > > > tree
> > > > > > and list.
> > > > > > + */
> > > > > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > > > > notifier__)	\
> > > > > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > > > > > root);	\
> > > > > > +	list_del(&(notifier__)->rb.entry)
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > > + *
> > > > > > + * This function finalizes the GPU SVM by cleaning up any
> > > > > > remaining
> > > > > > ranges and
> > > > > > + * notifiers, and dropping a reference to struct MM.
> > > > > > + */
> > > > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > > > > +
> > > > > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > > > > gpusvm, 0,
> > > > > > LONG_MAX) {
> > > > > > +		struct drm_gpusvm_range *range, *__next;
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * Remove notifier first to avoid racing
> > > > > > with
> > > > > > any
> > > > > > invalidation
> > > > > > +		 */
> > > > > > +		mmu_interval_notifier_remove(&notifier-
> > > > > > > notifier);
> > > > > > +		notifier->flags.removed = true;
> > > > > > +
> > > > > > +		drm_gpusvm_for_each_range_safe(range,
> > > > > > __next,
> > > > > > notifier, 0,
> > > > > > +					       LONG_MAX)
> > > > > > +			drm_gpusvm_range_remove(gpusvm,
> > > > > > range);
> > > > > > +	}
> > > > > > +
> > > > > > +	mmdrop(gpusvm->mm);
> > > > > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @fault_addr: Fault address
> > > > > > + *
> > > > > > + * This function allocates and initializes the GPU SVM
> > > > > > notifier
> > > > > > structure.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the allocated GPU SVM notifier on success,
> > > > > > ERR_PTR()
> > > > > > on failure.
> > > > > > + */
> > > > > > +static struct drm_gpusvm_notifier *
> > > > > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > > > > fault_addr)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > > +
> > > > > > +	if (gpusvm->ops->notifier_alloc)
> > > > > > +		notifier = gpusvm->ops->notifier_alloc();
> > > > > > +	else
> > > > > > +		notifier = kzalloc(sizeof(*notifier),
> > > > > > GFP_KERNEL);
> > > > > > +
> > > > > > +	if (!notifier)
> > > > > > +		return ERR_PTR(-ENOMEM);
> > > > > > +
> > > > > > +	notifier->gpusvm = gpusvm;
> > > > > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > > > > gpusvm-
> > > > > > > notifier_size);
> > > > > > +	notifier->interval.end = ALIGN(fault_addr + 1,
> > > > > > gpusvm-
> > > > > > > notifier_size);
> > > > > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > > > > +	notifier->root = RB_ROOT_CACHED;
> > > > > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > > > > +
> > > > > > +	return notifier;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > > + *
> > > > > > + * This function frees the GPU SVM notifier structure.
> > > > > > + */
> > > > > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +				     struct
> > > > > > drm_gpusvm_notifier
> > > > > > *notifier)
> > > > > > +{
> > > > > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > > > > +
> > > > > > +	if (gpusvm->ops->notifier_free)
> > > > > > +		gpusvm->ops->notifier_free(notifier);
> > > > > > +	else
> > > > > > +		kfree(notifier);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > > > > given
> > > > > > rbtree node
> > > > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > > > drm_gpusvm_range struct
> > > > > > + *
> > > > > > + * Return: A pointer to the containing drm_gpusvm_range
> > > > > > structure.
> > > > > > + */
> > > > > > +#define to_drm_gpusvm_range(node__)	\
> > > > > > +	container_of((node__), struct drm_gpusvm_range,
> > > > > > rb.node)
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * This function inserts the GPU SVM range into the notifier
> > > > > > RB
> > > > > > tree
> > > > > > and list.
> > > > > > + */
> > > > > > +static void drm_gpusvm_range_insert(struct
> > > > > > drm_gpusvm_notifier
> > > > > > *notifier,
> > > > > > +				    struct drm_gpusvm_range
> > > > > > *range)
> > > > > > +{
> > > > > > +	struct rb_node *node;
> > > > > > +	struct list_head *head;
> > > > > > +
> > > > > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > > > > +	range_insert(range, &notifier->root);
> > > > > > +
> > > > > > +	node = rb_prev(&range->rb.node);
> > > > > > +	if (node)
> > > > > > +		head = &(to_drm_gpusvm_range(node))-
> > > > > > >rb.entry;
> > > > > > +	else
> > > > > > +		head = &notifier->range_list;
> > > > > > +
> > > > > > +	list_add(&range->rb.entry, head);
> > > > > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > > > + * @range__: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * This macro removes the GPU SVM range from the notifier RB
> > > > > > tree
> > > > > > and list.
> > > > > > + */
> > > > > > +#define __drm_gpusvm_range_remove(notifier__,
> > > > > > range__)		\
> > > > > > +	range_remove((range__), &(notifier__)-
> > > > > > > root);		\
> > > > > > +	list_del(&(range__)->rb.entry)
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > > + * @fault_addr: Fault address
> > > > > > + * @chunk_size: Chunk size
> > > > > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > > > > + *
> > > > > > + * This function allocates and initializes the GPU SVM range
> > > > > > structure.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the allocated GPU SVM range on success,
> > > > > > ERR_PTR()
> > > > > > on
> > > > > > failure.
> > > > > > + */
> > > > > > +static struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > > > > +		       struct drm_gpusvm_notifier *notifier,
> > > > > > +		       u64 fault_addr, u64 chunk_size, bool
> > > > > > migrate_vram)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_range *range;
> > > > > > +
> > > > > > +	if (gpusvm->ops->range_alloc)
> > > > > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > > > > +	else
> > > > > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > > > +
> > > > > > +	if (!range)
> > > > > > +		return ERR_PTR(-ENOMEM);
> > > > > > +
> > > > > > +	kref_init(&range->refcount);
> > > > > > +	range->gpusvm = gpusvm;
> > > > > > +	range->notifier = notifier;
> > > > > > +	range->va.start = ALIGN_DOWN(fault_addr,
> > > > > > chunk_size);
> > > > > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > > > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > > > > +	range->notifier_seq = LONG_MAX;
> > > > > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > > > > +
> > > > > > +	return range;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_check_pages - Check pages
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > > + * @start: Start address
> > > > > > + * @end: End address
> > > > > > + *
> > > > > > + * Check if pages between start and end have been faulted in
> > > > > > on
> > > > > > the
> > > > > > CPU. Use to
> > > > > > + * prevent migration of pages without CPU backing store.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * True if pages have been faulted into CPU, False otherwise
> > > > > > + */
> > > > > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +				   struct
> > > > > > drm_gpusvm_notifier
> > > > > > *notifier,
> > > > > > +				   u64 start, u64 end)
> > > > > > +{
> > > > > > +	struct hmm_range hmm_range = {
> > > > > > +		.default_flags = 0,
> > > > > > +		.notifier = &notifier->notifier,
> > > > > > +		.start = start,
> > > > > > +		.end = end,
> > > > > > +		.dev_private_owner = gpusvm-
> > > > > > > device_private_page_owner,
> > > > > > +	};
> > > > > > +	unsigned long timeout =
> > > > > > +		jiffies +
> > > > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > > > +	unsigned long *pfns;
> > > > > > +	unsigned long npages = npages_in_range(start, end);
> > > > > > +	int err, i;
> > > > > > +
> > > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > > +
> > > > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > > > GFP_KERNEL);
> > > > > > +	if (!pfns)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	hmm_range.notifier_seq =
> > > > > > mmu_interval_read_begin(&notifier-
> > > > > > > notifier);
> > > > > > +	hmm_range.hmm_pfns = pfns;
> > > > > > +
> > > > > > +	while (true) {
> > > > > > +		err = hmm_range_fault(&hmm_range);
> > > > > > +		if (err == -EBUSY) {
> > > > > > +			if (time_after(jiffies, timeout))
> > > > > > +				break;
> > > > > > +
> > > > > > +			hmm_range.notifier_seq =
> > > > > > mmu_interval_read_begin(&notifier->notifier);
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		break;
> > > > > > +	}
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > > > > +			err = -EFAULT;
> > > > > > +			goto err_free;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +err_free:
> > > > > > +	kvfree(pfns);
> > > > > > +	return err ? false : true;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_chunk_size - Determine chunk size for
> > > > > > GPU
> > > > > > SVM
> > > > > > range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > > > + * @vas: Pointer to the virtual memory area structure
> > > > > > + * @fault_addr: Fault address
> > > > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > > > + * @check_pages: Flag indicating whether to check pages
> > > > > > + *
> > > > > > + * This function determines the chunk size for the GPU SVM
> > > > > > range
> > > > > > based on the
> > > > > > + * fault address, GPU SVM chunk sizes, existing GPU SVM
> > > > > > ranges,
> > > > > > and
> > > > > > the virtual
> > > > > > + * memory area boundaries.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Chunk size on success, LONG_MAX on failure.
> > > > > > + */
> > > > > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +				       struct
> > > > > > drm_gpusvm_notifier
> > > > > > *notifier,
> > > > > > +				       struct vm_area_struct
> > > > > > *vas,
> > > > > > +				       u64 fault_addr, u64
> > > > > > gpuva_start,
> > > > > > +				       u64 gpuva_end, bool
> > > > > > check_pages)
> > > > > > +{
> > > > > > +	u64 start, end;
> > > > > > +	int i = 0;
> > > > > > +
> > > > > > +retry:
> > > > > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > > > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > > > > chunk_sizes[i]);
> > > > > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > > > > > chunk_sizes[i]);
> > > > > > +
> > > > > > +		if (start >= vas->vm_start && end <= vas-
> > > > > > >vm_end
> > > > > > &&
> > > > > > +		    start >= notifier->interval.start &&
> > > > > > +		    end <= notifier->interval.end &&
> > > > > > +		    start >= gpuva_start && end <=
> > > > > > gpuva_end)
> > > > > > +			break;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (i == gpusvm->num_chunks)
> > > > > > +		return LONG_MAX;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If allocation more than page, ensure not to
> > > > > > overlap
> > > > > > with
> > > > > > existing
> > > > > > +	 * ranges.
> > > > > > +	 */
> > > > > > +	if (end - start != SZ_4K) {
> > > > > > +		struct drm_gpusvm_range *range;
> > > > > > +
> > > > > > +		range = drm_gpusvm_range_find(notifier,
> > > > > > start,
> > > > > > end);
> > > > > > +		if (range) {
> > > > > > +			++i;
> > > > > > +			goto retry;
> > > > > > +		}
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * XXX: Only create range on pages CPU has
> > > > > > faulted
> > > > > > in. Without
> > > > > > +		 * this check, or prefault, on BMG
> > > > > > 'xe_exec_system_allocator --r
> > > > > > +		 * process-many-malloc' fails. In the
> > > > > > failure
> > > > > > case,
> > > > > > each process
> > > > > > +		 * mallocs 16k but the CPU VMA is ~128k
> > > > > > which
> > > > > > results in 64k SVM
> > > > > > +		 * ranges. When migrating the SVM ranges,
> > > > > > some
> > > > > > processes fail in
> > > > > > +		 * drm_gpusvm_migrate_to_vram with
> > > > > > 'migrate.cpages
> > > > > > != npages'
> > > > > > +		 * and then upon drm_gpusvm_range_get_pages
> > > > > > device
> > > > > > pages from
> > > > > > +		 * other processes are collected + faulted
> > > > > > in
> > > > > > which
> > > > > > creates all
> > > > > > +		 * sorts of problems. Unsure exactly how
> > > > > > this
> > > > > > happening, also
> > > > > > +		 * problem goes away if
> > > > > > 'xe_exec_system_allocator --
> > > > > > r
> > > > > > +		 * process-many-malloc' mallocs at least 64k
> > > > > > at
> > > > > > a
> > > > > > time.
> > > > > > +		 */
> > > > > > +		if (check_pages &&
> > > > > > +		    !drm_gpusvm_check_pages(gpusvm,
> > > > > > notifier,
> > > > > > start,
> > > > > > end)) {
> > > > > > +			++i;
> > > > > > +			goto retry;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	return end - start;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > > > > range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @fault_addr: Fault address
> > > > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function finds or inserts a newly allocated a GPU
> > > > > > SVM
> > > > > > range
> > > > > > based on the
> > > > > > + * fault address. Caller must hold a lock to protect range
> > > > > > lookup
> > > > > > and insertion.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > > > > failure.
> > > > > > + */
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> > > > > > u64
> > > > > > fault_addr,
> > > > > > +				u64 gpuva_start, u64
> > > > > > gpuva_end,
> > > > > > +				const struct drm_gpusvm_ctx
> > > > > > *ctx)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > > +	struct drm_gpusvm_range *range;
> > > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > > +	struct vm_area_struct *vas;
> > > > > > +	bool notifier_alloc = false;
> > > > > > +	u64 chunk_size;
> > > > > > +	int err;
> > > > > > +	bool migrate_vram;
> > > > > > +
> > > > > > +	if (fault_addr < gpusvm->mm_start ||
> > > > > > +	    fault_addr > gpusvm->mm_start + gpusvm-
> > > > > > >mm_range) {
> > > > > > +		err = -EINVAL;
> > > > > > +		goto err_out;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		if (!mmget_not_zero(mm)) {
> > > > > > +			err = -EFAULT;
> > > > > > +			goto err_out;
> > > > > > +		}
> > > > > > +		mmap_write_lock(mm);
> > > > > > +	}
> > > > > > +
> > > > > > +	mmap_assert_write_locked(mm);
> > > > > > +
> > > > > > +	notifier = drm_gpusvm_notifier_find(gpusvm,
> > > > > > fault_addr);
> > > > > > +	if (!notifier) {
> > > > > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > > > > fault_addr);
> > > > > > +		if (IS_ERR(notifier)) {
> > > > > > +			err = PTR_ERR(notifier);
> > > > > > +			goto err_mmunlock;
> > > > > > +		}
> > > > > > +		notifier_alloc = true;
> > > > > > +		err =
> > > > > > mmu_interval_notifier_insert_locked(&notifier-
> > > > > > > notifier,
> > > > > > +							 
> > > > > > mm,
> > > > > > notifier->interval.start,
> > > > > > +							 
> > > > > > notifier-
> > > > > > > interval.end -
> > > > > > +							 
> > > > > > notifier-
> > > > > > > interval.start,
> > > > > > +							 
> > > > > > &drm_gpusvm_notifier_ops);
> > > > > > +		if (err)
> > > > > > +			goto err_notifier;
> > > > > > +	}
> > > > > > +
> > > > > > +	vas = vma_lookup(mm, fault_addr);
> > > > > > +	if (!vas) {
> > > > > > +		err = -ENOENT;
> > > > > > +		goto err_notifier_remove;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE))
> > > > > > {
> > > > > > +		err = -EPERM;
> > > > > > +		goto err_notifier_remove;
> > > > > > +	}
> > > > > > +
> > > > > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > > > > fault_addr + 1);
> > > > > > +	if (range)
> > > > > > +		goto out_mmunlock;
> > > > > > +	/*
> > > > > > +	 * XXX: Short-circuiting migration based on
> > > > > > migrate_vma_*
> > > > > > current
> > > > > > +	 * limitations. If/when migrate_vma_* add more
> > > > > > support,
> > > > > > this
> > > > > > logic will
> > > > > > +	 * have to change.
> > > > > > +	 */
> > > > > > +	migrate_vram = ctx->vram_possible &&
> > > > > > +		vma_is_anonymous(vas) &&
> > > > > > !is_vm_hugetlb_page(vas);
> > > > > > +
> > > > > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > > > > notifier,
> > > > > > vas,
> > > > > > +						 fault_addr,
> > > > > > gpuva_start,
> > > > > > +						 gpuva_end,
> > > > > > migrate_vram &&
> > > > > > +						 !ctx-
> > > > > > > prefault);
> > > > > > +	if (chunk_size == LONG_MAX) {
> > > > > > +		err = -EINVAL;
> > > > > > +		goto err_notifier_remove;
> > > > > > +	}
> > > > > > +
> > > > > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > > > > fault_addr,
> > > > > > chunk_size,
> > > > > > +				       migrate_vram);
> > > > > > +	if (IS_ERR(range)) {
> > > > > > +		err = PTR_ERR(range);
> > > > > > +		goto err_notifier_remove;
> > > > > > +	}
> > > > > > +
> > > > > > +	drm_gpusvm_range_insert(notifier, range);
> > > > > > +	if (notifier_alloc)
> > > > > > +		drm_gpusvm_notifier_insert(gpusvm,
> > > > > > notifier);
> > > > > > +
> > > > > > +	if (ctx->prefault) {
> > > > > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > > > > +
> > > > > > +		__ctx.mmap_locked = true;
> > > > > > +		err = drm_gpusvm_range_get_pages(gpusvm,
> > > > > > range,
> > > > > > &__ctx);
> > > > > > +		if (err)
> > > > > > +			goto err_range_remove;
> > > > > > +	}
> > > > > > +
> > > > > > +out_mmunlock:
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		mmap_write_unlock(mm);
> > > > > > +		mmput(mm);
> > > > > > +	}
> > > > > > +
> > > > > > +	return range;
> > > > > > +
> > > > > > +err_range_remove:
> > > > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > > > +err_notifier_remove:
> > > > > > +	if (notifier_alloc)
> > > > > > +		mmu_interval_notifier_remove(&notifier-
> > > > > > > notifier);
> > > > > > +err_notifier:
> > > > > > +	if (notifier_alloc)
> > > > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > > > +err_mmunlock:
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		mmap_write_unlock(mm);
> > > > > > +		mmput(mm);
> > > > > > +	}
> > > > > > +err_out:
> > > > > > +	return ERR_PTR(err);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > > > > + * @i__: the current page index in the iteration
> > > > > > + * @j__: the current page index, log order, in the iteration
> > > > > > + * @npages__: the total number of pages in the DMA region
> > > > > > + * @order__: the order of the pages in the DMA region
> > > > > > + *
> > > > > > + * This macro iterates over each page in a DMA region. The
> > > > > > DMA
> > > > > > region
> > > > > > + * is assumed to be composed of 2^@order__ pages, and the
> > > > > > macro
> > > > > > will
> > > > > > + * step through the region one block of 2^@order__ pages at
> > > > > > a
> > > > > > time.
> > > > > > + */
> > > > > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > > > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > > > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > > > > +
> > > > > > +/**
> > > > > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated
> > > > > > with
> > > > > > a
> > > > > > GPU SVM range (internal)
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * This function unmap pages associated with a GPU SVM
> > > > > > range.
> > > > > > Assumes and
> > > > > > + * asserts correct locking is in place when called.
> > > > > > + */
> > > > > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +					   struct
> > > > > > drm_gpusvm_range
> > > > > > *range)
> > > > > > +{
> > > > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > > > +
> > > > > > +	if (range->pages) {
> > > > > > +		unsigned long i, j, npages =
> > > > > > npages_in_range(range-
> > > > > > > va.start,
> > > > > > +							    
> > > > > > range-
> > > > > > > va.end);
> > > > > > +
> > > > > > +		if (range->flags.has_dma_mapping) {
> > > > > > +			for_each_dma_page(i, j, npages,
> > > > > > range-
> > > > > > > order)
> > > > > > +				dma_unmap_page(gpusvm->drm-
> > > > > > >dev,
> > > > > > +					       range-
> > > > > > > dma_addr[j],
> > > > > > +					       PAGE_SIZE <<
> > > > > > range-
> > > > > > > order,
> > > > > > +					      
> > > > > > DMA_BIDIRECTIONAL);
> > > > > > +		}
> > > > > > +
> > > > > > +		range->flags.has_vram_pages = false;
> > > > > > +		range->flags.has_dma_mapping = false;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_free_pages - Free pages associated with
> > > > > > a
> > > > > > GPU
> > > > > > SVM range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * This function free pages associated with a GPU SVM range.
> > > > > > + */
> > > > > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +					struct
> > > > > > drm_gpusvm_range
> > > > > > *range)
> > > > > > +{
> > > > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > > > +
> > > > > > +	if (range->pages) {
> > > > > > +		if (range->flags.kfree_mapping) {
> > > > > > +			kfree(range->dma_addr);
> > > > > > +			range->flags.kfree_mapping = false;
> > > > > > +			range->pages = NULL;
> > > > > > +		} else {
> > > > > > +			kvfree(range->pages);
> > > > > > +			range->pages = NULL;
> > > > > > +		}
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range to be removed
> > > > > > + *
> > > > > > + * This function removes the specified GPU SVM range and
> > > > > > also
> > > > > > removes the parent
> > > > > > + * GPU SVM notifier if no more ranges remain in the
> > > > > > notifier.
> > > > > > The
> > > > > > caller must
> > > > > > + * hold a lock to protect range and notifier removal.
> > > > > > + */
> > > > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > > > +			     struct drm_gpusvm_range *range)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > > +
> > > > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > > > > va.start);
> > > > > > +	if (WARN_ON_ONCE(!notifier))
> > > > > > +		return;
> > > > > > +
> > > > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +
> > > > > > +	drm_gpusvm_range_put(range);
> > > > > > +
> > > > > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > > > > +		if (!notifier->flags.removed)
> > > > > > +			mmu_interval_notifier_remove(&notifi
> > > > > > er-
> > > > > > > notifier);
> > > > > > +		drm_gpusvm_notifier_remove(gpusvm,
> > > > > > notifier);
> > > > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > > > > + * @range: Pointer to the GPU SVM range
> > > > > > + *
> > > > > > + * This function increments the reference count of the
> > > > > > specified
> > > > > > GPU
> > > > > > SVM range.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the GPU SVM range.
> > > > > > + */
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > > > > +{
> > > > > > +	kref_get(&range->refcount);
> > > > > > +
> > > > > > +	return range;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > > > > + * @refcount: Pointer to the reference counter embedded in
> > > > > > the
> > > > > > GPU
> > > > > > SVM range
> > > > > > + *
> > > > > > + * This function destroys the specified GPU SVM range when
> > > > > > its
> > > > > > reference count
> > > > > > + * reaches zero. If a custom range-free function is
> > > > > > provided, it
> > > > > > is
> > > > > > invoked to
> > > > > > + * free the range; otherwise, the range is deallocated using
> > > > > > kfree().
> > > > > > + */
> > > > > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_range *range =
> > > > > > +		container_of(refcount, struct
> > > > > > drm_gpusvm_range,
> > > > > > refcount);
> > > > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > > > +
> > > > > > +	if (gpusvm->ops->range_free)
> > > > > > +		gpusvm->ops->range_free(range);
> > > > > > +	else
> > > > > > +		kfree(range);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > > > > + * @range: Pointer to the GPU SVM range
> > > > > > + *
> > > > > > + * This function decrements the reference count of the
> > > > > > specified
> > > > > > GPU
> > > > > > SVM range
> > > > > > + * and frees it when the count reaches zero.
> > > > > > + */
> > > > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > > > > +{
> > > > > > +	kref_put(&range->refcount,
> > > > > > drm_gpusvm_range_destroy);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * This function determines if a GPU SVM range pages are
> > > > > > valid.
> > > > > > Expected be
> > > > > > + * called holding gpusvm->notifier_lock and as the last step
> > > > > > before
> > > > > > commiting a
> > > > > > + * GPU binding.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > > > + */
> > > > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > > > +				  struct drm_gpusvm_range
> > > > > > *range)
> > > > > > +{
> > > > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > > > +
> > > > > > +	return range->flags.has_vram_pages || range-
> > > > > > > flags.has_dma_mapping;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range
> > > > > > pages
> > > > > > valid
> > > > > > unlocked
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * This function determines if a GPU SVM range pages are
> > > > > > valid.
> > > > > > Expected be
> > > > > > + * called without holding gpusvm->notifier_lock.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > > > + */
> > > > > > +static bool
> > > > > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +				      struct
> > > > > > drm_gpusvm_range
> > > > > > *range)
> > > > > > +{
> > > > > > +	bool pages_valid;
> > > > > > +
> > > > > > +	if (!range->pages)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > > > > range);
> > > > > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > > > > +		kfree(range->dma_addr);
> > > > > > +		range->flags.kfree_mapping = false;
> > > > > > +		range->pages = NULL;
> > > > > > +	}
> > > > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +
> > > > > > +	return pages_valid;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM
> > > > > > range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function gets pages for a GPU SVM range and ensures
> > > > > > they
> > > > > > are
> > > > > > mapped for
> > > > > > + * DMA access.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +			       const struct drm_gpusvm_ctx
> > > > > > *ctx)
> > > > > > +{
> > > > > > +	struct mmu_interval_notifier *notifier = &range-
> > > > > > > notifier-
> > > > > > > notifier;
> > > > > > +	struct hmm_range hmm_range = {
> > > > > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > > > > > read_only
> > > > > > ? 0 :
> > > > > > +			HMM_PFN_REQ_WRITE),
> > > > > > +		.notifier = notifier,
> > > > > > +		.start = range->va.start,
> > > > > > +		.end = range->va.end,
> > > > > > +		.dev_private_owner = gpusvm-
> > > > > > > device_private_page_owner,
> > > > > > +	};
> > > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > > +	unsigned long timeout =
> > > > > > +		jiffies +
> > > > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > > > +	unsigned long i, j;
> > > > > > +	unsigned long npages = npages_in_range(range-
> > > > > > >va.start,
> > > > > > range->va.end);
> > > > > > +	unsigned int order = 0;
> > > > > > +	unsigned long *pfns;
> > > > > > +	struct page **pages;
> > > > > > +	int err = 0;
> > > > > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > > > > +	bool alloc_pfns = false, kfree_mapping;
> > > > > > +
> > > > > > +retry:
> > > > > > +	kfree_mapping = false;
> > > > > > +	hmm_range.notifier_seq =
> > > > > > mmu_interval_read_begin(notifier);
> > > > > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > > > > range))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > > > > > range-
> > > > > > > pages) {
> > > > > > +		if (ctx->prefault)
> > > > > > +			return 0;
> > > > > > +
> > > > > > +		pfns = (unsigned long *)range->pages;
> > > > > > +		pages = range->pages;
> > > > > > +		goto map_pages;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!range->pages) {
> > > > > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > > > GFP_KERNEL);
> > > > > > +		if (!pfns)
> > > > > > +			return -ENOMEM;
> > > > > > +		alloc_pfns = true;
> > > > > > +	} else {
> > > > > > +		pfns = (unsigned long *)range->pages;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		if (!mmget_not_zero(mm)) {
> > > > > > +			err = -EFAULT;
> > > > > > +			goto err_out;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	hmm_range.hmm_pfns = pfns;
> > > > > > +	while (true) {
> > > > > > +		/* Must be checked after
> > > > > > mmu_interval_read_begin
> > > > > > */
> > > > > > +		if (range->flags.unmapped) {
> > > > > > +			err = -EFAULT;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +
> > > > > > +		if (!ctx->mmap_locked) {
> > > > > > +			/*
> > > > > > +			 * XXX: HMM locking document
> > > > > > indicates
> > > > > > only
> > > > > > a read-lock
> > > > > > +			 * is required but there apears to
> > > > > > be a
> > > > > > window between
> > > > > > +			 * the MMU_NOTIFY_MIGRATE event
> > > > > > triggered in
> > > > > > a CPU fault
> > > > > > +			 * via migrate_vma_setup and the
> > > > > > pages
> > > > > > actually moving
> > > > > > +			 * in migrate_vma_finalize in which
> > > > > > this
> > > > > > code can grab
> > > > > > +			 * garbage pages. Grabbing the
> > > > > > write-
> > > > > > lock if
> > > > > > the range
> > > > > > +			 * is attached to vram appears to
> > > > > > protect
> > > > > > against this
> > > > > > +			 * race.
> > > > > > +			 */
> > > > > > +			if (vram_pages)
> > > > > > +				mmap_write_lock(mm);
> > > > > > +			else
> > > > > > +				mmap_read_lock(mm);
> > > > > > +		}
> > > > > > +		err = hmm_range_fault(&hmm_range);
> > > > > > +		if (!ctx->mmap_locked) {
> > > > > > +			if (vram_pages)
> > > > > > +				mmap_write_unlock(mm);
> > > > > > +			else
> > > > > > +				mmap_read_unlock(mm);
> > > > > > +		}
> > > > > > +
> > > > > > +		if (err == -EBUSY) {
> > > > > > +			if (time_after(jiffies, timeout))
> > > > > > +				break;
> > > > > > +
> > > > > > +			hmm_range.notifier_seq =
> > > > > > mmu_interval_read_begin(notifier);
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		break;
> > > > > > +	}
> > > > > > +	if (!ctx->mmap_locked)
> > > > > > +		mmput(mm);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	pages = (struct page **)pfns;
> > > > > > +
> > > > > > +	if (ctx->prefault) {
> > > > > > +		range->pages = pages;
> > > > > > +		goto set_seqno;
> > > > > > +	}
> > > > > > +
> > > > > > +map_pages:
> > > > > > +	if
> > > > > > (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > > > +
> > > > > > +		for (i = 0; i < npages; ++i) {
> > > > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > > > +
> > > > > > +			if
> > > > > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > > > +				err = -EOPNOTSUPP;
> > > > > > +				goto err_free;
> > > > > > +			}
> > > > > > +		}
> > > > > > +
> > > > > > +		/* Do not race with notifier unmapping pages
> > > > > > */
> > > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +		range->flags.has_vram_pages = true;
> > > > > > +		range->pages = pages;
> > > > > > +		if (mmu_interval_read_retry(notifier,
> > > > > > hmm_range.notifier_seq)) {
> > > > > > +			err = -EAGAIN;
> > > > > > +			__drm_gpusvm_range_unmap_pages(gpusv
> > > > > > m,
> > > > > > range);
> > > > > > +		}
> > > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +	} else {
> > > > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > > > +
> > > > > > +		for_each_dma_page(i, j, npages, order) {
> > > > > > +			if (WARN_ON_ONCE(i && order !=
> > > > > > +					
> > > > > > hmm_pfn_to_map_order(pfns[i]))) {
> > > > > > +				err = -EOPNOTSUPP;
> > > > > > +				npages = i;
> > > > > > +				goto err_unmap;
> > > > > > +			}
> > > > > > +			order =
> > > > > > hmm_pfn_to_map_order(pfns[i]);
> > > > > > +
> > > > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > > > +			if
> > > > > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > > > +				err = -EOPNOTSUPP;
> > > > > > +				npages = i;
> > > > > > +				goto err_unmap;
> > > > > > +			}
> > > > > > +
> > > > > > +			set_page_dirty_lock(pages[j]);
> > > > > > +			mark_page_accessed(pages[j]);
> > > > > > +
> > > > > > +			dma_addr[j] = dma_map_page(gpusvm-
> > > > > > >drm-
> > > > > > > dev,
> > > > > > +						   pages[j],
> > > > > > 0,
> > > > > > +						   PAGE_SIZE
> > > > > > <<
> > > > > > order,
> > > > > > +						  
> > > > > > DMA_BIDIRECTIONAL);
> > > > > > +			if (dma_mapping_error(gpusvm->drm-
> > > > > > >dev,
> > > > > > dma_addr[j])) {
> > > > > > +				err = -EFAULT;
> > > > > > +				npages = i;
> > > > > > +				goto err_unmap;
> > > > > > +			}
> > > > > > +		}
> > > > > > +
> > > > > > +		/* Huge pages, reduce memory footprint */
> > > > > > +		if (order) {
> > > > > > +			dma_addr = kmalloc_array(j,
> > > > > > sizeof(*dma_addr),
> > > > > > +						
> > > > > > GFP_KERNEL);
> > > > > > +			if (dma_addr) {
> > > > > > +				for (i = 0; i < j; ++i)
> > > > > > +					dma_addr[i] =
> > > > > > (dma_addr_t)pfns[i];
> > > > > > +				kvfree(pfns);
> > > > > > +				kfree_mapping = true;
> > > > > > +			} else {
> > > > > > +				dma_addr = (dma_addr_t
> > > > > > *)pfns;
> > > > > > +			}
> > > > > > +		}
> > > > > > +
> > > > > > +		/* Do not race with notifier unmapping pages
> > > > > > */
> > > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +		range->order = order;
> > > > > > +		range->flags.kfree_mapping = kfree_mapping;
> > > > > > +		range->flags.has_dma_mapping = true;
> > > > > > +		range->dma_addr = dma_addr;
> > > > > > +		range->vram_allocation = NULL;
> > > > > > +		if (mmu_interval_read_retry(notifier,
> > > > > > hmm_range.notifier_seq)) {
> > > > > > +			err = -EAGAIN;
> > > > > > +			__drm_gpusvm_range_unmap_pages(gpusv
> > > > > > m,
> > > > > > range);
> > > > > > +		}
> > > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +	}
> > > > > > +
> > > > > > +	if (err == -EAGAIN)
> > > > > > +		goto retry;
> > > > > > +set_seqno:
> > > > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +err_unmap:
> > > > > > +	for_each_dma_page(i, j, npages, order)
> > > > > > +		dma_unmap_page(gpusvm->drm->dev,
> > > > > > +			       (dma_addr_t)pfns[j],
> > > > > > +			       PAGE_SIZE << order,
> > > > > > DMA_BIDIRECTIONAL);
> > > > > > +err_free:
> > > > > > +	if (alloc_pfns)
> > > > > > +		kvfree(pfns);
> > > > > > +err_out:
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated
> > > > > > with a
> > > > > > GPU
> > > > > > SVM range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function unmaps pages associated with a GPU SVM
> > > > > > range.
> > > > > > If
> > > > > > @in_notifier
> > > > > > + * is set, it is assumed that gpusvm->notifier_lock is held
> > > > > > in
> > > > > > write
> > > > > > mode; if it
> > > > > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > > > > Must be
> > > > > > called on
> > > > > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > > > > invalidate for IOMMU
> > > > > > + * security model.
> > > > > > + */
> > > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > > +				  struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +				  const struct
> > > > > > drm_gpusvm_ctx
> > > > > > *ctx)
> > > > > > +{
> > > > > > +	if (ctx->in_notifier)
> > > > > > +		lockdep_assert_held_write(&gpusvm-
> > > > > > > notifier_lock);
> > > > > > +	else
> > > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +
> > > > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > > +
> > > > > > +	if (!ctx->in_notifier)
> > > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > > > + * @page: Pointer to the page to put
> > > > > > + *
> > > > > > + * This function unlocks and puts a page.
> > > > > > + */
> > > > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > > > +{
> > > > > > +	unlock_page(page);
> > > > > > +	put_page(page);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > > > + * @npages: Number of pages
> > > > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > > > + *
> > > > > > + * This function puts an array of pages.
> > > > > > + */
> > > > > > +static void drm_gpusvm_migration_put_pages(unsigned long
> > > > > > npages,
> > > > > > +					   unsigned long
> > > > > > *migrate_pfn)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		if (!migrate_pfn[i])
> > > > > > +			continue;
> > > > > > +
> > > > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to
> > > > > > _pag
> > > > > > e(mi
> > > > > > grate_pfn[i]));
> > > > > > +		migrate_pfn[i] = 0;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > > > + * @page: Pointer to the page
> > > > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > > > + *
> > > > > > + * This function associates the given page with the
> > > > > > specified
> > > > > > GPU
> > > > > > SVM zone
> > > > > > + * device data and initializes it for zone device usage.
> > > > > > + */
> > > > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > > > +				     struct drm_gpusvm_zdd
> > > > > > *zdd)
> > > > > > +{
> > > > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > > > +	zone_device_page_init(page);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for
> > > > > > GPU
> > > > > > SVM
> > > > > > migration
> > > > > > + * @dev: The device for which the pages are being mapped
> > > > > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > > > > mapped
> > > > > > pages
> > > > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > > > + * @npages: Number of pages to map
> > > > > > + * @dir: Direction of data transfer (e.g.,
> > > > > > DMA_BIDIRECTIONAL)
> > > > > > + *
> > > > > > + * This function maps pages of memory for migration usage in
> > > > > > GPU
> > > > > > SVM. It
> > > > > > + * iterates over each page frame number provided in
> > > > > > @migrate_pfn,
> > > > > > maps the
> > > > > > + * corresponding page, and stores the DMA address in the
> > > > > > provided
> > > > > > @dma_addr
> > > > > > + * array.
> > > > > > + *
> > > > > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > > > > mapping.
> > > > > > + */
> > > > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > > > +					dma_addr_t
> > > > > > *dma_addr,
> > > > > > +					long unsigned int
> > > > > > *migrate_pfn,
> > > > > > +					unsigned long
> > > > > > npages,
> > > > > > +					enum
> > > > > > dma_data_direction
> > > > > > dir)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		struct page *page =
> > > > > > migrate_pfn_to_page(migrate_pfn[i]);
> > > > > > +
> > > > > > +		if (!page)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > > > +			return -EFAULT;
> > > > > > +
> > > > > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > > > > PAGE_SIZE,
> > > > > > dir);
> > > > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > > > +			return -EFAULT;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> > > > > > mapped
> > > > > > for GPU SVM migration
> > > > > > + * @dev: The device for which the pages were mapped
> > > > > > + * @dma_addr: Array of DMA addresses corresponding to mapped
> > > > > > pages
> > > > > > + * @npages: Number of pages to unmap
> > > > > > + * @dir: Direction of data transfer (e.g.,
> > > > > > DMA_BIDIRECTIONAL)
> > > > > > + *
> > > > > > + * This function unmaps previously mapped pages of memory
> > > > > > for
> > > > > > GPU
> > > > > > Shared Virtual
> > > > > > + * Memory (SVM). It iterates over each DMA address provided
> > > > > > in
> > > > > > @dma_addr, checks
> > > > > > + * if it's valid and not already unmapped, and unmaps the
> > > > > > corresponding page.
> > > > > > + */
> > > > > > +static void drm_gpusvm_migrate_unmap_pages(struct device
> > > > > > *dev,
> > > > > > +					   dma_addr_t
> > > > > > *dma_addr,
> > > > > > +					   unsigned long
> > > > > > npages,
> > > > > > +					   enum
> > > > > > dma_data_direction
> > > > > > dir)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > > > > dma_addr[i]))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE,
> > > > > > dir);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to
> > > > > > VRAM
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *                   failure of this function.
> > > > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > > > allocation.
> > > > > > The caller
> > > > > > + *                   should hold a reference to the VRAM
> > > > > > allocation,
> > > > > > which
> > > > > > + *                   should be dropped via ops-
> > > > > > >vram_allocation
> > > > > > or
> > > > > > upon the
> > > > > > + *                   failure of this function.
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function migrates the specified GPU SVM range to
> > > > > > VRAM.
> > > > > > It
> > > > > > performs the
> > > > > > + * necessary setup and invokes the driver-specific
> > > > > > operations
> > > > > > for
> > > > > > migration to
> > > > > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > > > > reference @range
> > > > > > + * until ops->vram_release is called which only upon
> > > > > > successful
> > > > > > return.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +			       void *vram_allocation,
> > > > > > +			       const struct drm_gpusvm_ctx
> > > > > > *ctx)
> > > > > > +{
> > > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > > +	struct migrate_vma migrate = {
> > > > > > +		.start		= start,
> > > > > > +		.end		= end,
> > > > > > +		.pgmap_owner	= gpusvm-
> > > > > > > device_private_page_owner,
> > > > > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > > > > +	};
> > > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > > +	unsigned long i, npages = npages_in_range(start,
> > > > > > end);
> > > > > > +	struct vm_area_struct *vas;
> > > > > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > > > > +	struct page **pages;
> > > > > > +	dma_addr_t *dma_addr;
> > > > > > +	void *buf;
> > > > > > +	int err;
> > > > > > +
> > > > > > +	if (!range->flags.migrate_vram)
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > > > > > copy_to_vram ||
> > > > > > +	    !gpusvm->ops->copy_to_sram)
> > > > > > +		return -EOPNOTSUPP;
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		if (!mmget_not_zero(mm)) {
> > > > > > +			err = -EFAULT;
> > > > > > +			goto err_out;
> > > > > > +		}
> > > > > > +		mmap_write_lock(mm);
> > > > > > +	}
> > > > > > +
> > > > > > +	mmap_assert_locked(mm);
> > > > > > +
> > > > > > +	vas = vma_lookup(mm, start);
> > > > > > +	if (!vas) {
> > > > > > +		err = -ENOENT;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > > > +		err = -EINVAL;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!vma_is_anonymous(vas)) {
> > > > > > +		err = -EBUSY;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > > > sizeof(*dma_addr) +
> > > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > > +	if (!buf) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) *
> > > > > > npages);
> > > > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > > > sizeof(*dma_addr))
> > > > > > * npages;
> > > > > > +
> > > > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > > > +	if (!zdd) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_free;
> > > > > > +	}
> > > > > > +
> > > > > > +	migrate.vma = vas;
> > > > > > +	migrate.src = buf;
> > > > > > +	migrate.dst = migrate.src + npages;
> > > > > > +
> > > > > > +	err = migrate_vma_setup(&migrate);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * FIXME: Below cases, !migrate.cpages and
> > > > > > migrate.cpages !=
> > > > > > npages, not
> > > > > > +	 * always an error. Need to revisit possible cases
> > > > > > and
> > > > > > how
> > > > > > to handle. We
> > > > > > +	 * could prefault on migrate.cpages != npages via
> > > > > > hmm_range_fault.
> > > > > > +	 */
> > > > > > +
> > > > > > +	if (!migrate.cpages) {
> > > > > > +		err = -EFAULT;
> > > > > > +		goto err_free;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (migrate.cpages != npages) {
> > > > > > +		err = -EBUSY;
> > > > > > +		goto err_finalize;
> > > > > > +	}
> > > > > > +
> > > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > > > > vram_allocation, npages,
> > > > > > +					     migrate.dst);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > > > dma_addr,
> > > > > > +					   migrate.src,
> > > > > > npages,
> > > > > > DMA_TO_DEVICE);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		struct page *page =
> > > > > > pfn_to_page(migrate.dst[i]);
> > > > > > +
> > > > > > +		pages[i] = page;
> > > > > > +		migrate.dst[i] =
> > > > > > migrate_pfn(migrate.dst[i]);
> > > > > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > > > > +	}
> > > > > > +
> > > > > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages,
> > > > > > dma_addr,
> > > > > > npages);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	/* Upon success bind vram allocation to range and
> > > > > > zdd */
> > > > > > +	range->vram_allocation = vram_allocation;
> > > > > > +	WRITE_ONCE(zdd->vram_allocation,
> > > > > > vram_allocation);	/*
> > > > > > Owns ref */
> > > > > > +
> > > > > > +err_finalize:
> > > > > > +	if (err)
> > > > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > > > migrate.dst);
> > > > > > +	migrate_vma_pages(&migrate);
> > > > > > +	migrate_vma_finalize(&migrate);
> > > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > > > dma_addr,
> > > > > > npages,
> > > > > > +				       DMA_TO_DEVICE);
> > > > > > +err_free:
> > > > > > +	if (zdd)
> > > > > > +		drm_gpusvm_zdd_put(zdd);
> > > > > > +	kvfree(buf);
> > > > > > +err_mmunlock:
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		mmap_write_unlock(mm);
> > > > > > +		mmput(mm);
> > > > > > +	}
> > > > > > +err_out:
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs
> > > > > > for
> > > > > > a
> > > > > > VM area
> > > > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > > > + * @npages: Number of pages to populate
> > > > > > + * @src_mpfn: Source array of migrate PFNs
> > > > > > + * @mpfn: Array of migrate PFNs to populate
> > > > > > + * @addr: Start address for PFN allocation
> > > > > > + *
> > > > > > + * This function populates the SRAM migrate page frame
> > > > > > numbers
> > > > > > (PFNs) for the
> > > > > > + * specified VM area structure. It allocates and locks pages
> > > > > > in
> > > > > > the
> > > > > > VM area for
> > > > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for
> > > > > > allocation,
> > > > > > if NULL use
> > > > > > + * alloc_page for allocation.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > > > > vm_area_struct *vas,
> > > > > > +						unsigned
> > > > > > long
> > > > > > npages,
> > > > > > +						unsigned
> > > > > > long
> > > > > > *src_mpfn,
> > > > > > +						unsigned
> > > > > > long
> > > > > > *mpfn,
> > > > > > u64 addr)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > > > +		struct page *page;
> > > > > > +
> > > > > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (vas)
> > > > > > +			page = alloc_page_vma(GFP_HIGHUSER,
> > > > > > vas,
> > > > > > addr);
> > > > > > +		else
> > > > > > +			page = alloc_page(GFP_HIGHUSER);
> > > > > > +
> > > > > > +		if (!page)
> > > > > > +			return -ENOMEM;
> > > > > > +
> > > > > > +		lock_page(page);
> > > > > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not
> > > > > > require
> > > > > > mmap
> > > > > > lock and
> > > > > > + * migration done via migrate_device_* functions. Fallback
> > > > > > path
> > > > > > as
> > > > > > it is
> > > > > > + * preferred to issue migrations with mmap lock.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +				    struct drm_gpusvm_range
> > > > > > *range)
> > > > > > +{
> > > > > > +	unsigned long npages;
> > > > > > +	struct page **pages;
> > > > > > +	unsigned long *src, *dst;
> > > > > > +	dma_addr_t *dma_addr;
> > > > > > +	void *buf;
> > > > > > +	int i, err = 0;
> > > > > > +
> > > > > > +	npages = npages_in_range(range->va.start, range-
> > > > > > > va.end);
> > > > > > +
> > > > > > +	buf = kvcalloc(npages, 2 * sizeof(*src) +
> > > > > > sizeof(*dma_addr)
> > > > > > +
> > > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > > +	if (!buf) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_out;
> > > > > > +	}
> > > > > > +	src = buf;
> > > > > > +	dst = buf + (sizeof(*src) * npages);
> > > > > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr))
> > > > > > *
> > > > > > npages;
> > > > > > +
> > > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > > > > > vram_allocation,
> > > > > > +					     npages, src);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > > > > +				       gpusvm-
> > > > > > > device_private_page_owner, src,
> > > > > > +				       npages, range-
> > > > > > >va.start);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL,
> > > > > > npages,
> > > > > > src, dst, 0);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > > > dma_addr,
> > > > > > +					   dst, npages,
> > > > > > DMA_BIDIRECTIONAL);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i)
> > > > > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > > > > +
> > > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > > > > > dma_addr,
> > > > > > npages);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +err_finalize:
> > > > > > +	if (err)
> > > > > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > > > > +	migrate_device_pages(src, dst, npages);
> > > > > > +	migrate_device_finalize(src, dst, npages);
> > > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > > > dma_addr,
> > > > > > npages,
> > > > > > +				       DMA_BIDIRECTIONAL);
> > > > > > +err_free:
> > > > > > +	kvfree(buf);
> > > > > > +err_out:
> > > > > > +
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to
> > > > > > SRAM
> > > > > > (internal)
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @vas: Pointer to the VM area structure
> > > > > > + * @page: Pointer to the page for fault handling (can be
> > > > > > NULL)
> > > > > > + * @start: Start address of the migration range
> > > > > > + * @end: End address of the migration range
> > > > > > + *
> > > > > > + * This internal function performs the migration of the
> > > > > > specified
> > > > > > GPU SVM range
> > > > > > + * to SRAM. It sets up the migration, populates + dma maps
> > > > > > SRAM
> > > > > > PFNs, and
> > > > > > + * invokes the driver-specific operations for migration to
> > > > > > SRAM.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > > > > > *gpusvm,
> > > > > > +					struct
> > > > > > vm_area_struct
> > > > > > *vas,
> > > > > > +					struct page *page,
> > > > > > +					u64 start, u64 end)
> > > > > > +{
> > > > > > +	struct migrate_vma migrate = {
> > > > > > +		.vma		= vas,
> > > > > > +		.pgmap_owner	= gpusvm-
> > > > > > > device_private_page_owner,
> > > > > > +		.flags		=
> > > > > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > > > +		.fault_page	= page,
> > > > > > +	};
> > > > > > +	unsigned long npages;
> > > > > > +	struct page **pages;
> > > > > > +	dma_addr_t *dma_addr;
> > > > > > +	void *buf;
> > > > > > +	int i, err = 0;
> > > > > > +
> > > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > > +
> > > > > > +	/* Corner where VMA area struct has been partially
> > > > > > unmapped
> > > > > > */
> > > > > > +	if (start < vas->vm_start)
> > > > > > +		start = vas->vm_start;
> > > > > > +	if (end > vas->vm_end)
> > > > > > +		end = vas->vm_end;
> > > > > > +
> > > > > > +	migrate.start = start;
> > > > > > +	migrate.end = end;
> > > > > > +	npages = npages_in_range(start, end);
> > > > > > +
> > > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > > > sizeof(*dma_addr) +
> > > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > > +	if (!buf) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_out;
> > > > > > +	}
> > > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) *
> > > > > > npages);
> > > > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > > > sizeof(*dma_addr))
> > > > > > * npages;
> > > > > > +
> > > > > > +	migrate.vma = vas;
> > > > > > +	migrate.src = buf;
> > > > > > +	migrate.dst = migrate.src + npages;
> > > > > > +
> > > > > > +	err = migrate_vma_setup(&migrate);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	/* Raced with another CPU fault, nothing to do */
> > > > > > +	if (!migrate.cpages)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas,
> > > > > > npages,
> > > > > > +						  
> > > > > > migrate.src,
> > > > > > migrate.dst,
> > > > > > +						   start);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > > > dma_addr,
> > > > > > +					   migrate.dst,
> > > > > > npages,
> > > > > > +					  
> > > > > > DMA_BIDIRECTIONAL);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i)
> > > > > > +		pages[i] =
> > > > > > migrate_pfn_to_page(migrate.src[i]);
> > > > > > +
> > > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > > > > > dma_addr,
> > > > > > npages);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +err_finalize:
> > > > > > +	if (err)
> > > > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > > > migrate.dst);
> > > > > > +	migrate_vma_pages(&migrate);
> > > > > > +	migrate_vma_finalize(&migrate);
> > > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > > > dma_addr,
> > > > > > npages,
> > > > > > +				       DMA_BIDIRECTIONAL);
> > > > > > +err_free:
> > > > > > +	kvfree(buf);
> > > > > > +err_out:
> > > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > > +
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM
> > > > > > range to
> > > > > > SRAM
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function initiates the migration of the specified
> > > > > > GPU
> > > > > > SVM
> > > > > > range to
> > > > > > + * SRAM. It performs necessary checks and invokes the
> > > > > > internal
> > > > > > migration
> > > > > > + * function for actual migration.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +			       const struct drm_gpusvm_ctx
> > > > > > *ctx)
> > > > > > +{
> > > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > > +	struct vm_area_struct *vas;
> > > > > > +	int err;
> > > > > > +	bool retry = false;
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		if (!mmget_not_zero(mm)) {
> > > > > > +			err = -EFAULT;
> > > > > > +			goto err_out;
> > > > > > +		}
> > > > > > +		if (ctx->trylock_mmap) {
> > > > > > +			if (!mmap_read_trylock(mm))  {
> > > > > > +				err =
> > > > > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > > > +				goto err_mmput;
> > > > > > +			}
> > > > > > +		} else {
> > > > > > +			mmap_read_lock(mm);
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	mmap_assert_locked(mm);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Loop required to find all VMA area structs for
> > > > > > the
> > > > > > corner
> > > > > > case when
> > > > > > +	 * VRAM backing has been partially unmapped from
> > > > > > MM's
> > > > > > address space.
> > > > > > +	 */
> > > > > > +again:
> > > > > > +	vas = find_vma(mm, start);
> > > > > > +	if (!vas) {
> > > > > > +		if (!retry)
> > > > > > +			err = -ENOENT;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > > > +		if (!retry)
> > > > > > +			err = -EINVAL;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas,
> > > > > > NULL,
> > > > > > start,
> > > > > > end);
> > > > > > +	if (err)
> > > > > > +		goto err_mmunlock;
> > > > > > +
> > > > > > +	if (vas->vm_end < end) {
> > > > > > +		retry = true;
> > > > > > +		start = vas->vm_end;
> > > > > > +		goto again;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		mmap_read_unlock(mm);
> > > > > > +		/*
> > > > > > +		 * Using mmput_async as this function can be
> > > > > > called
> > > > > > while
> > > > > > +		 * holding a dma-resv lock, and a final put
> > > > > > can
> > > > > > grab
> > > > > > the mmap
> > > > > > +		 * lock, causing a lock inversion.
> > > > > > +		 */
> > > > > > +		mmput_async(mm);
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +err_mmunlock:
> > > > > > +	if (!ctx->mmap_locked)
> > > > > > +		mmap_read_unlock(mm);
> > > > > > +err_mmput:
> > > > > > +	if (!ctx->mmap_locked)
> > > > > > +		mmput_async(mm);
> > > > > > +err_out:
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > > > > > associated
> > > > > > with a page
> > > > > > + * @page: Pointer to the page
> > > > > > + *
> > > > > > + * This function is a callback used to put the GPU SVM zone
> > > > > > device
> > > > > > data
> > > > > > + * associated with a page when it is being released.
> > > > > > + */
> > > > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > > > +{
> > > > > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > > > > > (page
> > > > > > fault handler)
> > > > > > + * @vmf: Pointer to the fault information structure
> > > > > > + *
> > > > > > + * This function is a page fault handler used to migrate a
> > > > > > GPU
> > > > > > SVM
> > > > > > range to RAM.
> > > > > > + * It retrieves the GPU SVM range information from the
> > > > > > faulting
> > > > > > page
> > > > > > and invokes
> > > > > > + * the internal migration function to migrate the range back
> > > > > > to
> > > > > > RAM.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > > > + */
> > > > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > > > > > *vmf)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > > > > > > zone_device_data;
> > > > > > +	int err;
> > > > > > +
> > > > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range-
> > > > > > >gpusvm,
> > > > > > +					   vmf->vma, vmf-
> > > > > > >page,
> > > > > > +					   zdd->range-
> > > > > > >va.start,
> > > > > > +					   zdd->range-
> > > > > > >va.end);
> > > > > > +
> > > > > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_pagemap_ops - Device page map operations for
> > > > > > GPU
> > > > > > SVM
> > > > > > + */
> > > > > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops =
> > > > > > {
> > > > > > +	.page_free = drm_gpusvm_page_free,
> > > > > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page
> > > > > > map
> > > > > > operations
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the GPU SVM device page map operations
> > > > > > structure.
> > > > > > + */
> > > > > > +const struct dev_pagemap_ops
> > > > > > *drm_gpusvm_pagemap_ops_get(void)
> > > > > > +{
> > > > > > +	return &drm_gpusvm_pagemap_ops;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for
> > > > > > the
> > > > > > given address range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > > + * @start: Start address
> > > > > > + * @end: End address
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * True if GPU SVM has mapping, False otherwise
> > > > > > + */
> > > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > > > > start,
> > > > > > u64 end)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > > +
> > > > > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm,
> > > > > > start,
> > > > > > end) {
> > > > > > +		struct drm_gpusvm_range *range = NULL;
> > > > > > +
> > > > > > +		drm_gpusvm_for_each_range(range, notifier,
> > > > > > start,
> > > > > > end)
> > > > > > +			return true;
> > > > > > +	}
> > > > > > +
> > > > > > +	return false;
> > > > > > +}
> > > > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..0ea70f8534a8
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > > @@ -0,0 +1,415 @@
> > > > > > +/* SPDX-License-Identifier: MIT */
> > > > > > +/*
> > > > > > + * Copyright © 2024 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef __DRM_GPUSVM_H__
> > > > > > +#define __DRM_GPUSVM_H__
> > > > > > +
> > > > > > +#include <linux/kref.h>
> > > > > > +#include <linux/mmu_notifier.h>
> > > > > > +#include <linux/workqueue.h>
> > > > > > +
> > > > > > +struct dev_pagemap_ops;
> > > > > > +struct drm_device;
> > > > > > +struct drm_gpusvm;
> > > > > > +struct drm_gpusvm_notifier;
> > > > > > +struct drm_gpusvm_ops;
> > > > > > +struct drm_gpusvm_range;
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > > > > + *
> > > > > > + * This structure defines the operations for GPU Shared
> > > > > > Virtual
> > > > > > Memory (SVM).
> > > > > > + * These operations are provided by the GPU driver to manage
> > > > > > SVM
> > > > > > ranges and
> > > > > > + * perform operations such as migration between VRAM and
> > > > > > system
> > > > > > RAM.
> > > > > > + */
> > > > > > +struct drm_gpusvm_ops {
> > > > > > +	/**
> > > > > > +	 * @notifier_alloc: Allocate a GPU SVM notifier
> > > > > > (optional)
> > > > > > +	 *
> > > > > > +	 * This function shall allocate a GPU SVM notifier.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * Pointer to the allocated GPU SVM notifier on
> > > > > > success,
> > > > > > NULL on failure.
> > > > > > +	 */
> > > > > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @notifier_free: Free a GPU SVM notifier
> > > > > > (optional)
> > > > > > +	 * @notifier: Pointer to the GPU SVM notifier to be
> > > > > > freed
> > > > > > +	 *
> > > > > > +	 * This function shall free a GPU SVM notifier.
> > > > > > +	 */
> > > > > > +	void (*notifier_free)(struct drm_gpusvm_notifier
> > > > > > *notifier);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 *
> > > > > > +	 * This function shall allocate a GPU SVM range.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * Pointer to the allocated GPU SVM range on
> > > > > > success,
> > > > > > NULL
> > > > > > on failure.
> > > > > > +	 */
> > > > > > +	struct drm_gpusvm_range *(*range_alloc)(struct
> > > > > > drm_gpusvm
> > > > > > *gpusvm);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @range_free: Free a GPU SVM range (optional)
> > > > > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > > > > +	 *
> > > > > > +	 * This function shall free a GPU SVM range.
> > > > > > +	 */
> > > > > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @vram_release: Release VRAM allocation (optional)
> > > > > > +	 * @vram_allocation: Driver-private pointer to the
> > > > > > VRAM
> > > > > > allocation
> > > > > > +	 *
> > > > > > +	 * This function shall release VRAM allocation and
> > > > > > expects
> > > > > > to drop a
> > > > > > +	 * reference to VRAM allocation.
> > > > > > +	 */
> > > > > > +	void (*vram_release)(void *vram_allocation);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @populate_vram_pfn: Populate VRAM PFN (required
> > > > > > for
> > > > > > migration)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @vram_allocation: Driver-private pointer to the
> > > > > > VRAM
> > > > > > allocation
> > > > > > +	 * @npages: Number of pages to populate
> > > > > > +	 * @pfn: Array of page frame numbers to populate
> > > > > > +	 *
> > > > > > +	 * This function shall populate VRAM page frame
> > > > > > numbers
> > > > > > (PFN).
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * 0 on success, a negative error code on failure.
> > > > > > +	 */
> > > > > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > > > > +				 void *vram_allocation,
> > > > > > +				 unsigned long npages,
> > > > > > +				 unsigned long *pfn);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @copy_to_vram: Copy to VRAM (required for
> > > > > > migration)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @pages: Pointer to array of VRAM pages
> > > > > > (destination)
> > > > > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > > > > (source)
> > > > > > +	 * @npages: Number of pages to copy
> > > > > > +	 *
> > > > > > +	 * This function shall copy pages to VRAM.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * 0 on success, a negative error code on failure.
> > > > > > +	 */
> > > > > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > > > > +			    struct page **pages,
> > > > > > +			    dma_addr_t *dma_addr,
> > > > > > +			    unsigned long npages);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @copy_to_sram: Copy to system RAM (required for
> > > > > > migration)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > > > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > > > > (destination)
> > > > > > +	 * @npages: Number of pages to copy
> > > > > > +	 *
> > > > > > +	 * This function shall copy pages to system RAM.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * 0 on success, a negative error code on failure.
> > > > > > +	 */
> > > > > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > > > > +			    struct page **pages,
> > > > > > +			    dma_addr_t *dma_addr,
> > > > > > +			    unsigned long npages);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @invalidate: Invalidate GPU SVM notifier
> > > > > > (required)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > > > > +	 * @mmu_range: Pointer to the mmu_notifier_range
> > > > > > structure
> > > > > > +	 *
> > > > > > +	 * This function shall invalidate the GPU page
> > > > > > tables.
> > > > > > It
> > > > > > can safely
> > > > > > +	 * walk the notifier range RB tree/list in this
> > > > > > function.
> > > > > > Called while
> > > > > > +	 * holding the notifier lock.
> > > > > > +	 */
> > > > > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > > > > +			   struct drm_gpusvm_notifier
> > > > > > *notifier,
> > > > > > +			   const struct mmu_notifier_range
> > > > > > *mmu_range);
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_notifier - Structure representing a GPU
> > > > > > SVM
> > > > > > notifier
> > > > > > + *
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: MMU interval notifier
> > > > > > + * @interval: Interval for the notifier
> > > > > > + * @rb: Red-black tree node for the parent GPU SVM structure
> > > > > > notifier tree
> > > > > > + * @root: Cached root node of the RB tree containing ranges
> > > > > > + * @range_list: List head containing of ranges in the same
> > > > > > order
> > > > > > they appear in
> > > > > > + *              interval tree. This is useful to keep
> > > > > > iterating
> > > > > > ranges while
> > > > > > + *              doing modifications to RB tree.
> > > > > > + * @flags.removed: Flag indicating whether the MMU interval
> > > > > > notifier
> > > > > > has been
> > > > > > + *                 removed
> > > > > > + *
> > > > > > + * This structure represents a GPU SVM notifier.
> > > > > > + */
> > > > > > +struct drm_gpusvm_notifier {
> > > > > > +	struct drm_gpusvm *gpusvm;
> > > > > > +	struct mmu_interval_notifier notifier;
> > > > > > +	struct {
> > > > > > +		u64 start;
> > > > > > +		u64 end;
> > > > > > +	} interval;
> > > > > > +	struct {
> > > > > > +		struct rb_node node;
> > > > > > +		struct list_head entry;
> > > > > > +		u64 __subtree_last;
> > > > > > +	} rb;
> > > > > > +	struct rb_root_cached root;
> > > > > > +	struct list_head range_list;
> > > > > > +	struct {
> > > > > > +		u32 removed : 1;
> > > > > > +	} flags;
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_range - Structure representing a GPU
> > > > > > SVM
> > > > > > range
> > > > > > + *
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: Pointer to the GPU SVM notifier
> > > > > > + * @refcount: Reference count for the range
> > > > > > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > > > > > structure range tree
> > > > > > + * @va: Virtual address range
> > > > > > + * @notifier_seq: Notifier sequence number of the range's
> > > > > > pages
> > > > > > + * @pages: Pointer to the array of pages (if backing store
> > > > > > is in
> > > > > > VRAM)
> > > > > > + * @dma_addr: DMA address array (if backing store is SRAM
> > > > > > and
> > > > > > DMA
> > > > > > mapped)
> > > > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > > > allocation
> > > > > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is
> > > > > > mapping
> > > > > > size
> > > > > > + * @flags.migrate_vram: Flag indicating whether the range
> > > > > > can be
> > > > > > migrated to VRAM
> > > > > > + * @flags.unmapped: Flag indicating if the range has been
> > > > > > unmapped
> > > > > > + * @flags.partial_unmap: Flag indicating if the range has
> > > > > > been
> > > > > > partially unmapped
> > > > > > + * @flags.has_vram_pages: Flag indicating if the range has
> > > > > > vram
> > > > > > pages
> > > > > > + * @flags.has_dma_mapping: Flag indicating if the range has
> > > > > > a
> > > > > > DMA
> > > > > > mapping
> > > > > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a
> > > > > > compact
> > > > > > allocation based
> > > > > > + *                       on @order which releases via kfree
> > > > > > + *
> > > > > > + * This structure represents a GPU SVM range used for
> > > > > > tracking
> > > > > > memory ranges
> > > > > > + * mapped in a DRM device.
> > > > > > + */
> > > > > > +struct drm_gpusvm_range {
> > > > > > +	struct drm_gpusvm *gpusvm;
> > > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > > +	struct kref refcount;
> > > > > > +	struct {
> > > > > > +		struct rb_node node;
> > > > > > +		struct list_head entry;
> > > > > > +		u64 __subtree_last;
> > > > > > +	} rb;
> > > > > > +	struct {
> > > > > > +		u64 start;
> > > > > > +		u64 end;
> > > > > > +	} va;
> > > > > > +	unsigned long notifier_seq;
> > > > > > +	union {
> > > > > > +		struct page **pages;
> > > > > > +		dma_addr_t *dma_addr;
> > > > > > +	};
> > > > > > +	void *vram_allocation;
> > > > > > +	u16 order;
> > > > > > +	struct {
> > > > > > +		/* All flags below must be set upon creation
> > > > > > */
> > > > > > +		u16 migrate_vram : 1;
> > > > > > +		/* All flags below must be set / cleared
> > > > > > under
> > > > > > notifier lock */
> > > > > > +		u16 unmapped : 1;
> > > > > > +		u16 partial_unmap : 1;
> > > > > > +		u16 has_vram_pages : 1;
> > > > > > +		u16 has_dma_mapping : 1;
> > > > > > +		u16 kfree_mapping : 1;
> > > > > > +	} flags;
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm - GPU SVM structure
> > > > > > + *
> > > > > > + * @name: Name of the GPU SVM
> > > > > > + * @drm: Pointer to the DRM device structure
> > > > > > + * @mm: Pointer to the mm_struct for the address space
> > > > > > + * @device_private_page_owner: Device private pages owner
> > > > > > + * @mm_start: Start address of GPU SVM
> > > > > > + * @mm_range: Range of the GPU SVM
> > > > > > + * @notifier_size: Size of individual notifiers
> > > > > > + * @ops: Pointer to the operations structure for GPU SVM
> > > > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > > > range
> > > > > > allocation.
> > > > > > + *               Entries should be powers of 2 in descending
> > > > > > order.
> > > > > > + * @num_chunks: Number of chunks
> > > > > > + * @notifier_lock: Read-write semaphore for protecting
> > > > > > notifier
> > > > > > operations
> > > > > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > > > > + * @root: Cached root node of the Red-Black tree containing
> > > > > > GPU
> > > > > > SVM
> > > > > > notifiers
> > > > > > + * @notifier_list: list head containing of notifiers in the
> > > > > > same
> > > > > > order they
> > > > > > + *                 appear in interval tree. This is useful
> > > > > > to
> > > > > > keep
> > > > > > iterating
> > > > > > + *                 notifiers while doing modifications to RB
> > > > > > tree.
> > > > > > + *
> > > > > > + * This structure represents a GPU SVM (Shared Virtual
> > > > > > Memory)
> > > > > > used
> > > > > > for tracking
> > > > > > + * memory ranges mapped in a DRM (Direct Rendering Manager)
> > > > > > device.
> > > > > > + *
> > > > > > + * No reference counting is provided, as this is expected to
> > > > > > be
> > > > > > embedded in the
> > > > > > + * driver VM structure along with the struct drm_gpuvm,
> > > > > > which
> > > > > > handles reference
> > > > > > + * counting.
> > > > > > + */
> > > > > > +struct drm_gpusvm {
> > > > > > +	const char *name;
> > > > > > +	struct drm_device *drm;
> > > > > > +	struct mm_struct *mm;
> > > > > > +	void *device_private_page_owner;
> > > > > > +	u64 mm_start;
> > > > > > +	u64 mm_range;
> > > > > > +	u64 notifier_size;
> > > > > > +	const struct drm_gpusvm_ops *ops;
> > > > > > +	const u64 *chunk_sizes;
> > > > > > +	int num_chunks;
> > > > > > +	struct rw_semaphore notifier_lock;
> > > > > > +	struct workqueue_struct *zdd_wq;
> > > > > > +	struct rb_root_cached root;
> > > > > > +	struct list_head notifier_list;
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > > > > + *
> > > > > > + * @mmap_locked: mmap lock is locked
> > > > > > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > > > > > inversions
> > > > > > + *                (e.g.dma-revs -> mmap lock)
> > > > > > + * @in_notifier: entering from a MMU notifier
> > > > > > + * @read_only: operating on read-only memory
> > > > > > + * @vram_possible: possible to use VRAM
> > > > > > + * @prefault: prefault pages
> > > > > > + *
> > > > > > + * Context that is DRM GPUSVM is operating in (i.e. user
> > > > > > arguments).
> > > > > > + */
> > > > > > +struct drm_gpusvm_ctx {
> > > > > > +	u32 mmap_locked :1;
> > > > > > +	u32 trylock_mmap :1;
> > > > > > +	u32 in_notifier :1;
> > > > > > +	u32 read_only :1;
> > > > > > +	u32 vram_possible :1;
> > > > > > +	u32 prefault :1;
> > > > > > +};
> > > > > > +
> > > > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > > > +		    const char *name, struct drm_device
> > > > > > *drm,
> > > > > > +		    struct mm_struct *mm, void
> > > > > > *device_private_page_owner,
> > > > > > +		    u64 mm_start, u64 mm_range, u64
> > > > > > notifier_size,
> > > > > > +		    const struct drm_gpusvm_ops *ops,
> > > > > > +		    const u64 *chunk_sizes, int num_chunks);
> > > > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > > > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > > > > +
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> > > > > > u64
> > > > > > fault_addr,
> > > > > > +				u64 gpuva_start, u64
> > > > > > gpuva_end,
> > > > > > +				const struct drm_gpusvm_ctx
> > > > > > *ctx);
> > > > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > > > +			     struct drm_gpusvm_range
> > > > > > *range);
> > > > > > +
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > > > > +
> > > > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > > > +				  struct drm_gpusvm_range
> > > > > > *range);
> > > > > > +
> > > > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +			       const struct drm_gpusvm_ctx
> > > > > > *ctx);
> > > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > > +				  struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +				  const struct
> > > > > > drm_gpusvm_ctx
> > > > > > *ctx);
> > > > > > +
> > > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +			       void *vram_allocation,
> > > > > > +			       const struct drm_gpusvm_ctx
> > > > > > *ctx);
> > > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +			       const struct drm_gpusvm_ctx
> > > > > > *ctx);
> > > > > > +
> > > > > > +const struct dev_pagemap_ops
> > > > > > *drm_gpusvm_pagemap_ops_get(void);
> > > > > > +
> > > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > > > > > start,
> > > > > > u64 end);
> > > > > > +
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier,
> > > > > > u64
> > > > > > start, u64 end);
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > > + *
> > > > > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > > > > + */
> > > > > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > > > > +	down_read(&(gpusvm__)->notifier_lock)
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > > + *
> > > > > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > > > > + */
> > > > > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > > > > +	up_read(&(gpusvm__)->notifier_lock)
> > > > > > +
> > > > > > +/**
> > > > > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in
> > > > > > the
> > > > > > list
> > > > > > + * @range: a pointer to the current GPU SVM range
> > > > > > + *
> > > > > > + * Return: A pointer to the next drm_gpusvm_range if
> > > > > > available,
> > > > > > or
> > > > > > NULL if the
> > > > > > + *         current range is the last one or if the input
> > > > > > range
> > > > > > is
> > > > > > NULL.
> > > > > > + */
> > > > > > +static inline struct drm_gpusvm_range *
> > > > > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > > > > +{
> > > > > > +	if (range && !list_is_last(&range->rb.entry,
> > > > > > +				   &range->notifier-
> > > > > > > range_list))
> > > > > > +		return list_next_entry(range, rb.entry);
> > > > > > +
> > > > > > +	return NULL;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges
> > > > > > in a
> > > > > > notifier
> > > > > > + * @range__: Iterator variable for the ranges. If set, it
> > > > > > indicates
> > > > > > the start of
> > > > > > + *	     the iterator. If NULL, call
> > > > > > drm_gpusvm_range_find()
> > > > > > to
> > > > > > get the range.
> > > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > > + * @start__: Start address of the range
> > > > > > + * @end__: End address of the range
> > > > > > + *
> > > > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > > > notifier.
> > > > > > It is safe
> > > > > > + * to use while holding the driver SVM lock or the notifier
> > > > > > lock.
> > > > > > + */
> > > > > > +#define drm_gpusvm_for_each_range(range__, notifier__,
> > > > > > start__,
> > > > > > end__)	\
> > > > > > +	for ((range__) = (range__)
> > > > > > ?:					\
> > > > > > +	     drm_gpusvm_range_find((notifier__), (start__),
> > > > > > (end__));	\
> > > > > > +	     (range__) && (range__->va.start <
> > > > > > (end__));		\
> > > > > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as
> > > > > > unmapped
> > > > > > + * @range: Pointer to the GPU SVM range structure.
> > > > > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > > > > + *
> > > > > > + * This function marks a GPU SVM range as unmapped and sets
> > > > > > the
> > > > > > partial_unmap flag
> > > > > > + * if the range partially falls within the provided MMU
> > > > > > notifier
> > > > > > range.
> > > > > > + */
> > > > > > +static inline void
> > > > > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range
> > > > > > *range,
> > > > > > +			      const struct
> > > > > > mmu_notifier_range
> > > > > > *mmu_range)
> > > > > > +{
> > > > > > +	lockdep_assert_held_write(&range->gpusvm-
> > > > > > > notifier_lock);
> > > > > > +
> > > > > > +	range->flags.unmapped = true;
> > > > > > +	if (range->va.start < mmu_range->start ||
> > > > > > +	    range->va.end > mmu_range->end)
> > > > > > +		range->flags.partial_unmap = true;
> > > > > > +}
> > > > > > +
> > > > > > +#endif /* __DRM_GPUSVM_H__ */
> > > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-30 13:47             ` Matthew Brost
@ 2024-09-02  9:45               ` Thomas Hellström
  0 siblings, 0 replies; 100+ messages in thread
From: Thomas Hellström @ 2024-09-02  9:45 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

Hi, Matt.

On Fri, 2024-08-30 at 13:47 +0000, Matthew Brost wrote:
> On Fri, Aug 30, 2024 at 11:57:33AM +0200, Thomas Hellström wrote:
> > Hi, Matthew,
> > 
> > Agreed the below might not be important just now, but some ideas:
> > 
> > On Thu, 2024-08-29 at 20:56 +0000, Matthew Brost wrote:
> > > Issues with removing a SVM range:
> > > 
> > > - Xe bind code stores invalidation / present state in VMA, this
> > > would
> > >   need to be moved to the radix tree. I have Jira open for that
> > > work
> > >   which I believe other developers are going to own.
> > 
> > Yeah, although we shouldn't *design* around xe bind-code and page-
> > table
> > code shortcomings.
> > 
> 
> I'm thinking this one certainly should be fixed sooner rather than
> later which would be helpful.
> 
> But let's also consider the case where we get a bunch of individual
> page
> invalidates serially for an entire range (I can't remember when this
> happens but I have seen it in my testing, will look into this more to
> figure exactly when). If we invalidate 1 page at a time in radix
> tree,
> each invalidation could potentially results in TLB invalidation
> interaction with the hardware in cases where a larger GPU pages are
> not
> being used. The TLB invalidation is going to vastly slower than any
> CPU
> operation (e.g. RB search, radix tree walk). If we key on a range
> invalidate the entire once on the first invalidation this may end up
> being significantly faster.
> 
> Above is pure speculation though, a lot of what both of us is saying
> is... So another reason I'd like to get apps running to do profiling.
> It
> would be nice to make design decisions based on data not speculation.

Well nothing would stop you from adding a configurable invalidation
granularity, even with a radix-tree based approach. You'd just pad the
invalidation range to match the granularity.

> 
> > 
> > > - Where would the dma mapping / device pages be stored?
> > > 	- In the radix tree? What if ATS is enabled? We don't
> > > have a
> > > 	  driver owned radix tree. How do we reasonably connect
> > > a
> > > driver
> > > 	  owned radix to a common GPUSVM layer?
> > 
> > With ATS you mean IOMMU SVA, right? I think we could assume that
> > any
> > user of this code also has a gpu page-table since otherwise they
> > couldn't be using VRAM and a simpler solution would be in place. 
> > 
> 
> Fair point.
> 
> > But to that specific question, drm_gpusvm state would live in a
> > drm_gpusvm radix tree and driver-specific stuff in the driver tree.
> > A
> > helper based approach would then call drm_gpusvm_unmap_dma(range),
> > whereas a middle layer would just traverse the tree and unmap.
> > 
> 
> Let me consider this. Open to all options.
> 
> > > 	- In the notifier? What is the notifier is sparsely
> > > populated?
> > > 	  We would be wasting huge amounts of memory. What is
> > > the
> > > 	  notifier is configured to span the entire virtual
> > > address
> > > 	  space?
> > 
> > Let's assume you use a fake page-table like in xe_pt_walk.c as your
> > "radix tree", adapted to relevant page-sizes, sparsity is not a
> > problem.
> > 
> 
> Ok, makes sense I think.
> 
> > > - How does the garbage collector work? We can't allocate memory
> > > in
> > > the
> > >   notifier so we don't anything to add to the garbage collector.
> > > We
> > >   can't directly modify page tables given you need lock in the
> > > path
> > > of
> > >   reclaim.
> > 
> > The garbage collector would operate on the whole invalidated range.
> > In
> > the case of xe, upon zapping under reclaim you mark individual
> > page-
> > table bos that are to be removed as "invalid", the garbage
> > collector
> > walks the range removing the "invalid" entries. Subsequent (re-
> > binding)
> > avoids the "invalid" entries, (perhaps even helps removing them)
> > and
> > can thus race with the garbage collector. Hence, any ranges implied
> > by
> > the page-table code are elimitated.
> > 
> 
> This is pretty much with what I came up with too if we didn't have a
> SVM
> range.
> 
> > > - How do we deal with fault storms (e.g. tons of faults hitting
> > > the
> > > same
> > >   SVM range in a row)? Without a SVM range no every to know if
> > > mapping
> > >   is valid and GPU page handler can be short circuited.
> > 
> > Perhaps look at page-table tree and check whether the gpu_pte
> > causing
> > the fault is valid.
> > 
> 
> Came up with the same thing.
> 
> > > - Do we have notifier seqno for every PTE?
> > 
> > I'd say no. With this approach it makes sense to have a wide
> > notifier.
> > The seqno now only affects binding of new gpu_ptes, so the problem
> > with
> > a wide notifier becomes that if invalidation occurs to *any* part
> > of
> > the notifier while we're in the read section during binding, we
> > need to
> 
> I have avoided this by the drm_gpusvm_range_pages_valid. This isn't
> just
> an optimization is actually required for the 2 tile case to be able
> to
> safely know when dma pages can be unmapped (i.e. you can't dma unmap
> pages if either tile has a valid mapping).

OK, I still need to read up on that..

Thanks,
Thomas


> 
> Matt
> 
> > rerun the binding. Adding more notifiers to mitigate that would be
> > to
> > optimize faulting performance over core invalidation performance
> > which
> > Jason asked us to avoid.
> > 
> > /Thomas
> > 
> > 
> > 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-30 13:58             ` Matthew Brost
@ 2024-09-02  9:57               ` Thomas Hellström
  0 siblings, 0 replies; 100+ messages in thread
From: Thomas Hellström @ 2024-09-02  9:57 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

Hi, Matt

On Fri, 2024-08-30 at 13:58 +0000, Matthew Brost wrote:
> > 
> > So I specifically asked Jason about the performance problem about
> > using
> > many notifiers vs using a single one, and he responded that the
> > problem
> > is slowing down the core mm on invalidations, if the RB tree gets
> > too
> > large to walk. He also mentioned that we should consider core
> > invalidation performance before faulting performance because the
> > latter
> > is so slow anyway we must have the driver stack avoid gpu faults
> > using
> > user-space prefetching and similar techniques.
> > 
> > In particular inserting and removing into the mmu_interval tree is
> > not
> > costly in terms of locking but because of correctness requirements
> > insertion might block on ongoing validations.
> > 
> > So basically what I'm trying to say is that as long as we're using
> > SVM
> > ranges in the way we do (I'm not saying that is wrong at this
> > point,
> 
> If you have been following the mmap write discussions at all, one
> potential fix for removing that hack is a per range migrate mutex
> [1].
> This also need to be considered when / if we try to drop a raneg
> concept.

Still need to read up on that, and for migration I think the situation
is a bit different, pls see below.

> 
> [1]
> https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111296
> 
> > and I agree that could be fine-tuned later), The benefit of an
> > extra
> > notifier layer is questionable compared to directly inserting the
> > ranges into the mmu_interval_tree. So hence my questions, given
> > those
> > considerations why this additional layer?
> > 
> 
> One we do fairly easily if you think this questionable is have an
> option
> to size the notifier to range size and wire this the notifier size
> modparam [2]. Again once we have apps running it would be fairly to
> profile this and see if there is benefit to this large notifier
> scheme.
> If there really is none, perhaps then we consider ripping this out.
> 
> [2]
> https://patchwork.freedesktop.org/patch/611007/?series=137870&rev=1
> 
> Matt

At this point I'm mostly trying to understand the reasoning behind the
various design choices and why data structures look like they do.

But also considering that the page-table mapping and invalidation is
per (vm, gpu_vm) pair and migration is per (vm, device (device group))
pair,I have really been advocating for sorting out the page-table
mapping and invalidation first and end up with something that is
lightweight and sufficient for igpu systems, and to avoid conflating
possible page-table range requirements with migration range
requirements which might be completely different. 

I think the former can be done completely without ranges, having
configurable prefaulting-, invalidation- and notifier granularity,
whereas the latter also introduces migration granularity.

/Thomas





^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29 22:12           ` Matthew Brost
  2024-08-29 22:23             ` Matthew Brost
@ 2024-09-02 11:01             ` Christian König
  2024-09-02 12:50               ` Daniel Vetter
  2024-09-02 12:48             ` Daniel Vetter
  2 siblings, 1 reply; 100+ messages in thread
From: Christian König @ 2024-09-02 11:01 UTC (permalink / raw)
  To: Matthew Brost, Daniel Vetter
  Cc: Thomas Hellström, intel-xe, dri-devel, airlied, matthew.auld,
	daniel, Paneer Selvam, Arunpravin

Am 30.08.24 um 00:12 schrieb Matthew Brost:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
>> On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
>>> But as Sima pointed out in private communication, exhaustive eviction
>>> is not really needed for faulting to make (crawling) progress.
>>> Watermarks and VRAM trylock shrinking should suffice, since we're
>>> strictly only required to service a single gpu page granule at a time.
>>>
>>> However, ordinary bo-based jobs would still like to be able to
>>> completely evict SVM vram. Whether that is important enough to strive
>>> for is ofc up for discussion.
>> My take is that you don't win anything for exhaustive eviction by having
>> the dma_resv somewhere in there for svm allocations. Roughly for split lru
>> world, where svm ignores bo/dma_resv:
>>
>> When evicting vram from the ttm side we'll fairly switch between selecting
>> bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
>> will eventually succeed in vacuuming up everything (with a few retries
>> perhaps, if we're not yet at the head of the ww ticket queue).
>>
>> svm pages we need to try to evict anyway - there's no guarantee, becaue
>> the core mm might be holding temporary page references (which block
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.

That's most likely a bad idea. That the core holds a temporary page 
reference can happen any time without any bad doing from the 
application. E.g. for direct I/O, swapping etc...

So you can't punish the application with a segfault if you happen to not 
be able to migrate a page because it has a reference.

Regards,
Christian.

>
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.
>
>> migration) or have the page locked (which also block the migration). But
>> as long as those two steps succeed, we'll win and get the pages. There
>> might be some thrashing against concurrent svm faults stealing them again,
>> but they have a disadvantage since they can't steal dma_resv_locked bo.
>> And if it's still too much we can stall them in the page allocator.
>>
>> So it's not entirely reliable, but should be close enough.
>>
>> Now for bo based svm the picture isn't any different, because holding
>> dma_resv is not actually enough to migrate svm mappings. We still need to
>> hope there's no temporary page references around, and we still need to
>> succeed at locking the page. And the migration code only does trylocks,
>> because that's it's deadlock prevent algorithm if different migrations
>> needing the same set of pages, but acquiring them in a different order. So
>> we win nothing.
> Ok, maybe my statement above is false...
>
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.
>
>> Worse, if dma_resv does actually hold up svm migration and reclaim, then
>> we potentially deadlock because that lock is for a bigger range than
>> individual pages (or folios). And the core mm assumes that it can get out
>> of a deadlock bind by (at least stochastically) eventually succeeding in
>> acquiring/locking down a single page.
>>
>> This means we cannot use dma_resv tricks to give the ttm world an
>> advantage in exhaustive eviction against concurrent svm faults. Or at
>> least not more than we can do without by just stalling svm faults that
>> need to allocate gpu memory (but that must happen without holding locks or
>> we're busted).
>>
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?
>
>> So the only benefit I'm seeing is the unified lru, which I'm not sure is
>> worth it. There's also a bit a lru design tension here, because for the bo
> Well also not rewriting the world...
>
> Matt
>
>> world we want objects that are locked to stay on the lru, so that the
>> competing processes can figure out who has the winning ww ticket. The core
>> mm design otoh does isolate pages and remove them from the lru when
>> they're acquired, so that they don't gunk up other processes from trying
>> to make forward progress and are better hidden. Which reduces temporary
>> page references (from lru walk) preventing migration and stuff like that.
>>
>> Cheers, Sima
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 16:40           ` Matthew Brost
@ 2024-09-02 11:29             ` Daniel Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 11:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, Christian König, intel-xe, dri-devel, airlied,
	thomas.hellstrom, matthew.auld, daniel

On Thu, Aug 29, 2024 at 04:40:47PM +0000, Matthew Brost wrote:
> On Wed, Aug 28, 2024 at 06:25:18PM +0200, Daniel Vetter wrote:
> > On Wed, Aug 28, 2024 at 03:43:48PM +0000, Matthew Brost wrote:
> > > On Wed, Aug 28, 2024 at 04:46:24PM +0200, Christian König wrote:
> > > > Am 28.08.24 um 16:31 schrieb Daniel Vetter:
> > > > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > > > +		if (!ctx->mmap_locked) {
> > > > > > +			/*
> > > > > > +			 * XXX: HMM locking document indicates only a read-lock
> > > > > > +			 * is required but there apears to be a window between
> > > > > > +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > > > > > +			 * via migrate_vma_setup and the pages actually moving
> > > > > > +			 * in migrate_vma_finalize in which this code can grab
> > > > > > +			 * garbage pages. Grabbing the write-lock if the range
> > > > > > +			 * is attached to vram appears to protect against this
> > > > > > +			 * race.
> > > > > > +			 */
> > > 
> > > Thanks the comments, replying to both of you inline.
> > > 
> > > > > This one is really scary, since it means the entire migrate pte trickery
> > > > > is essentially completely busted. Grabbing the mmap write lock just means
> > > > > you block out pretty much everything interesting from concurrently
> > > > > happening.
> > > > > 
> > > > > My gut feeling says we need to figure out what's happening here, because
> > > > > this looks a bit too fundamental to me.
> > > 
> > > I agree. I haven’t looked into this issue for a couple of months but
> > > really need to understand what is going on.
> > > 
> > > I should have mentioned this in the cover letter: the goal of this
> > > series was to produce something for review that is stable and supports
> > > UMDs/user applications. It was not intended to be presented as a final
> > > solution. This issue certainly falls into the category of "needs to be
> > > understood and requires a proper fix."
> > > 
> > > One open question I have is whether the test case that triggers this
> > > issue is even defined behavior. The test creates concurrent access
> > > between the GPU and CPU to the same memory address, resulting in GPU and
> > > CPU faults racing against each other. It’s possible that this is
> > > undefined behavior, so data corruption might be acceptable—i.e., the
> > > kernel can’t crash, but incorrect results might be permissible.
> > 
> > Yes this is supposed to be defined, at least from an hmm pov. And core mm/
> > is ridiculous in how many races it allows, especially around concurrent
> > fault handling.
> > 
> > It is ofc really slow if every fault results in a migration, but that's a
> > matter of the application setting stupid memory migration hints for the
> > gpu.
> > 
> > > e.g. This is the only defined usage model:
> > > 
> > > alloc_memory();
> > > start_compute_kernel();
> > > sync_on_compute_kernel_completion();
> > > read_memory();
> > > 
> > > Hopefully, in the next week or so, I'll be heavily engaging with the UMD
> > > teams. Development can then start, and applications will be running soon
> > > after. This will allow us to address issues like this, collect data on
> > > memory usage, and verify some of the assumptions I've made, such as
> > > optimizing for 2M+ allocations.
> > > 
> > > > 
> > > > I think I have at least a high level understanding what's going on here,
> > > > Felix and especially Philip should know more of the details.
> > > > 
> > > 
> > > I meant to reach out to AMD for issues like this. So, Felix
> > > (felix.kuehling@amd.com) and Philip (Philip.Yang@amd.com) would be good
> > > contacts?
> > > 
> > > > In general grabbing the mm_lock to protect PTEs from changing is completely
> > > > nonsense. The mm_lock is to protect the VMAs and *not* the PTEs!
> > > > 
> > > 
> > > Thanks for the hint. I believe that in the AMD implementation, I noticed
> > > some additional locks for migration, which might be how you mitigated
> > > this issue.
> > 
> > Yeah, so in general hold mmap_reading is indeed pure magic thinking for
> > preventing pte changes, like Christian points out. It doesn't stop
> > invalidates, and with the per vma locking it also doesn't stop new valid
> 
> Invalidations happening to parallel to migrations, get pages, or
> bindings should be fine. The notifier lock usage should make all of this
> safe.
> 
> > ptes from being inserted at least for anon memory.
> > 
> > Except migration pte entries that point at vram pages are special, and are
> > _only_ resolved while holding mmap_read. Which means holding mmap_write
> > for the case of looking up our own vram pages with hmm_range_fault
> > actually prevents issues. And so this duct-tape of holding mmap_write very
> > much looks like a working hack to plug any races against concurrently
> > ongoing migrations to system memory due to cpu faults.
> > 
> 
> Agree holding mmap_write is a hack. Looking at AMD 'To serialize concurrent
> migrations or validations of the same range, the prange->migrate_mutex
> must be held.', seemly I could drop mmap write lock abuse and use
> something like this here. The would like be an inner lock of the mmap
> lock.
> 
> Does this seem like a reasonable thing to explore?

Meta-comment: Since I've learned as I typed replies there's a bit a mess
in my suggestions/questions ...

See the other replies, I think prage->migrate_mutex doesn't work because
we need a lock on the physical storage, not on the virtual range. Because
thanks to mremap (if I understand that thing right) and fork does do not
need to line up, or be unique.

The other thing I only slowly realized is that the code in migrate.c
forces full page/folio lock semantics on use already, like a migration
done by core mm code between sram and sram. So I /think/ that if we follow
the rules correctly, the page/folio lock should be enough to sufficiently
serialize migrations.

But then there's amd's migration_mutex, and your migration vs fault
troubles, and I'm honestly not really understanding those races ..

So my current thinking is that page/folio lock should be enough, or we
have bugs or wrong understanding of how this should work.

> > An even more fun corner case is multiple concurrent cpu faults on the same
> > vram page. fork gets you that, or maybe a bit more reasonable mremap with
> 
> My understanding is memory shared between processes cannot migrated due
> to current limitations migrate layer.
> 
> e.g. mmap called with MAP_SHARED is not eligible for migration.

Hm, where is that limitation? All the code I've seen suggests it all
should work, and the only memory not eligible for hmm/zone_device
migration is pagecache. But any kind of anon memory, whether private (i.e.
cow on fork) or shared should work?

If that's not the case then yeah a _lot_ of what I said is just plain
wrong I think ...

> Unsure what the behavior is fork() is called on a process with memory in
> VRAM and the child tries to access it. Maybe fork() is different than
> MAP_SHARED where as parent / child processes can share memory in VRAM? 
> 
> Also really I'm unsure what would happen if user space calls fork() and
> has an Xe VM open and tries to use it too. Before commenting more on
> this, I need play around with test cases like this educate myself.
> 
> My current test doesn't use mremap, agree that would be to good add.
> Again before commenting more here, let add more test cases to educate
> myself.

Yeah I definitely need more learning too.
> 
> > MREMAP_DONTUNMAP | MREMAP_MAYMOVE. I think just hammer the same va with
> > multiple threads along isn't enough, it's better to have a private va for
> 
> I do have test cases where multiple CPU faults from threads hammer the
> same memory. Found some bugs in my initial code but as far as I can tell
> multiple CPU faults in parallel occur in my testing and do work.
> 
> > each thread pointing at the same anon memory page, so that you can get
> 
> You are losing me here - 'private va for each thread pointing at the
> same anon memory page'. This is a fork() case where the parent allocates
> memory and then all children try to read in parallel?

shared anon memory should be a thing, at least to my knowledge. Which
might be wrong ...

> > more parallel faults due to finely grained pte locking.
> > 
> > Would be a good testcase to add, if you don't have it yet.
> >
> 
> See above, agree these are good test cases which I haven't considered and
> will expand my suite to include these. Thanks for the tip - IMO testing
> is as important or even more important than the KMD design and need to
> ensure I have all possible uses covered.
> 
> > > I must say it is a bit unfortunate that the HMM locking documentation
> > > doesn’t mention this. I believe the documentation needs additional
> > > information, which I can add once we finalize the solution.
> > 
> > Yeah, at least from my very cursory lock you don't have enough locking.
> > I've written an in-depth reply to patch 23 with the high-level summary of
> > my thoughts.
> >
> 
> Will look and reply there.

Cheers, Sima

> 
> Matt
> 
> > Cheers, Sima
> > 
> > > 
> > > Matt 
> > > 
> > > > Even with the write side of the mm_lock taken it is perfectly possible that
> > > > PTE change. It's just less likely.
> > > > 
> > > > We run into multiple issues before we figured out this important distinction
> > > > as well.
> > > > 
> > > > Christian.
> > > > 
> > > > > -Sima
> > > > > 
> > > > > 
> > > > > > +			if (vram_pages)
> > > > > > +				mmap_write_lock(mm);
> > > > > > +			else
> > > > > > +				mmap_read_lock(mm);
> > > > > > +		}
> > > > > > +		err = hmm_range_fault(&hmm_range);
> > > > > > +		if (!ctx->mmap_locked) {
> > > > > > +			if (vram_pages)
> > > > > > +				mmap_write_unlock(mm);
> > > > > > +			else
> > > > > > +				mmap_read_unlock(mm);
> > > > > > +		}
> > > > > > +
> > > > > > +		if (err == -EBUSY) {
> > > > > > +			if (time_after(jiffies, timeout))
> > > > > > +				break;
> > > > > > +
> > > > > > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +		break;
> > > > > > +	}
> > > > > > +	if (!ctx->mmap_locked)
> > > > > > +		mmput(mm);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	pages = (struct page **)pfns;
> > > > > > +
> > > > > > +	if (ctx->prefault) {
> > > > > > +		range->pages = pages;
> > > > > > +		goto set_seqno;
> > > > > > +	}
> > > > > > +
> > > > > > +map_pages:
> > > > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > > > +
> > > > > > +		for (i = 0; i < npages; ++i) {
> > > > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > > > +
> > > > > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > > > +				err = -EOPNOTSUPP;
> > > > > > +				goto err_free;
> > > > > > +			}
> > > > > > +		}
> > > > > > +
> > > > > > +		/* Do not race with notifier unmapping pages */
> > > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +		range->flags.has_vram_pages = true;
> > > > > > +		range->pages = pages;
> > > > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > > > +			err = -EAGAIN;
> > > > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > > +		}
> > > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +	} else {
> > > > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > > > +
> > > > > > +		for_each_dma_page(i, j, npages, order) {
> > > > > > +			if (WARN_ON_ONCE(i && order !=
> > > > > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > > > > +				err = -EOPNOTSUPP;
> > > > > > +				npages = i;
> > > > > > +				goto err_unmap;
> > > > > > +			}
> > > > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > > > +
> > > > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > > > +				err = -EOPNOTSUPP;
> > > > > > +				npages = i;
> > > > > > +				goto err_unmap;
> > > > > > +			}
> > > > > > +
> > > > > > +			set_page_dirty_lock(pages[j]);
> > > > > > +			mark_page_accessed(pages[j]);
> > > > > > +
> > > > > > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > > > > +						   pages[j], 0,
> > > > > > +						   PAGE_SIZE << order,
> > > > > > +						   DMA_BIDIRECTIONAL);
> > > > > > +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > > > > > +				err = -EFAULT;
> > > > > > +				npages = i;
> > > > > > +				goto err_unmap;
> > > > > > +			}
> > > > > > +		}
> > > > > > +
> > > > > > +		/* Huge pages, reduce memory footprint */
> > > > > > +		if (order) {
> > > > > > +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > > > > > +						 GFP_KERNEL);
> > > > > > +			if (dma_addr) {
> > > > > > +				for (i = 0; i < j; ++i)
> > > > > > +					dma_addr[i] = (dma_addr_t)pfns[i];
> > > > > > +				kvfree(pfns);
> > > > > > +				kfree_mapping = true;
> > > > > > +			} else {
> > > > > > +				dma_addr = (dma_addr_t *)pfns;
> > > > > > +			}
> > > > > > +		}
> > > > > > +
> > > > > > +		/* Do not race with notifier unmapping pages */
> > > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +		range->order = order;
> > > > > > +		range->flags.kfree_mapping = kfree_mapping;
> > > > > > +		range->flags.has_dma_mapping = true;
> > > > > > +		range->dma_addr = dma_addr;
> > > > > > +		range->vram_allocation = NULL;
> > > > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > > > +			err = -EAGAIN;
> > > > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > > +		}
> > > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +	}
> > > > > > +
> > > > > > +	if (err == -EAGAIN)
> > > > > > +		goto retry;
> > > > > > +set_seqno:
> > > > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +err_unmap:
> > > > > > +	for_each_dma_page(i, j, npages, order)
> > > > > > +		dma_unmap_page(gpusvm->drm->dev,
> > > > > > +			       (dma_addr_t)pfns[j],
> > > > > > +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > > > > > +err_free:
> > > > > > +	if (alloc_pfns)
> > > > > > +		kvfree(pfns);
> > > > > > +err_out:
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > > > > > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > > > > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > > > > > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > > > > > + * security model.
> > > > > > + */
> > > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > > +				  struct drm_gpusvm_range *range,
> > > > > > +				  const struct drm_gpusvm_ctx *ctx)
> > > > > > +{
> > > > > > +	if (ctx->in_notifier)
> > > > > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > > > > +	else
> > > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > > +
> > > > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > > +
> > > > > > +	if (!ctx->in_notifier)
> > > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > > > + * @page: Pointer to the page to put
> > > > > > + *
> > > > > > + * This function unlocks and puts a page.
> > > > > > + */
> > > > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > > > +{
> > > > > > +	unlock_page(page);
> > > > > > +	put_page(page);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > > > + * @npages: Number of pages
> > > > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > > > + *
> > > > > > + * This function puts an array of pages.
> > > > > > + */
> > > > > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > > > > +					   unsigned long *migrate_pfn)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		if (!migrate_pfn[i])
> > > > > > +			continue;
> > > > > > +
> > > > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > > > > > +		migrate_pfn[i] = 0;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > > > + * @page: Pointer to the page
> > > > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > > > + *
> > > > > > + * This function associates the given page with the specified GPU SVM zone
> > > > > > + * device data and initializes it for zone device usage.
> > > > > > + */
> > > > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > > > +				     struct drm_gpusvm_zdd *zdd)
> > > > > > +{
> > > > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > > > +	zone_device_page_init(page);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > > > > > + * @dev: The device for which the pages are being mapped
> > > > > > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > > > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > > > + * @npages: Number of pages to map
> > > > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > > > + *
> > > > > > + * This function maps pages of memory for migration usage in GPU SVM. It
> > > > > > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > > > > > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > > > > > + * array.
> > > > > > + *
> > > > > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > > > > + */
> > > > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > > > +					dma_addr_t *dma_addr,
> > > > > > +					long unsigned int *migrate_pfn,
> > > > > > +					unsigned long npages,
> > > > > > +					enum dma_data_direction dir)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > > > > > +
> > > > > > +		if (!page)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > > > +			return -EFAULT;
> > > > > > +
> > > > > > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > > > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > > > +			return -EFAULT;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > > > > > + * @dev: The device for which the pages were mapped
> > > > > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > > > > + * @npages: Number of pages to unmap
> > > > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > > > + *
> > > > > > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > > > > > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > > > > > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > > > > > + */
> > > > > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > > > > +					   dma_addr_t *dma_addr,
> > > > > > +					   unsigned long npages,
> > > > > > +					   enum dma_data_direction dir)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *                   failure of this function.
> > > > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > > > > > + *                   should hold a reference to the VRAM allocation, which
> > > > > > + *                   should be dropped via ops->vram_allocation or upon the
> > > > > > + *                   failure of this function.
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > > > > > + * necessary setup and invokes the driver-specific operations for migration to
> > > > > > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > > > > > + * until ops->vram_release is called which only upon successful return.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range *range,
> > > > > > +			       void *vram_allocation,
> > > > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > > > +{
> > > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > > +	struct migrate_vma migrate = {
> > > > > > +		.start		= start,
> > > > > > +		.end		= end,
> > > > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > > > > +	};
> > > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > > +	unsigned long i, npages = npages_in_range(start, end);
> > > > > > +	struct vm_area_struct *vas;
> > > > > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > > > > +	struct page **pages;
> > > > > > +	dma_addr_t *dma_addr;
> > > > > > +	void *buf;
> > > > > > +	int err;
> > > > > > +
> > > > > > +	if (!range->flags.migrate_vram)
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > > > > > +	    !gpusvm->ops->copy_to_sram)
> > > > > > +		return -EOPNOTSUPP;
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		if (!mmget_not_zero(mm)) {
> > > > > > +			err = -EFAULT;
> > > > > > +			goto err_out;
> > > > > > +		}
> > > > > > +		mmap_write_lock(mm);
> > > > > > +	}
> > > > > > +
> > > > > > +	mmap_assert_locked(mm);
> > > > > > +
> > > > > > +	vas = vma_lookup(mm, start);
> > > > > > +	if (!vas) {
> > > > > > +		err = -ENOENT;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > > > +		err = -EINVAL;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!vma_is_anonymous(vas)) {
> > > > > > +		err = -EBUSY;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > > +	if (!buf) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > > > +
> > > > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > > > +	if (!zdd) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_free;
> > > > > > +	}
> > > > > > +
> > > > > > +	migrate.vma = vas;
> > > > > > +	migrate.src = buf;
> > > > > > +	migrate.dst = migrate.src + npages;
> > > > > > +
> > > > > > +	err = migrate_vma_setup(&migrate);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > > > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > > > > +	 */
> > > > > > +
> > > > > > +	if (!migrate.cpages) {
> > > > > > +		err = -EFAULT;
> > > > > > +		goto err_free;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (migrate.cpages != npages) {
> > > > > > +		err = -EBUSY;
> > > > > > +		goto err_finalize;
> > > > > > +	}
> > > > > > +
> > > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > > > > > +					     migrate.dst);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > > > +					   migrate.src, npages, DMA_TO_DEVICE);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i) {
> > > > > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > > > > +
> > > > > > +		pages[i] = page;
> > > > > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > > > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > > > > +	}
> > > > > > +
> > > > > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	/* Upon success bind vram allocation to range and zdd */
> > > > > > +	range->vram_allocation = vram_allocation;
> > > > > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> > > > > > +
> > > > > > +err_finalize:
> > > > > > +	if (err)
> > > > > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > > > +	migrate_vma_pages(&migrate);
> > > > > > +	migrate_vma_finalize(&migrate);
> > > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > > > +				       DMA_TO_DEVICE);
> > > > > > +err_free:
> > > > > > +	if (zdd)
> > > > > > +		drm_gpusvm_zdd_put(zdd);
> > > > > > +	kvfree(buf);
> > > > > > +err_mmunlock:
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		mmap_write_unlock(mm);
> > > > > > +		mmput(mm);
> > > > > > +	}
> > > > > > +err_out:
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > > > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > > > + * @npages: Number of pages to populate
> > > > > > + * @src_mpfn: Source array of migrate PFNs
> > > > > > + * @mpfn: Array of migrate PFNs to populate
> > > > > > + * @addr: Start address for PFN allocation
> > > > > > + *
> > > > > > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > > > > > + * specified VM area structure. It allocates and locks pages in the VM area for
> > > > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > > > > > + * alloc_page for allocation.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > > > > > +						unsigned long npages,
> > > > > > +						unsigned long *src_mpfn,
> > > > > > +						unsigned long *mpfn, u64 addr)
> > > > > > +{
> > > > > > +	unsigned long i;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > > > +		struct page *page;
> > > > > > +
> > > > > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (vas)
> > > > > > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > > > > > +		else
> > > > > > +			page = alloc_page(GFP_HIGHUSER);
> > > > > > +
> > > > > > +		if (!page)
> > > > > > +			return -ENOMEM;
> > > > > > +
> > > > > > +		lock_page(page);
> > > > > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + *
> > > > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > > > > > + * migration done via migrate_device_* functions. Fallback path as it is
> > > > > > + * preferred to issue migrations with mmap lock.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > > > > +				    struct drm_gpusvm_range *range)
> > > > > > +{
> > > > > > +	unsigned long npages;
> > > > > > +	struct page **pages;
> > > > > > +	unsigned long *src, *dst;
> > > > > > +	dma_addr_t *dma_addr;
> > > > > > +	void *buf;
> > > > > > +	int i, err = 0;
> > > > > > +
> > > > > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > > > > +
> > > > > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > > +	if (!buf) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_out;
> > > > > > +	}
> > > > > > +	src = buf;
> > > > > > +	dst = buf + (sizeof(*src) * npages);
> > > > > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > > > > > +
> > > > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > > > > > +					     npages, src);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > > > > +				       gpusvm->device_private_page_owner, src,
> > > > > > +				       npages, range->va.start);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > > > +					   dst, npages, DMA_BIDIRECTIONAL);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i)
> > > > > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > > > > +
> > > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +err_finalize:
> > > > > > +	if (err)
> > > > > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > > > > +	migrate_device_pages(src, dst, npages);
> > > > > > +	migrate_device_finalize(src, dst, npages);
> > > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > > > +				       DMA_BIDIRECTIONAL);
> > > > > > +err_free:
> > > > > > +	kvfree(buf);
> > > > > > +err_out:
> > > > > > +
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @vas: Pointer to the VM area structure
> > > > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > > > + * @start: Start address of the migration range
> > > > > > + * @end: End address of the migration range
> > > > > > + *
> > > > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > > +					struct vm_area_struct *vas,
> > > > > > +					struct page *page,
> > > > > > +					u64 start, u64 end)
> > > > > > +{
> > > > > > +	struct migrate_vma migrate = {
> > > > > > +		.vma		= vas,
> > > > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > > > +		.fault_page	= page,
> > > > > > +	};
> > > > > > +	unsigned long npages;
> > > > > > +	struct page **pages;
> > > > > > +	dma_addr_t *dma_addr;
> > > > > > +	void *buf;
> > > > > > +	int i, err = 0;
> > > > > > +
> > > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > > +
> > > > > > +	/* Corner where VMA area struct has been partially unmapped */
> > > > > > +	if (start < vas->vm_start)
> > > > > > +		start = vas->vm_start;
> > > > > > +	if (end > vas->vm_end)
> > > > > > +		end = vas->vm_end;
> > > > > > +
> > > > > > +	migrate.start = start;
> > > > > > +	migrate.end = end;
> > > > > > +	npages = npages_in_range(start, end);
> > > > > > +
> > > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > > +	if (!buf) {
> > > > > > +		err = -ENOMEM;
> > > > > > +		goto err_out;
> > > > > > +	}
> > > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > > > +
> > > > > > +	migrate.vma = vas;
> > > > > > +	migrate.src = buf;
> > > > > > +	migrate.dst = migrate.src + npages;
> > > > > > +
> > > > > > +	err = migrate_vma_setup(&migrate);
> > > > > > +	if (err)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	/* Raced with another CPU fault, nothing to do */
> > > > > > +	if (!migrate.cpages)
> > > > > > +		goto err_free;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > > > > +						   migrate.src, migrate.dst,
> > > > > > +						   start);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > > > > +					   migrate.dst, npages,
> > > > > > +					   DMA_BIDIRECTIONAL);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; ++i)
> > > > > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > > > > +
> > > > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > > > > +	if (err)
> > > > > > +		goto err_finalize;
> > > > > > +
> > > > > > +err_finalize:
> > > > > > +	if (err)
> > > > > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > > > > +	migrate_vma_pages(&migrate);
> > > > > > +	migrate_vma_finalize(&migrate);
> > > > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > > > > +				       DMA_BIDIRECTIONAL);
> > > > > > +err_free:
> > > > > > +	kvfree(buf);
> > > > > > +err_out:
> > > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > > > +
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @range: Pointer to the GPU SVM range structure
> > > > > > + * @ctx: GPU SVM context
> > > > > > + *
> > > > > > + * This function initiates the migration of the specified GPU SVM range to
> > > > > > + * SRAM. It performs necessary checks and invokes the internal migration
> > > > > > + * function for actual migration.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * 0 on success, negative error code on failure.
> > > > > > + */
> > > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range *range,
> > > > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > > > +{
> > > > > > +	u64 start = range->va.start, end = range->va.end;
> > > > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > > > +	struct vm_area_struct *vas;
> > > > > > +	int err;
> > > > > > +	bool retry = false;
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		if (!mmget_not_zero(mm)) {
> > > > > > +			err = -EFAULT;
> > > > > > +			goto err_out;
> > > > > > +		}
> > > > > > +		if (ctx->trylock_mmap) {
> > > > > > +			if (!mmap_read_trylock(mm))  {
> > > > > > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > > > +				goto err_mmput;
> > > > > > +			}
> > > > > > +		} else {
> > > > > > +			mmap_read_lock(mm);
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	mmap_assert_locked(mm);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Loop required to find all VMA area structs for the corner case when
> > > > > > +	 * VRAM backing has been partially unmapped from MM's address space.
> > > > > > +	 */
> > > > > > +again:
> > > > > > +	vas = find_vma(mm, start);
> > > > > > +	if (!vas) {
> > > > > > +		if (!retry)
> > > > > > +			err = -ENOENT;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > > > +		if (!retry)
> > > > > > +			err = -EINVAL;
> > > > > > +		goto err_mmunlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > > > > > +	if (err)
> > > > > > +		goto err_mmunlock;
> > > > > > +
> > > > > > +	if (vas->vm_end < end) {
> > > > > > +		retry = true;
> > > > > > +		start = vas->vm_end;
> > > > > > +		goto again;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (!ctx->mmap_locked) {
> > > > > > +		mmap_read_unlock(mm);
> > > > > > +		/*
> > > > > > +		 * Using mmput_async as this function can be called while
> > > > > > +		 * holding a dma-resv lock, and a final put can grab the mmap
> > > > > > +		 * lock, causing a lock inversion.
> > > > > > +		 */
> > > > > > +		mmput_async(mm);
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +
> > > > > > +err_mmunlock:
> > > > > > +	if (!ctx->mmap_locked)
> > > > > > +		mmap_read_unlock(mm);
> > > > > > +err_mmput:
> > > > > > +	if (!ctx->mmap_locked)
> > > > > > +		mmput_async(mm);
> > > > > > +err_out:
> > > > > > +	return err;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > > > > > + * @page: Pointer to the page
> > > > > > + *
> > > > > > + * This function is a callback used to put the GPU SVM zone device data
> > > > > > + * associated with a page when it is being released.
> > > > > > + */
> > > > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > > > +{
> > > > > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > > > + * @vmf: Pointer to the fault information structure
> > > > > > + *
> > > > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > > > + * the internal migration function to migrate the range back to RAM.
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > > > + */
> > > > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > > > +	int err;
> > > > > > +
> > > > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > > > +					   vmf->vma, vmf->page,
> > > > > > +					   zdd->range->va.start,
> > > > > > +					   zdd->range->va.end);
> > > > > > +
> > > > > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > > > > + */
> > > > > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > > > > +	.page_free = drm_gpusvm_page_free,
> > > > > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * Pointer to the GPU SVM device page map operations structure.
> > > > > > + */
> > > > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > > > > +{
> > > > > > +	return &drm_gpusvm_pagemap_ops;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > > > + * @start: Start address
> > > > > > + * @end: End address
> > > > > > + *
> > > > > > + * Returns:
> > > > > > + * True if GPU SVM has mapping, False otherwise
> > > > > > + */
> > > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > > > > > +{
> > > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > > +
> > > > > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > > > > +		struct drm_gpusvm_range *range = NULL;
> > > > > > +
> > > > > > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > > > > > +			return true;
> > > > > > +	}
> > > > > > +
> > > > > > +	return false;
> > > > > > +}
> > > > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..0ea70f8534a8
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > > > > @@ -0,0 +1,415 @@
> > > > > > +/* SPDX-License-Identifier: MIT */
> > > > > > +/*
> > > > > > + * Copyright © 2024 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef __DRM_GPUSVM_H__
> > > > > > +#define __DRM_GPUSVM_H__
> > > > > > +
> > > > > > +#include <linux/kref.h>
> > > > > > +#include <linux/mmu_notifier.h>
> > > > > > +#include <linux/workqueue.h>
> > > > > > +
> > > > > > +struct dev_pagemap_ops;
> > > > > > +struct drm_device;
> > > > > > +struct drm_gpusvm;
> > > > > > +struct drm_gpusvm_notifier;
> > > > > > +struct drm_gpusvm_ops;
> > > > > > +struct drm_gpusvm_range;
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > > > > + *
> > > > > > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > > > > > + * These operations are provided by the GPU driver to manage SVM ranges and
> > > > > > + * perform operations such as migration between VRAM and system RAM.
> > > > > > + */
> > > > > > +struct drm_gpusvm_ops {
> > > > > > +	/**
> > > > > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > > > > +	 *
> > > > > > +	 * This function shall allocate a GPU SVM notifier.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > > > > > +	 */
> > > > > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > > > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > > > > +	 *
> > > > > > +	 * This function shall free a GPU SVM notifier.
> > > > > > +	 */
> > > > > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 *
> > > > > > +	 * This function shall allocate a GPU SVM range.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > > > > > +	 */
> > > > > > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @range_free: Free a GPU SVM range (optional)
> > > > > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > > > > +	 *
> > > > > > +	 * This function shall free a GPU SVM range.
> > > > > > +	 */
> > > > > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @vram_release: Release VRAM allocation (optional)
> > > > > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > > > +	 *
> > > > > > +	 * This function shall release VRAM allocation and expects to drop a
> > > > > > +	 * reference to VRAM allocation.
> > > > > > +	 */
> > > > > > +	void (*vram_release)(void *vram_allocation);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > > > +	 * @npages: Number of pages to populate
> > > > > > +	 * @pfn: Array of page frame numbers to populate
> > > > > > +	 *
> > > > > > +	 * This function shall populate VRAM page frame numbers (PFN).
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * 0 on success, a negative error code on failure.
> > > > > > +	 */
> > > > > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > > > > +				 void *vram_allocation,
> > > > > > +				 unsigned long npages,
> > > > > > +				 unsigned long *pfn);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > > > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > > > > +	 * @npages: Number of pages to copy
> > > > > > +	 *
> > > > > > +	 * This function shall copy pages to VRAM.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * 0 on success, a negative error code on failure.
> > > > > > +	 */
> > > > > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > > > > +			    struct page **pages,
> > > > > > +			    dma_addr_t *dma_addr,
> > > > > > +			    unsigned long npages);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @copy_to_sram: Copy to system RAM (required for migration)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > > > > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > > > > > +	 * @npages: Number of pages to copy
> > > > > > +	 *
> > > > > > +	 * This function shall copy pages to system RAM.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 * 0 on success, a negative error code on failure.
> > > > > > +	 */
> > > > > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > > > > +			    struct page **pages,
> > > > > > +			    dma_addr_t *dma_addr,
> > > > > > +			    unsigned long npages);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > > > > +	 * @gpusvm: Pointer to the GPU SVM
> > > > > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > > > > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > > > > > +	 *
> > > > > > +	 * This function shall invalidate the GPU page tables. It can safely
> > > > > > +	 * walk the notifier range RB tree/list in this function. Called while
> > > > > > +	 * holding the notifier lock.
> > > > > > +	 */
> > > > > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > > > > +			   struct drm_gpusvm_notifier *notifier,
> > > > > > +			   const struct mmu_notifier_range *mmu_range);
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > > > > > + *
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: MMU interval notifier
> > > > > > + * @interval: Interval for the notifier
> > > > > > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > > > > > + * @root: Cached root node of the RB tree containing ranges
> > > > > > + * @range_list: List head containing of ranges in the same order they appear in
> > > > > > + *              interval tree. This is useful to keep iterating ranges while
> > > > > > + *              doing modifications to RB tree.
> > > > > > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > > > > > + *                 removed
> > > > > > + *
> > > > > > + * This structure represents a GPU SVM notifier.
> > > > > > + */
> > > > > > +struct drm_gpusvm_notifier {
> > > > > > +	struct drm_gpusvm *gpusvm;
> > > > > > +	struct mmu_interval_notifier notifier;
> > > > > > +	struct {
> > > > > > +		u64 start;
> > > > > > +		u64 end;
> > > > > > +	} interval;
> > > > > > +	struct {
> > > > > > +		struct rb_node node;
> > > > > > +		struct list_head entry;
> > > > > > +		u64 __subtree_last;
> > > > > > +	} rb;
> > > > > > +	struct rb_root_cached root;
> > > > > > +	struct list_head range_list;
> > > > > > +	struct {
> > > > > > +		u32 removed : 1;
> > > > > > +	} flags;
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > > > > + *
> > > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > > + * @notifier: Pointer to the GPU SVM notifier
> > > > > > + * @refcount: Reference count for the range
> > > > > > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > > > > > + * @va: Virtual address range
> > > > > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > > > > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > > > > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > > > > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > > > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > > > > > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > > > > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > > > > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > > > > > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > > > > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > > > > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > > > > > + *                       on @order which releases via kfree
> > > > > > + *
> > > > > > + * This structure represents a GPU SVM range used for tracking memory ranges
> > > > > > + * mapped in a DRM device.
> > > > > > + */
> > > > > > +struct drm_gpusvm_range {
> > > > > > +	struct drm_gpusvm *gpusvm;
> > > > > > +	struct drm_gpusvm_notifier *notifier;
> > > > > > +	struct kref refcount;
> > > > > > +	struct {
> > > > > > +		struct rb_node node;
> > > > > > +		struct list_head entry;
> > > > > > +		u64 __subtree_last;
> > > > > > +	} rb;
> > > > > > +	struct {
> > > > > > +		u64 start;
> > > > > > +		u64 end;
> > > > > > +	} va;
> > > > > > +	unsigned long notifier_seq;
> > > > > > +	union {
> > > > > > +		struct page **pages;
> > > > > > +		dma_addr_t *dma_addr;
> > > > > > +	};
> > > > > > +	void *vram_allocation;
> > > > > > +	u16 order;
> > > > > > +	struct {
> > > > > > +		/* All flags below must be set upon creation */
> > > > > > +		u16 migrate_vram : 1;
> > > > > > +		/* All flags below must be set / cleared under notifier lock */
> > > > > > +		u16 unmapped : 1;
> > > > > > +		u16 partial_unmap : 1;
> > > > > > +		u16 has_vram_pages : 1;
> > > > > > +		u16 has_dma_mapping : 1;
> > > > > > +		u16 kfree_mapping : 1;
> > > > > > +	} flags;
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm - GPU SVM structure
> > > > > > + *
> > > > > > + * @name: Name of the GPU SVM
> > > > > > + * @drm: Pointer to the DRM device structure
> > > > > > + * @mm: Pointer to the mm_struct for the address space
> > > > > > + * @device_private_page_owner: Device private pages owner
> > > > > > + * @mm_start: Start address of GPU SVM
> > > > > > + * @mm_range: Range of the GPU SVM
> > > > > > + * @notifier_size: Size of individual notifiers
> > > > > > + * @ops: Pointer to the operations structure for GPU SVM
> > > > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > > > > > + *               Entries should be powers of 2 in descending order.
> > > > > > + * @num_chunks: Number of chunks
> > > > > > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > > > > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > > > > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > > > > > + * @notifier_list: list head containing of notifiers in the same order they
> > > > > > + *                 appear in interval tree. This is useful to keep iterating
> > > > > > + *                 notifiers while doing modifications to RB tree.
> > > > > > + *
> > > > > > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > > > > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > > > > + *
> > > > > > + * No reference counting is provided, as this is expected to be embedded in the
> > > > > > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > > > > > + * counting.
> > > > > > + */
> > > > > > +struct drm_gpusvm {
> > > > > > +	const char *name;
> > > > > > +	struct drm_device *drm;
> > > > > > +	struct mm_struct *mm;
> > > > > > +	void *device_private_page_owner;
> > > > > > +	u64 mm_start;
> > > > > > +	u64 mm_range;
> > > > > > +	u64 notifier_size;
> > > > > > +	const struct drm_gpusvm_ops *ops;
> > > > > > +	const u64 *chunk_sizes;
> > > > > > +	int num_chunks;
> > > > > > +	struct rw_semaphore notifier_lock;
> > > > > > +	struct workqueue_struct *zdd_wq;
> > > > > > +	struct rb_root_cached root;
> > > > > > +	struct list_head notifier_list;
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > > > > + *
> > > > > > + * @mmap_locked: mmap lock is locked
> > > > > > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > > > > > + *                (e.g.dma-revs -> mmap lock)
> > > > > > + * @in_notifier: entering from a MMU notifier
> > > > > > + * @read_only: operating on read-only memory
> > > > > > + * @vram_possible: possible to use VRAM
> > > > > > + * @prefault: prefault pages
> > > > > > + *
> > > > > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > > > > + */
> > > > > > +struct drm_gpusvm_ctx {
> > > > > > +	u32 mmap_locked :1;
> > > > > > +	u32 trylock_mmap :1;
> > > > > > +	u32 in_notifier :1;
> > > > > > +	u32 read_only :1;
> > > > > > +	u32 vram_possible :1;
> > > > > > +	u32 prefault :1;
> > > > > > +};
> > > > > > +
> > > > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > > > +		    const char *name, struct drm_device *drm,
> > > > > > +		    struct mm_struct *mm, void *device_private_page_owner,
> > > > > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > > > > +		    const struct drm_gpusvm_ops *ops,
> > > > > > +		    const u64 *chunk_sizes, int num_chunks);
> > > > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > > > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > > > > +
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > > > > > +				u64 gpuva_start, u64 gpuva_end,
> > > > > > +				const struct drm_gpusvm_ctx *ctx);
> > > > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > > > +			     struct drm_gpusvm_range *range);
> > > > > > +
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > > > > +
> > > > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > > > +				  struct drm_gpusvm_range *range);
> > > > > > +
> > > > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range *range,
> > > > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > > > +				  struct drm_gpusvm_range *range,
> > > > > > +				  const struct drm_gpusvm_ctx *ctx);
> > > > > > +
> > > > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range *range,
> > > > > > +			       void *vram_allocation,
> > > > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > > +			       struct drm_gpusvm_range *range,
> > > > > > +			       const struct drm_gpusvm_ctx *ctx);
> > > > > > +
> > > > > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > > > > +
> > > > > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > > > > > +
> > > > > > +struct drm_gpusvm_range *
> > > > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > > + *
> > > > > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > > > > + */
> > > > > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > > > > +	down_read(&(gpusvm__)->notifier_lock)
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > > > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > > > > + *
> > > > > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > > > > + */
> > > > > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > > > > +	up_read(&(gpusvm__)->notifier_lock)
> > > > > > +
> > > > > > +/**
> > > > > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > > > > + * @range: a pointer to the current GPU SVM range
> > > > > > + *
> > > > > > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > > > > > + *         current range is the last one or if the input range is NULL.
> > > > > > + */
> > > > > > +static inline struct drm_gpusvm_range *
> > > > > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > > > > +{
> > > > > > +	if (range && !list_is_last(&range->rb.entry,
> > > > > > +				   &range->notifier->range_list))
> > > > > > +		return list_next_entry(range, rb.entry);
> > > > > > +
> > > > > > +	return NULL;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > > > > > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > > > > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > > > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > > > + * @start__: Start address of the range
> > > > > > + * @end__: End address of the range
> > > > > > + *
> > > > > > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > > > > > + * to use while holding the driver SVM lock or the notifier lock.
> > > > > > + */
> > > > > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > > > > > +	for ((range__) = (range__) ?:					\
> > > > > > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > > > > > +	     (range__) && (range__->va.start < (end__));		\
> > > > > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > > > > + * @range: Pointer to the GPU SVM range structure.
> > > > > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > > > > + *
> > > > > > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > > > > > + * if the range partially falls within the provided MMU notifier range.
> > > > > > + */
> > > > > > +static inline void
> > > > > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > > > > +			      const struct mmu_notifier_range *mmu_range)
> > > > > > +{
> > > > > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > > > > +
> > > > > > +	range->flags.unmapped = true;
> > > > > > +	if (range->va.start < mmu_range->start ||
> > > > > > +	    range->va.end > mmu_range->end)
> > > > > > +		range->flags.partial_unmap = true;
> > > > > > +}
> > > > > > +
> > > > > > +#endif /* __DRM_GPUSVM_H__ */
> > > > > > -- 
> > > > > > 2.34.1
> > > > > > 
> > > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-30  5:00     ` Matthew Brost
@ 2024-09-02 11:36       ` Daniel Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 11:36 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, intel-xe, dri-devel, airlied, christian.koenig,
	thomas.hellstrom, matthew.auld, daniel

On Fri, Aug 30, 2024 at 05:00:11AM +0000, Matthew Brost wrote:
> On Wed, Aug 28, 2024 at 04:31:19PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > +		if (!ctx->mmap_locked) {
> > > +			/*
> > > +			 * XXX: HMM locking document indicates only a read-lock
> > > +			 * is required but there apears to be a window between
> > > +			 * the MMU_NOTIFY_MIGRATE event triggered in a CPU fault
> > > +			 * via migrate_vma_setup and the pages actually moving
> > > +			 * in migrate_vma_finalize in which this code can grab
> > > +			 * garbage pages. Grabbing the write-lock if the range
> > > +			 * is attached to vram appears to protect against this
> > > +			 * race.
> > > +			 */
> > 
> > This one is really scary, since it means the entire migrate pte trickery
> > is essentially completely busted. Grabbing the mmap write lock just means
> > you block out pretty much everything interesting from concurrently
> > happening.
> > 
> > My gut feeling says we need to figure out what's happening here, because
> > this looks a bit too fundamental to me.
> > -Sima
> > 
> 
> Sima,
> 
> I’ve already replied to this.
> 
> We’ve discussed the mmap write hack extensively, so I’m not quite sure
> where to put this. The reply chain is quickly becoming a mess. However,
> I’ve looked into this and collected some data points based on your
> feedback.
> 
> I’ve pushed a branch [1] with the updated code.
> 
> The first new commit [2] removes the mmap write lock hack and addresses
> an issue related to VRAM migrations, which couldn’t collect all VRAM
> pages without this hack.
> 
> With this commit [2], xe_exec_system_allocator --r twice*race* fails
> quite regularly, perhaps 25% of the time. This test is a
> single-thread/process test that races CPU and GPU faults with migration.
> 
> It fails with the following dmesg:
> 
> [   68.473007] WARNING: CPU: 12 PID: 1643 at drivers/gpu/drm/xe/drm_gpusvm.c:1407 drm_gpusvm_range_get_pages+0xbda/0x1480 [xe]
> ...
> [   68.473836] xe 0000:03:00.0: [drm:pf_queue_work_func [xe]] Fault response: Unsuccessful -95
> [   68.474024] xe 0000:03:00.0: [drm:xe_guc_exec_queue_memory_cat_error_handler [xe]] GT1: Engine memory cat error: engine_class=vecs, logical_mask: 0x2, guc_id=0
> [   68.474163] xe 0000:03:00.0: [drm] exec queue reset detected
> [   68.474696] xe 0000:03:00.0: [drm] GT1: Engine reset: engine_class=vecs, logical_mask: 0x2, guc_id=0
> 
> This means hmm_range_fault collects a mix of SRAM and VRAM pages, which
> my design aims to avoid. Perhaps allowing a mix of SRAM and VRAM pages
> in my design might work, but I highly doubt it based on AMD's
> range->migration_mutex and my inspection of the migration layer.
> Allowing mixed mappings would introduce significant complexity, so I’d
> prefer to avoid this if possible. Additionally, allowing mixed mappings
> would eliminate the use of huge GPU pages when race like this occurs.

Ah, if the issue is just that you get a mix of sram and vram pages from
hmm_range_fault, then I think that answers all my questions. From the
discussion we had and your comment it sounded like you're getting complete
nonsense, or missing an invalidation, or something else equally scary.

Thanks a lot for these details, I'm a lot less worried now here.

> I also implemented a retry loop to see if the system stabilizes with
> either only SRAM or VRAM pages. Unfortunately, it results in a
> continuous loop of drm_gpusvm_range_get_pages / hmm_range_fault until
> the test case kills the MM due to a timeout.

Yeah, the core mm makes no guarantees about forward progress for groups of
pages/folios. So if we go with core mm locking rules, then you have to
deal with individual pages/folios and anything bigger is just a
performance optimasation that must fall back to a per-page/folio approach.

> Next, I added a lock similar to AMD's range->migration_lock, but using
> an rwsem [3]. The semantics are to allow read access for CPU access and
> write access for GPU access, thus enabling parallel CPU page faults for
> the same range which matching existing core semantics. This provides
> finer granularity compared to using the mmap write lock; it only
> disallows CPU and GPU servicing in parallel for a given range, rather
> than the entire MM. It also aligns with AMD’s approach. I haven’t
> checked Nvidia’s approach wrt this locking but can do so if you think it
> would be helpful.

Yeah minus the entire question whether va base locking is ok, something
like amd's migration_mutex should also close the race you're having here.

Cheers, Sima

> 
> Matt
> 
> [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/commits/mmap_write_lock
> [2] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/commit/6cf67d98c719ffbb4ac6124a7cb81d797a5bad9f
> [3] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/commit/2b62075d193265b2c1634ecfd0497dffd2e18c13
> 
> > 
> > > +			if (vram_pages)
> > > +				mmap_write_lock(mm);
> > > +			else
> > > +				mmap_read_lock(mm);
> > > +		}
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (!ctx->mmap_locked) {
> > > +			if (vram_pages)
> > > +				mmap_write_unlock(mm);
> > > +			else
> > > +				mmap_read_unlock(mm);
> > > +		}
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (!ctx->mmap_locked)
> > > +		mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +
> > > +	if (ctx->prefault) {
> > > +		range->pages = pages;
> > > +		goto set_seqno;
> > > +	}
> > > +
> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > > +
> > > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > +						   pages[j], 0,
> > > +						   PAGE_SIZE << order,
> > > +						   DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev, dma_addr[j])) {
> > > +				err = -EFAULT;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +		}
> > > +
> > > +		/* Huge pages, reduce memory footprint */
> > > +		if (order) {
> > > +			dma_addr = kmalloc_array(j, sizeof(*dma_addr),
> > > +						 GFP_KERNEL);
> > > +			if (dma_addr) {
> > > +				for (i = 0; i < j; ++i)
> > > +					dma_addr[i] = (dma_addr_t)pfns[i];
> > > +				kvfree(pfns);
> > > +				kfree_mapping = true;
> > > +			} else {
> > > +				dma_addr = (dma_addr_t *)pfns;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->order = order;
> > > +		range->flags.kfree_mapping = kfree_mapping;
> > > +		range->flags.has_dma_mapping = true;
> > > +		range->dma_addr = dma_addr;
> > > +		range->vram_allocation = NULL;
> > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	}
> > > +
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	for_each_dma_page(i, j, npages, order)
> > > +		dma_unmap_page(gpusvm->drm->dev,
> > > +			       (dma_addr_t)pfns[j],
> > > +			       PAGE_SIZE << order, DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	if (alloc_pfns)
> > > +		kvfree(pfns);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > +					   unsigned long *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified GPU SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in GPU SVM. It
> > > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > > + * @dev: The device for which the pages were mapped
> > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > + * @npages: Number of pages to unmap
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > > + */
> > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > +					   dma_addr_t *dma_addr,
> > > +					   unsigned long npages,
> > > +					   enum dma_data_direction dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > > +			continue;
> > > +
> > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *                   failure of this function.
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation. The caller
> > > + *                   should hold a reference to the VRAM allocation, which
> > > + *                   should be dropped via ops->vram_allocation or upon the
> > > + *                   failure of this function.
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function migrates the specified GPU SVM range to VRAM. It performs the
> > > + * necessary setup and invokes the driver-specific operations for migration to
> > > + * VRAM. Upon successful return, @vram_allocation can safely reference @range
> > > + * until ops->vram_release is called which only upon successful return.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long i, npages = npages_in_range(start, end);
> > > +	struct vm_area_struct *vas;
> > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int err;
> > > +
> > > +	if (!range->flags.migrate_vram)
> > > +		return -EINVAL;
> > > +
> > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops->copy_to_vram ||
> > > +	    !gpusvm->ops->copy_to_sram)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > +	 */
> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, vram_allocation, npages,
> > > +					     migrate.dst);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > +					   migrate.src, npages, DMA_TO_DEVICE);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > +
> > > +		pages[i] = page;
> > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > +	}
> > > +
> > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	/* Upon success bind vram allocation to range and zdd */
> > > +	range->vram_allocation = vram_allocation;
> > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/* Owns ref */
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > +				       DMA_TO_DEVICE);
> > > +err_free:
> > > +	if (zdd)
> > > +		drm_gpusvm_zdd_put(zdd);
> > > +	kvfree(buf);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a VM area
> > > + * @vas: Pointer to the VM area structure, can be NULL
> > > + * @npages: Number of pages to populate
> > > + * @src_mpfn: Source array of migrate PFNs
> > > + * @mpfn: Array of migrate PFNs to populate
> > > + * @addr: Start address for PFN allocation
> > > + *
> > > + * This function populates the SRAM migrate page frame numbers (PFNs) for the
> > > + * specified VM area structure. It allocates and locks pages in the VM area for
> > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > > + * alloc_page for allocation.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct vm_area_struct *vas,
> > > +						unsigned long npages,
> > > +						unsigned long *src_mpfn,
> > > +						unsigned long *mpfn, u64 addr)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > +		struct page *page;
> > > +
> > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > +			continue;
> > > +
> > > +		if (vas)
> > > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > > +		else
> > > +			page = alloc_page(GFP_HIGHUSER);
> > > +
> > > +		if (!page)
> > > +			return -ENOMEM;
> > > +
> > > +		lock_page(page);
> > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap lock and
> > > + * migration done via migrate_device_* functions. Fallback path as it is
> > > + * preferred to issue migrations with mmap lock.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	unsigned long *src, *dst;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	src = buf;
> > > +	dst = buf + (sizeof(*src) * npages);
> > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range->vram_allocation,
> > > +					     npages, src);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > +				       gpusvm->device_private_page_owner, src,
> > > +				       npages, range->va.start);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages, src, dst, 0);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > +					   dst, npages, DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > +	migrate_device_pages(src, dst, npages);
> > > +	migrate_device_finalize(src, dst, npages);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the specified GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +					struct vm_area_struct *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	/* Corner where VMA area struct has been partially unmapped */
> > > +	if (start < vas->vm_start)
> > > +		start = vas->vm_start;
> > > +	if (end > vas->vm_end)
> > > +		end = vas->vm_end;
> > > +
> > > +	migrate.start = start;
> > > +	migrate.end = end;
> > > +	npages = npages_in_range(start, end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/* Raced with another CPU fault, nothing to do */
> > > +	if (!migrate.cpages)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > +						   migrate.src, migrate.dst,
> > > +						   start);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev, dma_addr,
> > > +					   migrate.dst, npages,
> > > +					   DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr, npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function initiates the migration of the specified GPU SVM range to
> > > + * SRAM. It performs necessary checks and invokes the internal migration
> > > + * function for actual migration.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the corner case when
> > > +	 * VRAM backing has been partially unmapped from MM's address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > > +	if (!vas) {
> > > +		if (!retry)
> > > +			err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > +		if (!retry)
> > > +			err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start, end);
> > > +	if (err)
> > > +		goto err_mmunlock;
> > > +
> > > +	if (vas->vm_end < end) {
> > > +		retry = true;
> > > +		start = vas->vm_end;
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_read_unlock(mm);
> > > +		/*
> > > +		 * Using mmput_async as this function can be called while
> > > +		 * holding a dma-resv lock, and a final put can grab the mmap
> > > +		 * lock, causing a lock inversion.
> > > +		 */
> > > +		mmput_async(mm);
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked)
> > > +		mmap_read_unlock(mm);
> > > +err_mmput:
> > > +	if (!ctx->mmap_locked)
> > > +		mmput_async(mm);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > > + * @page: Pointer to the page
> > > + *
> > > + * This function is a callback used to put the GPU SVM zone device data
> > > + * associated with a page when it is being released.
> > > + */
> > > +static void drm_gpusvm_page_free(struct page *page)
> > > +{
> > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > + * the internal migration function to migrate the range back to RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > +					   vmf->vma, vmf->page,
> > > +					   zdd->range->va.start,
> > > +					   zdd->range->va.end);
> > > +
> > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > + */
> > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > +	.page_free = drm_gpusvm_page_free,
> > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM device page map operations structure.
> > > + */
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > +{
> > > +	return &drm_gpusvm_pagemap_ops;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM has mapping, False otherwise
> > > + */
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > +		struct drm_gpusvm_range *range = NULL;
> > > +
> > > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > > +			return true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > new file mode 100644
> > > index 000000000000..0ea70f8534a8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > @@ -0,0 +1,415 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_GPUSVM_H__
> > > +#define __DRM_GPUSVM_H__
> > > +
> > > +#include <linux/kref.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct dev_pagemap_ops;
> > > +struct drm_device;
> > > +struct drm_gpusvm;
> > > +struct drm_gpusvm_notifier;
> > > +struct drm_gpusvm_ops;
> > > +struct drm_gpusvm_range;
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > + *
> > > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > > + * These operations are provided by the GPU driver to manage SVM ranges and
> > > + * perform operations such as migration between VRAM and system RAM.
> > > + */
> > > +struct drm_gpusvm_ops {
> > > +	/**
> > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM notifier.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > > +	 */
> > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > +
> > > +	/**
> > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM notifier.
> > > +	 */
> > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > +
> > > +	/**
> > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM range.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > > +	 */
> > > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > > +
> > > +	/**
> > > +	 * @range_free: Free a GPU SVM range (optional)
> > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM range.
> > > +	 */
> > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > +
> > > +	/**
> > > +	 * @vram_release: Release VRAM allocation (optional)
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > +	 *
> > > +	 * This function shall release VRAM allocation and expects to drop a
> > > +	 * reference to VRAM allocation.
> > > +	 */
> > > +	void (*vram_release)(void *vram_allocation);
> > > +
> > > +	/**
> > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > +	 * @npages: Number of pages to populate
> > > +	 * @pfn: Array of page frame numbers to populate
> > > +	 *
> > > +	 * This function shall populate VRAM page frame numbers (PFN).
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > +				 void *vram_allocation,
> > > +				 unsigned long npages,
> > > +				 unsigned long *pfn);
> > > +
> > > +	/**
> > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to VRAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @copy_to_sram: Copy to system RAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to system RAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > > +	 *
> > > +	 * This function shall invalidate the GPU page tables. It can safely
> > > +	 * walk the notifier range RB tree/list in this function. Called while
> > > +	 * holding the notifier lock.
> > > +	 */
> > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > +			   struct drm_gpusvm_notifier *notifier,
> > > +			   const struct mmu_notifier_range *mmu_range);
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: MMU interval notifier
> > > + * @interval: Interval for the notifier
> > > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > > + * @root: Cached root node of the RB tree containing ranges
> > > + * @range_list: List head containing of ranges in the same order they appear in
> > > + *              interval tree. This is useful to keep iterating ranges while
> > > + *              doing modifications to RB tree.
> > > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > > + *                 removed
> > > + *
> > > + * This structure represents a GPU SVM notifier.
> > > + */
> > > +struct drm_gpusvm_notifier {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct mmu_interval_notifier notifier;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} interval;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct rb_root_cached root;
> > > +	struct list_head range_list;
> > > +	struct {
> > > +		u32 removed : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier
> > > + * @refcount: Reference count for the range
> > > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > > + * @va: Virtual address range
> > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA mapped)
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
> > > + * @flags.migrate_vram: Flag indicating whether the range can be migrated to VRAM
> > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > > + * @flags.has_vram_pages: Flag indicating if the range has vram pages
> > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact allocation based
> > > + *                       on @order which releases via kfree
> > > + *
> > > + * This structure represents a GPU SVM range used for tracking memory ranges
> > > + * mapped in a DRM device.
> > > + */
> > > +struct drm_gpusvm_range {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct kref refcount;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} va;
> > > +	unsigned long notifier_seq;
> > > +	union {
> > > +		struct page **pages;
> > > +		dma_addr_t *dma_addr;
> > > +	};
> > > +	void *vram_allocation;
> > > +	u16 order;
> > > +	struct {
> > > +		/* All flags below must be set upon creation */
> > > +		u16 migrate_vram : 1;
> > > +		/* All flags below must be set / cleared under notifier lock */
> > > +		u16 unmapped : 1;
> > > +		u16 partial_unmap : 1;
> > > +		u16 has_vram_pages : 1;
> > > +		u16 has_dma_mapping : 1;
> > > +		u16 kfree_mapping : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm - GPU SVM structure
> > > + *
> > > + * @name: Name of the GPU SVM
> > > + * @drm: Pointer to the DRM device structure
> > > + * @mm: Pointer to the mm_struct for the address space
> > > + * @device_private_page_owner: Device private pages owner
> > > + * @mm_start: Start address of GPU SVM
> > > + * @mm_range: Range of the GPU SVM
> > > + * @notifier_size: Size of individual notifiers
> > > + * @ops: Pointer to the operations structure for GPU SVM
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > > + *               Entries should be powers of 2 in descending order.
> > > + * @num_chunks: Number of chunks
> > > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > > + * @notifier_list: list head containing of notifiers in the same order they
> > > + *                 appear in interval tree. This is useful to keep iterating
> > > + *                 notifiers while doing modifications to RB tree.
> > > + *
> > > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > + *
> > > + * No reference counting is provided, as this is expected to be embedded in the
> > > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > > + * counting.
> > > + */
> > > +struct drm_gpusvm {
> > > +	const char *name;
> > > +	struct drm_device *drm;
> > > +	struct mm_struct *mm;
> > > +	void *device_private_page_owner;
> > > +	u64 mm_start;
> > > +	u64 mm_range;
> > > +	u64 notifier_size;
> > > +	const struct drm_gpusvm_ops *ops;
> > > +	const u64 *chunk_sizes;
> > > +	int num_chunks;
> > > +	struct rw_semaphore notifier_lock;
> > > +	struct workqueue_struct *zdd_wq;
> > > +	struct rb_root_cached root;
> > > +	struct list_head notifier_list;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > + *
> > > + * @mmap_locked: mmap lock is locked
> > > + * @trylock_mmap: trylock mmap lock, used to avoid locking inversions
> > > + *                (e.g.dma-revs -> mmap lock)
> > > + * @in_notifier: entering from a MMU notifier
> > > + * @read_only: operating on read-only memory
> > > + * @vram_possible: possible to use VRAM
> > > + * @prefault: prefault pages
> > > + *
> > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > + */
> > > +struct drm_gpusvm_ctx {
> > > +	u32 mmap_locked :1;
> > > +	u32 trylock_mmap :1;
> > > +	u32 in_notifier :1;
> > > +	u32 read_only :1;
> > > +	u32 vram_possible :1;
> > > +	u32 prefault :1;
> > > +};
> > > +
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks);
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > +
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range);
> > > +
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > +
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > + */
> > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > +	down_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > + */
> > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > +	up_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > + * @range: a pointer to the current GPU SVM range
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > > + *         current range is the last one or if the input range is NULL.
> > > + */
> > > +static inline struct drm_gpusvm_range *
> > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > +{
> > > +	if (range && !list_is_last(&range->rb.entry,
> > > +				   &range->notifier->range_list))
> > > +		return list_next_entry(range, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > > + * to use while holding the driver SVM lock or the notifier lock.
> > > + */
> > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > > +	for ((range__) = (range__) ?:					\
> > > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > > +	     (range__) && (range__->va.start < (end__));		\
> > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > + * @range: Pointer to the GPU SVM range structure.
> > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > + *
> > > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > > + * if the range partially falls within the provided MMU notifier range.
> > > + */
> > > +static inline void
> > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > +			      const struct mmu_notifier_range *mmu_range)
> > > +{
> > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > +
> > > +	range->flags.unmapped = true;
> > > +	if (range->va.start < mmu_range->start ||
> > > +	    range->va.end > mmu_range->end)
> > > +		range->flags.partial_unmap = true;
> > > +}
> > > +
> > > +#endif /* __DRM_GPUSVM_H__ */
> > > -- 
> > > 2.34.1
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 16:49     ` Matthew Brost
@ 2024-09-02 11:40       ` Daniel Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 11:40 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, intel-xe, dri-devel, airlied, christian.koenig,
	thomas.hellstrom, matthew.auld, daniel

On Thu, Aug 29, 2024 at 04:49:15PM +0000, Matthew Brost wrote:
> On Wed, Aug 28, 2024 at 08:50:02PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err = drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the corner case when
> > > +	 * VRAM backing has been partially unmapped from MM's address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > 
> > So a hiliarous case that amdkfd gets a bit better but still not entirely
> > is that the original vma might entirely gone. Even when you can still get
> > at the mm of that process. This happens with cow (or shared too I think)
> > mappings in forked child processes, or also if you play fun mremap games.
> > 
> > I think that outside of the ->migrate_to_ram callback migration/eviction
> > to sram cannot assume there's any reasonable vma around and has to
> > unconditionally go with the drm_gpusvm_evict_to_sram path.
> > 
> 
> See my response here [1]. Let me drop the whole trylock thing and
> convert to an 'evict' flag which calls drm_gpusvm_evict_to_sram in
> places where Xe needs to evict VRAM. Or maybe just export that function
> and call it directly. That way the only place the VMA is looked up for
> SRAM -> VRAM is upon CPU page fault.

Yeah I think a dedicated path for migrate_to_ram hook that goes directly
into your evict_to_sram path is the design-clean approach here imo.

> [1] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164
> 
> > Also in the migrate_to_ram case the vma is essentially nothing else that
> > informational about which ranges we might need if we prefault a bit (in
> > case the child changed the vma compared to the original one). So it's good
> > to as parameter for migrate_vma_setup, but absolutely nothing else.
> > 
> > amdkfd almost gets this right by being entirely based on their svm_range
> > structures, except they still have the lingering check that the orignal mm
> > is still alive. Of course you cannot ever use that memory on the gpu
> > anymore, but the child process could get very pissed if their memory is
> > suddenly gone. Also the eviction code has the same issue as yours and
> > limits itself to vma that still exist in the original mm, leaving anything
> > that's orphaned in children or remaps stuck in vram. At least that's my
> > understanding, I might very well be wrong.
> > 
> > So probably want a bunch of these testcases too to make sure that all
> > works, and we're not stuck with memory allocations in vram that we can't
> > move out.
> 
> When writing some additional test cases, let me add hooks in my IGTs to
> be able to verify we are not orphaning VRAM too.

So maybe apply caution, I'm honestly not sure whether core mm makes any
guarantees about not orphaning stuff, at least for a little bit.

Over the w/e my brain tossed me the "so are we sure we can tear down our
zone_device data, the page array specifically" brain teaser. And I think
the answer is that we have to wait until all page references disappear,
which might take a long time. Core mm makes no guarantee about elevated
page references disappearing in a timely manner, at least as far as I
know. Which is also why migration is a best effort thing only.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 17:27     ` Matthew Brost
@ 2024-09-02 11:53       ` Daniel Vetter
  2024-09-02 17:03         ` Matthew Brost
  0 siblings, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 11:53 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, intel-xe, dri-devel, airlied, christian.koenig,
	thomas.hellstrom, matthew.auld, daniel

On Thu, Aug 29, 2024 at 05:27:13PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 11:45:08AM +0200, Daniel Vetter wrote:
> > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > sharing of memory between the CPU and GPU, enhancing performance and
> > > flexibility in GPU computing tasks.
> > > 
> > > The patch adds the necessary infrastructure for SVM, including data
> > > structures and functions for managing SVM ranges and notifiers. It also
> > > provides mechanisms for allocating, deallocating, and migrating memory
> > > regions between system RAM and GPU VRAM.
> > > 
> > > This mid-layer is largely inspired by GPUVM.
> > > 
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > Still not sure I've got the right race that you paper over with
> > mmap_write_lock, but I spotted a few things, commments inline.
> > 
> 
> I've replied to this issue several times, let's table the
> mmap_write_lock issue in this reply - a lot of other things to get
> through. Current thinking is try to add a range->migrate_lock like AMD
> which I state here [1]. Let's continue discussing the mmap lock issue
> there if possible.

Yeah I wrote replies as I read code, so there's a bit a mess from my side
here. Apologies for that.

> [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169

Some more replies below that I think we haven't covered anywhere else yet.

> > > + * 2) Garbage Collector.
> > > + *
> > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > + *					struct drm_gpusvm_range *range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> > > + *		if (range->flags.partial_unmap)
> > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
> > 
> > Note that the migration back to sram isn't guaranteed to succeed, so you
> > might be still stuck with partially migrated range. This might be a case
> > where hmm gives you vram pfns, but the range you have doesn't have any
> > vram allocation anymore because you droppped it here. Not sure tbh.
> >
> 
> Hmm isn't the picture here nor will a VMA once the
> drm_gpusvm_evict_to_sram path is always taken as discussed here [2]. I
> might have a corner case BO refcounting / TTM resource lookup bug in
> somewhere in here which needs to be resolved though (e.g. eviction
> racing with this code path), will try to close on that.
> 
> [2] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164

So maybe my understanding is wrong, but from my reading of the device
migration code the exact same non-guarantees as for the sram2sram
migration code apply:

- There's no guarantee the page/folio doesn't have an elevated refcount,
  which makes the migration fail (in try_to_migrate, where it checks for
  surplus refcounts).

- There's no guarantee you'll get the page/folio lock, which makes the
  migration fail. Worse the core mm seems to use a fallback to per-page
  locking as it's extremely crude "get out of deadlocks due to acquiring
  multiple page locks" card.

> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > 
> > You can't do the above, because the pfn you get from hmm come with zero
> > guarantees, you neither hold a page reference nor the page lock. The only
> > thing you can do is grab the pagetable lock (or mmu notifier locks) and
> > check it's still valid, before you can touch any state. I think the
> > range->vram_allocation is probably always valid since you clean that up
> > under the same lock/thread, but there's good chances the vram allocation
> > is otherwise already gone for good. Or you get an inconsistent snapshot.
> > 
> 
> I haven't seen this pop in my testing yet which is fairly thorough. My
> thinking was migration always being enforced at range grainularity we'd
> never get mixed mappings from the core as migration is completely under
> control of the driver. Maybe I'm not understanding what you are saying
> here...

So one scenario is that you race (without the mmap write lock or the
migration_mutex design ofc) with another invalidate, and get a partial
view here of mixed vram and sram pages. Until you acquire the mmu notifier
lock and have made sure your pages are still valid, there's essentially no
guarantee.
> 
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > 
> > You can't do these, because you don't hold a page reference. They're also
> > not needed because hmm_range_fault goes thorugh the full mkwrite dance,
> > which takes care of these, unlike the gup family of functions.
> >
> 
> This is a left over from our existing userpte code and it does appear to
> be incorrect. Let me remove this and fixup our userptr code while I'm at
> it.

Ack.

> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> 
> This is a bit stale, can update this comment.
> 
> > > +	 */
> > 
> > Yeah I think especially under contention partial migrations, at least back
> > to sram due to cpu faults, are pretty much expected. And you need to cope
> > somehow.
> > 
> 
> I have seen these pop if the IGT calls mlock on the memory. My thinking
> is migration to VRAM is basically optional and fallback to leaving range
> in SRAM if an error occurs rather than doing a partial migration. This
> is what currently happens so it is coped with.
> 
> If the memory is marked as must be in VRAM (NIY), well then the user
> program has done something wrong and can kill the app (akin to
> segfault).

Yeah SIGBUS for "must be in VRAM" sounds like ok semantics.

> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}

What I think is more fundamental is that I think this one here doesn't
work. For migrate_to_ram you cannot assume that you can always migrate the
entire block, I think to uphold the core mm forward progress rules we need
to allow partial migrations there. And I think your current code allows
that.

But that then means you also are stuck with partial migration state here.
That was the point I tried to make.

> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the specified GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +					struct vm_area_struct *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > 
> > That's the wrong mm, at least for the ->migrate_to_ram path. You might be
> > called on a anon mapping from a child process. That also means that the
> > vma you're looking at might have no relationship with anythign you're
> > tracking in your gpusvm.
> >
> 
> Hmm, as discussed [3] I haven't added tests with child processes yet.
> Let me do that and update the design as needed. This likely isn't
> correct as you say.
> 
> [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169 

Ack. More tests should definitely help here to figure out what's up, and
what's just me being confused.

> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > + * the internal migration function to migrate the range back to RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > 
> > So I think zdd->range doesn't work, because even within a single mm the
> > vma mapping a given piece of anon memory does not need to be unique, you
> > can duplicate them with mremap.
> > 
> 
> This is attached to a page, not a VMA. Both AMD and Nvidia drivers use a
> similar lookup mechanism.

Yeah the page->zone_device_data is fine. It's the zone_device_rage->range
which I think isn't ok.

> > So all you have here is the physical memory and the vma, which might or
> > might not be from the same process as gpusvm->mm.
> > 
> > Also the child process scenario means you using mmap_write on the fault
> > side doesn't stop all cpu faults migrating stuff back.
> > 
> > Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
> > is busted, because it's va based and so misses concurrently ongoing
> > different mappings moving physical storage around underneath.
> >
> 
> I think all of the above which falls into the fork() + child process
> issues which you have raise. Until I test this out I can't speak to this
> any level of confidence so I won't. Thanks for raising this issue and
> let me write test cases as discussed and educate myself. Once I do that,
> we can engage in further discussions.

I think fork + childs will still result in zdd->range being unique (albeit
confused about which mm). You need mremap of some of these mappings to
change the addresses and really cause confusion, which I /think/ (but
didn't test) is doable with a single process even and duplicating anon
memory mappings with mremap.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-30  9:16   ` Thomas Hellström
@ 2024-09-02 12:20     ` Daniel Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 12:20 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Matthew Brost, intel-xe, dri-devel, airlied, christian.koenig,
	matthew.auld, daniel

On Fri, Aug 30, 2024 at 11:16:53AM +0200, Thomas Hellström wrote:
> Hi, Matthew
> 
> On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework designed to
> > manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient data
> > exchange and
> > + * processing for GPU-accelerated applications by allowing memory
> > sharing and
> > + * synchronization between the CPU's and GPU's virtual address
> > spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU
> > SVM
> > + *		initialization parameter, with a recommendation of
> > 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of
> > ranges that
> > + *		fall within the notifier interval. Notifiers are
> > tracked within
> > + *		a GPU SVM Red-BlacK tree and list and are
> > dynamically inserted
> > + *		or removed as ranges within the interval are created
> > or
> > + *		destroyed.
> > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > managed
> > + *	     by GPU SVM. They are sized based on an array of chunk
> > sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU
> > address space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits
> > within the
> > + *	     faulting CPU address space is chosen for the range
> > size. Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and
> > removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > are tracked in
> > + *	     a notifier's Red-Black tree.
> > + * - Operations: Define the interface for driver-specific SVM
> > operations such as
> > + *		 allocation, page collection, migration,
> > invalidations, and VRAM
> > + *		 release.
> > + *
> 
> Another question, since ranges, as I understand it, are per gpuvm and
> per cpu mm, whereas migration is per device and per cpu_mm, (whe might
> have multiple gpuvms mapping the same cpu_mm), I figure the gpu_svm is
> per gpuvm, but that makes migration currently inconsistent, right?

I think anything that tracks va must be 1:1 tied to the single specific
cpu mm that we use for hmm/svm. So I think that's ok.

There's a pile of paths where that 1:1 mapping doesn't capture the entire
picture. but I think there the right choice is to just completely ignore
any cpu/gpu mm/vma stuff, and defacto rely on the core mm rmap
datastructure to make sure we find them all (e.g. to update/invalidate
ptes during migration).
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-29 20:56         ` Matthew Brost
  2024-08-30  8:18           ` Thomas Hellström
  2024-08-30  9:57           ` Thomas Hellström
@ 2024-09-02 12:33           ` Daniel Vetter
  2024-09-04 12:27             ` Thomas Hellström
  2 siblings, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 12:33 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Thomas Hellström, intel-xe, dri-devel, airlied,
	christian.koenig, matthew.auld, daniel

Jumping in here in the middle, since I think it's a solid place to drop my
idea of "align with core mm" gpusvm locking ...

On Thu, Aug 29, 2024 at 08:56:23PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 09:18:29PM +0200, Thomas Hellström wrote:
> Issues with removing a SVM range:
> 
> - Xe bind code stores invalidation / present state in VMA, this would
>   need to be moved to the radix tree. I have Jira open for that work
>   which I believe other developers are going to own.
> - Where would the dma mapping / device pages be stored?
> 	- In the radix tree? What if ATS is enabled? We don't have a
> 	  driver owned radix tree. How do we reasonably connect a driver
> 	  owned radix to a common GPUSVM layer?

Yeah this one is really annoying, because the core mm gets away with
nothing because it can just store the pfn in the pte. And it doesn't need
anything else. So we probably still need something unfortuantely ...

> 	- In the notifier? What is the notifier is sparsely populated?
> 	  We would be wasting huge amounts of memory. What is the
> 	  notifier is configured to span the entire virtual address
> 	  space?

So if we go with the radix idea, we could model the radix to exactly match
the gpu pagetables. That's essentially what the core mm does. Then each
pagetable at each level has a spinlock for essentially a range lock.
notifier seqno would be stored into each pagetable (not the endividual
entries, that's probably too much), which should allow us to very
effeciently check whether an entire arbitrary va range is still valid on
the fault side.

On the notifier side we can also very efficiently walk arbitrary ranges,
because the locking is really fine-grained and in an adaptive way.

> - How does the garbage collector work? We can't allocate memory in the
>   notifier so we don't anything to add to the garbage collector. We
>   can't directly modify page tables given you need lock in the path of
>   reclaim.

Probably no more garbage collector, you deal with pages/folios like the
core mm expects.

> - How do we deal with fault storms (e.g. tons of faults hitting the same
>   SVM range in a row)? Without a SVM range no every to know if mapping
>   is valid and GPU page handler can be short circuited.

So the core mm sorts this out by allowing faults to be handled in
parallel, without any lock. Essentially:
- you get a fault (or prefault)
- you hold just enough read locks to make sure stuff doesn't disappear.
  Currently that's mmap_read_lock, but strictly speaking we only need the
  new-ish per-vma lock.
- you allocate memory, dma_map, everything else you need
- you grab that very fine-grained radix tree lock (pagetable locks on the
  cpu side) and recheck whether you've raced: mmu notifier seqno and the
  pte must still be non-present. If that check fails, you bail out and
  release all the vram/dma_maps you've created.

> - Do we have notifier seqno for every PTE?

I think per-pagetable, so every node in the radix tree, would make sense.
If we go with also one lock per pagetable like the cpu mm then tracking
notifier seqno to match makes the most sense imo.

Again, this is entirely aside from the discussion in this subthread about
understanding the current design and tradeoffs/reasons. Just figured this
is a good spot to drop this.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29 22:12           ` Matthew Brost
  2024-08-29 22:23             ` Matthew Brost
  2024-09-02 11:01             ` Christian König
@ 2024-09-02 12:48             ` Daniel Vetter
  2024-09-02 22:20               ` Matthew Brost
  2 siblings, 1 reply; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 12:48 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, Thomas Hellström, Christian König,
	intel-xe, dri-devel, airlied, matthew.auld, daniel,
	Paneer Selvam, Arunpravin

On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > But as Sima pointed out in private communication, exhaustive eviction
> > > is not really needed for faulting to make (crawling) progress.
> > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > strictly only required to service a single gpu page granule at a time.
> > > 
> > > However, ordinary bo-based jobs would still like to be able to
> > > completely evict SVM vram. Whether that is important enough to strive
> > > for is ofc up for discussion.
> > 
> > My take is that you don't win anything for exhaustive eviction by having
> > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > world, where svm ignores bo/dma_resv:
> > 
> > When evicting vram from the ttm side we'll fairly switch between selecting
> > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > will eventually succeed in vacuuming up everything (with a few retries
> > perhaps, if we're not yet at the head of the ww ticket queue).
> > 
> > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > the core mm might be holding temporary page references (which block
> 
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.
> 
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.

I think it does :-/

If you read do_swap_page around ->migrate_to_ram:


	get_page(vmf->page);
	pte_unmap_unlock(vmf->pte, vmf->ptl);
	ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
	put_page(vmf->page);

Also the migrate code itself does lock pages. So unless we toss in
additional locking on top of what core mm does (which I think should be
enough to cover migration), migrations will temporarily fail. And this is
just for multiple threads trying to get the same page back to sram, which
I think is a case we should support because the application did nothing
wrong.

> > migration) or have the page locked (which also block the migration). But
> > as long as those two steps succeed, we'll win and get the pages. There
> > might be some thrashing against concurrent svm faults stealing them again,
> > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > And if it's still too much we can stall them in the page allocator.
> > 
> > So it's not entirely reliable, but should be close enough.
> > 
> > Now for bo based svm the picture isn't any different, because holding
> > dma_resv is not actually enough to migrate svm mappings. We still need to
> > hope there's no temporary page references around, and we still need to
> > succeed at locking the page. And the migration code only does trylocks,
> > because that's it's deadlock prevent algorithm if different migrations
> > needing the same set of pages, but acquiring them in a different order. So
> > we win nothing.
> 
> Ok, maybe my statement above is false...
> 
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.

Yeah. And additionally core mm relies on some amount of Good Luck here,
plus the assumption that at least falling back to a single page/folio will
work out. At least eventually ...

The trouble is if your design assumes you can migrate an entire block,
because then if threads hammer that range in different orders you'll never
make forward progress. Because the core mm code doesn't have a fancy ww
locking scheme to get out of this, but only uses trylock, plus the
assumption that falling back to a single page will work out eventually.

Wrt TTM resource refcounting, I think that all looks ok. But maybe I
checked the wrong things.

> > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > we potentially deadlock because that lock is for a bigger range than
> > individual pages (or folios). And the core mm assumes that it can get out
> > of a deadlock bind by (at least stochastically) eventually succeeding in
> > acquiring/locking down a single page.
> > 
> > This means we cannot use dma_resv tricks to give the ttm world an
> > advantage in exhaustive eviction against concurrent svm faults. Or at
> > least not more than we can do without by just stalling svm faults that
> > need to allocate gpu memory (but that must happen without holding locks or
> > we're busted).
> > 
> 
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?

So you can have multiple cpu threads hammering a given svm range. And
thanks to the lols of mremap and fork each of them can have a different
view of that range (they are all obviously different processes from the
one that has created the gpusvm binding). And if you try to migrate, they
might all grab the pages in different orders, which can deadlock.

That's why there's so much retrying and also why core mm only does trylock
on pages if it grabs an entire pile.

Now if you have a lock that nests within the page lock you need to trylock
it, or it deadlocks. Which kinda defeats the point of having a bigger lock
and moving the entire bo as a unit.

But if that is outside of the page lock (like amdgpu), you still have the
issue of the elevated page reference from do_swap_page. Which also blocks
migration.

Note that neither is a hard deadlock, as in lockdep complaints, because
they're all retrying anyway. They're more like lifelocks, and the bigger
your pile of pages the more likely that you'll always have a failed page
and need to abort and retry. Which results in threads spinning forever.

> > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > worth it. There's also a bit a lru design tension here, because for the bo
> 
> Well also not rewriting the world...

Yeah it's tough. I'm still at the "understanding all the tradeoffs" stage,
just to make that clear.
-Sima

> Matt
> 
> > world we want objects that are locked to stay on the lru, so that the
> > competing processes can figure out who has the winning ww ticket. The core
> > mm design otoh does isolate pages and remove them from the lru when
> > they're acquired, so that they don't gunk up other processes from trying
> > to make forward progress and are better hidden. Which reduces temporary
> > page references (from lru walk) preventing migration and stuff like that.
> > 
> > Cheers, Sima
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-09-02 11:01             ` Christian König
@ 2024-09-02 12:50               ` Daniel Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 12:50 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, Daniel Vetter, Thomas Hellström, intel-xe,
	dri-devel, airlied, matthew.auld, daniel,
	Paneer Selvam, Arunpravin

On Mon, Sep 02, 2024 at 01:01:45PM +0200, Christian König wrote:
> Am 30.08.24 um 00:12 schrieb Matthew Brost:
> > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > > But as Sima pointed out in private communication, exhaustive eviction
> > > > is not really needed for faulting to make (crawling) progress.
> > > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > > strictly only required to service a single gpu page granule at a time.
> > > > 
> > > > However, ordinary bo-based jobs would still like to be able to
> > > > completely evict SVM vram. Whether that is important enough to strive
> > > > for is ofc up for discussion.
> > > My take is that you don't win anything for exhaustive eviction by having
> > > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > > world, where svm ignores bo/dma_resv:
> > > 
> > > When evicting vram from the ttm side we'll fairly switch between selecting
> > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > > will eventually succeed in vacuuming up everything (with a few retries
> > > perhaps, if we're not yet at the head of the ww ticket queue).
> > > 
> > > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > > the core mm might be holding temporary page references (which block
> > Yea, but think you can could kill the app then - not suggesting we
> > should but could. To me this is akin to a CPU fault and not being able
> > to migrate the device pages - the migration layer doc says when this
> > happens kick this to user space and segfault the app.
> 
> That's most likely a bad idea. That the core holds a temporary page
> reference can happen any time without any bad doing from the application.
> E.g. for direct I/O, swapping etc...
> 
> So you can't punish the application with a segfault if you happen to not be
> able to migrate a page because it has a reference.

See my other reply, it even happens as a direct consequence of a 2nd
thread trying to migrate the exact same page from vram to sram. And that
really is a core use case.

RESo yeah, we really can't SIGBUS on this case.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-08-29 21:48       ` Matthew Brost
@ 2024-09-02 13:02         ` Daniel Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 13:02 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Christian König, Daniel Vetter, intel-xe, dri-devel, airlied,
	thomas.hellstrom, matthew.auld, daniel, Paneer Selvam, Arunpravin

On Thu, Aug 29, 2024 at 09:48:11PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 11:24:26AM +0200, Christian König wrote:
> >    Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> > 
> 
> A lot to unpack here. Will try to address as much as I can in this
> single reply to both of you (Daniel, Christian).
> 
> > On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> > 
> > Migration is implemented with range granularity, with VRAM backing being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >   advantage over a BO.
> > 
> > This one I'm not understanding.
> >
> >    Adding Arun as well. I couldn't understand it fully either, but maybe
> >    it's because the buddy allocator is more optimized for higher orders of
> >    allocations?
> > 
> 
> As currently written BO VRAM allocation resolves to a DRM buddy
> allocation. Unless there is memory pressure this is likely 1 buddy block
> if the allocation is aligned (SVM should basically also be doing aligned
> allocations which the common case of 2M at a time).
> 
> It was suggested in earlier revs by a colleague of mine allocating
> directly from DRM buddy pool provided a benefit wrt to freeing a page at
> a time. It doesn't given even if you bypass a BO most likely you are
> going to get 1 buddy block which is larger than a page. In either case
> you need to ref count the allocation or do some wild spliting algorithm
> (I don't want to that). Or alternatively write a new buddy allocator
> which easily code with freeing a page at a time (I don't want to that).
> 
> Lastly, the common case for getting dev_pagemap_ops.page_free is going
> to be consecutive calls spanning the entire allocation (e.g. eviction or
> CPU fault which triggers migration).
> 
> > 
> > 
> > - DRM buddy allocations do not solve locking inversion problems between
> >   mmap lock and dma-resv locks.
> > 
> > Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> > matters hugely which migration path we're in, i.e. opportunistic
> > migration, cpu fault where we have to migrate or die, or when we run out
> > of vram and need to evict stuff to make space.
> > 
> >    Mhm I think the locking order between mmap lock and dma-resv lock is
> >    well defined since dma_resv_lockdep() was added.
> >
> 
> Yes. Also solved the inversion issue by using migrate_device_*. At one
> point had trylocking of mmap lock (still kinda there) but have agreed
> based on Daniel's feedback to rip the out.
>  
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
> >   evict each other).
> > 
> > So core mm handles this by just roughly equally shrinking everything.
> > Seems to work, and it has a pile of object shrinkers, and the page lru is
> > also split into page cache and anon memory.
> > 
> > I think you need to put in more justification that unified eviction is
> > required than just stating it, because a look at mm/ gives a very well
> > established counterexample.
> > 
> > 
> > - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
> >   require a dma-resv.
> > 
> > So from the TTM side we need exhaustive eviction, or at least something a
> > bit more exhaustive than what ttm currently has. Note that i915-gem also
> > never really got to perfect exhaustive eviction, it's just a pile better
> > than ttm right now.
> > 
> >    Please define what exhaustive eviction should mean? I think I know what
> >    it is and I have been pushing TTM into the direction of solving this
> >    for years.
> >    The last missing puzzle piece is to use drm_exec for TTM evictions, but
> >    apart from that everything should work now.
> >    Regards,
> >    Christian.
> 
> I think Thomas has defined this in his replies. He also touches on our
> SVM design allows mixing user BO mappings and SVM mappings within the
> same VM. These need to be able to fairly evict each other. A dma-resv
> lock provides a level of fairness and ensure forward progress once a
> flavor of his series lands.
> 
> Also worth nothing in addition to user BOs, we have kernel BOs (for page
> table, user exec queues, etc...) in Xe which absolutely need to be able
> to evict something or the application dies.
> 
> > 
> > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> 
> Page LRU isn't used for device pages from my understanding.

Yeah, we'd need to manage that ourselves. We could use exactly what the
core mm is doing, I haven't found anything that prohibits that. I think
core mm simply doesn't maintain zone device lru because it's not involved
in any device side access.

> > eviction is going to win because the shrinkers can only trylock dma_resv.
> > So this part works. It actually works so well on the system memory side
> > that if we're not careful we can trigger oom, because we're too good at
> > getting at all the memory.
> > 
> > SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> > don't see why, because the idea is that thanks to gpu and cpu page faults,
> > you can always get out of a pinch by just trashing everything for a while
> > and migrating the handfull of available pages a lot.
> > 
> > 
> > - Likely allocation size is 2M which makes of size of BO (872)
> >   acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> > 
> > Except it requires that you hold dma_resv, which brings in all kinds of
> > pain. And for eviction we really don't need a lot of synchronization, so a
> 
> Yes, but I think I have solved all those issues wrt to dma-resv.
> 
> What is really the alternative here? Teaching TTM to evict non-BO SVM
> allocations? Writing an SVM VRAM allocator which ends up looking also
> exactly like TTM and teaching it to evict TTM BOs? The later case we'd
> still need to grab dma-lock...

Yup.

> Do we write a new page based buddy allocator and wire that to TTM if SVM
> could possibly be used?

Well rebase it on top of a page array, like the buddy allocator in core mm
would be my first idea.

> This would be tons of code and not really sure what ROI is here.
> 
> > lot of that locking is not needed, unlike the case where we have a cpu
> > fault, where we absolutely need mmap_lock and all that to make sure we
> > fault in the right page.
> > 
> > But for eviction we only need to throw out some pages, if we're not
> > entirely precise with picking the right ones (or have no idea into which
> > vma they're all currently mapped into) it doesn't matter. That's why
> > migrate_device_pages doesn't care about any of that at all, it doesn't
> > need to by design. But by bo backing memory you drag in all that stuff
> > that's causing headacheds for eviction.
> > 
> > The only thing migration tries to do is remove all pte, and if that
> > succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> > code as cheat sheet, we need roughly:
> > 
> > - reverse mapping structure like anon_vma. Except gpusvm can assume that
> >   there's currently only one gpu side mapping, so we can just stuff the
> >   gpusvm an va_address into the page, and protect it with the page lock.
> > 
> > - we need pagetable locks, so that we can manipulate pagetables (well
> >   specifically make ptes invalid) without taking any other locks.
> > 
> > - everyone else inserting or removing ptes for svm mappings also needs to
> >   lock the page, or we have races. This might be the hmm_range_fault races
> >   you're seeing when allowing vram pages, since I don't think there's
> >   anything else stopping the page lookup otherwise from succeeding.
> 
> AMD looks take the range->migration_mutex to prevent races.
> 
> > 
> > - we might also need to stuff migrate ptes into the gpu side, like the cpu
> >   does, to hold up refaults before the migration has finished. But I think
> >   those are only needed for anon memory in sram because there's no other
> >   way to find the right page than swap pte entries, of which migration
> >   entries are a special case.
> > 
> > - core code also expects us to handle the page refcount correctly for svm
> >   device memory, so we can't free the pages like normal bo pages either
> >   directly to drm_buddy.
> > 
> > Now typing this all up will look an awful lot like what you have, with the
> > dma_resv lock serving as the page lock and the pagetable lock. The only
> 
> dma_resv is indeed one of locks we need for page table updates (binds)
> as we allocate TTM BOs for page tables and we install fences for binds
> in dma-resv slots (certainly for non-SVM, might be able to drop that for
> SVM).

So the way this is solved on the core mm side is with two tricks:

- Page faults race entirely, and races are only resolved at pte insertion
  time when you acquire the pagetable lock. If there's anything else than
  a page-not-present pte, you've raced and bail out.

- Pagetables are allocated upfront, with the same trick: If someone else
  was faster, you bail out. Pagetables are never reclaimed for core mm
  code, so that avoids someone else nuking it meanwhile. At least while
  the vma mapping stays valid.

  I'm not sure we can entirely emulate that design with gpusvm.

And yeah there would be a substantial difference in code between the bo
and the svm world.

> > reason is that these locks are much smaller and nest within all the other
> > stuff going on and so avoid the inversion issues.
> > 
> > So one annoying part is that this is a lot of pointlessly looking typing.
> > The other is that it's full of races, because core mm really is yolo all
> > the way down. So lots of ways you lock the wrong page and fun stuff like
> > that, but the few cases that matter work:
> > 
> > - svm fault handling with hmm_range fault retries with mmu notifiers. Note
> >   that we need to have vram pages locked and the notifier retrie needs to
> >   be under the pagetable lock, or there's room to escape. At least that's
> >   what I came up with last time I thought it all through.

Correction: at fault time core mm does not lock pages. Just elevated
refcount is enough.

> We grab the gpusvm->notifier lock just be committing the bind and check
> for retry. If we need to retry we completely unwind all locks and
> restart the GPU fault.

Yeah core mm does the same.
  
> > - migrate_to_ram: it will hold a page reference which we know was the
> >   valid vram page when the cpu pte was locked, but it might not be it
> >   anymore. So we have to lock the page and check whether it's still gpu
> >   mapped, and if not retry the entire fault since most likey another
> >   migrate_to_ram has succeed meanwhile in parallel.
> > 
> > - for eviction we don't care, we might actually be migrating a page no one
> >   even wants anymore.
> > 
> > Now I think you can get all this done with the dma_resv lock and maybe the
> > bo refcount. But it does involve a tremendous amount of headaches and
> 
> I don't the headaches are too bad...
> 
> > impendence mismatch, because that's not how page faults and migrations
> > work in core mm.
> 
> Agree there is a bit of impendence mismatch but see above I can't really
> think of a better solution without thousands of lines of new code and
> invavise changes across the subsystem.
> 
> What I have in place appears to work with very little code changes to Xe
> and none to TTM. AMD also landed on a BO likely for similar reasons I
> have laid out.

My understanding of the history is that large chunks of the gpusvm
features where retrofitted, without updating the design. So I'm not
putting that much weight on amdkfd as a good solution, it's just the
obvious incremental solution.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction
  2024-08-29 15:55     ` Matthew Brost
@ 2024-09-02 13:05       ` Daniel Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Daniel Vetter @ 2024-09-02 13:05 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, intel-xe, dri-devel, airlied, christian.koenig,
	thomas.hellstrom, matthew.auld, daniel

On Thu, Aug 29, 2024 at 03:55:56PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 12:14:53PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 27, 2024 at 07:48:57PM -0700, Matthew Brost wrote:
> > > Wire xe_bo_move to GPUSVM migration to SRAM with trylocking of mmap
> > > lock.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_bo.c       | 35 +++++++++++++++++++++++++++++++-
> > >  drivers/gpu/drm/xe/xe_bo_types.h |  3 +++
> > >  drivers/gpu/drm/xe/xe_svm.c      |  2 ++
> > >  drivers/gpu/drm/xe/xe_svm.h      | 13 ++++++++++++
> > >  4 files changed, 52 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > > index ad804b6f9e84..ae71fcbe5380 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > @@ -25,6 +25,7 @@
> > >  #include "xe_pm.h"
> > >  #include "xe_preempt_fence.h"
> > >  #include "xe_res_cursor.h"
> > > +#include "xe_svm.h"
> > >  #include "xe_trace_bo.h"
> > >  #include "xe_ttm_stolen_mgr.h"
> > >  #include "xe_vm.h"
> > > @@ -250,6 +251,8 @@ int xe_bo_placement_for_flags(struct xe_device *xe, struct xe_bo *bo,
> > >  static void xe_evict_flags(struct ttm_buffer_object *tbo,
> > >  			   struct ttm_placement *placement)
> > >  {
> > > +	struct xe_bo *bo;
> > > +
> > >  	if (!xe_bo_is_xe_bo(tbo)) {
> > >  		/* Don't handle scatter gather BOs */
> > >  		if (tbo->type == ttm_bo_type_sg) {
> > > @@ -261,6 +264,12 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
> > >  		return;
> > >  	}
> > >  
> > > +	bo = ttm_to_xe_bo(tbo);
> > > +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) {
> > > +		*placement = sys_placement;
> > > +		return;
> > > +	}
> > > +
> > >  	/*
> > >  	 * For xe, sg bos that are evicted to system just triggers a
> > >  	 * rebind of the sg list upon subsequent validation to XE_PL_TT.
> > > @@ -758,6 +767,17 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
> > >  		}
> > >  	}
> > >  
> > > +	if (!move_lacks_source && (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) &&
> > > +	    new_mem->mem_type == XE_PL_SYSTEM) {
> > > +		ret = xe_svm_range_evict(bo->range);
> > > +		if (!ret) {
> > > +			drm_dbg(&xe->drm, "Evict system allocator BO success\n");
> > > +			ttm_bo_move_null(ttm_bo, new_mem);
> > > +		}
> > > +
> > > +		goto out;
> > > +	}
> > > +
> > >  	if (!move_lacks_source &&
> > >  	    ((old_mem_type == XE_PL_SYSTEM && resource_is_vram(new_mem)) ||
> > >  	     (mem_type_is_vram(old_mem_type) &&
> > > @@ -1096,6 +1116,19 @@ static void xe_ttm_bo_delete_mem_notify(struct ttm_buffer_object *ttm_bo)
> > >  	}
> > >  }
> > >  
> > > +static bool xe_bo_eviction_valuable(struct ttm_buffer_object *ttm_bo,
> > > +				    const struct ttm_place *place)
> > > +{
> > > +	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
> > > +
> > > +	/* Do not evict SVMs before having a binding */
> > > +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC &&
> > > +	    !xe_svm_range_has_vram_binding(bo->range))
> > > +		return false;
> > > +
> > > +	return ttm_bo_eviction_valuable(ttm_bo, place);
> > > +}
> > > +
> > >  const struct ttm_device_funcs xe_ttm_funcs = {
> > >  	.ttm_tt_create = xe_ttm_tt_create,
> > >  	.ttm_tt_populate = xe_ttm_tt_populate,
> > > @@ -1106,7 +1139,7 @@ const struct ttm_device_funcs xe_ttm_funcs = {
> > >  	.io_mem_reserve = xe_ttm_io_mem_reserve,
> > >  	.io_mem_pfn = xe_ttm_io_mem_pfn,
> > >  	.release_notify = xe_ttm_bo_release_notify,
> > > -	.eviction_valuable = ttm_bo_eviction_valuable,
> > > +	.eviction_valuable = xe_bo_eviction_valuable,
> > >  	.delete_mem_notify = xe_ttm_bo_delete_mem_notify,
> > >  };
> > >  
> > > diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
> > > index 2ed558ac2264..4523b033417c 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> > > @@ -16,6 +16,7 @@
> > >  #include "xe_ggtt_types.h"
> > >  
> > >  struct xe_device;
> > > +struct xe_svm_range;
> > >  struct xe_vm;
> > >  
> > >  #define XE_BO_MAX_PLACEMENTS	3
> > > @@ -47,6 +48,8 @@ struct xe_bo {
> > >  	struct ttm_bo_kmap_obj kmap;
> > >  	/** @pinned_link: link to present / evicted list of pinned BO */
> > >  	struct list_head pinned_link;
> > > +	/** @range: SVM range for BO */
> > > +	struct xe_svm_range *range;
> > >  #ifdef CONFIG_PROC_FS
> > >  	/**
> > >  	 * @client: @xe_drm_client which created the bo
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > > index fd8987e0a506..dc9810828c0a 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -531,6 +531,8 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> > >  			  range->base.va.start, ttm_bo_type_device,
> > >  			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> > >  			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
> > > +	if (!IS_ERR(bo))
> > > +		bo->range = range;
> > >  	xe_vm_unlock(vm);
> > >  	if (IS_ERR(bo)) {
> > >  		err = PTR_ERR(bo);
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index 3f432483a230..b9cf0e2500da 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -46,6 +46,19 @@ static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range *range)
> > >  	return range->base.flags.has_dma_mapping;
> > >  }
> > >  
> > > +static inline bool xe_svm_range_has_vram_binding(struct xe_svm_range *range)
> > > +{
> > > +	return xe_svm_range_in_vram(range) && range->tile_present;
> > > +}
> > > +
> > > +static inline int xe_svm_range_evict(struct xe_svm_range *range)
> > > +{
> > > +	struct drm_gpusvm_ctx ctx = { .trylock_mmap = true, };
> > 
> > So even trying to acquire an mmap lock for eviction is I think a design
> > bug for svm memory ranges. It's a bunch of physical memory, you have no
> > idea how many mm/vma map it and which one you pick as the special one is
> > fairly arbitrary.
> > 
> 
> Let me drop whole trylock this and just evict via
> drm_gpusvm_evict_to_sram / migrate_device_vma_range which does not
> require the mmap. I add this code very recently per Matt Auld suggestion
> and agree it makes the try locking unnecessary.
> 
> > So dont, eviction should entirely ignore va/mm issues at the top level
> > like the migrate_device_range function does (maybe we need a
> > scatter-gather version of that instead of just a range.
> > 
> 
> I needed to add migrate_device_vma_range (might be a bad name) as VRAM
> may be non-continuous pfns when memory pressure exists where as
> migrate_device_range only supports continuous pfns.

Ah, I think that's another fallout of tying vram allocations and
management too closely to the gpusvm->mm va layout. Makes sense under the
assumptions of your design at least.

So I think we can file that under the large discussion item of per
page/folio or per-range gpusvm design.
-Sima

> > That function internally makes sure you're in sync with any vma/vm by:
> > - installing migration ptes everywhere, which does the mmu_notifer dance
> > - locking the pages to prevent other concurrent migration or other fun
> >   stuff from happening
> > - then restore ptes to something sensisble when it's all done
> > 
> > And it does that by looping over _all_ possible mappings of a page with
> > the rmap_walk infrastructure.
> > 
> > The only reason when we need the mmap lock (or vma lock or whatever) is if
> > we need to be coherent with other concurrent mm updates of a specific mm.
> > That should only be the case when migrating to vram, where the gpusvm->mm
> > is the special one, and when migrating to sram due to cpu faults, where
> > the vmf->vma->mm is special (and might at best have a tenous relationship
> > to the gpusvm->mm). But that's the only cases where a specific mm and vma
> > have any relevance to svm vram allocations.
> > 
> 
> Thanks for the info.
> 
> Matt
> 
> > -Sima
> > 
> > > +
> > > +	return drm_gpusvm_migrate_to_sram(range->base.gpusvm, &range->base,
> > > +					  &ctx);
> > > +}
> > > +
> > >  #define xe_svm_notifier_lock(vm__)	\
> > >  	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
> > >  
> > > -- 
> > > 2.34.1
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-02 11:53       ` Daniel Vetter
@ 2024-09-02 17:03         ` Matthew Brost
  2024-09-11 16:06           ` Matthew Brost
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-09-02 17:03 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Mon, Sep 02, 2024 at 01:53:14PM +0200, Daniel Vetter wrote:
> On Thu, Aug 29, 2024 at 05:27:13PM +0000, Matthew Brost wrote:
> > On Thu, Aug 29, 2024 at 11:45:08AM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> > > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > > sharing of memory between the CPU and GPU, enhancing performance and
> > > > flexibility in GPU computing tasks.
> > > > 
> > > > The patch adds the necessary infrastructure for SVM, including data
> > > > structures and functions for managing SVM ranges and notifiers. It also
> > > > provides mechanisms for allocating, deallocating, and migrating memory
> > > > regions between system RAM and GPU VRAM.
> > > > 
> > > > This mid-layer is largely inspired by GPUVM.
> > > > 
> > > > Cc: Dave Airlie <airlied@redhat.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > 
> > > Still not sure I've got the right race that you paper over with
> > > mmap_write_lock, but I spotted a few things, commments inline.
> > > 
> > 
> > I've replied to this issue several times, let's table the
> > mmap_write_lock issue in this reply - a lot of other things to get
> > through. Current thinking is try to add a range->migrate_lock like AMD
> > which I state here [1]. Let's continue discussing the mmap lock issue
> > there if possible.
> 
> Yeah I wrote replies as I read code, so there's a bit a mess from my side
> here. Apologies for that.
> 

All good, has been quite helpful thus far.

> > [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169
> 
> Some more replies below that I think we haven't covered anywhere else yet.
> 
> > > > + * 2) Garbage Collector.
> > > > + *
> > > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > > + *					struct drm_gpusvm_range *range)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > + *
> > > > + *		assert_driver_svm_locked(gpusvm);
> > > > + *
> > > > + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> > > > + *		if (range->flags.partial_unmap)
> > > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
> > > 
> > > Note that the migration back to sram isn't guaranteed to succeed, so you
> > > might be still stuck with partially migrated range. This might be a case
> > > where hmm gives you vram pfns, but the range you have doesn't have any
> > > vram allocation anymore because you droppped it here. Not sure tbh.
> > >
> > 
> > Hmm isn't the picture here nor will a VMA once the
> > drm_gpusvm_evict_to_sram path is always taken as discussed here [2]. I
> > might have a corner case BO refcounting / TTM resource lookup bug in
> > somewhere in here which needs to be resolved though (e.g. eviction
> > racing with this code path), will try to close on that.
> > 
> > [2] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164
> 
> So maybe my understanding is wrong, but from my reading of the device
> migration code the exact same non-guarantees as for the sram2sram
> migration code apply:
> 
> - There's no guarantee the page/folio doesn't have an elevated refcount,
>   which makes the migration fail (in try_to_migrate, where it checks for
>   surplus refcounts).
> 
> - There's no guarantee you'll get the page/folio lock, which makes the
>   migration fail. Worse the core mm seems to use a fallback to per-page
>   locking as it's extremely crude "get out of deadlocks due to acquiring
>   multiple page locks" card.
>

I think this circles back to basically the design must be able to move
VRAM -> SRAM because the host can't access VRAM. Certainly in the CPU
page fault path this can't fail on the fauling page at least or if it
does the app gets segfaulted. I'll investigate more here but that is
still my current thinking. If VRAM -> SRAM can fail / make partial
progress in eviction paths, then mixed mappings likely need to be
supported which shouldn't be all that painful - basically just need
cursor in the bind code which can walk mixed mappings.

SRAM -> VRAM certainly can fail which is handled by just aborting the
migration.

> > > > +map_pages:
> > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > +
> > > > +		for (i = 0; i < npages; ++i) {
> > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > +
> > > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				goto err_free;
> > > > +			}
> > > > +		}
> > > 
> > > You can't do the above, because the pfn you get from hmm come with zero
> > > guarantees, you neither hold a page reference nor the page lock. The only
> > > thing you can do is grab the pagetable lock (or mmu notifier locks) and
> > > check it's still valid, before you can touch any state. I think the
> > > range->vram_allocation is probably always valid since you clean that up
> > > under the same lock/thread, but there's good chances the vram allocation
> > > is otherwise already gone for good. Or you get an inconsistent snapshot.
> > > 
> > 
> > I haven't seen this pop in my testing yet which is fairly thorough. My
> > thinking was migration always being enforced at range grainularity we'd
> > never get mixed mappings from the core as migration is completely under
> > control of the driver. Maybe I'm not understanding what you are saying
> > here...
> 
> So one scenario is that you race (without the mmap write lock or the
> migration_mutex design ofc) with another invalidate, and get a partial
> view here of mixed vram and sram pages. Until you acquire the mmu notifier
> lock and have made sure your pages are still valid, there's essentially no
> guarantee.

The pages are collected in notifier stable state via the hmm locking +
seqno begin and recheck. Before they can used (e.g. program a bind) yes
the notifier lock needs to be taken to ensure they haven't changed
between collection and used - at least this my understanding.

> > 
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->flags.has_vram_pages = true;
> > > > +		range->pages = pages;
> > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	} else {
> > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > +
> > > > +		for_each_dma_page(i, j, npages, order) {
> > > > +			if (WARN_ON_ONCE(i && order !=
> > > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > +
> > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			set_page_dirty_lock(pages[j]);
> > > > +			mark_page_accessed(pages[j]);
> > > 
> > > You can't do these, because you don't hold a page reference. They're also
> > > not needed because hmm_range_fault goes thorugh the full mkwrite dance,
> > > which takes care of these, unlike the gup family of functions.
> > >
> > 
> > This is a left over from our existing userpte code and it does appear to
> > be incorrect. Let me remove this and fixup our userptr code while I'm at
> > it.
> 
> Ack.
> 
> > > > +	vas = vma_lookup(mm, start);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > +		err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (!vma_is_anonymous(vas)) {
> > > > +		err = -EBUSY;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > +
> > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > +	if (!zdd) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/*
> > > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > 
> > This is a bit stale, can update this comment.
> > 
> > > > +	 */
> > > 
> > > Yeah I think especially under contention partial migrations, at least back
> > > to sram due to cpu faults, are pretty much expected. And you need to cope
> > > somehow.
> > > 
> > 
> > I have seen these pop if the IGT calls mlock on the memory. My thinking
> > is migration to VRAM is basically optional and fallback to leaving range
> > in SRAM if an error occurs rather than doing a partial migration. This
> > is what currently happens so it is coped with.
> > 
> > If the memory is marked as must be in VRAM (NIY), well then the user
> > program has done something wrong and can kill the app (akin to
> > segfault).
> 
> Yeah SIGBUS for "must be in VRAM" sounds like ok semantics.
> 
> > > > +
> > > > +	if (!migrate.cpages) {
> > > > +		err = -EFAULT;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	if (migrate.cpages != npages) {
> > > > +		err = -EBUSY;
> > > > +		goto err_finalize;
> > > > +	}
> 
> What I think is more fundamental is that I think this one here doesn't
> work. For migrate_to_ram you cannot assume that you can always migrate the
> entire block, I think to uphold the core mm forward progress rules we need
> to allow partial migrations there. And I think your current code allows
> that.
>

Yes. I had similar checks in migrate_to_ram at one point and that did
not work when multiple CPU faults from different threads occured in
parallel. Each thread can grab a random set of VRAM pages to migrate I
think.
 
> But that then means you also are stuck with partial migration state here.
> That was the point I tried to make.
>

The error path with migrate_vma_pages/finalize safely unwinds the
migration in these cases leaving all pages in SRAM.

> > > > +/**
> > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @vas: Pointer to the VM area structure
> > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > + * @start: Start address of the migration range
> > > > + * @end: End address of the migration range
> > > > + *
> > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +					struct vm_area_struct *vas,
> > > > +					struct page *page,
> > > > +					u64 start, u64 end)
> > > > +{
> > > > +	struct migrate_vma migrate = {
> > > > +		.vma		= vas,
> > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page	= page,
> > > > +	};
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > 
> > > That's the wrong mm, at least for the ->migrate_to_ram path. You might be
> > > called on a anon mapping from a child process. That also means that the
> > > vma you're looking at might have no relationship with anythign you're
> > > tracking in your gpusvm.
> > >
> > 
> > Hmm, as discussed [3] I haven't added tests with child processes yet.
> > Let me do that and update the design as needed. This likely isn't
> > correct as you say.
> > 
> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169 
> 
> Ack. More tests should definitely help here to figure out what's up, and
> what's just me being confused.
> 

Starting to add tests this fork() appears to work after dropping these
asserts. More thorough testing is needed though.

> > > > +/**
> > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > + * @vmf: Pointer to the fault information structure
> > > > + *
> > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > + * the internal migration function to migrate the range back to RAM.
> > > > + *
> > > > + * Returns:
> > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > + */
> > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > +	int err;
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > 
> > > So I think zdd->range doesn't work, because even within a single mm the
> > > vma mapping a given piece of anon memory does not need to be unique, you
> > > can duplicate them with mremap.
> > > 
> > 
> > This is attached to a page, not a VMA. Both AMD and Nvidia drivers use a
> > similar lookup mechanism.
> 
> Yeah the page->zone_device_data is fine. It's the zone_device_rage->range
> which I think isn't ok.
> 

Yes, this gets a little confusing with fork() and mremap. The range's
start / end can be nonsense in the remap case. Also as you mention a
range->migrate_mutex doesn't seem correct either. I can make it work but
maybe not worth even typing out why here (I can provide a little more
detail in another reply). New thinking is zdd stores a size field and
has the locking - I think is akin to a VRAM folio then.

> > > So all you have here is the physical memory and the vma, which might or
> > > might not be from the same process as gpusvm->mm.
> > > 
> > > Also the child process scenario means you using mmap_write on the fault
> > > side doesn't stop all cpu faults migrating stuff back.
> > > 
> > > Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
> > > is busted, because it's va based and so misses concurrently ongoing
> > > different mappings moving physical storage around underneath.
> > >
> > 
> > I think all of the above which falls into the fork() + child process
> > issues which you have raise. Until I test this out I can't speak to this
> > any level of confidence so I won't. Thanks for raising this issue and
> > let me write test cases as discussed and educate myself. Once I do that,
> > we can engage in further discussions.
> 
> I think fork + childs will still result in zdd->range being unique (albeit
> confused about which mm). You need mremap of some of these mappings to

Agree for fork + child based on initial testing.

> change the addresses and really cause confusion, which I /think/ (but
> didn't test) is doable with a single process even and duplicating anon

Yep, remap changes the address so range is confusing and really size is
sufficient aligning within VMA's start / end upon CPU fault. AMD does
this but with a VMA search which I think is a bit overkill.

Matt

> memory mappings with mremap.
> 
> Cheers, Sima
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-09-02 12:48             ` Daniel Vetter
@ 2024-09-02 22:20               ` Matthew Brost
  2024-09-03  8:07                 ` Simona Vetter
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-09-02 22:20 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Thomas Hellström, Christian König, intel-xe, dri-devel,
	airlied, matthew.auld, daniel, Paneer Selvam, Arunpravin

On Mon, Sep 02, 2024 at 02:48:55PM +0200, Daniel Vetter wrote:
> On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > > But as Sima pointed out in private communication, exhaustive eviction
> > > > is not really needed for faulting to make (crawling) progress.
> > > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > > strictly only required to service a single gpu page granule at a time.
> > > > 
> > > > However, ordinary bo-based jobs would still like to be able to
> > > > completely evict SVM vram. Whether that is important enough to strive
> > > > for is ofc up for discussion.
> > > 
> > > My take is that you don't win anything for exhaustive eviction by having
> > > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > > world, where svm ignores bo/dma_resv:
> > > 
> > > When evicting vram from the ttm side we'll fairly switch between selecting
> > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > > will eventually succeed in vacuuming up everything (with a few retries
> > > perhaps, if we're not yet at the head of the ww ticket queue).
> > > 
> > > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > > the core mm might be holding temporary page references (which block
> > 
> > Yea, but think you can could kill the app then - not suggesting we
> > should but could. To me this is akin to a CPU fault and not being able
> > to migrate the device pages - the migration layer doc says when this
> > happens kick this to user space and segfault the app.
> > 
> > My last patch in the series adds some asserts to see if this ever
> > happens, thus far never. If it occurs we could gracefully handle it by
> > aborting the migration I guess... I think the user really needs to
> > something a bit crazy to trigger this condition - I don't think the core
> > randomly grabs refs to device pages but could be wrong.
> 
> I think it does :-/
> 
> If you read do_swap_page around ->migrate_to_ram:
> 
> 
> 	get_page(vmf->page);
> 	pte_unmap_unlock(vmf->pte, vmf->ptl);
> 	ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> 	put_page(vmf->page);
> 

Yep, I've seen this show in some of local rework getting MREMAP working.
The common case is I have test with 2M mapping which I call MREMAP on
and then immediately read from the new mapping (the one from MREMAP).
Since a MREMAP results in a UNMAP notifier event one of the possible
solutions I have just evict the VRAM via drm_gpusvm_evict_to_sram upon
the UNMAP event. This case race with fault from the user space so the
evict only moves 511 of the pages while the CPU fault moves 1 page. This
actually seems to be fine though as the entire VRAM allocation is
migrated.

> Also the migrate code itself does lock pages. So unless we toss in
> additional locking on top of what core mm does (which I think should be
> enough to cover migration), migrations will temporarily fail. And this is
> just for multiple threads trying to get the same page back to sram, which
> I think is a case we should support because the application did nothing
> wrong.

Yes, I think I've mentioned this already. Multiple CPU faults from
different threads targeting the same range / allocation can race but
again this actually seems fine too. Each thread gets a semi-random set
of VRAM pages which it migrates but again the end result the entire
VRAM allocation is migrated after all racing faults are serviced. I
think the only guarantee when CPU faults race is the faulting page per
each thread is migrated in that thread.

I have threaded test which hammers reads on a single 2M migration which
checks every 4k page's data integrity that passes reliably. Working on
updated this to fork version now too.

> 
> > > migration) or have the page locked (which also block the migration). But
> > > as long as those two steps succeed, we'll win and get the pages. There
> > > might be some thrashing against concurrent svm faults stealing them again,
> > > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > > And if it's still too much we can stall them in the page allocator.
> > > 
> > > So it's not entirely reliable, but should be close enough.
> > > 
> > > Now for bo based svm the picture isn't any different, because holding
> > > dma_resv is not actually enough to migrate svm mappings. We still need to
> > > hope there's no temporary page references around, and we still need to
> > > succeed at locking the page. And the migration code only does trylocks,
> > > because that's it's deadlock prevent algorithm if different migrations
> > > needing the same set of pages, but acquiring them in a different order. So
> > > we win nothing.
> > 
> > Ok, maybe my statement above is false...
> > 
> > Wouldn't be the only time this falls is if another migration is in
> > flight (e.g. CPU fault) and they race? Then the eviction will naturally
> > happen via refcount being dropped from the other migration. I guess I
> > likely need to update my eviction code to not free the TTM resource if
> > all pages are not migrated.
> 
> Yeah. And additionally core mm relies on some amount of Good Luck here,
> plus the assumption that at least falling back to a single page/folio will
> work out. At least eventually ...
> 
> The trouble is if your design assumes you can migrate an entire block,
> because then if threads hammer that range in different orders you'll never
> make forward progress. Because the core mm code doesn't have a fancy ww
> locking scheme to get out of this, but only uses trylock, plus the
> assumption that falling back to a single page will work out eventually.
> 

Hmm, see above. I think forward progess is made unless I'm completely
missing something. 

> Wrt TTM resource refcounting, I think that all looks ok. But maybe I
> checked the wrong things.
> 
> > > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > > we potentially deadlock because that lock is for a bigger range than
> > > individual pages (or folios). And the core mm assumes that it can get out
> > > of a deadlock bind by (at least stochastically) eventually succeeding in
> > > acquiring/locking down a single page.
> > > 
> > > This means we cannot use dma_resv tricks to give the ttm world an
> > > advantage in exhaustive eviction against concurrent svm faults. Or at
> > > least not more than we can do without by just stalling svm faults that
> > > need to allocate gpu memory (but that must happen without holding locks or
> > > we're busted).
> > > 
> > 
> > I'm a little lost here on the deadlock case. Do you mean when we try to
> > evict SVM BO we trigger reclaim by allocating system pages and can
> > deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> > BOs?
> 
> So you can have multiple cpu threads hammering a given svm range. And
> thanks to the lols of mremap and fork each of them can have a different
> view of that range (they are all obviously different processes from the
> one that has created the gpusvm binding). And if you try to migrate, they
> might all grab the pages in different orders, which can deadlock.
> 

Yes, grabbing locks in different orders would be bad and that could
deadlock. But I don't that that will happen, even with a range lock I
believe I have this working as the range in zdd is pointing the
originally allocated range. The MM and start / end can be wrong (with
fork / mremap) so that has to worked around which isn't ideal. If zdd or
vram allocation has the lock and we remove the range from migration view
this conceptually makes more sense which kinda where I'm trending if we
agree to roughly keep what I have in place at least initially.

Touch on this here too [1].

[1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1112527 

> That's why there's so much retrying and also why core mm only does trylock
> on pages if it grabs an entire pile.
> 
> Now if you have a lock that nests within the page lock you need to trylock
> it, or it deadlocks. Which kinda defeats the point of having a bigger lock
> and moving the entire bo as a unit.
> 
> But if that is outside of the page lock (like amdgpu), you still have the
> issue of the elevated page reference from do_swap_page. Which also blocks
> migration.
> 

See above, it doesn't actually block migration as each thread still make
forward progress and collectively all complete the migration, at least
that is what I'm observing.

> Note that neither is a hard deadlock, as in lockdep complaints, because
> they're all retrying anyway. They're more like lifelocks, and the bigger
> your pile of pages the more likely that you'll always have a failed page
> and need to abort and retry. Which results in threads spinning forever.
> 
> > > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > > worth it. There's also a bit a lru design tension here, because for the bo
> > 
> > Well also not rewriting the world...
> 
> Yeah it's tough. I'm still at the "understanding all the tradeoffs" stage,
> just to make that clear.

That's basically where I'm at too. Trying balance between simple as
possible vs. dream solution. Wrote this series fairly quickly to what I
could get working and help me understand how all of this works. 

I've also said this a few time throughout my replies, also really want
UMD / application data to help understand how SVM will be used too. Feel
like that information will also help determine some design choices (e.g.
what to optimize for).

Matt

> -Sima
> 
> > Matt
> > 
> > > world we want objects that are locked to stay on the lru, so that the
> > > competing processes can figure out who has the winning ww ticket. The core
> > > mm design otoh does isolate pages and remove them from the lru when
> > > they're acquired, so that they don't gunk up other processes from trying
> > > to make forward progress and are better hidden. Which reduces temporary
> > > page references (from lru walk) preventing migration and stuff like that.
> > > 
> > > Cheers, Sima
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
  2024-09-02 22:20               ` Matthew Brost
@ 2024-09-03  8:07                 ` Simona Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Simona Vetter @ 2024-09-03  8:07 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, Thomas Hellström, Christian König,
	intel-xe, dri-devel, airlied, matthew.auld, daniel,
	Paneer Selvam, Arunpravin

On Mon, Sep 02, 2024 at 10:20:07PM +0000, Matthew Brost wrote:
> On Mon, Sep 02, 2024 at 02:48:55PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > > > But as Sima pointed out in private communication, exhaustive eviction
> > > > > is not really needed for faulting to make (crawling) progress.
> > > > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > > > strictly only required to service a single gpu page granule at a time.
> > > > > 
> > > > > However, ordinary bo-based jobs would still like to be able to
> > > > > completely evict SVM vram. Whether that is important enough to strive
> > > > > for is ofc up for discussion.
> > > > 
> > > > My take is that you don't win anything for exhaustive eviction by having
> > > > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > > > world, where svm ignores bo/dma_resv:
> > > > 
> > > > When evicting vram from the ttm side we'll fairly switch between selecting
> > > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > > > will eventually succeed in vacuuming up everything (with a few retries
> > > > perhaps, if we're not yet at the head of the ww ticket queue).
> > > > 
> > > > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > > > the core mm might be holding temporary page references (which block
> > > 
> > > Yea, but think you can could kill the app then - not suggesting we
> > > should but could. To me this is akin to a CPU fault and not being able
> > > to migrate the device pages - the migration layer doc says when this
> > > happens kick this to user space and segfault the app.
> > > 
> > > My last patch in the series adds some asserts to see if this ever
> > > happens, thus far never. If it occurs we could gracefully handle it by
> > > aborting the migration I guess... I think the user really needs to
> > > something a bit crazy to trigger this condition - I don't think the core
> > > randomly grabs refs to device pages but could be wrong.
> > 
> > I think it does :-/
> > 
> > If you read do_swap_page around ->migrate_to_ram:
> > 
> > 
> > 	get_page(vmf->page);
> > 	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > 	ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > 	put_page(vmf->page);
> > 
> 
> Yep, I've seen this show in some of local rework getting MREMAP working.
> The common case is I have test with 2M mapping which I call MREMAP on
> and then immediately read from the new mapping (the one from MREMAP).
> Since a MREMAP results in a UNMAP notifier event one of the possible
> solutions I have just evict the VRAM via drm_gpusvm_evict_to_sram upon
> the UNMAP event. This case race with fault from the user space so the
> evict only moves 511 of the pages while the CPU fault moves 1 page. This
> actually seems to be fine though as the entire VRAM allocation is
> migrated.

There's also MREMAP_DONTUNMAP for added fun. So you cannot rely on the
unmap happening I think.

> > Also the migrate code itself does lock pages. So unless we toss in
> > additional locking on top of what core mm does (which I think should be
> > enough to cover migration), migrations will temporarily fail. And this is
> > just for multiple threads trying to get the same page back to sram, which
> > I think is a case we should support because the application did nothing
> > wrong.
> 
> Yes, I think I've mentioned this already. Multiple CPU faults from
> different threads targeting the same range / allocation can race but
> again this actually seems fine too. Each thread gets a semi-random set
> of VRAM pages which it migrates but again the end result the entire
> VRAM allocation is migrated after all racing faults are serviced. I
> think the only guarantee when CPU faults race is the faulting page per
> each thread is migrated in that thread.
> 
> I have threaded test which hammers reads on a single 2M migration which
> checks every 4k page's data integrity that passes reliably. Working on
> updated this to fork version now too.
> 
> > 
> > > > migration) or have the page locked (which also block the migration). But
> > > > as long as those two steps succeed, we'll win and get the pages. There
> > > > might be some thrashing against concurrent svm faults stealing them again,
> > > > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > > > And if it's still too much we can stall them in the page allocator.
> > > > 
> > > > So it's not entirely reliable, but should be close enough.
> > > > 
> > > > Now for bo based svm the picture isn't any different, because holding
> > > > dma_resv is not actually enough to migrate svm mappings. We still need to
> > > > hope there's no temporary page references around, and we still need to
> > > > succeed at locking the page. And the migration code only does trylocks,
> > > > because that's it's deadlock prevent algorithm if different migrations
> > > > needing the same set of pages, but acquiring them in a different order. So
> > > > we win nothing.
> > > 
> > > Ok, maybe my statement above is false...
> > > 
> > > Wouldn't be the only time this falls is if another migration is in
> > > flight (e.g. CPU fault) and they race? Then the eviction will naturally
> > > happen via refcount being dropped from the other migration. I guess I
> > > likely need to update my eviction code to not free the TTM resource if
> > > all pages are not migrated.
> > 
> > Yeah. And additionally core mm relies on some amount of Good Luck here,
> > plus the assumption that at least falling back to a single page/folio will
> > work out. At least eventually ...
> > 
> > The trouble is if your design assumes you can migrate an entire block,
> > because then if threads hammer that range in different orders you'll never
> > make forward progress. Because the core mm code doesn't have a fancy ww
> > locking scheme to get out of this, but only uses trylock, plus the
> > assumption that falling back to a single page will work out eventually.
> > 
> 
> Hmm, see above. I think forward progess is made unless I'm completely
> missing something. 
> 
> > Wrt TTM resource refcounting, I think that all looks ok. But maybe I
> > checked the wrong things.
> > 
> > > > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > > > we potentially deadlock because that lock is for a bigger range than
> > > > individual pages (or folios). And the core mm assumes that it can get out
> > > > of a deadlock bind by (at least stochastically) eventually succeeding in
> > > > acquiring/locking down a single page.
> > > > 
> > > > This means we cannot use dma_resv tricks to give the ttm world an
> > > > advantage in exhaustive eviction against concurrent svm faults. Or at
> > > > least not more than we can do without by just stalling svm faults that
> > > > need to allocate gpu memory (but that must happen without holding locks or
> > > > we're busted).
> > > > 
> > > 
> > > I'm a little lost here on the deadlock case. Do you mean when we try to
> > > evict SVM BO we trigger reclaim by allocating system pages and can
> > > deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> > > BOs?
> > 
> > So you can have multiple cpu threads hammering a given svm range. And
> > thanks to the lols of mremap and fork each of them can have a different
> > view of that range (they are all obviously different processes from the
> > one that has created the gpusvm binding). And if you try to migrate, they
> > might all grab the pages in different orders, which can deadlock.
> > 
> 
> Yes, grabbing locks in different orders would be bad and that could
> deadlock. But I don't that that will happen, even with a range lock I
> believe I have this working as the range in zdd is pointing the
> originally allocated range. The MM and start / end can be wrong (with
> fork / mremap) so that has to worked around which isn't ideal. If zdd or
> vram allocation has the lock and we remove the range from migration view
> this conceptually makes more sense which kinda where I'm trending if we
> agree to roughly keep what I have in place at least initially.
> 
> Touch on this here too [1].
> 
> [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1112527 

So I think there's multiple scenarios:

- You use your mmap_write hack, and don't have forks. There's only one
  migration ever, the write mmap lock will block any concurrent faults. No
  troubles.

- With the migration_mutex, at least how I understand the amdgpu design I
  think there can be trouble, if you throw enough threads at the issue.

  1. Lots of threads fault on different pages in the same range. They all
  grap a page reference from do_swap_page around calling ->migrate_to_ram.

  2. First thing they do is mutex_lock(range->migration_mutex). That
  serializes everything. It also means that the other threads will not
  wait on the migration pte, because that's not visible outside of holding
  the migration_mutex, instead they pile up against that lock. So you have
  pages in that range you cannot migrate, and you need partial migration
  support or you're stuck.

  Unless you force a full eviction for anything partially migrated before
  you try to handle gpu faults, you also need to handle partial migrations
  on the fault side or things won't make forward progress. As soon as you
  allow faults to fully race with migrate_to_ram (with or without
  migration_mutex) you need to support partial migration state because the
  gpu fault side is not in any way better at winning races than the cpu
  fault migration.

  Or you fall back to sram, but for "must be in vram" memory that's not
  going to work too well.

- Next up is migration_mutex with partial migrations, but instead of
  threads hammering different pages in parallel, they hammer the same page
  in parallel. They will all pile up in the mutex_lock(migration_mutex),
  because the migration pte is never visible outside of that lock. And the
  migration won't ever work, because there's other threads that have an
  elevated page reference.

  You can fix that by making the migration_mutex a trylock. At that point
  you've pretty much exactly reinvented the page lock semantics, except
  the lock is for a pile of pages instead of each individually.

  If you're not yet seeing this I think there's not yet ennough concurrent
  faulting in your tests going on. And this is a legit use case:
  1. allocate range, move it to gpu
  2. do stuff on gpu
  3. when gpu finishes a thundering herd of cpu threads want to look at
  the result

  Note that even the trylock isn't good enough, because there's a window
  between when do_swap_page elevates the page refcount and when we
  trylock. But if we _only_ use the page lock we could move that trylock
  into do_swap_page, while we hold the cpu pagetable lock, which would
  close the race. Cpu threads that lost the race will simply busy-loop for
  a bit until the migration pte is installed, at which point they'll block
  and wait for the migration to finish.

So I think even if we limit us to legit use-cases the migration_mutex (or
any other physical storage lock that's not the page lock) will just get in
the way eventually. And the core mm page lock really should be enough to
make sure migration doesn't race.

> > That's why there's so much retrying and also why core mm only does trylock
> > on pages if it grabs an entire pile.
> > 
> > Now if you have a lock that nests within the page lock you need to trylock
> > it, or it deadlocks. Which kinda defeats the point of having a bigger lock
> > and moving the entire bo as a unit.
> > 
> > But if that is outside of the page lock (like amdgpu), you still have the
> > issue of the elevated page reference from do_swap_page. Which also blocks
> > migration.
> > 
> 
> See above, it doesn't actually block migration as each thread still make
> forward progress and collectively all complete the migration, at least
> that is what I'm observing.
> 
> > Note that neither is a hard deadlock, as in lockdep complaints, because
> > they're all retrying anyway. They're more like lifelocks, and the bigger
> > your pile of pages the more likely that you'll always have a failed page
> > and need to abort and retry. Which results in threads spinning forever.
> > 
> > > > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > > > worth it. There's also a bit a lru design tension here, because for the bo
> > > 
> > > Well also not rewriting the world...
> > 
> > Yeah it's tough. I'm still at the "understanding all the tradeoffs" stage,
> > just to make that clear.
> 
> That's basically where I'm at too. Trying balance between simple as
> possible vs. dream solution. Wrote this series fairly quickly to what I
> could get working and help me understand how all of this works. 
> 
> I've also said this a few time throughout my replies, also really want
> UMD / application data to help understand how SVM will be used too. Feel
> like that information will also help determine some design choices (e.g.
> what to optimize for).

My thinking is a bit different: ttm bo world is from a lot of semantics
fundamentally at odds with core mm. And I didn't realize this fully until
this recent thread, but migrate_device.c and hmm.c really inflict most of
the bonkers "we love all the races" mm semantics onto drivers. Which means
we have a huge semantic design conflict, and the question is where should
we solve it. There's a range, but the two extremes are roughly:

- Try to make svm look as much as possible as the bo+userptr world. This
  means lots of tensions within our code, and the risk that we design
  ourselves into a corner we cannot fix. Like we cannot trylock the
  range->migration_mutex while holding cpu pagetables in do_swap_page, so
  we'd need a different fix for that, and I haven't come up with one yet.

- Just adopt core mm design, and end up with a bunch of duplicated code.
  This means tension between these two worlds, but there's a clear design
  to address that from core mm (shrinkers for ttm bo, page reclaim for
  svm, running in a loop applying equal eviction pressure to both). So
  much cleaner separation, and really well structured interaction between
  the two worlds like we have on the igpu side already for managing sram.
  But it comes at the cost of more code.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-02 12:33           ` Daniel Vetter
@ 2024-09-04 12:27             ` Thomas Hellström
  2024-09-24  8:41               ` Simona Vetter
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-09-04 12:27 UTC (permalink / raw)
  To: Daniel Vetter, Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

Hi, Sima,

On Mon, 2024-09-02 at 14:33 +0200, Daniel Vetter wrote:
> Jumping in here in the middle, since I think it's a solid place to
> drop my
> idea of "align with core mm" gpusvm locking ...
> 
> On Thu, Aug 29, 2024 at 08:56:23PM +0000, Matthew Brost wrote:
> > On Thu, Aug 29, 2024 at 09:18:29PM +0200, Thomas Hellström wrote:
> > Issues with removing a SVM range:
> > 
> > - Xe bind code stores invalidation / present state in VMA, this
> > would
> >   need to be moved to the radix tree. I have Jira open for that
> > work
> >   which I believe other developers are going to own.
> > - Where would the dma mapping / device pages be stored?
> > 	- In the radix tree? What if ATS is enabled? We don't have
> > a
> > 	  driver owned radix tree. How do we reasonably connect a
> > driver
> > 	  owned radix to a common GPUSVM layer?
> 
> Yeah this one is really annoying, because the core mm gets away with
> nothing because it can just store the pfn in the pte. And it doesn't
> need
> anything else. So we probably still need something unfortuantely ...
> 
> > 	- In the notifier? What is the notifier is sparsely
> > populated?
> > 	  We would be wasting huge amounts of memory. What is the
> > 	  notifier is configured to span the entire virtual
> > address
> > 	  space?
> 
> So if we go with the radix idea, we could model the radix to exactly
> match
> the gpu pagetables. That's essentially what the core mm does. Then
> each
> pagetable at each level has a spinlock for essentially a range lock.
> notifier seqno would be stored into each pagetable (not the
> endividual
> entries, that's probably too much), which should allow us to very
> effeciently check whether an entire arbitrary va range is still valid
> on
> the fault side.

I still wonder wether this should be owned by the driver, though. And
if we were optimizing for multiple simultaneous fault processing with a
small granularity, I would agree, but given that gpu pagefaults are
considered so slow they should be avoided, I wonder whether xe's
current approach of a single page-table lock wouldn't suffice, in
addition to a semi-global seqno?

For invalidations, I think we actually currently allow simultaneous
overlapping invalidations that are only protected by the write-side of
the notifier seqno.

> 
> On the notifier side we can also very efficiently walk arbitrary
> ranges,
> because the locking is really fine-grained and in an adaptive way.
> 
> > - How does the garbage collector work? We can't allocate memory in
> > the
> >   notifier so we don't anything to add to the garbage collector. We
> >   can't directly modify page tables given you need lock in the path
> > of
> >   reclaim.
> 
> Probably no more garbage collector, you deal with pages/folios like
> the
> core mm expects.

Yeah, if the page-table locks are reclaim-safe no more garbage
collector, but OTOH, IIRC even in core-mm, the invalidation
counterpart, unmap_mapping_range() can't and doesn't remove page-table
subtrees when called from the address-space side, whereas zapping when
called from the mm side, like madvise(WONTNEED), can.

/Thomas



> 
> > - How do we deal with fault storms (e.g. tons of faults hitting the
> > same
> >   SVM range in a row)? Without a SVM range no every to know if
> > mapping
> >   is valid and GPU page handler can be short circuited.
> 
> So the core mm sorts this out by allowing faults to be handled in
> parallel, without any lock. Essentially:
> - you get a fault (or prefault)
> - you hold just enough read locks to make sure stuff doesn't
> disappear.
>   Currently that's mmap_read_lock, but strictly speaking we only need
> the
>   new-ish per-vma lock.
> - you allocate memory, dma_map, everything else you need
> - you grab that very fine-grained radix tree lock (pagetable locks on
> the
>   cpu side) and recheck whether you've raced: mmu notifier seqno and
> the
>   pte must still be non-present. If that check fails, you bail out
> and
>   release all the vram/dma_maps you've created.
> 
> > - Do we have notifier seqno for every PTE?
> 
> I think per-pagetable, so every node in the radix tree, would make
> sense.
> If we go with also one lock per pagetable like the cpu mm then
> tracking
> notifier seqno to match makes the most sense imo.
> 
> Again, this is entirely aside from the discussion in this subthread
> about
> understanding the current design and tradeoffs/reasons. Just figured
> this
> is a good spot to drop this.
> -Sima


^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                     ` (4 preceding siblings ...)
  2024-08-30  9:16   ` Thomas Hellström
@ 2024-09-06 18:41   ` Zeng, Oak
  2024-09-24  9:25     ` Simona Vetter
  2024-09-24 10:42   ` Thomas Hellström
  2024-10-09 10:50   ` Thomas Hellström
  7 siblings, 1 reply; 100+ messages in thread
From: Zeng, Oak @ 2024-09-06 18:41 UTC (permalink / raw)
  To: Brost, Matthew, intel-xe@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	Auld,  Matthew, daniel@ffwll.ch
  Cc: airlied@gmail.com, christian.koenig@amd.com,
	thomas.hellstrom@linux.intel.com, Auld, Matthew, daniel@ffwll.ch

There are fundamental design conflicts with what we have aligned, see inline.

> -----Original Message-----
> From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf
> Of Matthew Brost
> Sent: Tuesday, August 27, 2024 10:49 PM
> To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Cc: airlied@gmail.com; christian.koenig@amd.com;
> thomas.hellstrom@linux.intel.com; Auld, Matthew
> <matthew.auld@intel.com>; daniel@ffwll.ch
> Subject: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU
> Shared Virtual Memory
> 
> This patch introduces support for GPU Shared Virtual Memory (SVM)
> in the
> Direct Rendering Manager (DRM) subsystem. SVM allows for
> seamless
> sharing of memory between the CPU and GPU, enhancing
> performance and
> flexibility in GPU computing tasks.
> 
> The patch adds the necessary infrastructure for SVM, including data
> structures and functions for managing SVM ranges and notifiers. It
> also
> provides mechanisms for allocating, deallocating, and migrating
> memory
> regions between system RAM and GPU VRAM.
> 
> This mid-layer is largely inspired by GPUVM.
> 
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile     |    3 +-
>  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
>  3 files changed, 2591 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index b9670ae09a9e..b8fc2ee58f1a 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> 
>  # core driver code
> 
> -xe-y += xe_bb.o \
> +xe-y += drm_gpusvm.o \
> +	xe_bb.o \
>  	xe_bo.o \
>  	xe_bo_evict.o \
>  	xe_devcoredump.o \
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> b/drivers/gpu/drm/xe/drm_gpusvm.c
> new file mode 100644
> index 000000000000..fc1e44e6ae72
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> @@ -0,0 +1,2174 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + *
> + * Authors:
> + *     Matthew Brost <matthew.brost@intel.com>
> + */
> +
> +#include <linux/dma-mapping.h>
> +#include <linux/interval_tree_generic.h>
> +#include <linux/hmm.h>
> +#include <linux/memremap.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_types.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +
> +#include <drm/drm_device.h>
> +#include "drm_gpusvm.h"
> +
> +/**
> + * DOC: Overview
> + *
> + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> Rendering Manager (DRM)
> + *
> + * The GPU SVM layer is a component of the DRM framework
> designed to manage shared
> + * virtual memory between the CPU and GPU. It enables efficient
> data exchange and
> + * processing for GPU-accelerated applications by allowing memory
> sharing and
> + * synchronization between the CPU's and GPU's virtual address
> spaces.
> + *
> + * Key GPU SVM Components:
> + * - Notifiers: Notifiers: Used for tracking memory intervals and
> notifying the
> + *		GPU of changes, notifiers are sized based on a GPU
> SVM
> + *		initialization parameter, with a recommendation of
> 512M or
> + *		larger. They maintain a Red-BlacK tree and a list of
> ranges that
> + *		fall within the notifier interval. Notifiers are tracked
> within
> + *		a GPU SVM Red-BlacK tree and list and are
> dynamically inserted
> + *		or removed as ranges within the interval are created
> or
> + *		destroyed.
> + * - Ranges: Represent memory ranges mapped in a DRM device and
> managed
> + *	     by GPU SVM. 


This svm_range concept has introduced a lot of code duplications in xekmd, 
Indicating that this is a wrong design. I think one of the design principle is to
Reuse, not to duplicate.

Look at patch 9, 11, bunch of duplicated codes to page table update, invalidate,
And page fault handler. 

I had this range concept in v1 [1], but after we agreed to unify svm and userptr
Codes during review, I dropped this concept, and the xe_svm concept, which ends
Up much less duplicated codes in v2[2]. I will say more below why I thought the svm
Concept can also be removed.

Conceptually vma represent a range. Why duplicate?

[1] https://patchwork.freedesktop.org/patch/574898/?series=128910&rev=1
[2] https://patchwork.freedesktop.org/series/132229/


They are sized based on an array of chunk
> sizes, which
> + *	     is a GPU SVM initialization parameter, and the CPU address
> space.
> + *	     Upon GPU fault, the largest aligned chunk that fits within
> the
> + *	     faulting CPU address space is chosen for the range size.
> Ranges are
> + *	     expected to be dynamically allocated on GPU fault and
> removed on an
> + *	     MMU notifier UNMAP event. As mentioned above, ranges
> are tracked in
> + *	     a notifier's Red-Black tree.
> + * - Operations: Define the interface for driver-specific SVM
> operations such as
> + *		 allocation, page collection, migration, invalidations,
> and VRAM
> + *		 release.
> + *
> + * This layer provides interfaces for allocating, mapping, migrating,
> and
> + * releasing memory ranges between the CPU and GPU. It handles
> all core memory
> + * management interactions (DMA mapping, HMM, and migration)
> and provides
> + * driver-specific virtual functions (vfuncs). This infrastructure is
> sufficient
> + * to build the expected driver components for an SVM
> implementation as detailed
> + * below.
> + *
> + * Expected Driver Components:
> + * - GPU page fault handler: Used to create ranges and notifiers
> based on the
> + *			     fault address, optionally migrate the range
> to
> + *			     VRAM, and create GPU bindings.
> + * - Garbage collector: Used to destroy GPU bindings for ranges.
> Ranges are
> + *			expected to be added to the garbage collector
> upon
> + *			MMU_NOTIFY_UNMAP event.
> + */
> +
> +/**
> + * DOC: Locking
> + *
> + * GPU SVM handles locking for core MM interactions, i.e., it
> locks/unlocks the
> + * mmap lock as needed. Alternatively, if the driver prefers to
> handle the mmap
> + * lock itself, a 'locked' argument is provided to the functions that
> require
> + * the mmap lock. This option may be useful for drivers that need to
> call into
> + * GPU SVM while also holding a dma-resv lock, thus preventing
> locking
> + * inversions between the mmap and dma-resv locks.
> + *
> + * GPU SVM introduces a global notifier lock, which safeguards the
> notifier's
> + * range RB tree and list, as well as the range's DMA mappings and
> sequence
> + * number. GPU SVM manages all necessary locking and unlocking
> operations,
> + * except for the recheck of the range's sequence number
> + * (mmu_interval_read_retry) when the driver is committing GPU
> bindings. This
> + * lock corresponds to the 'driver->update' lock mentioned in the
> HMM
> + * documentation (TODO: Link). Future revisions may transition from
> a GPU SVM
> + * global lock to a per-notifier lock if finer-grained locking is deemed
> + * necessary.
> + *
> + * In addition to the locking mentioned above, the driver should
> implement a
> + * lock to safeguard core GPU SVM function calls that modify state,
> such as
> + * drm_gpusvm_range_find_or_insert and
> drm_gpusvm_range_remove. Alternatively,
> + * these core functions can be called within a single kernel thread,
> for
> + * instance, using an ordered work queue. This lock is denoted as
> + * 'driver_svm_lock' in code examples.
> + */
> +
> +/**
> + * DOC: Migrataion
> + *
> + * The migration support is quite simple, allowing migration between
> SRAM and
> + * VRAM at the range granularity. For example, GPU SVM currently
> does not
> + * support mixing SRAM and VRAM pages within a range. This means
> that upon GPU
> + * fault, the entire range can be migrated to VRAM, and upon CPU
> fault, the
> + * entire range is migrated to SRAM.
> + *
> + * The reasoning for only supporting range granularity is as follows: it
> + * simplifies the implementation, and range sizes are driver-defined
> and should
> + * be relatively small.

Migration at range granularity just couples the physical world with virtual world,
Which is against the fundamental page-centric design we aligned before.

Looking at core mm behavior, the shrinking/swapping doesn't operate at vma or any
Virtual range granularity. This way we swap out the less frequently used pages and
Keep the more frequently used pages in ram. 

Similar thing should be done to vram migration to sram.

> + */
> +
> +/**
> + * DOC: Partial Unmapping of Ranges
> + *
> + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> CPU resulting
> + * in MMU_NOTIFY_UNMAP event) presents several challenges,

As said above, the challenge is coming from a design choice. In a 
Page centric design, the challenges don't exist at all.



> with the main one
> + * being that a subset of the range still has CPU and GPU mappings.
> If the
> + * backing store for the range is in VRAM, a subset of the backing
> store has
> + * references. One option would be to split the range and VRAM
> backing store,
> + * but the implementation for this would be quite complicated.
> Given that
> + * partial unmappings are rare and driver-defined range sizes are
> relatively
> + * small, GPU SVM does not support splitting of ranges.
> + *
> + * With no support for range splitting, upon partial unmapping of a
> range, the
> + * driver is expected to invalidate and destroy the entire range. If
> the range
> + * has VRAM as its backing, the driver is also expected to migrate any
> remaining
> + * pages back to SRAM.
> + */
> +
> +/**
> + * DOC: Examples
> + *
> + * This section provides two examples of how to build the expected
> driver
> + * components: the GPU page fault handler and the garbage
> collector. A third
> + * example demonstrates a sample invalidation driver vfunc.
> + *
> + * The generic code provided does not include logic for complex
> migration
> + * policies, optimized invalidations, or other potentially required
> driver
> + * locking (e.g., DMA-resv locks).
> + *
> + * 1) GPU page fault handler
> + *
> + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> drm_gpusvm_range *range)
> + *	{
> + *		int err = 0;
> + *
> + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> range);
> + *
> + *		drm_gpusvm_notifier_lock(gpusvm);
> + *		if (drm_gpusvm_range_pages_valid(range))
> + *			driver_commit_bind(gpusvm, range);
> + *		else
> + *			err = -EAGAIN;
> + *		drm_gpusvm_notifier_unlock(gpusvm);
> + *
> + *		return err;
> + *	}
> + *
> + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> + *			     u64 gpuva_start, u64 gpuva_end)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *		int err;
> + *
> + *		driver_svm_lock();
> + *	retry:
> + *		// Always process UNMAPs first so view of GPU SVM
> ranges is current
> + *		driver_garbage_collector(gpusvm);
> + *
> + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> fault_addr,
> + *							gpuva_start,
> gpuva_end,
> + *						        &ctx);
> + *		if (IS_ERR(range)) {
> + *			err = PTR_ERR(range);
> + *			goto unlock;
> + *		}
> + *
> + *		if (driver_migration_policy(range)) {
> + *			bo = driver_alloc_bo();
> + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> range, bo, &ctx);
> + *			if (err)	// CPU mappings may have changed
> + *				goto retry;
> + *		}
> + *
> + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &ctx);
> + *		if (err == -EFAULT || err == -EPERM)	// CPU
> mappings changed
> + *			goto retry;
> + *		else if (err)
> + *			goto unlock;
> + *
> + *		err = driver_bind_range(gpusvm, range);
> + *		if (err == -EAGAIN)	// CPU mappings changed
> + *			goto retry
> + *
> + *	unlock:
> + *		driver_svm_unlock();
> + *		return err;
> + *	}
> + *
> + * 2) Garbage Collector.
> + *
> + *	void __driver_garbage_collector(struct drm_gpusvm
> *gpusvm,
> + *					struct drm_gpusvm_range
> *range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		// Partial unmap, migrate any remaining VRAM pages
> back to SRAM
> + *		if (range->flags.partial_unmap)
> + *			drm_gpusvm_migrate_to_sram(gpusvm,
> range, &ctx);
> + *
> + *		driver_unbind_range(range);
> + *		drm_gpusvm_range_remove(gpusvm, range);
> + *	}
> + *
> + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> + *	{
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		for_each_range_in_garbage_collector(gpusvm, range)
> + *			__driver_garbage_collector(gpusvm, range);
> + *	}
> + *
> + * 3) Invalidation driver vfunc.
> + *
> + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> + *				 struct drm_gpusvm_notifier *notifier,
> + *				 const struct mmu_notifier_range
> *mmu_range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> + *		struct drm_gpusvm_range *range = NULL;
> + *
> + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> >start, mmu_range->end);
> + *
> + *		drm_gpusvm_for_each_range(range, notifier,
> mmu_range->start,
> + *					  mmu_range->end) {
> + *			drm_gpusvm_range_unmap_pages(gpusvm,
> range, &ctx);
> + *
> + *			if (mmu_range->event !=
> MMU_NOTIFY_UNMAP)
> + *				continue;
> + *
> + *			drm_gpusvm_range_set_unmapped(range,
> mmu_range);
> + *			driver_garbage_collector_add(gpusvm,
> range);
> + *		}
> + *	}
> + */
> +
> +#define DRM_GPUSVM_RANGE_START(_range)	((_range)-
> >va.start)
> +#define DRM_GPUSVM_RANGE_END(_range)	((_range)-
> >va.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> rb.__subtree_last,
> +		     DRM_GPUSVM_RANGE_START,
> DRM_GPUSVM_RANGE_END,
> +		     static __maybe_unused, range);
> +
> +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> >interval.start)
> +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> >interval.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> +		     rb.__subtree_last,
> DRM_GPUSVM_NOTIFIER_START,
> +		     DRM_GPUSVM_NOTIFIER_END, static
> __maybe_unused, notifier);
> +
> +/**
> + * npages_in_range() - Calculate the number of pages in a given
> range
> + * @start__: The start address of the range
> + * @end__: The end address of the range
> + *
> + * This macro calculates the number of pages in a given memory
> range,
> + * specified by the start and end addresses. It divides the difference
> + * between the end and start addresses by the page size
> (PAGE_SIZE) to
> + * determine the number of pages in the range.
> + *
> + * Return: The number of pages in the specified range.
> + */
> +#define npages_in_range(start__, end__)	\
> +	(((end__) - (start__)) >> PAGE_SHIFT)
> +
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd
> destruction
> + * @range: Pointer to the GPU SVM range
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up
> a range
> + * upon CPU page fault and asynchronously releasing VRAM once
> the CPU has no
> + * page references. Asynchronous release is useful because CPU
> page references
> + * can be dropped in IRQ contexts, while releasing VRAM likely
> requires sleeping
> + * locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_range *range;
> +	void *vram_allocation;
> +};
> +
> +/**
> + * drm_gpusvm_zdd_destroy_work_func - Work function for
> destroying a zdd
> + * @w: Pointer to the work_struct
> + *
> + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> + */
> +static void drm_gpusvm_zdd_destroy_work_func(struct
> work_struct *w)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(w, struct drm_gpusvm_zdd,
> destroy_work);
> +	struct drm_gpusvm_range *range = zdd->range;
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> +		gpusvm->ops->vram_release(zdd->vram_allocation);
> +	drm_gpusvm_range_put(range);
> +	kfree(zdd);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> + * @range: Pointer to the GPU SVM range.
> + *
> + * This function allocates and initializes a new zdd structure. It sets
> up the
> + * reference count, initializes the destroy work, and links the
> provided GPU SVM
> + * range.
> + *
> + * Returns:
> + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_zdd *
> +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_zdd *zdd;
> +
> +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> +	if (!zdd)
> +		return NULL;
> +
> +	kref_init(&zdd->refcount);
> +	INIT_WORK(&zdd->destroy_work,
> drm_gpusvm_zdd_destroy_work_func);
> +	zdd->range = drm_gpusvm_range_get(range);
> +	zdd->vram_allocation = NULL;
> +
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function increments the reference count of the provided zdd
> structure.
> + *
> + * Returns: Pointer to the zdd structure.
> + */
> +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> drm_gpusvm_zdd *zdd)
> +{
> +	kref_get(&zdd->refcount);
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> + * @ref: Pointer to the reference count structure.
> + *
> + * This function queues the destroy_work of the zdd for
> asynchronous destruction.
> + */
> +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> +
> +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_put - Put a zdd reference.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function decrements the reference count of the provided zdd
> structure
> + * and schedules its destruction if the count drops to zero.
> + */
> +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> notifier
> + * @notifier: Pointer to the GPU SVM notifier structure.
> + * @start: Start address of the range
> + * @end: End address of the range
> + *
> + * Return: A pointer to the drm_gpusvm_range if found or NULL
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end)
> +{
> +	return range_iter_first(&notifier->root, start, end - 1);
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU
> SVM ranges in a notifier
> + * @range__: Iterator variable for the ranges
> + * @next__: Iterator variable for the ranges temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier
> while
> + * removing ranges from it.
> + */
> +#define drm_gpusvm_for_each_range_safe(range__, next__,
> notifier__, start__, end__)	\
> +	for ((range__) = drm_gpusvm_range_find((notifier__),
> (start__), (end__)),	\
> +	     (next__) = __drm_gpusvm_range_next(range__);
> 			\
> +	     (range__) && (range__->va.start < (end__));
> 			\
> +	     (range__) = (next__), (next__) =
> __drm_gpusvm_range_next(range__))
> +
> +/**
> + * __drm_gpusvm_notifier_next - get the next
> drm_gpusvm_notifier in the list
> + * @notifier: a pointer to the current drm_gpusvm_notifier
> + *
> + * Return: A pointer to the next drm_gpusvm_notifier if available, or
> NULL if
> + *         the current notifier is the last one or if the input notifier is
> + *         NULL.
> + */
> +static struct drm_gpusvm_notifier *
> +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier
> *notifier)
> +{
> +	if (notifier && !list_is_last(&notifier->rb.entry,
> +				      &notifier->gpusvm->notifier_list))
> +		return list_next_entry(notifier, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> + */
> +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> start__, end__)		\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1);	\
> +	     (notifier__) && (notifier__->interval.start < (end__));
> 			\
> +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> SVM notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @next__: Iterator variable for the notifiers temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> while
> + * removing notifiers from it.
> + */
> +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> gpusvm__, start__, end__)	\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1),	\
> +	     (next__) = __drm_gpusvm_notifier_next(notifier__);
> 				\
> +	     (notifier__) && (notifier__->interval.start < (end__));
> 			\
> +	     (notifier__) = (next__), (next__) =
> __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> + * @mni: Pointer to the mmu_interval_notifier structure.
> + * @mmu_range: Pointer to the mmu_notifier_range structure.
> + * @cur_seq: Current sequence number.
> + *
> + * This function serves as a generic MMU notifier for GPU SVM. It
> sets the MMU
> + * notifier sequence number and calls the driver invalidate vfunc
> under
> + * gpusvm->notifier_lock.
> + *
> + * Returns:
> + * true if the operation succeeds, false otherwise.
> + */
> +static bool
> +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> +			       const struct mmu_notifier_range
> *mmu_range,
> +			       unsigned long cur_seq)
> +{
> +	struct drm_gpusvm_notifier *notifier =
> +		container_of(mni, typeof(*notifier), notifier);
> +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> +
> +	if (!mmu_notifier_range_blockable(mmu_range))
> +		return false;
> +
> +	down_write(&gpusvm->notifier_lock);
> +	mmu_interval_set_seq(mni, cur_seq);
> +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> +	up_write(&gpusvm->notifier_lock);
> +
> +	return true;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> GPU SVM
> + */
> +static const struct mmu_interval_notifier_ops
> drm_gpusvm_notifier_ops = {
> +	.invalidate = drm_gpusvm_notifier_invalidate,
> +};
> +
> +/**
> + * drm_gpusvm_init - Initialize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @name: Name of the GPU SVM.
> + * @drm: Pointer to the DRM device structure.
> + * @mm: Pointer to the mm_struct for the address space.
> + * @device_private_page_owner: Device private pages owner.
> + * @mm_start: Start address of GPU SVM.
> + * @mm_range: Range of the GPU SVM.
> + * @notifier_size: Size of individual notifiers.
> + * @ops: Pointer to the operations structure for GPU SVM.
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order with last
> + *               entry being SZ_4K.
> + * @num_chunks: Number of chunks.
> + *
> + * This function initializes the GPU SVM.
> + *
> + * Returns:
> + * 0 on success, a negative error code on failure.
> + */
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks)
> +{
> +	if (!ops->invalidate || !num_chunks)
> +		return -EINVAL;
> +
> +	gpusvm->name = name;
> +	gpusvm->drm = drm;
> +	gpusvm->mm = mm;
> +	gpusvm->device_private_page_owner =
> device_private_page_owner;
> +	gpusvm->mm_start = mm_start;
> +	gpusvm->mm_range = mm_range;
> +	gpusvm->notifier_size = notifier_size;
> +	gpusvm->ops = ops;
> +	gpusvm->chunk_sizes = chunk_sizes;
> +	gpusvm->num_chunks = num_chunks;
> +	gpusvm->zdd_wq = system_wq;
> +
> +	mmgrab(mm);
> +	gpusvm->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> +
> +	init_rwsem(&gpusvm->notifier_lock);
> +
> +	fs_reclaim_acquire(GFP_KERNEL);
> +	might_lock(&gpusvm->notifier_lock);
> +	fs_reclaim_release(GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure
> + * @fault_addr__: Fault address
> + *
> + * This macro finds the GPU SVM notifier associated with the fault
> address.
> + *
> + * Returns:
> + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> + */
> +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)
> 	\
> +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),
> 	\
> +			    (fault_addr__ + 1))
> +
> +/**
> + * to_drm_gpusvm_notifier - retrieve the container struct for a
> given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_notifier struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_notifier
> structure.
> + */
> +#define to_drm_gpusvm_notifier(__node)
> 	\
> +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> +
> +/**
> + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function inserts the GPU SVM notifier into the GPU SVM RB
> tree and list.
> + */
> +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	notifier_insert(notifier, &gpusvm->root);
> +
> +	node = rb_prev(&notifier->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> +	else
> +		head = &gpusvm->notifier_list;
> +
> +	list_add(&notifier->rb.entry, head);
> +}
> +
> +/**
> + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM tructure
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + *
> + * This macro removes the GPU SVM notifier from the GPU SVM RB
> tree and list.
> + */
> +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)
> 	\
> +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> +	list_del(&(notifier__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_fini - Finalize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + *
> + * This function finalizes the GPU SVM by cleaning up any remaining
> ranges and
> + * notifiers, and dropping a reference to struct MM.
> + */
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> +{
> +	struct drm_gpusvm_notifier *notifier, *next;
> +
> +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm,
> 0, LONG_MAX) {
> +		struct drm_gpusvm_range *range, *__next;
> +
> +		/*
> +		 * Remove notifier first to avoid racing with any
> invalidation
> +		 */
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +		notifier->flags.removed = true;
> +
> +		drm_gpusvm_for_each_range_safe(range, __next,
> notifier, 0,
> +					       LONG_MAX)
> +			drm_gpusvm_range_remove(gpusvm, range);
> +	}
> +
> +	mmdrop(gpusvm->mm);
> +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> +}
> +
> +/**
> + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + *
> + * This function allocates and initializes the GPU SVM notifier
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> on failure.
> + */
> +static struct drm_gpusvm_notifier *
> +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> fault_addr)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	if (gpusvm->ops->notifier_alloc)
> +		notifier = gpusvm->ops->notifier_alloc();
> +	else
> +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> +
> +	if (!notifier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	notifier->gpusvm = gpusvm;
> +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> >notifier_size);
> +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> >notifier_size);
> +	INIT_LIST_HEAD(&notifier->rb.entry);
> +	notifier->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&notifier->range_list);
> +
> +	return notifier;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function frees the GPU SVM notifier structure.
> + */
> +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> +				     struct drm_gpusvm_notifier
> *notifier)
> +{
> +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> +
> +	if (gpusvm->ops->notifier_free)
> +		gpusvm->ops->notifier_free(notifier);
> +	else
> +		kfree(notifier);
> +}
> +
> +/**
> + * to_drm_gpusvm_range - retrieve the container struct for a given
> rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_range struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_range structure.
> + */
> +#define to_drm_gpusvm_range(node__)	\
> +	container_of((node__), struct drm_gpusvm_range, rb.node)
> +
> +/**
> + * drm_gpusvm_range_insert - Insert GPU SVM range
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function inserts the GPU SVM range into the notifier RB tree
> and list.
> + */
> +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> *notifier,
> +				    struct drm_gpusvm_range *range)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> +	range_insert(range, &notifier->root);
> +
> +	node = rb_prev(&range->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> +	else
> +		head = &notifier->range_list;
> +
> +	list_add(&range->rb.entry, head);
> +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> +}
> +
> +/**
> + * __drm_gpusvm_range_remove - Remove GPU SVM range
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + * @range__: Pointer to the GPU SVM range structure
> + *
> + * This macro removes the GPU SVM range from the notifier RB tree
> and list.
> + */
> +#define __drm_gpusvm_range_remove(notifier__, range__)
> 		\
> +	range_remove((range__), &(notifier__)->root);
> 	\
> +	list_del(&(range__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @fault_addr: Fault address
> + * @chunk_size: Chunk size
> + * @migrate_vram: Flag indicating whether to migrate VRAM
> + *
> + * This function allocates and initializes the GPU SVM range structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> failure.
> + */
> +static struct drm_gpusvm_range *
> +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> +		       struct drm_gpusvm_notifier *notifier,
> +		       u64 fault_addr, u64 chunk_size, bool
> migrate_vram)
> +{
> +	struct drm_gpusvm_range *range;
> +
> +	if (gpusvm->ops->range_alloc)
> +		range = gpusvm->ops->range_alloc(gpusvm);
> +	else
> +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&range->refcount);
> +	range->gpusvm = gpusvm;
> +	range->notifier = notifier;
> +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> +	INIT_LIST_HEAD(&range->rb.entry);
> +	range->notifier_seq = LONG_MAX;
> +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_check_pages - Check pages
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @start: Start address
> + * @end: End address
> + *
> + * Check if pages between start and end have been faulted in on the
> CPU. Use to
> + * prevent migration of pages without CPU backing store.
> + *
> + * Returns:
> + * True if pages have been faulted into CPU, False otherwise
> + */
> +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> +				   struct drm_gpusvm_notifier
> *notifier,
> +				   u64 start, u64 end)
> +{
> +	struct hmm_range hmm_range = {
> +		.default_flags = 0,
> +		.notifier = &notifier->notifier,
> +		.start = start,
> +		.end = end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns;
> +	unsigned long npages = npages_in_range(start, end);
> +	int err, i;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return false;
> +
> +	hmm_range.notifier_seq =
> mmu_interval_read_begin(&notifier->notifier);
> +	hmm_range.hmm_pfns = pfns;
> +
> +	while (true) {
> +		err = hmm_range_fault(&hmm_range);
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(&notifier->notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (err)
> +		goto err_free;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!(pfns[i] & HMM_PFN_VALID)) {
> +			err = -EFAULT;
> +			goto err_free;
> +		}
> +	}
> +
> +err_free:
> +	kvfree(pfns);
> +	return err ? false : true;
> +}
> +
> +/**
> + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @vas: Pointer to the virtual memory area structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @check_pages: Flag indicating whether to check pages
> + *
> + * This function determines the chunk size for the GPU SVM range
> based on the
> + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> and the virtual
> + * memory area boundaries.
> + *
> + * Returns:
> + * Chunk size on success, LONG_MAX on failure.
> + */
> +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier,
> +				       struct vm_area_struct *vas,
> +				       u64 fault_addr, u64 gpuva_start,
> +				       u64 gpuva_end, bool check_pages)
> +{
> +	u64 start, end;
> +	int i = 0;
> +
> +retry:
> +	for (; i < gpusvm->num_chunks; ++i) {
> +		start = ALIGN_DOWN(fault_addr, gpusvm-
> >chunk_sizes[i]);
> +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> +
> +		if (start >= vas->vm_start && end <= vas->vm_end
> &&
> +		    start >= notifier->interval.start &&
> +		    end <= notifier->interval.end &&
> +		    start >= gpuva_start && end <= gpuva_end)
> +			break;
> +	}
> +
> +	if (i == gpusvm->num_chunks)
> +		return LONG_MAX;
> +
> +	/*
> +	 * If allocation more than page, ensure not to overlap with
> existing
> +	 * ranges.
> +	 */
> +	if (end - start != SZ_4K) {
> +		struct drm_gpusvm_range *range;
> +
> +		range = drm_gpusvm_range_find(notifier, start, end);
> +		if (range) {
> +			++i;
> +			goto retry;
> +		}
> +
> +		/*
> +		 * XXX: Only create range on pages CPU has faulted in.
> Without
> +		 * this check, or prefault, on BMG
> 'xe_exec_system_allocator --r
> +		 * process-many-malloc' fails. In the failure case, each
> process
> +		 * mallocs 16k but the CPU VMA is ~128k which results
> in 64k SVM
> +		 * ranges. When migrating the SVM ranges, some
> processes fail in
> +		 * drm_gpusvm_migrate_to_vram with
> 'migrate.cpages != npages'
> +		 * and then upon drm_gpusvm_range_get_pages
> device pages from
> +		 * other processes are collected + faulted in which
> creates all
> +		 * sorts of problems. Unsure exactly how this
> happening, also
> +		 * problem goes away if 'xe_exec_system_allocator --
> r
> +		 * process-many-malloc' mallocs at least 64k at a time.
> +		 */
> +		if (check_pages &&
> +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> end)) {
> +			++i;
> +			goto retry;
> +		}
> +	}
> +
> +	return end - start;
> +}
> +
> +/**
> + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @ctx: GPU SVM context
> + *
> + * This function finds or inserts a newly allocated a GPU SVM range
> based on the
> + * fault address. Caller must hold a lock to protect range lookup and
> insertion.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> u64 fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +	struct drm_gpusvm_range *range;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	bool notifier_alloc = false;
> +	u64 chunk_size;
> +	int err;
> +	bool migrate_vram;
> +
> +	if (fault_addr < gpusvm->mm_start ||
> +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_write_locked(mm);
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> +	if (!notifier) {
> +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> fault_addr);
> +		if (IS_ERR(notifier)) {
> +			err = PTR_ERR(notifier);
> +			goto err_mmunlock;
> +		}
> +		notifier_alloc = true;
> +		err = mmu_interval_notifier_insert_locked(&notifier-
> >notifier,
> +							  mm, notifier-
> >interval.start,
> +							  notifier-
> >interval.end -
> +							  notifier-
> >interval.start,
> +
> &drm_gpusvm_notifier_ops);
> +		if (err)
> +			goto err_notifier;
> +	}
> +
> +	vas = vma_lookup(mm, fault_addr);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_notifier_remove;
> +	}
> +
> +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> +		err = -EPERM;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_find(notifier, fault_addr,
> fault_addr + 1);
> +	if (range)
> +		goto out_mmunlock;
> +	/*
> +	 * XXX: Short-circuiting migration based on migrate_vma_*
> current
> +	 * limitations. If/when migrate_vma_* add more support, this
> logic will
> +	 * have to change.
> +	 */
> +	migrate_vram = ctx->vram_possible &&
> +		vma_is_anonymous(vas)
> && !is_vm_hugetlb_page(vas);
> +
> +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> notifier, vas,
> +						 fault_addr,
> gpuva_start,
> +						 gpuva_end,
> migrate_vram &&
> +						 !ctx->prefault);
> +	if (chunk_size == LONG_MAX) {
> +		err = -EINVAL;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> fault_addr, chunk_size,
> +				       migrate_vram);
> +	if (IS_ERR(range)) {
> +		err = PTR_ERR(range);
> +		goto err_notifier_remove;
> +	}
> +
> +	drm_gpusvm_range_insert(notifier, range);
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> +
> +	if (ctx->prefault) {
> +		struct drm_gpusvm_ctx __ctx = *ctx;
> +
> +		__ctx.mmap_locked = true;
> +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &__ctx);
> +		if (err)
> +			goto err_range_remove;
> +	}
> +
> +out_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +
> +	return range;
> +
> +err_range_remove:
> +	__drm_gpusvm_range_remove(notifier, range);
> +err_notifier_remove:
> +	if (notifier_alloc)
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +err_notifier:
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * for_each_dma_page - iterate over pages in a DMA regio`n
> + * @i__: the current page index in the iteration
> + * @j__: the current page index, log order, in the iteration
> + * @npages__: the total number of pages in the DMA region
> + * @order__: the order of the pages in the DMA region
> + *
> + * This macro iterates over each page in a DMA region. The DMA
> region
> + * is assumed to be composed of 2^@order__ pages, and the macro
> will
> + * step through the region one block of 2^@order__ pages at a time.
> + */
> +#define for_each_dma_page(i__, j__, npages__, order__)	\
> +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> +	     (j__)++, (i__) += 0x1 << (order__))
> +
> +/**
> + * __drm_gpusvm_range_unmap_pages - Unmap pages associated
> with a GPU SVM range (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function unmap pages associated with a GPU SVM range.
> Assumes and
> + * asserts correct locking is in place when called.
> + */
> +static void __drm_gpusvm_range_unmap_pages(struct
> drm_gpusvm *gpusvm,
> +					   struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		unsigned long i, j, npages = npages_in_range(range-
> >va.start,
> +							     range-
> >va.end);
> +
> +		if (range->flags.has_dma_mapping) {
> +			for_each_dma_page(i, j, npages, range-
> >order)
> +				dma_unmap_page(gpusvm->drm-
> >dev,
> +					       range->dma_addr[j],
> +					       PAGE_SIZE << range-
> >order,
> +					       DMA_BIDIRECTIONAL);
> +		}
> +
> +		range->flags.has_vram_pages = false;
> +		range->flags.has_dma_mapping = false;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_free_pages - Free pages associated with a
> GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function free pages associated with a GPU SVM range.
> + */
> +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> *gpusvm,
> +					struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		if (range->flags.kfree_mapping) {
> +			kfree(range->dma_addr);
> +			range->flags.kfree_mapping = false;
> +			range->pages = NULL;
> +		} else {
> +			kvfree(range->pages);
> +			range->pages = NULL;
> +		}
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_remove - Remove GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function removes the specified GPU SVM range and also
> removes the parent
> + * GPU SVM notifier if no more ranges remain in the notifier. The
> caller must
> + * hold a lock to protect range and notifier removal.
> + */
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> >va.start);
> +	if (WARN_ON_ONCE(!notifier))
> +		return;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +	drm_gpusvm_range_free_pages(gpusvm, range);
> +	__drm_gpusvm_range_remove(notifier, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	drm_gpusvm_range_put(range);
> +
> +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> +		if (!notifier->flags.removed)
> +			mmu_interval_notifier_remove(&notifier-
> >notifier);
> +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function increments the reference count of the specified
> GPU SVM range.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> +{
> +	kref_get(&range->refcount);
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> + * @refcount: Pointer to the reference counter embedded in the
> GPU SVM range
> + *
> + * This function destroys the specified GPU SVM range when its
> reference count
> + * reaches zero. If a custom range-free function is provided, it is
> invoked to
> + * free the range; otherwise, the range is deallocated using kfree().
> + */
> +static void drm_gpusvm_range_destroy(struct kref *refcount)
> +{
> +	struct drm_gpusvm_range *range =
> +		container_of(refcount, struct drm_gpusvm_range,
> refcount);
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->range_free)
> +		gpusvm->ops->range_free(range);
> +	else
> +		kfree(range);
> +}
> +
> +/**
> + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function decrements the reference count of the specified
> GPU SVM range
> + * and frees it when the count reaches zero.
> + */
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> +{
> +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called holding gpusvm->notifier_lock and as the last step before
> commiting a
> + * GPU binding.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	return range->flags.has_vram_pages || range-
> >flags.has_dma_mapping;
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range
> pages valid unlocked
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called without holding gpusvm->notifier_lock.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +static bool
> +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm
> *gpusvm,
> +				      struct drm_gpusvm_range *range)
> +{
> +	bool pages_valid;
> +
> +	if (!range->pages)
> +		return false;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> range);
> +	if (!pages_valid && range->flags.kfree_mapping) {
> +		kfree(range->dma_addr);
> +		range->flags.kfree_mapping = false;
> +		range->pages = NULL;
> +	}
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	return pages_valid;
> +}
> +
> +/**
> + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function gets pages for a GPU SVM range and ensures they
> are mapped for
> + * DMA access.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct mmu_interval_notifier *notifier = &range->notifier-
> >notifier;
> +	struct hmm_range hmm_range = {
> +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> >read_only ? 0 :
> +			HMM_PFN_REQ_WRITE),
> +		.notifier = notifier,
> +		.start = range->va.start,
> +		.end = range->va.end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long i, j;
> +	unsigned long npages = npages_in_range(range->va.start,
> range->va.end);
> +	unsigned int order = 0;
> +	unsigned long *pfns;
> +	struct page **pages;
> +	int err = 0;
> +	bool vram_pages = !!range->flags.migrate_vram;
> +	bool alloc_pfns = false, kfree_mapping;
> +
> +retry:
> +	kfree_mapping = false;
> +	hmm_range.notifier_seq =
> mmu_interval_read_begin(notifier);
> +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> range))
> +		return 0;
> +
> +	if (range->notifier_seq == hmm_range.notifier_seq &&
> range->pages) {
> +		if (ctx->prefault)
> +			return 0;
> +
> +		pfns = (unsigned long *)range->pages;
> +		pages = range->pages;
> +		goto map_pages;
> +	}
> +
> +	if (!range->pages) {
> +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> GFP_KERNEL);
> +		if (!pfns)
> +			return -ENOMEM;
> +		alloc_pfns = true;
> +	} else {
> +		pfns = (unsigned long *)range->pages;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +	}
> +
> +	hmm_range.hmm_pfns = pfns;
> +	while (true) {
> +		/* Must be checked after mmu_interval_read_begin
> */
> +		if (range->flags.unmapped) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		if (!ctx->mmap_locked) {
> +			/*
> +			 * XXX: HMM locking document indicates only
> a read-lock
> +			 * is required but there apears to be a window
> between
> +			 * the MMU_NOTIFY_MIGRATE event
> triggered in a CPU fault
> +			 * via migrate_vma_setup and the pages
> actually moving
> +			 * in migrate_vma_finalize in which this code
> can grab
> +			 * garbage pages. Grabbing the write-lock if
> the range
> +			 * is attached to vram appears to protect
> against this
> +			 * race.
> +			 */
> +			if (vram_pages)
> +				mmap_write_lock(mm);
> +			else
> +				mmap_read_lock(mm);
> +		}
> +		err = hmm_range_fault(&hmm_range);
> +		if (!ctx->mmap_locked) {
> +			if (vram_pages)
> +				mmap_write_unlock(mm);
> +			else
> +				mmap_read_unlock(mm);
> +		}
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (!ctx->mmap_locked)
> +		mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +
> +	if (ctx->prefault) {
> +		range->pages = pages;
> +		goto set_seqno;
> +	}
> +
> +map_pages:
> +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> +		WARN_ON_ONCE(!range->vram_allocation);
> +
> +		for (i = 0; i < npages; ++i) {
> +			pages[i] = hmm_pfn_to_page(pfns[i]);
> +
> +			if
> (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> +				err = -EOPNOTSUPP;
> +				goto err_free;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->flags.has_vram_pages = true;
> +		range->pages = pages;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +
> 	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	} else {
> +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> +
> +		for_each_dma_page(i, j, npages, order) {
> +			if (WARN_ON_ONCE(i && order !=
> +
> hmm_pfn_to_map_order(pfns[i]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +			order = hmm_pfn_to_map_order(pfns[i]);
> +
> +			pages[j] = hmm_pfn_to_page(pfns[i]);
> +			if
> (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +
> +			set_page_dirty_lock(pages[j]);
> +			mark_page_accessed(pages[j]);
> +
> +			dma_addr[j] = dma_map_page(gpusvm-
> >drm->dev,
> +						   pages[j], 0,
> +						   PAGE_SIZE << order,
> +
> DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev,
> dma_addr[j])) {
> +				err = -EFAULT;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +		}
> +
> +		/* Huge pages, reduce memory footprint */
> +		if (order) {
> +			dma_addr = kmalloc_array(j,
> sizeof(*dma_addr),
> +						 GFP_KERNEL);
> +			if (dma_addr) {
> +				for (i = 0; i < j; ++i)
> +					dma_addr[i] =
> (dma_addr_t)pfns[i];
> +				kvfree(pfns);
> +				kfree_mapping = true;
> +			} else {
> +				dma_addr = (dma_addr_t *)pfns;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->order = order;
> +		range->flags.kfree_mapping = kfree_mapping;
> +		range->flags.has_dma_mapping = true;
> +		range->dma_addr = dma_addr;
> +		range->vram_allocation = NULL;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +
> 	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	}
> +
> +	if (err == -EAGAIN)
> +		goto retry;
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	for_each_dma_page(i, j, npages, order)
> +		dma_unmap_page(gpusvm->drm->dev,
> +			       (dma_addr_t)pfns[j],
> +			       PAGE_SIZE << order,
> DMA_BIDIRECTIONAL);
> +err_free:
> +	if (alloc_pfns)
> +		kvfree(pfns);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated
> with a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If
> @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write
> mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> called on
> + * each GPU SVM range attached to notifier in gpusvm->ops-
> >invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long
> npages,
> +					   unsigned long *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +
> 	drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> grate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU
> SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_vram_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for
> GPU SVM migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to
> mapped pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU
> SVM. It
> + * iterates over each page frame number provided in @migrate_pfn,
> maps the
> + * corresponding page, and stores the DMA address in the provided
> @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int
> *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page =
> migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0,
> PAGE_SIZE, dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> mapped for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped
> pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for
> GPU Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in
> @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the
> corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction
> dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev,
> dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to
> VRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *                   failure of this function.
> + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> The caller
> + *                   should hold a reference to the VRAM allocation, which
> + *                   should be dropped via ops->vram_allocation or upon the
> + *                   failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to VRAM. It
> performs the
> + * necessary setup and invokes the driver-specific operations for
> migration to
> + * VRAM. Upon successful return, @vram_allocation can safely
> reference @range
> + * until ops->vram_release is called which only upon successful
> return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm-
> >device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_vram)
> +		return -EINVAL;
> +
> +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> >copy_to_vram ||
> +	    !gpusvm->ops->copy_to_sram)
> +		return -EOPNOTSUPP;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(range);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> npages, not
> +	 * always an error. Need to revisit possible cases and how to
> handle. We
> +	 * could prefault on migrate.cpages != npages via
> hmm_range_fault.
> +	 */
> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> vram_allocation, npages,
> +					     migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.src, npages,
> DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_vram_page(page, zdd);
> +	}
> +
> +	err = gpusvm->ops->copy_to_vram(gpusvm, pages,
> dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind vram allocation to range and zdd */
> +	range->vram_allocation = vram_allocation;
> +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages,
> migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> dma_addr, npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM
> PFNs for a VM area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the SRAM migrate page frame numbers
> (PFNs) for the
> + * specified VM area structure. It allocates and locks pages in the VM
> area for
> + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> vm_area_struct *vas,
> +						unsigned long npages,
> +						unsigned long
> *src_mpfn,
> +						unsigned long *mpfn,
> u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);
> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * Similar to __drm_gpusvm_migrate_to_sram but does not require
> mmap lock and
> + * migration done via migrate_device_* functions. Fallback path as it
> is
> + * preferred to issue migrations with mmap lock.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> +				    struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = npages_in_range(range->va.start, range->va.end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> >vram_allocation,
> +					     npages, src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_vma_range(gpusvm->mm,
> +				       gpusvm-
> >device_private_page_owner, src,
> +				       npages, range->va.start);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL,
> npages, src, dst, 0);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   dst, npages,
> DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> dma_addr, npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to
> SRAM (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @vas: Pointer to the VM area structure
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @start: Start address of the migration range
> + * @end: End address of the migration range
> + *
> + * This internal function performs the migration of the specified GPU
> SVM range
> + * to SRAM. It sets up the migration, populates + dma maps SRAM
> PFNs, and
> + * invokes the driver-specific operations for migration to SRAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> *gpusvm,
> +					struct vm_area_struct *vas,
> +					struct page *page,
> +					u64 start, u64 end)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= gpusvm-
> >device_private_page_owner,
> +		.flags		=
> MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page	= page,
> +	};
> +	unsigned long npages;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	/* Corner where VMA area struct has been partially
> unmapped */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> +						   migrate.src,
> migrate.dst,
> +						   start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.dst, npages,
> +					   DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages,
> migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> dma_addr, npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range
> to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function initiates the migration of the specified GPU SVM
> range to
> + * SRAM. It performs necessary checks and invokes the internal
> migration
> + * function for actual migration.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	int err;
> +	bool retry = false;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		if (ctx->trylock_mmap) {
> +			if (!mmap_read_trylock(mm))  {
> +				err =
> drm_gpusvm_evict_to_sram(gpusvm, range);
> +				goto err_mmput;
> +			}
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	/*
> +	 * Loop required to find all VMA area structs for the corner
> case when
> +	 * VRAM backing has been partially unmapped from MM's
> address space.
> +	 */
> +again:
> +	vas = find_vma(mm, start);
> +	if (!vas) {
> +		if (!retry)
> +			err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end <= vas->vm_start || start >= vas->vm_end) {
> +		if (!retry)
> +			err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL,
> start, end);
> +	if (err)
> +		goto err_mmunlock;
> +
> +	if (vas->vm_end < end) {
> +		retry = true;
> +		start = vas->vm_end;
> +		goto again;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		mmap_read_unlock(mm);
> +		/*
> +		 * Using mmput_async as this function can be called
> while
> +		 * holding a dma-resv lock, and a final put can grab the
> mmap
> +		 * lock, causing a lock inversion.
> +		 */
> +		mmput_async(mm);
> +	}
> +
> +	return 0;
> +
> +err_mmunlock:
> +	if (!ctx->mmap_locked)
> +		mmap_read_unlock(mm);
> +err_mmput:
> +	if (!ctx->mmap_locked)
> +		mmput_async(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data
> associated with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device
> data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> (page fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM
> range to RAM.
> + * It retrieves the GPU SVM range information from the faulting
> page and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page-
> >zone_device_data;
> +	int err;
> +
> +	err = __drm_gpusvm_migrate_to_sram(zdd->range-
> >gpusvm,
> +					   vmf->vma, vmf->page,
> +					   zdd->range->va.start,
> +					   zdd->range->va.end);
> +
> +	return err ? VM_FAULT_SIGBUS : 0;
> +}
> +
> +/**
> + * drm_gpusvm_pagemap_ops - Device page map operations for
> GPU SVM
> + */
> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops =
> {
> +	.page_free = drm_gpusvm_page_free,
> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> +};
> +
> +/**
> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device
> page map operations
> + *
> + * Returns:
> + * Pointer to the GPU SVM device page map operations structure.
> + */
> +const struct dev_pagemap_ops
> *drm_gpusvm_pagemap_ops_get(void)
> +{
> +	return &drm_gpusvm_pagemap_ops;
> +}
> +
> +/**
> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for
> the given address range
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @start: Start address
> + * @end: End address
> + *
> + * Returns:
> + * True if GPU SVM has mapping, False otherwise
> + */
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> start, u64 end)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end)
> {
> +		struct drm_gpusvm_range *range = NULL;
> +
> +		drm_gpusvm_for_each_range(range, notifier, start,
> end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> b/drivers/gpu/drm/xe/drm_gpusvm.h
> new file mode 100644
> index 000000000000..0ea70f8534a8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> @@ -0,0 +1,415 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __DRM_GPUSVM_H__
> +#define __DRM_GPUSVM_H__
> +
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/workqueue.h>
> +
> +struct dev_pagemap_ops;
> +struct drm_device;
> +struct drm_gpusvm;
> +struct drm_gpusvm_notifier;
> +struct drm_gpusvm_ops;
> +struct drm_gpusvm_range;
> +
> +/**
> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> + *
> + * This structure defines the operations for GPU Shared Virtual
> Memory (SVM).
> + * These operations are provided by the GPU driver to manage SVM
> ranges and
> + * perform operations such as migration between VRAM and system
> RAM.
> + */
> +struct drm_gpusvm_ops {
> +	/**
> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> +	 *
> +	 * This function shall allocate a GPU SVM notifier.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM notifier on success, NULL
> on failure.
> +	 */
> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> +
> +	/**
> +	 * @notifier_free: Free a GPU SVM notifier (optional)
> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> +	 *
> +	 * This function shall free a GPU SVM notifier.
> +	 */
> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> +
> +	/**
> +	 * @range_alloc: Allocate a GPU SVM range (optional)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 *
> +	 * This function shall allocate a GPU SVM range.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM range on success, NULL
> on failure.
> +	 */
> +	struct drm_gpusvm_range *(*range_alloc)(struct
> drm_gpusvm *gpusvm);
> +
> +	/**
> +	 * @range_free: Free a GPU SVM range (optional)
> +	 * @range: Pointer to the GPU SVM range to be freed
> +	 *
> +	 * This function shall free a GPU SVM range.
> +	 */
> +	void (*range_free)(struct drm_gpusvm_range *range);
> +
> +	/**
> +	 * @vram_release: Release VRAM allocation (optional)
> +	 * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> +	 *
> +	 * This function shall release VRAM allocation and expects to
> drop a
> +	 * reference to VRAM allocation.
> +	 */
> +	void (*vram_release)(void *vram_allocation);
> +
> +	/**
> +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> +	 * @npages: Number of pages to populate
> +	 * @pfn: Array of page frame numbers to populate
> +	 *
> +	 * This function shall populate VRAM page frame numbers
> (PFN).
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> +				 void *vram_allocation,
> +				 unsigned long npages,
> +				 unsigned long *pfn);
> +
> +	/**
> +	 * @copy_to_vram: Copy to VRAM (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (destination)
> +	 * @dma_addr: Pointer to array of DMA addresses (source)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to VRAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @copy_to_sram: Copy to system RAM (required for
> migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (source)
> +	 * @dma_addr: Pointer to array of DMA addresses
> (destination)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to system RAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @invalidate: Invalidate GPU SVM notifier (required)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @notifier: Pointer to the GPU SVM notifier
> +	 * @mmu_range: Pointer to the mmu_notifier_range
> structure
> +	 *
> +	 * This function shall invalidate the GPU page tables. It can
> safely
> +	 * walk the notifier range RB tree/list in this function. Called
> while
> +	 * holding the notifier lock.
> +	 */
> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> +			   struct drm_gpusvm_notifier *notifier,
> +			   const struct mmu_notifier_range
> *mmu_range);
> +};
> +
> +/**
> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> notifier
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: MMU interval notifier
> + * @interval: Interval for the notifier
> + * @rb: Red-black tree node for the parent GPU SVM structure
> notifier tree
> + * @root: Cached root node of the RB tree containing ranges
> + * @range_list: List head containing of ranges in the same order they
> appear in
> + *              interval tree. This is useful to keep iterating ranges while
> + *              doing modifications to RB tree.
> + * @flags.removed: Flag indicating whether the MMU interval
> notifier has been
> + *                 removed
> + *
> + * This structure represents a GPU SVM notifier.
> + */
> +struct drm_gpusvm_notifier {
> +	struct drm_gpusvm *gpusvm;
> +	struct mmu_interval_notifier notifier;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} interval;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct rb_root_cached root;
> +	struct list_head range_list;
> +	struct {
> +		u32 removed : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm_range - Structure representing a GPU SVM
> range
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier
> + * @refcount: Reference count for the range
> + * @rb: Red-black tree node for the parent GPU SVM notifier
> structure range tree
> + * @va: Virtual address range
> + * @notifier_seq: Notifier sequence number of the range's pages
> + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> + * @dma_addr: DMA address array (if backing store is SRAM and
> DMA mapped)
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is
> mapping size
> + * @flags.migrate_vram: Flag indicating whether the range can be
> migrated to VRAM
> + * @flags.unmapped: Flag indicating if the range has been
> unmapped
> + * @flags.partial_unmap: Flag indicating if the range has been
> partially unmapped
> + * @flags.has_vram_pages: Flag indicating if the range has vram
> pages
> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> mapping
> + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> allocation based
> + *                       on @order which releases via kfree
> + *
> + * This structure represents a GPU SVM range used for tracking
> memory ranges
> + * mapped in a DRM device.
> + */
> +struct drm_gpusvm_range {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +	struct kref refcount;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} va;
> +	unsigned long notifier_seq;
> +	union {
> +		struct page **pages;
> +		dma_addr_t *dma_addr;
> +	};
> +	void *vram_allocation;
> +	u16 order;
> +	struct {
> +		/* All flags below must be set upon creation */
> +		u16 migrate_vram : 1;
> +		/* All flags below must be set / cleared under notifier
> lock */
> +		u16 unmapped : 1;
> +		u16 partial_unmap : 1;
> +		u16 has_vram_pages : 1;
> +		u16 has_dma_mapping : 1;
> +		u16 kfree_mapping : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm - GPU SVM structure
> + *
> + * @name: Name of the GPU SVM
> + * @drm: Pointer to the DRM device structure
> + * @mm: Pointer to the mm_struct for the address space
> + * @device_private_page_owner: Device private pages owner
> + * @mm_start: Start address of GPU SVM
> + * @mm_range: Range of the GPU SVM
> + * @notifier_size: Size of individual notifiers
> + * @ops: Pointer to the operations structure for GPU SVM
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order.
> + * @num_chunks: Number of chunks
> + * @notifier_lock: Read-write semaphore for protecting notifier
> operations
> + * @zdd_wq: Workqueue for deferred work on zdd destruction
> + * @root: Cached root node of the Red-Black tree containing GPU
> SVM notifiers
> + * @notifier_list: list head containing of notifiers in the same order
> they
> + *                 appear in interval tree. This is useful to keep iterating
> + *                 notifiers while doing modifications to RB tree.
> + *
> + * This structure represents a GPU SVM (Shared Virtual Memory)
> used for tracking
> + * memory ranges mapped in a DRM (Direct Rendering Manager)
> device.
> + *
> + * No reference counting is provided, as this is expected to be
> embedded in the
> + * driver VM structure along with the struct drm_gpuvm, which
> handles reference
> + * counting.
> + */
> +struct drm_gpusvm {
> +	const char *name;
> +	struct drm_device *drm;
> +	struct mm_struct *mm;
> +	void *device_private_page_owner;
> +	u64 mm_start;
> +	u64 mm_range;
> +	u64 notifier_size;
> +	const struct drm_gpusvm_ops *ops;
> +	const u64 *chunk_sizes;
> +	int num_chunks;
> +	struct rw_semaphore notifier_lock;
> +	struct workqueue_struct *zdd_wq;
> +	struct rb_root_cached root;
> +	struct list_head notifier_list;
> +};

I also think the gpusvm concept is a duplication of the drm_gpuvm.
Look at the members here, mm_start, mm_range, rb_tree...

Maintaining a list of notifier at this layer is odd. Everybody else seems
Embed the notifier in a range...

Mm field is essential for svm though. I think what we can do is, introduce a
*mm field in drm_gpuvm and introduce uAPI to allow user to say one gpuvm
Participate svm. If one gpuvm participate svm, we set the mm field for this
Gpuvm.

Another benefit of the proposed way is, multiple gpuvms can share address space
With single cpu mm process.


Oak


> +
> +/**
> + * struct drm_gpusvm_ctx - DRM GPU SVM context
> + *
> + * @mmap_locked: mmap lock is locked
> + * @trylock_mmap: trylock mmap lock, used to avoid locking
> inversions
> + *                (e.g.dma-revs -> mmap lock)
> + * @in_notifier: entering from a MMU notifier
> + * @read_only: operating on read-only memory
> + * @vram_possible: possible to use VRAM
> + * @prefault: prefault pages
> + *
> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> + */
> +struct drm_gpusvm_ctx {
> +	u32 mmap_locked :1;
> +	u32 trylock_mmap :1;
> +	u32 in_notifier :1;
> +	u32 read_only :1;
> +	u32 vram_possible :1;
> +	u32 prefault :1;
> +};
> +
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks);
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> u64 fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> +
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range);
> +
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx);
> +
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx);
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +
> +const struct dev_pagemap_ops
> *drm_gpusvm_pagemap_ops_get(void);
> +
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> start, u64 end);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end);
> +
> +/**
> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, take lock
> + */
> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> +	down_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, drop lock
> + */
> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> +	up_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the
> list
> + * @range: a pointer to the current GPU SVM range
> + *
> + * Return: A pointer to the next drm_gpusvm_range if available, or
> NULL if the
> + *         current range is the last one or if the input range is NULL.
> + */
> +static inline struct drm_gpusvm_range *
> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> +{
> +	if (range && !list_is_last(&range->rb.entry,
> +				   &range->notifier->range_list))
> +		return list_next_entry(range, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> notifier
> + * @range__: Iterator variable for the ranges. If set, it indicates the
> start of
> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get
> the range.
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier. It
> is safe
> + * to use while holding the driver SVM lock or the notifier lock.
> + */
> +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> end__)	\
> +	for ((range__) = (range__) ?:
> 	\
> +	     drm_gpusvm_range_find((notifier__), (start__), (end__));
> 	\
> +	     (range__) && (range__->va.start < (end__));
> 	\
> +	     (range__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as
> unmapped
> + * @range: Pointer to the GPU SVM range structure.
> + * @mmu_range: Pointer to the MMU notifier range structure.
> + *
> + * This function marks a GPU SVM range as unmapped and sets the
> partial_unmap flag
> + * if the range partially falls within the provided MMU notifier range.
> + */
> +static inline void
> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range
> *range,
> +			      const struct mmu_notifier_range
> *mmu_range)
> +{
> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> +
> +	range->flags.unmapped = true;
> +	if (range->va.start < mmu_range->start ||
> +	    range->va.end > mmu_range->end)
> +		range->flags.partial_unmap = true;
> +}
> +
> +#endif /* __DRM_GPUSVM_H__ */
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-02 17:03         ` Matthew Brost
@ 2024-09-11 16:06           ` Matthew Brost
  0 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-09-11 16:06 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Mon, Sep 02, 2024 at 05:03:20PM +0000, Matthew Brost wrote:
> On Mon, Sep 02, 2024 at 01:53:14PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 05:27:13PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 29, 2024 at 11:45:08AM +0200, Daniel Vetter wrote:
> > > > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > > > This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> > > > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > > > sharing of memory between the CPU and GPU, enhancing performance and
> > > > > flexibility in GPU computing tasks.
> > > > > 
> > > > > The patch adds the necessary infrastructure for SVM, including data
> > > > > structures and functions for managing SVM ranges and notifiers. It also
> > > > > provides mechanisms for allocating, deallocating, and migrating memory
> > > > > regions between system RAM and GPU VRAM.
> > > > > 
> > > > > This mid-layer is largely inspired by GPUVM.
> > > > > 
> > > > > Cc: Dave Airlie <airlied@redhat.com>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > > Still not sure I've got the right race that you paper over with
> > > > mmap_write_lock, but I spotted a few things, commments inline.
> > > > 
> > > 
> > > I've replied to this issue several times, let's table the
> > > mmap_write_lock issue in this reply - a lot of other things to get
> > > through. Current thinking is try to add a range->migrate_lock like AMD
> > > which I state here [1]. Let's continue discussing the mmap lock issue
> > > there if possible.
> > 
> > Yeah I wrote replies as I read code, so there's a bit a mess from my side
> > here. Apologies for that.
> > 
> 
> All good, has been quite helpful thus far.
> 
> > > [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169
> > 
> > Some more replies below that I think we haven't covered anywhere else yet.
> > 
> > > > > + * 2) Garbage Collector.
> > > > > + *
> > > > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > > > + *					struct drm_gpusvm_range *range)
> > > > > + *	{
> > > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > > + *
> > > > > + *		assert_driver_svm_locked(gpusvm);
> > > > > + *
> > > > > + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> > > > > + *		if (range->flags.partial_unmap)
> > > > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
> > > > 
> > > > Note that the migration back to sram isn't guaranteed to succeed, so you
> > > > might be still stuck with partially migrated range. This might be a case
> > > > where hmm gives you vram pfns, but the range you have doesn't have any
> > > > vram allocation anymore because you droppped it here. Not sure tbh.
> > > >
> > > 
> > > Hmm isn't the picture here nor will a VMA once the
> > > drm_gpusvm_evict_to_sram path is always taken as discussed here [2]. I
> > > might have a corner case BO refcounting / TTM resource lookup bug in
> > > somewhere in here which needs to be resolved though (e.g. eviction
> > > racing with this code path), will try to close on that.
> > > 
> > > [2] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164
> > 
> > So maybe my understanding is wrong, but from my reading of the device
> > migration code the exact same non-guarantees as for the sram2sram
> > migration code apply:
> > 
> > - There's no guarantee the page/folio doesn't have an elevated refcount,
> >   which makes the migration fail (in try_to_migrate, where it checks for
> >   surplus refcounts).
> > 
> > - There's no guarantee you'll get the page/folio lock, which makes the
> >   migration fail. Worse the core mm seems to use a fallback to per-page
> >   locking as it's extremely crude "get out of deadlocks due to acquiring
> >   multiple page locks" card.
> >
> 
> I think this circles back to basically the design must be able to move
> VRAM -> SRAM because the host can't access VRAM. Certainly in the CPU
> page fault path this can't fail on the fauling page at least or if it
> does the app gets segfaulted. I'll investigate more here but that is
> still my current thinking. If VRAM -> SRAM can fail / make partial
> progress in eviction paths, then mixed mappings likely need to be
> supported which shouldn't be all that painful - basically just need
> cursor in the bind code which can walk mixed mappings.
> 
> SRAM -> VRAM certainly can fail which is handled by just aborting the
> migration.
> 
> > > > > +map_pages:
> > > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > > +
> > > > > +		for (i = 0; i < npages; ++i) {
> > > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > > +
> > > > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				goto err_free;
> > > > > +			}
> > > > > +		}
> > > > 
> > > > You can't do the above, because the pfn you get from hmm come with zero
> > > > guarantees, you neither hold a page reference nor the page lock. The only
> > > > thing you can do is grab the pagetable lock (or mmu notifier locks) and
> > > > check it's still valid, before you can touch any state. I think the
> > > > range->vram_allocation is probably always valid since you clean that up
> > > > under the same lock/thread, but there's good chances the vram allocation
> > > > is otherwise already gone for good. Or you get an inconsistent snapshot.
> > > > 
> > > 
> > > I haven't seen this pop in my testing yet which is fairly thorough. My
> > > thinking was migration always being enforced at range grainularity we'd
> > > never get mixed mappings from the core as migration is completely under
> > > control of the driver. Maybe I'm not understanding what you are saying
> > > here...
> > 
> > So one scenario is that you race (without the mmap write lock or the
> > migration_mutex design ofc) with another invalidate, and get a partial
> > view here of mixed vram and sram pages. Until you acquire the mmu notifier
> > lock and have made sure your pages are still valid, there's essentially no
> > guarantee.
> 
> The pages are collected in notifier stable state via the hmm locking +
> seqno begin and recheck. Before they can used (e.g. program a bind) yes
> the notifier lock needs to be taken to ensure they haven't changed
> between collection and used - at least this my understanding.
> 
> > > 
> > > > > +
> > > > > +		/* Do not race with notifier unmapping pages */
> > > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > > +		range->flags.has_vram_pages = true;
> > > > > +		range->pages = pages;
> > > > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > > > +			err = -EAGAIN;
> > > > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > > +		}
> > > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > > +	} else {
> > > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > > +
> > > > > +		for_each_dma_page(i, j, npages, order) {
> > > > > +			if (WARN_ON_ONCE(i && order !=
> > > > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > > +
> > > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > > +				err = -EOPNOTSUPP;
> > > > > +				npages = i;
> > > > > +				goto err_unmap;
> > > > > +			}
> > > > > +
> > > > > +			set_page_dirty_lock(pages[j]);
> > > > > +			mark_page_accessed(pages[j]);
> > > > 
> > > > You can't do these, because you don't hold a page reference. They're also
> > > > not needed because hmm_range_fault goes thorugh the full mkwrite dance,
> > > > which takes care of these, unlike the gup family of functions.
> > > >
> > > 
> > > This is a left over from our existing userpte code and it does appear to
> > > be incorrect. Let me remove this and fixup our userptr code while I'm at
> > > it.
> > 
> > Ack.
> > 
> > > > > +	vas = vma_lookup(mm, start);
> > > > > +	if (!vas) {
> > > > > +		err = -ENOENT;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > > +		err = -EINVAL;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	if (!vma_is_anonymous(vas)) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +
> > > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!buf) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_mmunlock;
> > > > > +	}
> > > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > > > +
> > > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > > +	if (!zdd) {
> > > > > +		err = -ENOMEM;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	migrate.vma = vas;
> > > > > +	migrate.src = buf;
> > > > > +	migrate.dst = migrate.src + npages;
> > > > > +
> > > > > +	err = migrate_vma_setup(&migrate);
> > > > > +	if (err)
> > > > > +		goto err_free;
> > > > > +
> > > > > +	/*
> > > > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > > 
> > > This is a bit stale, can update this comment.
> > > 
> > > > > +	 */
> > > > 
> > > > Yeah I think especially under contention partial migrations, at least back
> > > > to sram due to cpu faults, are pretty much expected. And you need to cope
> > > > somehow.
> > > > 
> > > 
> > > I have seen these pop if the IGT calls mlock on the memory. My thinking
> > > is migration to VRAM is basically optional and fallback to leaving range
> > > in SRAM if an error occurs rather than doing a partial migration. This
> > > is what currently happens so it is coped with.
> > > 
> > > If the memory is marked as must be in VRAM (NIY), well then the user
> > > program has done something wrong and can kill the app (akin to
> > > segfault).
> > 
> > Yeah SIGBUS for "must be in VRAM" sounds like ok semantics.
> > 
> > > > > +
> > > > > +	if (!migrate.cpages) {
> > > > > +		err = -EFAULT;
> > > > > +		goto err_free;
> > > > > +	}
> > > > > +
> > > > > +	if (migrate.cpages != npages) {
> > > > > +		err = -EBUSY;
> > > > > +		goto err_finalize;
> > > > > +	}
> > 
> > What I think is more fundamental is that I think this one here doesn't
> > work. For migrate_to_ram you cannot assume that you can always migrate the
> > entire block, I think to uphold the core mm forward progress rules we need
> > to allow partial migrations there. And I think your current code allows
> > that.
> >
> 
> Yes. I had similar checks in migrate_to_ram at one point and that did
> not work when multiple CPU faults from different threads occured in
> parallel. Each thread can grab a random set of VRAM pages to migrate I
> think.
>  
> > But that then means you also are stuck with partial migration state here.
> > That was the point I tried to make.
> >
> 
> The error path with migrate_vma_pages/finalize safely unwinds the
> migration in these cases leaving all pages in SRAM.
> 
> > > > > +/**
> > > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > > + * @vas: Pointer to the VM area structure
> > > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > > + * @start: Start address of the migration range
> > > > > + * @end: End address of the migration range
> > > > > + *
> > > > > + * This internal function performs the migration of the specified GPU SVM range
> > > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * 0 on success, negative error code on failure.
> > > > > + */
> > > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > > +					struct vm_area_struct *vas,
> > > > > +					struct page *page,
> > > > > +					u64 start, u64 end)
> > > > > +{
> > > > > +	struct migrate_vma migrate = {
> > > > > +		.vma		= vas,
> > > > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > > +		.fault_page	= page,
> > > > > +	};
> > > > > +	unsigned long npages;
> > > > > +	struct page **pages;
> > > > > +	dma_addr_t *dma_addr;
> > > > > +	void *buf;
> > > > > +	int i, err = 0;
> > > > > +
> > > > > +	mmap_assert_locked(gpusvm->mm);
> > > > 
> > > > That's the wrong mm, at least for the ->migrate_to_ram path. You might be
> > > > called on a anon mapping from a child process. That also means that the
> > > > vma you're looking at might have no relationship with anythign you're
> > > > tracking in your gpusvm.
> > > >
> > > 
> > > Hmm, as discussed [3] I haven't added tests with child processes yet.
> > > Let me do that and update the design as needed. This likely isn't
> > > correct as you say.
> > > 
> > > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169 
> > 
> > Ack. More tests should definitely help here to figure out what's up, and
> > what's just me being confused.
> > 
> 
> Starting to add tests this fork() appears to work after dropping these
> asserts. More thorough testing is needed though.
> 
> > > > > +/**
> > > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > > > + * @vmf: Pointer to the fault information structure
> > > > > + *
> > > > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > > > + * the internal migration function to migrate the range back to RAM.
> > > > > + *
> > > > > + * Returns:
> > > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > > + */
> > > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > > > +{
> > > > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > > > +	int err;
> > > > > +
> > > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > 
> > > > So I think zdd->range doesn't work, because even within a single mm the
> > > > vma mapping a given piece of anon memory does not need to be unique, you
> > > > can duplicate them with mremap.
> > > > 
> > > 
> > > This is attached to a page, not a VMA. Both AMD and Nvidia drivers use a
> > > similar lookup mechanism.
> > 
> > Yeah the page->zone_device_data is fine. It's the zone_device_rage->range
> > which I think isn't ok.
> > 
> 
> Yes, this gets a little confusing with fork() and mremap. The range's
> start / end can be nonsense in the remap case. Also as you mention a
> range->migrate_mutex doesn't seem correct either. I can make it work but
> maybe not worth even typing out why here (I can provide a little more
> detail in another reply). New thinking is zdd stores a size field and
> has the locking - I think is akin to a VRAM folio then.
> 
> > > > So all you have here is the physical memory and the vma, which might or
> > > > might not be from the same process as gpusvm->mm.
> > > > 
> > > > Also the child process scenario means you using mmap_write on the fault
> > > > side doesn't stop all cpu faults migrating stuff back.
> > > > 
> > > > Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
> > > > is busted, because it's va based and so misses concurrently ongoing
> > > > different mappings moving physical storage around underneath.
> > > >
> > > 
> > > I think all of the above which falls into the fork() + child process
> > > issues which you have raise. Until I test this out I can't speak to this
> > > any level of confidence so I won't. Thanks for raising this issue and
> > > let me write test cases as discussed and educate myself. Once I do that,
> > > we can engage in further discussions.
> > 
> > I think fork + childs will still result in zdd->range being unique (albeit
> > confused about which mm). You need mremap of some of these mappings to
> 
> Agree for fork + child based on initial testing.
> 
> > change the addresses and really cause confusion, which I /think/ (but
> > didn't test) is doable with a single process even and duplicating anon
> 
> Yep, remap changes the address so range is confusing and really size is
> sufficient aligning within VMA's start / end upon CPU fault. AMD does
> this but with a VMA search which I think is a bit overkill.
> 

Sima gave me something to investigate over the past week or so and asked
me to write up my findings and share the list. I'm replying here because
this seems like as good a place as any. 

A. Investigate possible livelock with do_swap_page taking a device page
reference and folio_migrate_mapping aborting in migrate_vma_* if
multiple references are held.

	Sima was correct in identifying this livelock. I was able to reproduce a
	stable livelock with a test where multiple CPU threads faulted the same
	device page in parallel, and an exclusive lock was taken in
	migrate_to_ram. Without an exclusive lock, forward progress is made, but
	on average, there were ~32k calls to migrate_to_ram before a thread
	succeeded. This issue appears to affect all implementations that use
	device pages.

	I have posted a patch with Sima’s suggested core MM fix on the list [1]
	and verified in the local Xe branch that this patch resolves the
	livelock and reduces multiple calls to migrate_to_ram on the same
	faulting page. It would be helpful to get AMD's input and testing on
	this patch.

B. Test out fork

	I added a few test sections to my IGT.

	This basically worked due to the COW (Copy-On-Write) semantics of fork.
	Both the parent and child processes fault on their first CPU access,
	getting their own new copy of any memory allocated before the fork.

	I believe the only change needed was dropping a lockdep assert in
	migrate_to_ram, as the MM can change.

C. MMAP shared with anonymous memory

	I found that this is actually not anonymous memory but rather
	shmem-backed [2] [3]. Only anonymous memory is available for migration,
	so the corner cases related to multiple CPU mappings for device pages
	discussed do not exist.

D. MREMAP behavior

	Added a few test sections to my IGT for possible REMAP cases.

	In all cases (DONTUNMAP, DONTUNMAP with read only...) the old
	memory generates a MMU_NOTIFY_UNMAP event. This fits nicely with
	the design as old range it just unmapped. Next CPU or GPU acess
	to old memory has zero fill semantics.

	The new memory can point to previously allocated device pages.
	With a simple update to design the next GPU fault can find these
	pages and map them.

	MREMAP did expose some problems with a zdd (device page
	zone_device_data) pointing to a range. It was pointed out that
	something physical pointing to something virtual is nonsense. I
	have fixed this locally and agree that all refs from physical to
	virtual will be dropped in the common layer.

D. MREMAP Behavior

	I added a few test sections to my IGT to explore possible MREMAP cases.

	In all cases (e.g., DONTUNMAP, DONTUNMAP with read-only...), the old
	memory generates an MMU_NOTIFY_UNMAP event. This aligns well with the
	design, as the old range is simply unmapped. Subsequent CPU or GPU
	access to the old memory has zero-fill semantics.

	The new memory can point to previously allocated device pages. With a
	simple update to the design, the next GPU fault can find these pages and
	map them.

	MREMAP did reveal some issues with zdd (device page zone_device_data)
	pointing to a range. It was pointed out that having something physical
	pointing to something virtual is nonsensical. I have fixed this locally
	and agree that all references from physical to virtual will be removed
	in the common layer.

E. Locking issues

	Sima strongly suggested not inventing locks for migration to avoid
	races, but rather to accept the core MM races. I removed all locks
	except for the existing Xe locks and eliminated mmap write abuse. With a
	more robust retry loop in the GPU page fault handler, I was able to
	successfully avoid mixed mappings. Whether mixed mappings will be
	supported is a different topic, but in my opinion, this demonstrates
	that a design can work with minimal locking. This will be the design
	moving forward.

Matt

[1] https://patchwork.freedesktop.org/series/138497/
[2] https://elixir.bootlin.com/linux/v6.10.7/source/mm/mmap.c#L2934
[3] https://elixir.bootlin.com/linux/v6.10.7/source/mm/shmem.c#L4941

> Matt
> 
> > memory mappings with mremap.
> > 
> > Cheers, Sima
> > -- 
> > Simona Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-04 12:27             ` Thomas Hellström
@ 2024-09-24  8:41               ` Simona Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Simona Vetter @ 2024-09-24  8:41 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Daniel Vetter, Matthew Brost, intel-xe, dri-devel, airlied,
	christian.koenig, matthew.auld, daniel

On Wed, Sep 04, 2024 at 02:27:15PM +0200, Thomas Hellström wrote:
> Hi, Sima,
> 
> On Mon, 2024-09-02 at 14:33 +0200, Daniel Vetter wrote:
> > Jumping in here in the middle, since I think it's a solid place to
> > drop my
> > idea of "align with core mm" gpusvm locking ...
> > 
> > On Thu, Aug 29, 2024 at 08:56:23PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 29, 2024 at 09:18:29PM +0200, Thomas Hellström wrote:
> > > Issues with removing a SVM range:
> > > 
> > > - Xe bind code stores invalidation / present state in VMA, this
> > > would
> > >   need to be moved to the radix tree. I have Jira open for that
> > > work
> > >   which I believe other developers are going to own.
> > > - Where would the dma mapping / device pages be stored?
> > > 	- In the radix tree? What if ATS is enabled? We don't have
> > > a
> > > 	  driver owned radix tree. How do we reasonably connect a
> > > driver
> > > 	  owned radix to a common GPUSVM layer?
> > 
> > Yeah this one is really annoying, because the core mm gets away with
> > nothing because it can just store the pfn in the pte. And it doesn't
> > need
> > anything else. So we probably still need something unfortuantely ...
> > 
> > > 	- In the notifier? What is the notifier is sparsely
> > > populated?
> > > 	  We would be wasting huge amounts of memory. What is the
> > > 	  notifier is configured to span the entire virtual
> > > address
> > > 	  space?
> > 
> > So if we go with the radix idea, we could model the radix to exactly
> > match
> > the gpu pagetables. That's essentially what the core mm does. Then
> > each
> > pagetable at each level has a spinlock for essentially a range lock.
> > notifier seqno would be stored into each pagetable (not the
> > endividual
> > entries, that's probably too much), which should allow us to very
> > effeciently check whether an entire arbitrary va range is still valid
> > on
> > the fault side.
> 
> I still wonder wether this should be owned by the driver, though. And
> if we were optimizing for multiple simultaneous fault processing with a
> small granularity, I would agree, but given that gpu pagefaults are
> considered so slow they should be avoided, I wonder whether xe's
> current approach of a single page-table lock wouldn't suffice, in
> addition to a semi-global seqno?
> 
> For invalidations, I think we actually currently allow simultaneous
> overlapping invalidations that are only protected by the write-side of
> the notifier seqno.

Yeah I think this is just a long-term design point: As long as the
pagetable locking is conceptually a range thing I agree it doesn't matter
what we start out with, as long as it's somewhere on the line between a
global lock and the over-the-top scalable radix tree per-pagetable node
approach core mm has.

> > On the notifier side we can also very efficiently walk arbitrary
> > ranges,
> > because the locking is really fine-grained and in an adaptive way.
> > 
> > > - How does the garbage collector work? We can't allocate memory in
> > > the
> > >   notifier so we don't anything to add to the garbage collector. We
> > >   can't directly modify page tables given you need lock in the path
> > > of
> > >   reclaim.
> > 
> > Probably no more garbage collector, you deal with pages/folios like
> > the
> > core mm expects.
> 
> Yeah, if the page-table locks are reclaim-safe no more garbage
> collector, but OTOH, IIRC even in core-mm, the invalidation
> counterpart, unmap_mapping_range() can't and doesn't remove page-table
> subtrees when called from the address-space side, whereas zapping when
> called from the mm side, like madvise(WONTNEED), can.

Yeah we might need to mark up entirely empty pagetables and pass that up
the radix tree, so that on the next gpu bind we can zap those if needed.
Since we have the pagetables already it should be doable to add them to a
"needs garbage collecting" list of some sorts for entirely empty
pagetables, unlike the garbage collector that tosses out partial ranges
and so needs more stuff.

But also, future problem for post-merge I think.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation
  2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (30 preceding siblings ...)
  2024-08-28  2:56 ` ✗ CI.KUnit: failure " Patchwork
@ 2024-09-24  9:16 ` Simona Vetter
  2024-09-24 19:36   ` Matthew Brost
  31 siblings, 1 reply; 100+ messages in thread
From: Simona Vetter @ 2024-09-24  9:16 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote:
> Continuation of SVM work by Oak Zeng [1][2] based on community feedback.
> Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for
> system allocations (e.g., malloc), runtime allocations (e.g., binding a
> BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM
> allocations can evict each other). Fully tested; more on this below.
> 
> The patch breakdown is as follows:
> 
> 1. Preparation patches already on the list [3].
> 	- Patches 1-3.
> 	- Please refrain from reviewing these here.	
> 2. New migrate layer functionality
> 	- Patch 4.
> 	- Required for eviction to avoid locking inversion between
> 	  dma-resv and mmap lock.
> 3. GPU SVM.
> 	- Patch 5.
> 	- This is what needs community review.
> 	- Inspired by GPUVM.
> 	- Kernel doc should explain design principles.
> 	- There is certainly room for optimization of the implementation
> 	  and improvements with existing core MM interaction. Pulling in
> 	  pending DMA mapping work [4] and additional core MM support
> 	  for SVM is also likely desired. However, this serves as a good
> 	  starting point for any SVM discussions and could be used as a
> 	  stepping stone to future core MM work.
> 3. Basic SVM support in Xe (i.e., SRAM backing only).
> 	- Patches 6-15.
> 	- The uAPI in the patch could benefit from community input.
> 4. SVM VRAM migration support in Xe.
> 	- Patches 16-23.
> 	- Using TMM BOs for SVM VRAM allocations could use community
> 	  input. Patch 23 has a detailed explaination of this design
> 	  choice in the commit message.
> 5. SVM eviction support in Xe.
> 	- Patch 24.
> 	- Should work with exhaustive eviction [5] when it merges.
> 6. Xe SVM debug / tuning.
> 	- Patch 25-28.
> 
> Kernel documentation and commit messages are relatively light, aside
> from GPU SVM and uAPI patches as this is an RFC.
> 
> Testing has been conducted quite thoroughly with new IGT [6]. Various
> system allocation types (malloc, mmap, mmap flags, huge pages, different
> sizes, different alignments), mixing runtime allocations, unmapping
> corners, invalid faults, and eviction have been tested. Testing scales
> from single thread to multiple threads and multiple processes. Tests
> pass on LNL, BMG, PVC 1 tile, and PVC 2 tile.
> 
> 1. Multiple GPU support.
> 	- This is likely to follow or occur in parallel to this work.
> 2. Userptr unification with GPU SVM.
> 	- This is essentially designed in my head (likely involving a
> 	  few new GPU SVM layer functions) but would require some fairly
> 	  invasive changes to Xe KMD to test out. Therefore, I would
> 	  like GPU SVM to be reviewed first before proceeding with these
> 	  changes.
> 3. Madvise and prefetch IOCTLs
> 	- This is likely to follow or occur in parallel to this work.
> 
> Given the size of the series, I have pushed a GitLab branch for
> reference [7].
> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/128910/
> [2] https://patchwork.freedesktop.org/series/132229/
> [3] https://patchwork.freedesktop.org/series/137805/
> [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> [5] https://patchwork.freedesktop.org/series/133643/
> [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2
> [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads

Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here
it is finally. I think with all the experiments and discussions I have
fairly clear understanding of the really tricky parts of svm (thanks a lot
to Matt for all the work done). From my side the key points, sorted
roughly in how important I think they are:

1. migrate_to_ram path

I think this is the absolute center piece of making sure we're aligned
with core mm and don't suffer from deadlocks or livelocks fundamental to
the gpusvm library design. So this part imo needs to be solid for the
first version we merge. But of course any core mm changes Matt prototyped
shouldn't gate merging the drm side since they're nicely decoupled, we
only need those to validate the design in testing.

I think the key points are:

- we rely on the migration pte, temporary page references and page lock
  only, which with the core mm changes Matt worked on gives us guaranteed
  reliable migration back to system memory. And we need that, or svm
  essentially becomes unusable as a concept.

- we need to support partial migration, including the worst case fallback
  of only migrating that single page core mm managed to trylock for us
  while holding the pagetable lock.

  Since we have guaranteed migration back to system memory we can make the
  assumption on the gpu fault handling side that we will only ever handle
  ranges that are entirely in vram (by throwing any partial migrations
  out). Needs a retry loop for that in the gpu fault side, but I no logner
  see an issue with that assumption on the gpu fault side otherwise, so
  not needed for merging or even later until we have a driver that
  requires partial vram support.

- no other driver locks related to that memory range in any way are
  allowed, and ideally we also test with the forced fallback to mmap_read
  lock in do_swap_page removed, i.e. calling migrate_to_ram with only
  holding the read vma lock. Of course driver locks for blitting are
  allowed, it's only locks related to managing physical memory which are
  problematic and could result in deadlocks.

- the drm side must uphold the guarantee of not having elevated page
  references whithout holding the page lock. Otherwise there's a race and
  we cannot guarantee migratione to sram.

- also through the try_to_migrate maze we'll hit our own gpu pte
  invalidate paths, so there's some requirements there too. But I've put
  the discussion for that at the very bottom, since most of the gpu pte
  locking questions are imo not that important, and definitely not
  important for the first version we merge.

Everything else below I think we can sort out post merge and just need
rough alignment on the design.

2. eviction

Requirements much like migrate_to_ram, because otherwise we break the
migration gurantee:

- Only looking at physical memory datastructures and locks, no looking at
  mm/vma structs or relying on those being locked. We rely entirely on
  reverse maps from try_to_migrate to find all the mappings on both cpu
  and gpu side (cpu only zone device swap or migration pte entries ofc).

- Partial migration needs to work to make sure we can get out of any
  low memory bind.

3. gpu fault side

- We can only rely on mmap_read for hmm_range_fault. And ideally should
  assume that's not held anywhere else since with per-vma locking I'd
  expect the mm/vma locking will move into hmm_range_fault. This also
  means no looking at vma beyond just passing it around as currently
  needed by core mm functions.

- Big retry loop to handle all races with the mmu notifier under the gpu
  pagetable locks/mmu notifier range lock/whatever we end up calling
  those. Races (especially against concurrent eviction/migrate_to_ram)
  should _not_ be handled on the fault side by trying to hold locks
  instead.

- Long term I think we need to be able to handle concurrent faults, even
  on hw where there's only one gpu fault handling queue. For performance
  we absolutely want to prefault aggressively, and that likely happens
  through cpu ioctls that are entirely independent from the gpu fault
  handling.

  Short term enough (driver-side) locking to make sure this doesn't go
  boom is enough, I think just some design goal documentation here on how
  to achieve that is all we need.

4. physical memory to virtual backpointer

No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram
path and I think Matt already fixed this all anyway.

5. gpu pagetable locking

Or mmu notifier range locking or whatever you want to call it. Like on the
cpu side this should _only_ protect the pagetable entries and additional
for us mmu notifier seqno tracking, nothing else.

Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault
need to be handled by retrying outside of holding the pagetable locks. If
we try to impose additional consistency guarantees we'll fall over and
have a livelock/deadlock fight with core mm in migrate_to_ram. This part
is required I think for the first version, but we already need that anyway
to make migrate_to_ram work properly.

For the actual data structure/locking design I think anything on the
design line between a single global lock and the radix tree over-the-top
scalable per-pagetable (spin)lock design of the core mm is fine.

The design here with 3 levels (mmu notifer, range, struct page) wouldn't
be my first choice, but clearly fits on that line so imo is fine for
initial merging. We might want to make sure that the range locking (I
guess mostly relevant for the invalidate side, drivers don't see much
else) is somewhat abstracted so we can easily change that post-merge, but
not required imo at all.

For consensus documentation I'd recommend a todo or design documentation
patch, where we put down both the current design and why it's like that,
and some of the longer term goals. Then get that acked (imo needs at least
one other driver that's seriously interested in this, plus I think an ack
from Danilo for gpuvm interactions), then merge that. SVM is tricky enough
that I think this would be really useful to make sure we're not
unecessarily stuck in limbo.

From my side again I think the only part we really have to get right from
the start is migrate_to_ram. And I'm confident we've got that now really
solid.

Oh also you need userspace ofc :-)

Cheers, Sima

> Matthew Brost (28):
>   dma-buf: Split out dma fence array create into alloc and arm functions
>   drm/xe: Invalidate media_gt TLBs in PT code
>   drm/xe: Retry BO allocation
>   mm/migrate: Add migrate_device_vma_range
>   drm/gpusvm: Add support for GPU Shared Virtual Memory
>   drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
>   drm/xe: Add SVM init / fini to faulting VMs
>   drm/xe: Add dma_addr res cursor
>   drm/xe: Add SVM range invalidation
>   drm/gpuvm: Add DRM_GPUVA_OP_USER
>   drm/xe: Add (re)bind to SVM page fault handler
>   drm/xe: Add SVM garbage collector
>   drm/xe: Add unbind to SVM garbage collector
>   drm/xe: Do not allow system allocator VMA unbind if the GPU has
>     bindings
>   drm/xe: Enable system allocator uAPI
>   drm/xe: Add migrate layer functions for SVM support
>   drm/xe: Add SVM device memory mirroring
>   drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
>   drm/xe: Update PT layer to understand ranges in VRAM
>   drm/xe: Add Xe SVM populate_vram_pfn vfunc
>   drm/xe: Add Xe SVM vram_release vfunc
>   drm/xe: Add BO flags required for SVM
>   drm/xe: Add SVM VRAM migration
>   drm/xe: Basic SVM BO eviction
>   drm/xe: Add SVM debug
>   drm/xe: Add modparam for SVM notifier size
>   drm/xe: Add modparam for SVM prefault
>   drm/gpusvm: Ensure all pages migrated upon eviction
> 
>  drivers/dma-buf/dma-fence-array.c    |   78 +-
>  drivers/gpu/drm/xe/Makefile          |    4 +-
>  drivers/gpu/drm/xe/drm_gpusvm.c      | 2213 ++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/drm_gpusvm.h      |  415 +++++
>  drivers/gpu/drm/xe/xe_bo.c           |   54 +-
>  drivers/gpu/drm/xe/xe_bo.h           |    2 +
>  drivers/gpu/drm/xe/xe_bo_types.h     |    3 +
>  drivers/gpu/drm/xe/xe_device_types.h |    8 +
>  drivers/gpu/drm/xe/xe_gt_pagefault.c |   17 +-
>  drivers/gpu/drm/xe/xe_migrate.c      |  150 ++
>  drivers/gpu/drm/xe/xe_migrate.h      |   10 +
>  drivers/gpu/drm/xe/xe_module.c       |    7 +
>  drivers/gpu/drm/xe/xe_module.h       |    2 +
>  drivers/gpu/drm/xe/xe_pt.c           |  456 +++++-
>  drivers/gpu/drm/xe/xe_pt.h           |    3 +
>  drivers/gpu/drm/xe/xe_pt_types.h     |    2 +
>  drivers/gpu/drm/xe/xe_res_cursor.h   |   50 +-
>  drivers/gpu/drm/xe/xe_svm.c          |  775 +++++++++
>  drivers/gpu/drm/xe/xe_svm.h          |   70 +
>  drivers/gpu/drm/xe/xe_tile.c         |    5 +
>  drivers/gpu/drm/xe/xe_vm.c           |  286 +++-
>  drivers/gpu/drm/xe/xe_vm.h           |   15 +-
>  drivers/gpu/drm/xe/xe_vm_types.h     |   44 +
>  include/drm/drm_gpuvm.h              |    5 +
>  include/linux/dma-fence-array.h      |    6 +
>  include/linux/migrate.h              |    3 +
>  include/uapi/drm/xe_drm.h            |   19 +-
>  mm/migrate_device.c                  |   53 +
>  28 files changed, 4615 insertions(+), 140 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> 
> -- 
> 2.34.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-06 18:41   ` Zeng, Oak
@ 2024-09-24  9:25     ` Simona Vetter
  2024-09-25 16:34       ` Zeng, Oak
  0 siblings, 1 reply; 100+ messages in thread
From: Simona Vetter @ 2024-09-24  9:25 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	Auld, Matthew, daniel@ffwll.ch, airlied@gmail.com,
	christian.koenig@amd.com

On Fri, Sep 06, 2024 at 06:41:18PM +0000, Zeng, Oak wrote:
> There are fundamental design conflicts with what we have aligned, see inline.
> 
> > -----Original Message-----
> > From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf
> > Of Matthew Brost
> > Sent: Tuesday, August 27, 2024 10:49 PM
> > To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> > Cc: airlied@gmail.com; christian.koenig@amd.com;
> > thomas.hellstrom@linux.intel.com; Auld, Matthew
> > <matthew.auld@intel.com>; daniel@ffwll.ch
> > Subject: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU
> > Shared Virtual Memory
> > 
> > This patch introduces support for GPU Shared Virtual Memory (SVM)
> > in the
> > Direct Rendering Manager (DRM) subsystem. SVM allows for
> > seamless
> > sharing of memory between the CPU and GPU, enhancing
> > performance and
> > flexibility in GPU computing tasks.
> > 
> > The patch adds the necessary infrastructure for SVM, including data
> > structures and functions for managing SVM ranges and notifiers. It
> > also
> > provides mechanisms for allocating, deallocating, and migrating
> > memory
> > regions between system RAM and GPU VRAM.
> > 
> > This mid-layer is largely inspired by GPUVM.
> > 
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile     |    3 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> >  3 files changed, 2591 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index b9670ae09a9e..b8fc2ee58f1a 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > 
> >  # core driver code
> > 
> > -xe-y += xe_bb.o \
> > +xe-y += drm_gpusvm.o \
> > +	xe_bb.o \
> >  	xe_bo.o \
> >  	xe_bo_evict.o \
> >  	xe_devcoredump.o \
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > new file mode 100644
> > index 000000000000..fc1e44e6ae72
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > @@ -0,0 +1,2174 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + *
> > + * Authors:
> > + *     Matthew Brost <matthew.brost@intel.com>
> > + */
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interval_tree_generic.h>
> > +#include <linux/hmm.h>
> > +#include <linux/memremap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_types.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/slab.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include "drm_gpusvm.h"
> > +
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework
> > designed to manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient
> > data exchange and
> > + * processing for GPU-accelerated applications by allowing memory
> > sharing and
> > + * synchronization between the CPU's and GPU's virtual address
> > spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU
> > SVM
> > + *		initialization parameter, with a recommendation of
> > 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of
> > ranges that
> > + *		fall within the notifier interval. Notifiers are tracked
> > within
> > + *		a GPU SVM Red-BlacK tree and list and are
> > dynamically inserted
> > + *		or removed as ranges within the interval are created
> > or
> > + *		destroyed.
> > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > managed
> > + *	     by GPU SVM. 
> 
> 
> This svm_range concept has introduced a lot of code duplications in xekmd, 
> Indicating that this is a wrong design. I think one of the design principle is to
> Reuse, not to duplicate.
> 
> Look at patch 9, 11, bunch of duplicated codes to page table update, invalidate,
> And page fault handler. 
> 
> I had this range concept in v1 [1], but after we agreed to unify svm and userptr
> Codes during review, I dropped this concept, and the xe_svm concept, which ends
> Up much less duplicated codes in v2[2]. I will say more below why I thought the svm
> Concept can also be removed.
> 
> Conceptually vma represent a range. Why duplicate?

Because we cannot rely on mmap_read/write locks or vma_read/write locks
without causing headaches. They are core mm datastructures that the gpu
driver does not own, so for better or worse we have to do a bit of
duplication.

Duplication for no reaons is bad, but trying to avoid necessary
duplication that's inherit to the design challenge we face is much worse.


> [1] https://patchwork.freedesktop.org/patch/574898/?series=128910&rev=1
> [2] https://patchwork.freedesktop.org/series/132229/
> 
> 
> They are sized based on an array of chunk
> > sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU address
> > space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits within
> > the
> > + *	     faulting CPU address space is chosen for the range size.
> > Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and
> > removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > are tracked in
> > + *	     a notifier's Red-Black tree.
> > + * - Operations: Define the interface for driver-specific SVM
> > operations such as
> > + *		 allocation, page collection, migration, invalidations,
> > and VRAM
> > + *		 release.
> > + *
> > + * This layer provides interfaces for allocating, mapping, migrating,
> > and
> > + * releasing memory ranges between the CPU and GPU. It handles
> > all core memory
> > + * management interactions (DMA mapping, HMM, and migration)
> > and provides
> > + * driver-specific virtual functions (vfuncs). This infrastructure is
> > sufficient
> > + * to build the expected driver components for an SVM
> > implementation as detailed
> > + * below.
> > + *
> > + * Expected Driver Components:
> > + * - GPU page fault handler: Used to create ranges and notifiers
> > based on the
> > + *			     fault address, optionally migrate the range
> > to
> > + *			     VRAM, and create GPU bindings.
> > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > Ranges are
> > + *			expected to be added to the garbage collector
> > upon
> > + *			MMU_NOTIFY_UNMAP event.
> > + */
> > +
> > +/**
> > + * DOC: Locking
> > + *
> > + * GPU SVM handles locking for core MM interactions, i.e., it
> > locks/unlocks the
> > + * mmap lock as needed. Alternatively, if the driver prefers to
> > handle the mmap
> > + * lock itself, a 'locked' argument is provided to the functions that
> > require
> > + * the mmap lock. This option may be useful for drivers that need to
> > call into
> > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > locking
> > + * inversions between the mmap and dma-resv locks.
> > + *
> > + * GPU SVM introduces a global notifier lock, which safeguards the
> > notifier's
> > + * range RB tree and list, as well as the range's DMA mappings and
> > sequence
> > + * number. GPU SVM manages all necessary locking and unlocking
> > operations,
> > + * except for the recheck of the range's sequence number
> > + * (mmu_interval_read_retry) when the driver is committing GPU
> > bindings. This
> > + * lock corresponds to the 'driver->update' lock mentioned in the
> > HMM
> > + * documentation (TODO: Link). Future revisions may transition from
> > a GPU SVM
> > + * global lock to a per-notifier lock if finer-grained locking is deemed
> > + * necessary.
> > + *
> > + * In addition to the locking mentioned above, the driver should
> > implement a
> > + * lock to safeguard core GPU SVM function calls that modify state,
> > such as
> > + * drm_gpusvm_range_find_or_insert and
> > drm_gpusvm_range_remove. Alternatively,
> > + * these core functions can be called within a single kernel thread,
> > for
> > + * instance, using an ordered work queue. This lock is denoted as
> > + * 'driver_svm_lock' in code examples.
> > + */
> > +
> > +/**
> > + * DOC: Migrataion
> > + *
> > + * The migration support is quite simple, allowing migration between
> > SRAM and
> > + * VRAM at the range granularity. For example, GPU SVM currently
> > does not
> > + * support mixing SRAM and VRAM pages within a range. This means
> > that upon GPU
> > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > fault, the
> > + * entire range is migrated to SRAM.
> > + *
> > + * The reasoning for only supporting range granularity is as follows: it
> > + * simplifies the implementation, and range sizes are driver-defined
> > and should
> > + * be relatively small.
> 
> Migration at range granularity just couples the physical world with virtual world,
> Which is against the fundamental page-centric design we aligned before.
> 
> Looking at core mm behavior, the shrinking/swapping doesn't operate at vma or any
> Virtual range granularity. This way we swap out the less frequently used pages and
> Keep the more frequently used pages in ram. 
> 
> Similar thing should be done to vram migration to sram.
> 
> > + */
> > +
> > +/**
> > + * DOC: Partial Unmapping of Ranges
> > + *
> > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > CPU resulting
> > + * in MMU_NOTIFY_UNMAP event) presents several challenges,
> 
> As said above, the challenge is coming from a design choice. In a 
> Page centric design, the challenges don't exist at all.

See my other reply, as long as migrate_to_ram is entirely page centric
we're fine. And I think Matt fixed that now.

The other aspect of being page centric is gpu pagetable locking, and there
I also gained a lot of clarity on what exactly matters, and what doesn't.
The mmu_notifer -> range -> page design wouldn't be my personal first
choice, but it is a perfectly ok one I think. As long as we follow all the
other rules we need to follow about page-centric locking/refcounting/pte
invaliation that migrate_to_ram requires.

Cheers, Sima


> > with the main one
> > + * being that a subset of the range still has CPU and GPU mappings.
> > If the
> > + * backing store for the range is in VRAM, a subset of the backing
> > store has
> > + * references. One option would be to split the range and VRAM
> > backing store,
> > + * but the implementation for this would be quite complicated.
> > Given that
> > + * partial unmappings are rare and driver-defined range sizes are
> > relatively
> > + * small, GPU SVM does not support splitting of ranges.
> > + *
> > + * With no support for range splitting, upon partial unmapping of a
> > range, the
> > + * driver is expected to invalidate and destroy the entire range. If
> > the range
> > + * has VRAM as its backing, the driver is also expected to migrate any
> > remaining
> > + * pages back to SRAM.
> > + */
> > +
> > +/**
> > + * DOC: Examples
> > + *
> > + * This section provides two examples of how to build the expected
> > driver
> > + * components: the GPU page fault handler and the garbage
> > collector. A third
> > + * example demonstrates a sample invalidation driver vfunc.
> > + *
> > + * The generic code provided does not include logic for complex
> > migration
> > + * policies, optimized invalidations, or other potentially required
> > driver
> > + * locking (e.g., DMA-resv locks).
> > + *
> > + * 1) GPU page fault handler
> > + *
> > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > drm_gpusvm_range *range)
> > + *	{
> > + *		int err = 0;
> > + *
> > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > range);
> > + *
> > + *		drm_gpusvm_notifier_lock(gpusvm);
> > + *		if (drm_gpusvm_range_pages_valid(range))
> > + *			driver_commit_bind(gpusvm, range);
> > + *		else
> > + *			err = -EAGAIN;
> > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > + *
> > + *		return err;
> > + *	}
> > + *
> > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > + *			     u64 gpuva_start, u64 gpuva_end)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *		int err;
> > + *
> > + *		driver_svm_lock();
> > + *	retry:
> > + *		// Always process UNMAPs first so view of GPU SVM
> > ranges is current
> > + *		driver_garbage_collector(gpusvm);
> > + *
> > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > fault_addr,
> > + *							gpuva_start,
> > gpuva_end,
> > + *						        &ctx);
> > + *		if (IS_ERR(range)) {
> > + *			err = PTR_ERR(range);
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		if (driver_migration_policy(range)) {
> > + *			bo = driver_alloc_bo();
> > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > range, bo, &ctx);
> > + *			if (err)	// CPU mappings may have changed
> > + *				goto retry;
> > + *		}
> > + *
> > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &ctx);
> > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > mappings changed
> > + *			goto retry;
> > + *		else if (err)
> > + *			goto unlock;
> > + *
> > + *		err = driver_bind_range(gpusvm, range);
> > + *		if (err == -EAGAIN)	// CPU mappings changed
> > + *			goto retry
> > + *
> > + *	unlock:
> > + *		driver_svm_unlock();
> > + *		return err;
> > + *	}
> > + *
> > + * 2) Garbage Collector.
> > + *
> > + *	void __driver_garbage_collector(struct drm_gpusvm
> > *gpusvm,
> > + *					struct drm_gpusvm_range
> > *range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		// Partial unmap, migrate any remaining VRAM pages
> > back to SRAM
> > + *		if (range->flags.partial_unmap)
> > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > range, &ctx);
> > + *
> > + *		driver_unbind_range(range);
> > + *		drm_gpusvm_range_remove(gpusvm, range);
> > + *	}
> > + *
> > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > + *			__driver_garbage_collector(gpusvm, range);
> > + *	}
> > + *
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier *notifier,
> > + *				 const struct mmu_notifier_range
> > *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > >start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier,
> > mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > range, &ctx);
> > + *
> > + *			if (mmu_range->event !=
> > MMU_NOTIFY_UNMAP)
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range,
> > mmu_range);
> > + *			driver_garbage_collector_add(gpusvm,
> > range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)-
> > >va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)-
> > >va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START,
> > DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > >interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > >interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last,
> > DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static
> > __maybe_unused, notifier);
> > +
> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given
> > range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory
> > range,
> > + * specified by the start and end addresses. It divides the difference
> > + * between the end and start addresses by the page size
> > (PAGE_SIZE) to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd
> > destruction
> > + * @range: Pointer to the GPU SVM range
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up
> > a range
> > + * upon CPU page fault and asynchronously releasing VRAM once
> > the CPU has no
> > + * page references. Asynchronous release is useful because CPU
> > page references
> > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > requires sleeping
> > + * locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_range *range;
> > +	void *vram_allocation;
> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > destroying a zdd
> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct
> > work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd,
> > destroy_work);
> > +	struct drm_gpusvm_range *range = zdd->range;
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > +	drm_gpusvm_range_put(range);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @range: Pointer to the GPU SVM range.
> > + *
> > + * This function allocates and initializes a new zdd structure. It sets
> > up the
> > + * reference count, initializes the destroy work, and links the
> > provided GPU SVM
> > + * range.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work,
> > drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->range = drm_gpusvm_range_get(range);
> > +	zdd->vram_allocation = NULL;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd
> > structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for
> > asynchronous destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > +
> > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd
> > structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU
> > SVM ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > notifier__, start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > (start__), (end__)),	\
> > +	     (next__) = __drm_gpusvm_range_next(range__);
> > 			\
> > +	     (range__) && (range__->va.start < (end__));
> > 			\
> > +	     (range__) = (next__), (next__) =
> > __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next
> > drm_gpusvm_notifier in the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available, or
> > NULL if
> > + *         the current notifier is the last one or if the input notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm->notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> > in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > start__, end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start < (end__));
> > 			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> > SVM notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1),	\
> > +	     (next__) = __drm_gpusvm_notifier_next(notifier__);
> > 				\
> > +	     (notifier__) && (notifier__->interval.start < (end__));
> > 			\
> > +	     (notifier__) = (next__), (next__) =
> > __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It
> > sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc
> > under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range
> > *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops
> > drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner =
> > device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +	gpusvm->zdd_wq = system_wq;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault
> > address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)
> > 	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),
> > 	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier
> > structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)
> > 	\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB
> > tree and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)
> > 	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining
> > ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm,
> > 0, LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any
> > invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next,
> > notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > >notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > >notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree
> > and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree
> > and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)
> > 		\
> > +	range_remove((range__), &(notifier__)->root);
> > 	\
> > +	list_del(&(range__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > + *
> > + * This function allocates and initializes the GPU SVM range structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool
> > migrate_vram)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the
> > CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier
> > *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq =
> > mmu_interval_read_begin(&notifier->notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range
> > based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > and the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64 gpuva_start,
> > +				       u64 gpuva_end, bool check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > >chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end
> > &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with
> > existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted in.
> > Without
> > +		 * this check, or prefault, on BMG
> > 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case, each
> > process
> > +		 * mallocs 16k but the CPU VMA is ~128k which results
> > in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some
> > processes fail in
> > +		 * drm_gpusvm_migrate_to_vram with
> > 'migrate.cpages != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages
> > device pages from
> > +		 * other processes are collected + faulted in which
> > creates all
> > +		 * sorts of problems. Unsure exactly how this
> > happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --
> > r
> > +		 * process-many-malloc' mallocs at least 64k at a time.
> > +		 */
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range
> > based on the
> > + * fault address. Caller must hold a lock to protect range lookup and
> > insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> > u64 fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_vram;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_write_locked(mm);
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > >notifier,
> > +							  mm, notifier-
> > >interval.start,
> > +							  notifier-
> > >interval.end -
> > +							  notifier-
> > >interval.start,
> > +
> > &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > current
> > +	 * limitations. If/when migrate_vma_* add more support, this
> > logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_vram = ctx->vram_possible &&
> > +		vma_is_anonymous(vas)
> > && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > notifier, vas,
> > +						 fault_addr,
> > gpuva_start,
> > +						 gpuva_end,
> > migrate_vram &&
> > +						 !ctx->prefault);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > fault_addr, chunk_size,
> > +				       migrate_vram);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +	if (ctx->prefault) {
> > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > +
> > +		__ctx.mmap_locked = true;
> > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &__ctx);
> > +		if (err)
> > +			goto err_range_remove;
> > +	}
> > +
> > +out_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +
> > +	return range;
> > +
> > +err_range_remove:
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +err_notifier_remove:
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > + * @i__: the current page index in the iteration
> > + * @j__: the current page index, log order, in the iteration
> > + * @npages__: the total number of pages in the DMA region
> > + * @order__: the order of the pages in the DMA region
> > + *
> > + * This macro iterates over each page in a DMA region. The DMA
> > region
> > + * is assumed to be composed of 2^@order__ pages, and the macro
> > will
> > + * step through the region one block of 2^@order__ pages at a time.
> > + */
> > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > +	     (j__)++, (i__) += 0x1 << (order__))
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated
> > with a GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function unmap pages associated with a GPU SVM range.
> > Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct
> > drm_gpusvm *gpusvm,
> > +					   struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		unsigned long i, j, npages = npages_in_range(range-
> > >va.start,
> > +							     range-
> > >va.end);
> > +
> > +		if (range->flags.has_dma_mapping) {
> > +			for_each_dma_page(i, j, npages, range-
> > >order)
> > +				dma_unmap_page(gpusvm->drm-
> > >dev,
> > +					       range->dma_addr[j],
> > +					       PAGE_SIZE << range-
> > >order,
> > +					       DMA_BIDIRECTIONAL);
> > +		}
> > +
> > +		range->flags.has_vram_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a
> > GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > *gpusvm,
> > +					struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		if (range->flags.kfree_mapping) {
> > +			kfree(range->dma_addr);
> > +			range->flags.kfree_mapping = false;
> > +			range->pages = NULL;
> > +		} else {
> > +			kvfree(range->pages);
> > +			range->pages = NULL;
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also
> > removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > >va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier-
> > >notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified
> > GPU SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the
> > GPU SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its
> > reference count
> > + * reaches zero. If a custom range-free function is provided, it is
> > invoked to
> > + * free the range; otherwise, the range is deallocated using kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range,
> > refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified
> > GPU SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before
> > commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_vram_pages || range-
> > >flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range
> > pages valid unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm
> > *gpusvm,
> > +				      struct drm_gpusvm_range *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->pages)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > range);
> > +	if (!pages_valid && range->flags.kfree_mapping) {
> > +		kfree(range->dma_addr);
> > +		range->flags.kfree_mapping = false;
> > +		range->pages = NULL;
> > +	}
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they
> > are mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > >notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > >read_only ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	bool vram_pages = !!range->flags.migrate_vram;
> > +	bool alloc_pfns = false, kfree_mapping;
> > +
> > +retry:
> > +	kfree_mapping = false;
> > +	hmm_range.notifier_seq =
> > mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > range))
> > +		return 0;
> > +
> > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > range->pages) {
> > +		if (ctx->prefault)
> > +			return 0;
> > +
> > +		pfns = (unsigned long *)range->pages;
> > +		pages = range->pages;
> > +		goto map_pages;
> > +	}
> > +
> > +	if (!range->pages) {
> > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > GFP_KERNEL);
> > +		if (!pfns)
> > +			return -ENOMEM;
> > +		alloc_pfns = true;
> > +	} else {
> > +		pfns = (unsigned long *)range->pages;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		/* Must be checked after mmu_interval_read_begin
> > */
> > +		if (range->flags.unmapped) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		if (!ctx->mmap_locked) {
> > +			/*
> > +			 * XXX: HMM locking document indicates only
> > a read-lock
> > +			 * is required but there apears to be a window
> > between
> > +			 * the MMU_NOTIFY_MIGRATE event
> > triggered in a CPU fault
> > +			 * via migrate_vma_setup and the pages
> > actually moving
> > +			 * in migrate_vma_finalize in which this code
> > can grab
> > +			 * garbage pages. Grabbing the write-lock if
> > the range
> > +			 * is attached to vram appears to protect
> > against this
> > +			 * race.
> > +			 */
> > +			if (vram_pages)
> > +				mmap_write_lock(mm);
> > +			else
> > +				mmap_read_lock(mm);
> > +		}
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (!ctx->mmap_locked) {
> > +			if (vram_pages)
> > +				mmap_write_unlock(mm);
> > +			else
> > +				mmap_read_unlock(mm);
> > +		}
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (!ctx->mmap_locked)
> > +		mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +
> > +	if (ctx->prefault) {
> > +		range->pages = pages;
> > +		goto set_seqno;
> > +	}
> > +
> > +map_pages:
> > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > +		WARN_ON_ONCE(!range->vram_allocation);
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +			if
> > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_free;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->flags.has_vram_pages = true;
> > +		range->pages = pages;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +
> > 	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	} else {
> > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > +
> > +		for_each_dma_page(i, j, npages, order) {
> > +			if (WARN_ON_ONCE(i && order !=
> > +
> > hmm_pfn_to_map_order(pfns[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +			order = hmm_pfn_to_map_order(pfns[i]);
> > +
> > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > +			if
> > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +
> > +			set_page_dirty_lock(pages[j]);
> > +			mark_page_accessed(pages[j]);
> > +
> > +			dma_addr[j] = dma_map_page(gpusvm-
> > >drm->dev,
> > +						   pages[j], 0,
> > +						   PAGE_SIZE << order,
> > +
> > DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev,
> > dma_addr[j])) {
> > +				err = -EFAULT;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +		}
> > +
> > +		/* Huge pages, reduce memory footprint */
> > +		if (order) {
> > +			dma_addr = kmalloc_array(j,
> > sizeof(*dma_addr),
> > +						 GFP_KERNEL);
> > +			if (dma_addr) {
> > +				for (i = 0; i < j; ++i)
> > +					dma_addr[i] =
> > (dma_addr_t)pfns[i];
> > +				kvfree(pfns);
> > +				kfree_mapping = true;
> > +			} else {
> > +				dma_addr = (dma_addr_t *)pfns;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->order = order;
> > +		range->flags.kfree_mapping = kfree_mapping;
> > +		range->flags.has_dma_mapping = true;
> > +		range->dma_addr = dma_addr;
> > +		range->vram_allocation = NULL;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +
> > 	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	}
> > +
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	for_each_dma_page(i, j, npages, order)
> > +		dma_unmap_page(gpusvm->drm->dev,
> > +			       (dma_addr_t)pfns[j],
> > +			       PAGE_SIZE << order,
> > DMA_BIDIRECTIONAL);
> > +err_free:
> > +	if (alloc_pfns)
> > +		kvfree(pfns);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated
> > with a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If
> > @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > >invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long
> > npages,
> > +					   unsigned long *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +
> > 	drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > grate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU
> > SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_vram_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for
> > GPU SVM migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to
> > mapped pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU
> > SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn,
> > maps the
> > + * corresponding page, and stores the DMA address in the provided
> > @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int
> > *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page =
> > migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > PAGE_SIZE, dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> > mapped for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped
> > pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for
> > GPU Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in
> > @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the
> > corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction
> > dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to
> > VRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *                   failure of this function.
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> > The caller
> > + *                   should hold a reference to the VRAM allocation, which
> > + *                   should be dropped via ops->vram_allocation or upon the
> > + *                   failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to VRAM. It
> > performs the
> > + * necessary setup and invokes the driver-specific operations for
> > migration to
> > + * VRAM. Upon successful return, @vram_allocation can safely
> > reference @range
> > + * until ops->vram_release is called which only upon successful
> > return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm-
> > >device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_vram)
> > +		return -EINVAL;
> > +
> > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > >copy_to_vram ||
> > +	    !gpusvm->ops->copy_to_sram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(range);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > npages, not
> > +	 * always an error. Need to revisit possible cases and how to
> > handle. We
> > +	 * could prefault on migrate.cpages != npages via
> > hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > vram_allocation, npages,
> > +					     migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.src, npages,
> > DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_vram_page(page, zdd);
> > +	}
> > +
> > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages,
> > dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind vram allocation to range and zdd */
> > +	range->vram_allocation = vram_allocation;
> > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages,
> > migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > dma_addr, npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM
> > PFNs for a VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the SRAM migrate page frame numbers
> > (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the VM
> > area for
> > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > vm_area_struct *vas,
> > +						unsigned long npages,
> > +						unsigned long
> > *src_mpfn,
> > +						unsigned long *mpfn,
> > u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_sram but does not require
> > mmap lock and
> > + * migration done via migrate_device_* functions. Fallback path as it
> > is
> > + * preferred to issue migrations with mmap lock.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > >vram_allocation,
> > +					     npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_vma_range(gpusvm->mm,
> > +				       gpusvm-
> > >device_private_page_owner, src,
> > +				       npages, range->va.start);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL,
> > npages, src, dst, 0);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   dst, npages,
> > DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > dma_addr, npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to
> > SRAM (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @vas: Pointer to the VM area structure
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @start: Start address of the migration range
> > + * @end: End address of the migration range
> > + *
> > + * This internal function performs the migration of the specified GPU
> > SVM range
> > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > PFNs, and
> > + * invokes the driver-specific operations for migration to SRAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > *gpusvm,
> > +					struct vm_area_struct *vas,
> > +					struct page *page,
> > +					u64 start, u64 end)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= gpusvm-
> > >device_private_page_owner,
> > +		.flags		=
> > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page	= page,
> > +	};
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	/* Corner where VMA area struct has been partially
> > unmapped */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > +						   migrate.src,
> > migrate.dst,
> > +						   start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.dst, npages,
> > +					   DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages,
> > migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > dma_addr, npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range
> > to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function initiates the migration of the specified GPU SVM
> > range to
> > + * SRAM. It performs necessary checks and invokes the internal
> > migration
> > + * function for actual migration.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err =
> > drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner
> > case when
> > +	 * VRAM backing has been partially unmapped from MM's
> > address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> > +	if (!vas) {
> > +		if (!retry)
> > +			err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > +		if (!retry)
> > +			err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL,
> > start, end);
> > +	if (err)
> > +		goto err_mmunlock;
> > +
> > +	if (vas->vm_end < end) {
> > +		retry = true;
> > +		start = vas->vm_end;
> > +		goto again;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		mmap_read_unlock(mm);
> > +		/*
> > +		 * Using mmput_async as this function can be called
> > while
> > +		 * holding a dma-resv lock, and a final put can grab the
> > mmap
> > +		 * lock, causing a lock inversion.
> > +		 */
> > +		mmput_async(mm);
> > +	}
> > +
> > +	return 0;
> > +
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked)
> > +		mmap_read_unlock(mm);
> > +err_mmput:
> > +	if (!ctx->mmap_locked)
> > +		mmput_async(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > associated with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device
> > data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > (page fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM
> > range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting
> > page and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > >zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(zdd->range-
> > >gpusvm,
> > +					   vmf->vma, vmf->page,
> > +					   zdd->range->va.start,
> > +					   zdd->range->va.end);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for
> > GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops =
> > {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device
> > page map operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops
> > *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for
> > the given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > start, u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end)
> > {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start,
> > end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..0ea70f8534a8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,415 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual
> > Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM
> > ranges and
> > + * perform operations such as migration between VRAM and system
> > RAM.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success, NULL
> > on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct
> > drm_gpusvm *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @vram_release: Release VRAM allocation (optional)
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 *
> > +	 * This function shall release VRAM allocation and expects to
> > drop a
> > +	 * reference to VRAM allocation.
> > +	 */
> > +	void (*vram_release)(void *vram_allocation);
> > +
> > +	/**
> > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate VRAM page frame numbers
> > (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > +				 void *vram_allocation,
> > +				 unsigned long npages,
> > +				 unsigned long *pfn);
> > +
> > +	/**
> > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to VRAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_sram: Copy to system RAM (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses
> > (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range
> > structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It can
> > safely
> > +	 * walk the notifier range RB tree/list in this function. Called
> > while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range
> > *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure
> > notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order they
> > appear in
> > + *              interval tree. This is useful to keep iterating ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval
> > notifier has been
> > + *                 removed
> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM
> > range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @pages: Pointer to the array of pages (if backing store is in VRAM)
> > + * @dma_addr: DMA address array (if backing store is SRAM and
> > DMA mapped)
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is
> > mapping size
> > + * @flags.migrate_vram: Flag indicating whether the range can be
> > migrated to VRAM
> > + * @flags.unmapped: Flag indicating if the range has been
> > unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been
> > partially unmapped
> > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> > mapping
> > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > allocation based
> > + *                       on @order which releases via kfree
> > + *
> > + * This structure represents a GPU SVM range used for tracking
> > memory ranges
> > + * mapped in a DRM device.
> > + */
> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	union {
> > +		struct page **pages;
> > +		dma_addr_t *dma_addr;
> > +	};
> > +	void *vram_allocation;
> > +	u16 order;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_vram : 1;
> > +		/* All flags below must be set / cleared under notifier
> > lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_vram_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +		u16 kfree_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier
> > operations
> > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > + * @root: Cached root node of the Red-Black tree containing GPU
> > SVM notifiers
> > + * @notifier_list: list head containing of notifiers in the same order
> > they
> > + *                 appear in interval tree. This is useful to keep iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory)
> > used for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager)
> > device.
> > + *
> > + * No reference counting is provided, as this is expected to be
> > embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which
> > handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct workqueue_struct *zdd_wq;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> 
> I also think the gpusvm concept is a duplication of the drm_gpuvm.
> Look at the members here, mm_start, mm_range, rb_tree...
> 
> Maintaining a list of notifier at this layer is odd. Everybody else seems
> Embed the notifier in a range...
> 
> Mm field is essential for svm though. I think what we can do is, introduce a
> *mm field in drm_gpuvm and introduce uAPI to allow user to say one gpuvm
> Participate svm. If one gpuvm participate svm, we set the mm field for this
> Gpuvm.
> 
> Another benefit of the proposed way is, multiple gpuvms can share address space
> With single cpu mm process.
> 
> 
> Oak
> 
> 
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @mmap_locked: mmap lock is locked
> > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > inversions
> > + *                (e.g.dma-revs -> mmap lock)
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @vram_possible: possible to use VRAM
> > + * @prefault: prefault pages
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 mmap_locked :1;
> > +	u32 trylock_mmap :1;
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 vram_possible :1;
> > +	u32 prefault :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm,
> > u64 fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> > +
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +
> > +const struct dev_pagemap_ops
> > *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64
> > start, u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the
> > list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or
> > NULL if the
> > + *         current range is the last one or if the input range is NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates the
> > start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get
> > the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier. It
> > is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > end__)	\
> > +	for ((range__) = (range__) ?:
> > 	\
> > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));
> > 	\
> > +	     (range__) && (range__->va.start < (end__));
> > 	\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as
> > unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the
> > partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range
> > *range,
> > +			      const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> > --
> > 2.34.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                     ` (5 preceding siblings ...)
  2024-09-06 18:41   ` Zeng, Oak
@ 2024-09-24 10:42   ` Thomas Hellström
  2024-09-24 16:30     ` Matthew Brost
  2024-10-09 10:50   ` Thomas Hellström
  7 siblings, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-09-24 10:42 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: airlied, christian.koenig, matthew.auld, daniel

Hi, Matt,

Some random review comments on this patch I came across while looking
at multi-device.

Thanks,
Thomas


On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> This patch introduces support for GPU Shared Virtual Memory (SVM) in
> the
> Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> sharing of memory between the CPU and GPU, enhancing performance and
> flexibility in GPU computing tasks.
> 
> The patch adds the necessary infrastructure for SVM, including data
> structures and functions for managing SVM ranges and notifiers. It
> also
> provides mechanisms for allocating, deallocating, and migrating
> memory
> regions between system RAM and GPU VRAM.
> 
> This mid-layer is largely inspired by GPUVM.

NIT: Naming, Should it be drm_svm rather than drm_gpusvm? For the
drm_gpuvm component, gpuvm clearly distinguished a gpu_vm from a
mm_struct but here we don't have the same need.

> 
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile     |    3 +-
>  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
>  3 files changed, 2591 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index b9670ae09a9e..b8fc2ee58f1a 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
>  
>  # core driver code
>  
> -xe-y += xe_bb.o \
> +xe-y += drm_gpusvm.o \
> +	xe_bb.o \
>  	xe_bo.o \
>  	xe_bo_evict.o \
>  	xe_devcoredump.o \
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> b/drivers/gpu/drm/xe/drm_gpusvm.c
> new file mode 100644
> index 000000000000..fc1e44e6ae72
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> @@ -0,0 +1,2174 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + *
> + * Authors:
> + *     Matthew Brost <matthew.brost@intel.com>
> + */
> +
> +#include <linux/dma-mapping.h>
> +#include <linux/interval_tree_generic.h>
> +#include <linux/hmm.h>
> +#include <linux/memremap.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_types.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +
> +#include <drm/drm_device.h>
> +#include "drm_gpusvm.h"
> +
> +/**
> + * DOC: Overview
> + *
> + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> Rendering Manager (DRM)
> + *
> + * The GPU SVM layer is a component of the DRM framework designed to
> manage shared
> + * virtual memory between the CPU and GPU. It enables efficient data
> exchange and
> + * processing for GPU-accelerated applications by allowing memory
> sharing and
> + * synchronization between the CPU's and GPU's virtual address
> spaces.
> + *
> + * Key GPU SVM Components:
> + * - Notifiers: Notifiers: Used for tracking memory intervals and
> notifying the
> + *		GPU of changes, notifiers are sized based on a GPU
> SVM
> + *		initialization parameter, with a recommendation of
> 512M or
> + *		larger. They maintain a Red-BlacK tree and a list of
> ranges that
> + *		fall within the notifier interval. Notifiers are
> tracked within
> + *		a GPU SVM Red-BlacK tree and list and are
> dynamically inserted
> + *		or removed as ranges within the interval are created
> or
> + *		destroyed.
> + * - Ranges: Represent memory ranges mapped in a DRM device and
> managed
> + *	     by GPU SVM. They are sized based on an array of chunk
> sizes, which
> + *	     is a GPU SVM initialization parameter, and the CPU
> address space.
> + *	     Upon GPU fault, the largest aligned chunk that fits
> within the
> + *	     faulting CPU address space is chosen for the range
> size. Ranges are
> + *	     expected to be dynamically allocated on GPU fault and
> removed on an
> + *	     MMU notifier UNMAP event. As mentioned above, ranges
> are tracked in
> + *	     a notifier's Red-Black tree.
> + * - Operations: Define the interface for driver-specific SVM
> operations such as
> + *		 allocation, page collection, migration,
> invalidations, and VRAM
> + *		 release.
> + *
> + * This layer provides interfaces for allocating, mapping,
> migrating, and
> + * releasing memory ranges between the CPU and GPU. It handles all
> core memory
> + * management interactions (DMA mapping, HMM, and migration) and
> provides
> + * driver-specific virtual functions (vfuncs). This infrastructure
> is sufficient
> + * to build the expected driver components for an SVM implementation
> as detailed
> + * below.
> + *
> + * Expected Driver Components:
> + * - GPU page fault handler: Used to create ranges and notifiers
> based on the
> + *			     fault address, optionally migrate the
> range to
> + *			     VRAM, and create GPU bindings.
> + * - Garbage collector: Used to destroy GPU bindings for ranges.
> Ranges are
> + *			expected to be added to the garbage
> collector upon
> + *			MMU_NOTIFY_UNMAP event.
> + */
> +
> +/**
> + * DOC: Locking
> + *
> + * GPU SVM handles locking for core MM interactions, i.e., it
> locks/unlocks the
> + * mmap lock as needed. Alternatively, if the driver prefers to
> handle the mmap
> + * lock itself, a 'locked' argument is provided to the functions
> that require
> + * the mmap lock. This option may be useful for drivers that need to
> call into
> + * GPU SVM while also holding a dma-resv lock, thus preventing
> locking
> + * inversions between the mmap and dma-resv locks.
> + *
> + * GPU SVM introduces a global notifier lock, which safeguards the
> notifier's
> + * range RB tree and list, as well as the range's DMA mappings and
> sequence
> + * number. GPU SVM manages all necessary locking and unlocking
> operations,
> + * except for the recheck of the range's sequence number
> + * (mmu_interval_read_retry) when the driver is committing GPU
> bindings. This
> + * lock corresponds to the 'driver->update' lock mentioned in the
> HMM
> + * documentation (TODO: Link). Future revisions may transition from
> a GPU SVM
> + * global lock to a per-notifier lock if finer-grained locking is
> deemed
> + * necessary.
> + *
> + * In addition to the locking mentioned above, the driver should
> implement a
> + * lock to safeguard core GPU SVM function calls that modify state,
> such as
> + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> Alternatively,
> + * these core functions can be called within a single kernel thread,
> for
> + * instance, using an ordered work queue. This lock is denoted as
> + * 'driver_svm_lock' in code examples.
> + */
> +
> +/**
> + * DOC: Migrataion
> + *
> + * The migration support is quite simple, allowing migration between
> SRAM and
> + * VRAM at the range granularity. For example, GPU SVM currently
> does not
> + * support mixing SRAM and VRAM pages within a range. This means
> that upon GPU
> + * fault, the entire range can be migrated to VRAM, and upon CPU
> fault, the
> + * entire range is migrated to SRAM.
> + *
> + * The reasoning for only supporting range granularity is as
> follows: it
> + * simplifies the implementation, and range sizes are driver-defined
> and should
> + * be relatively small.
> + */
> +
> +/**
> + * DOC: Partial Unmapping of Ranges
> + *
> + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> CPU resulting
> + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> main one
> + * being that a subset of the range still has CPU and GPU mappings.
> If the
> + * backing store for the range is in VRAM, a subset of the backing
> store has
> + * references. One option would be to split the range and VRAM
> backing store,
> + * but the implementation for this would be quite complicated. Given
> that
> + * partial unmappings are rare and driver-defined range sizes are
> relatively
> + * small, GPU SVM does not support splitting of ranges.
> + *
> + * With no support for range splitting, upon partial unmapping of a
> range, the
> + * driver is expected to invalidate and destroy the entire range. If
> the range
> + * has VRAM as its backing, the driver is also expected to migrate
> any remaining
> + * pages back to SRAM.
> + */
> +
> +/**
> + * DOC: Examples
> + *
> + * This section provides two examples of how to build the expected
> driver
> + * components: the GPU page fault handler and the garbage collector.
> A third
> + * example demonstrates a sample invalidation driver vfunc.
> + *
> + * The generic code provided does not include logic for complex
> migration
> + * policies, optimized invalidations, or other potentially required
> driver
> + * locking (e.g., DMA-resv locks).
> + *
> + * 1) GPU page fault handler
> + *
> + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> drm_gpusvm_range *range)
> + *	{
> + *		int err = 0;
> + *
> + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> range);
> + *
> + *		drm_gpusvm_notifier_lock(gpusvm);
> + *		if (drm_gpusvm_range_pages_valid(range))
> + *			driver_commit_bind(gpusvm, range);
> + *		else
> + *			err = -EAGAIN;
> + *		drm_gpusvm_notifier_unlock(gpusvm);
> + *
> + *		return err;
> + *	}
> + *
> + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> + *			     u64 gpuva_start, u64 gpuva_end)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *		int err;
> + *
> + *		driver_svm_lock();
> + *	retry:
> + *		// Always process UNMAPs first so view of GPU SVM
> ranges is current
> + *		driver_garbage_collector(gpusvm);
> + *
> + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> fault_addr,
> + *							gpuva_start,
> gpuva_end,
> + *						        &ctx);
> + *		if (IS_ERR(range)) {
> + *			err = PTR_ERR(range);
> + *			goto unlock;
> + *		}
> + *
> + *		if (driver_migration_policy(range)) {
> + *			bo = driver_alloc_bo();
> + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> range, bo, &ctx);
> + *			if (err)	// CPU mappings may have
> changed
> + *				goto retry;
> + *		}
> + *
> + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &ctx);
> + *		if (err == -EFAULT || err == -EPERM)	// CPU
> mappings changed
> + *			goto retry;
> + *		else if (err)
> + *			goto unlock;
> + *
> + *		err = driver_bind_range(gpusvm, range);
> + *		if (err == -EAGAIN)	// CPU mappings changed
> + *			goto retry
> + *
> + *	unlock:
> + *		driver_svm_unlock();
> + *		return err;
> + *	}
> + *
> + * 2) Garbage Collector.
> + *
> + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> + *					struct drm_gpusvm_range
> *range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		// Partial unmap, migrate any remaining VRAM pages
> back to SRAM
> + *		if (range->flags.partial_unmap)
> + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> &ctx);
> + *
> + *		driver_unbind_range(range);
> + *		drm_gpusvm_range_remove(gpusvm, range);
> + *	}
> + *
> + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> + *	{
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		for_each_range_in_garbage_collector(gpusvm, range)
> + *			__driver_garbage_collector(gpusvm, range);
> + *	}
> + *
> + * 3) Invalidation driver vfunc.
> + *
> + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> + *				 struct drm_gpusvm_notifier
> *notifier,
> + *				 const struct mmu_notifier_range
> *mmu_range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> };
> + *		struct drm_gpusvm_range *range = NULL;
> + *
> + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> >start, mmu_range->end);
> + *
> + *		drm_gpusvm_for_each_range(range, notifier,
> mmu_range->start,
> + *					  mmu_range->end) {
> + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> &ctx);
> + *
> + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> + *				continue;
> + *
> + *			drm_gpusvm_range_set_unmapped(range,
> mmu_range);
> + *			driver_garbage_collector_add(gpusvm, range);
> + *		}
> + *	}
> + */
> +
> +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> rb.__subtree_last,
> +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> +		     static __maybe_unused, range);
> +
> +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> >interval.start)
> +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> >interval.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> notifier);
> +

Since these trees span struct mm_struct address space which should fit
in an unsigned long, can we use the generic version (interval_tree.h)
rather than instantiating two new versions? I figure both contain
overlapping ranges so we can't use maple trees?

> +/**
> + * npages_in_range() - Calculate the number of pages in a given
> range
> + * @start__: The start address of the range
> + * @end__: The end address of the range
> + *
> + * This macro calculates the number of pages in a given memory
> range,
> + * specified by the start and end addresses. It divides the
> difference
> + * between the end and start addresses by the page size (PAGE_SIZE)
> to
> + * determine the number of pages in the range.
> + *
> + * Return: The number of pages in the specified range.
> + */
> +#define npages_in_range(start__, end__)	\
> +	(((end__) - (start__)) >> PAGE_SHIFT)
> +
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd destruction
> + * @range: Pointer to the GPU SVM range
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up
> a range
> + * upon CPU page fault and asynchronously releasing VRAM once the
> CPU has no
> + * page references. Asynchronous release is useful because CPU page
> references
> + * can be dropped in IRQ contexts, while releasing VRAM likely
> requires sleeping
> + * locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_range *range;
 
I still believe previous review comments are valid here, considering we
do have multiple drm_gpusvm per struct mm_struct, potentially all
mapping the above page.

> +	void *vram_allocation;

NIT: Naming. The core is using device memory or devmem. Should we
follow.

Also could we, rather than using av void * use an embeddable struct
with its own ops rather than using the gpusvm ops for this?

> +};
> +
> +/**
> + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> zdd
> + * @w: Pointer to the work_struct
> + *
> + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> + */
> +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(w, struct drm_gpusvm_zdd,
> destroy_work);
> +	struct drm_gpusvm_range *range = zdd->range;
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> +		gpusvm->ops->vram_release(zdd->vram_allocation);
> +	drm_gpusvm_range_put(range);
> +	kfree(zdd);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> + * @range: Pointer to the GPU SVM range.
> + *
> + * This function allocates and initializes a new zdd structure. It
> sets up the
> + * reference count, initializes the destroy work, and links the
> provided GPU SVM
> + * range.
> + *
> + * Returns:
> + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_zdd *
> +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_zdd *zdd;
> +
> +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> +	if (!zdd)
> +		return NULL;
> +
> +	kref_init(&zdd->refcount);
> +	INIT_WORK(&zdd->destroy_work,
> drm_gpusvm_zdd_destroy_work_func);
> +	zdd->range = drm_gpusvm_range_get(range);
> +	zdd->vram_allocation = NULL;
> +
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function increments the reference count of the provided zdd
> structure.
> + *
> + * Returns: Pointer to the zdd structure.
> + */
> +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> drm_gpusvm_zdd *zdd)
> +{
> +	kref_get(&zdd->refcount);
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> + * @ref: Pointer to the reference count structure.
> + *
> + * This function queues the destroy_work of the zdd for asynchronous
> destruction.
> + */
> +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> +
> +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_put - Put a zdd reference.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function decrements the reference count of the provided zdd
> structure
> + * and schedules its destruction if the count drops to zero.
> + */
> +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> + * @notifier: Pointer to the GPU SVM notifier structure.
> + * @start: Start address of the range
> + * @end: End address of the range
> + *
> + * Return: A pointer to the drm_gpusvm_range if found or NULL
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end)
> +{
> +	return range_iter_first(&notifier->root, start, end - 1);
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> ranges in a notifier
> + * @range__: Iterator variable for the ranges
> + * @next__: Iterator variable for the ranges temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier
> while
> + * removing ranges from it.
> + */
> +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> start__, end__)	\
> +	for ((range__) = drm_gpusvm_range_find((notifier__),
> (start__), (end__)),	\
> +	     (next__) =
> __drm_gpusvm_range_next(range__);				\
> +	     (range__) && (range__->va.start <
> (end__));				\
> +	     (range__) = (next__), (next__) =
> __drm_gpusvm_range_next(range__))
> +
> +/**
> + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> the list
> + * @notifier: a pointer to the current drm_gpusvm_notifier
> + *
> + * Return: A pointer to the next drm_gpusvm_notifier if available,
> or NULL if
> + *         the current notifier is the last one or if the input
> notifier is
> + *         NULL.
> + */
> +static struct drm_gpusvm_notifier *
> +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> +{
> +	if (notifier && !list_is_last(&notifier->rb.entry,
> +				      &notifier->gpusvm-
> >notifier_list))
> +		return list_next_entry(notifier, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> + */
> +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> end__)		\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1);	\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @next__: Iterator variable for the notifiers temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> while
> + * removing notifiers from it.
> + */
> +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> gpusvm__, start__, end__)	\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1),	\
> +	     (next__) =
> __drm_gpusvm_notifier_next(notifier__);				\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = (next__), (next__) =
> __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> + * @mni: Pointer to the mmu_interval_notifier structure.
> + * @mmu_range: Pointer to the mmu_notifier_range structure.
> + * @cur_seq: Current sequence number.
> + *
> + * This function serves as a generic MMU notifier for GPU SVM. It
> sets the MMU
> + * notifier sequence number and calls the driver invalidate vfunc
> under
> + * gpusvm->notifier_lock.
> + *
> + * Returns:
> + * true if the operation succeeds, false otherwise.
> + */
> +static bool
> +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> +			       const struct mmu_notifier_range
> *mmu_range,
> +			       unsigned long cur_seq)
> +{
> +	struct drm_gpusvm_notifier *notifier =
> +		container_of(mni, typeof(*notifier), notifier);
> +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> +
> +	if (!mmu_notifier_range_blockable(mmu_range))
> +		return false;
> +
> +	down_write(&gpusvm->notifier_lock);
> +	mmu_interval_set_seq(mni, cur_seq);
> +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> +	up_write(&gpusvm->notifier_lock);
> +
> +	return true;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> GPU SVM
> + */
> +static const struct mmu_interval_notifier_ops
> drm_gpusvm_notifier_ops = {
> +	.invalidate = drm_gpusvm_notifier_invalidate,
> +};
> +
> +/**
> + * drm_gpusvm_init - Initialize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @name: Name of the GPU SVM.
> + * @drm: Pointer to the DRM device structure.
> + * @mm: Pointer to the mm_struct for the address space.
> + * @device_private_page_owner: Device private pages owner.
> + * @mm_start: Start address of GPU SVM.
> + * @mm_range: Range of the GPU SVM.
> + * @notifier_size: Size of individual notifiers.
> + * @ops: Pointer to the operations structure for GPU SVM.
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order
> with last
> + *               entry being SZ_4K.
> + * @num_chunks: Number of chunks.
> + *
> + * This function initializes the GPU SVM.
> + *
> + * Returns:
> + * 0 on success, a negative error code on failure.
> + */
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks)
> +{
> +	if (!ops->invalidate || !num_chunks)
> +		return -EINVAL;
> +
> +	gpusvm->name = name;
> +	gpusvm->drm = drm;
> +	gpusvm->mm = mm;
> +	gpusvm->device_private_page_owner =
> device_private_page_owner;
> +	gpusvm->mm_start = mm_start;
> +	gpusvm->mm_range = mm_range;
> +	gpusvm->notifier_size = notifier_size;
> +	gpusvm->ops = ops;
> +	gpusvm->chunk_sizes = chunk_sizes;
> +	gpusvm->num_chunks = num_chunks;
> +	gpusvm->zdd_wq = system_wq;
> +
> +	mmgrab(mm);
> +	gpusvm->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> +
> +	init_rwsem(&gpusvm->notifier_lock);
> +
> +	fs_reclaim_acquire(GFP_KERNEL);
> +	might_lock(&gpusvm->notifier_lock);
> +	fs_reclaim_release(GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure
> + * @fault_addr__: Fault address
> + *
> + * This macro finds the GPU SVM notifier associated with the fault
> address.
> + *
> + * Returns:
> + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> + */
> +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> +			    (fault_addr__ + 1))
> +
> +/**
> + * to_drm_gpusvm_notifier - retrieve the container struct for a
> given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_notifier struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_notifier
> structure.
> + */
> +#define to_drm_gpusvm_notifier(__node)				\
> +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> +
> +/**
> + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function inserts the GPU SVM notifier into the GPU SVM RB
> tree and list.
> + */
> +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	notifier_insert(notifier, &gpusvm->root);
> +
> +	node = rb_prev(&notifier->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> +	else
> +		head = &gpusvm->notifier_list;
> +
> +	list_add(&notifier->rb.entry, head);
> +}
> +
> +/**
> + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM tructure
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + *
> + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> and list.
> + */
> +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> +	list_del(&(notifier__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_fini - Finalize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + *
> + * This function finalizes the GPU SVM by cleaning up any remaining
> ranges and
> + * notifiers, and dropping a reference to struct MM.
> + */
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> +{
> +	struct drm_gpusvm_notifier *notifier, *next;
> +
> +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> LONG_MAX) {
> +		struct drm_gpusvm_range *range, *__next;
> +
> +		/*
> +		 * Remove notifier first to avoid racing with any
> invalidation
> +		 */
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +		notifier->flags.removed = true;
> +
> +		drm_gpusvm_for_each_range_safe(range, __next,
> notifier, 0,
> +					       LONG_MAX)
> +			drm_gpusvm_range_remove(gpusvm, range);
> +	}
> +
> +	mmdrop(gpusvm->mm);
> +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> +}
> +
> +/**
> + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + *
> + * This function allocates and initializes the GPU SVM notifier
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> on failure.
> + */
> +static struct drm_gpusvm_notifier *
> +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	if (gpusvm->ops->notifier_alloc)
> +		notifier = gpusvm->ops->notifier_alloc();
> +	else
> +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> +
> +	if (!notifier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	notifier->gpusvm = gpusvm;
> +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> >notifier_size);
> +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> >notifier_size);
> +	INIT_LIST_HEAD(&notifier->rb.entry);
> +	notifier->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&notifier->range_list);
> +
> +	return notifier;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function frees the GPU SVM notifier structure.
> + */
> +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> +				     struct drm_gpusvm_notifier
> *notifier)
> +{
> +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> +
> +	if (gpusvm->ops->notifier_free)
> +		gpusvm->ops->notifier_free(notifier);
> +	else
> +		kfree(notifier);
> +}
> +
> +/**
> + * to_drm_gpusvm_range - retrieve the container struct for a given
> rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_range struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_range structure.
> + */
> +#define to_drm_gpusvm_range(node__)	\
> +	container_of((node__), struct drm_gpusvm_range, rb.node)
> +
> +/**
> + * drm_gpusvm_range_insert - Insert GPU SVM range
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function inserts the GPU SVM range into the notifier RB tree
> and list.
> + */
> +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> *notifier,
> +				    struct drm_gpusvm_range *range)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> +	range_insert(range, &notifier->root);
> +
> +	node = rb_prev(&range->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> +	else
> +		head = &notifier->range_list;
> +
> +	list_add(&range->rb.entry, head);
> +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> +}
> +
> +/**
> + * __drm_gpusvm_range_remove - Remove GPU SVM range
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + * @range__: Pointer to the GPU SVM range structure
> + *
> + * This macro removes the GPU SVM range from the notifier RB tree
> and list.
> + */
> +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> +	range_remove((range__), &(notifier__)->root);		\
> +	list_del(&(range__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @fault_addr: Fault address
> + * @chunk_size: Chunk size
> + * @migrate_vram: Flag indicating whether to migrate VRAM
> + *
> + * This function allocates and initializes the GPU SVM range
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> failure.
> + */
> +static struct drm_gpusvm_range *
> +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> +		       struct drm_gpusvm_notifier *notifier,
> +		       u64 fault_addr, u64 chunk_size, bool
> migrate_vram)
> +{
> +	struct drm_gpusvm_range *range;
> +
> +	if (gpusvm->ops->range_alloc)
> +		range = gpusvm->ops->range_alloc(gpusvm);
> +	else
> +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&range->refcount);
> +	range->gpusvm = gpusvm;
> +	range->notifier = notifier;
> +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> +	INIT_LIST_HEAD(&range->rb.entry);
> +	range->notifier_seq = LONG_MAX;
> +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_check_pages - Check pages
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @start: Start address
> + * @end: End address
> + *
> + * Check if pages between start and end have been faulted in on the
> CPU. Use to
> + * prevent migration of pages without CPU backing store.
> + *
> + * Returns:
> + * True if pages have been faulted into CPU, False otherwise
> + */
> +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> +				   struct drm_gpusvm_notifier
> *notifier,
> +				   u64 start, u64 end)
> +{
> +	struct hmm_range hmm_range = {
> +		.default_flags = 0,
> +		.notifier = &notifier->notifier,
> +		.start = start,
> +		.end = end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns;
> +	unsigned long npages = npages_in_range(start, end);
> +	int err, i;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return false;
> +
> +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> >notifier);
> +	hmm_range.hmm_pfns = pfns;
> +
> +	while (true) {
> +		err = hmm_range_fault(&hmm_range);
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(&notifier->notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (err)
> +		goto err_free;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!(pfns[i] & HMM_PFN_VALID)) {
> +			err = -EFAULT;
> +			goto err_free;
> +		}
> +	}
> +
> +err_free:
> +	kvfree(pfns);
> +	return err ? false : true;
> +}
> +
> +/**
> + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @vas: Pointer to the virtual memory area structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @check_pages: Flag indicating whether to check pages
> + *
> + * This function determines the chunk size for the GPU SVM range
> based on the
> + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> the virtual
> + * memory area boundaries.
> + *
> + * Returns:
> + * Chunk size on success, LONG_MAX on failure.
> + */
> +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier,
> +				       struct vm_area_struct *vas,
> +				       u64 fault_addr, u64
> gpuva_start,
> +				       u64 gpuva_end, bool
> check_pages)
> +{
> +	u64 start, end;
> +	int i = 0;
> +
> +retry:
> +	for (; i < gpusvm->num_chunks; ++i) {
> +		start = ALIGN_DOWN(fault_addr, gpusvm-
> >chunk_sizes[i]);
> +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> +
> +		if (start >= vas->vm_start && end <= vas->vm_end &&
> +		    start >= notifier->interval.start &&
> +		    end <= notifier->interval.end &&
> +		    start >= gpuva_start && end <= gpuva_end)
> +			break;
> +	}
> +
> +	if (i == gpusvm->num_chunks)
> +		return LONG_MAX;
> +
> +	/*
> +	 * If allocation more than page, ensure not to overlap with
> existing
> +	 * ranges.
> +	 */
> +	if (end - start != SZ_4K) {
> +		struct drm_gpusvm_range *range;
> +
> +		range = drm_gpusvm_range_find(notifier, start, end);
> +		if (range) {
> +			++i;
> +			goto retry;
> +		}
> +
> +		/*
> +		 * XXX: Only create range on pages CPU has faulted
> in. Without
> +		 * this check, or prefault, on BMG
> 'xe_exec_system_allocator --r
> +		 * process-many-malloc' fails. In the failure case,
> each process
> +		 * mallocs 16k but the CPU VMA is ~128k which
> results in 64k SVM
> +		 * ranges. When migrating the SVM ranges, some
> processes fail in
> +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> != npages'
> +		 * and then upon drm_gpusvm_range_get_pages device
> pages from
> +		 * other processes are collected + faulted in which
> creates all
> +		 * sorts of problems. Unsure exactly how this
> happening, also
> +		 * problem goes away if 'xe_exec_system_allocator --
> r
> +		 * process-many-malloc' mallocs at least 64k at a
> time.
> +		 */
> +		if (check_pages &&
> +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> end)) {
> +			++i;
> +			goto retry;
> +		}
> +	}
> +
> +	return end - start;
> +}
> +
> +/**
> + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @ctx: GPU SVM context
> + *
> + * This function finds or inserts a newly allocated a GPU SVM range
> based on the
> + * fault address. Caller must hold a lock to protect range lookup
> and insertion.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +	struct drm_gpusvm_range *range;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	bool notifier_alloc = false;
> +	u64 chunk_size;
> +	int err;
> +	bool migrate_vram;
> +
> +	if (fault_addr < gpusvm->mm_start ||
> +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_write_locked(mm);
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> +	if (!notifier) {
> +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> fault_addr);
> +		if (IS_ERR(notifier)) {
> +			err = PTR_ERR(notifier);
> +			goto err_mmunlock;
> +		}
> +		notifier_alloc = true;
> +		err = mmu_interval_notifier_insert_locked(&notifier-
> >notifier,
> +							  mm,
> notifier->interval.start,
> +							  notifier-
> >interval.end -
> +							  notifier-
> >interval.start,
> +							 
> &drm_gpusvm_notifier_ops);
> +		if (err)
> +			goto err_notifier;
> +	}
> +
> +	vas = vma_lookup(mm, fault_addr);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_notifier_remove;
> +	}
> +
> +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> +		err = -EPERM;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_find(notifier, fault_addr,
> fault_addr + 1);
> +	if (range)
> +		goto out_mmunlock;
> +	/*
> +	 * XXX: Short-circuiting migration based on migrate_vma_*
> current
> +	 * limitations. If/when migrate_vma_* add more support, this
> logic will
> +	 * have to change.
> +	 */
> +	migrate_vram = ctx->vram_possible &&
> +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> +
> +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> vas,
> +						 fault_addr,
> gpuva_start,
> +						 gpuva_end,
> migrate_vram &&
> +						 !ctx->prefault);
> +	if (chunk_size == LONG_MAX) {
> +		err = -EINVAL;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> chunk_size,
> +				       migrate_vram);
> +	if (IS_ERR(range)) {
> +		err = PTR_ERR(range);
> +		goto err_notifier_remove;
> +	}
> +
> +	drm_gpusvm_range_insert(notifier, range);
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> +
> +	if (ctx->prefault) {
> +		struct drm_gpusvm_ctx __ctx = *ctx;
> +
> +		__ctx.mmap_locked = true;
> +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &__ctx);
> +		if (err)
> +			goto err_range_remove;
> +	}
> +
> +out_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +
> +	return range;
> +
> +err_range_remove:
> +	__drm_gpusvm_range_remove(notifier, range);
> +err_notifier_remove:
> +	if (notifier_alloc)
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +err_notifier:
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * for_each_dma_page - iterate over pages in a DMA regio`n
> + * @i__: the current page index in the iteration
> + * @j__: the current page index, log order, in the iteration
> + * @npages__: the total number of pages in the DMA region
> + * @order__: the order of the pages in the DMA region
> + *
> + * This macro iterates over each page in a DMA region. The DMA
> region
> + * is assumed to be composed of 2^@order__ pages, and the macro will
> + * step through the region one block of 2^@order__ pages at a time.
> + */
> +#define for_each_dma_page(i__, j__, npages__, order__)	\
> +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> +	     (j__)++, (i__) += 0x1 << (order__))
> +
> +/**
> + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> GPU SVM range (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function unmap pages associated with a GPU SVM range.
> Assumes and
> + * asserts correct locking is in place when called.
> + */
> +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> *gpusvm,
> +					   struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		unsigned long i, j, npages = npages_in_range(range-
> >va.start,
> +							     range-
> >va.end);
> +
> +		if (range->flags.has_dma_mapping) {
> +			for_each_dma_page(i, j, npages, range-
> >order)
> +				dma_unmap_page(gpusvm->drm->dev,
> +					       range->dma_addr[j],
> +					       PAGE_SIZE << range-
> >order,
> +					       DMA_BIDIRECTIONAL);
> +		}
> +
> +		range->flags.has_vram_pages = false;
> +		range->flags.has_dma_mapping = false;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function free pages associated with a GPU SVM range.
> + */
> +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> +					struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		if (range->flags.kfree_mapping) {
> +			kfree(range->dma_addr);
> +			range->flags.kfree_mapping = false;
> +			range->pages = NULL;
> +		} else {
> +			kvfree(range->pages);
> +			range->pages = NULL;
> +		}
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_remove - Remove GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function removes the specified GPU SVM range and also
> removes the parent
> + * GPU SVM notifier if no more ranges remain in the notifier. The
> caller must
> + * hold a lock to protect range and notifier removal.
> + */
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> >va.start);
> +	if (WARN_ON_ONCE(!notifier))
> +		return;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +	drm_gpusvm_range_free_pages(gpusvm, range);
> +	__drm_gpusvm_range_remove(notifier, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	drm_gpusvm_range_put(range);
> +
> +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> +		if (!notifier->flags.removed)
> +			mmu_interval_notifier_remove(&notifier-
> >notifier);
> +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function increments the reference count of the specified GPU
> SVM range.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> +{
> +	kref_get(&range->refcount);
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> + * @refcount: Pointer to the reference counter embedded in the GPU
> SVM range
> + *
> + * This function destroys the specified GPU SVM range when its
> reference count
> + * reaches zero. If a custom range-free function is provided, it is
> invoked to
> + * free the range; otherwise, the range is deallocated using
> kfree().
> + */
> +static void drm_gpusvm_range_destroy(struct kref *refcount)
> +{
> +	struct drm_gpusvm_range *range =
> +		container_of(refcount, struct drm_gpusvm_range,
> refcount);
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->range_free)
> +		gpusvm->ops->range_free(range);
> +	else
> +		kfree(range);
> +}
> +
> +/**
> + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function decrements the reference count of the specified GPU
> SVM range
> + * and frees it when the count reaches zero.
> + */
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> +{
> +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called holding gpusvm->notifier_lock and as the last step before
> commiting a
> + * GPU binding.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	return range->flags.has_vram_pages || range-
> >flags.has_dma_mapping;
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> unlocked
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called without holding gpusvm->notifier_lock.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +static bool
> +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> +				      struct drm_gpusvm_range
> *range)
> +{
> +	bool pages_valid;
> +
> +	if (!range->pages)
> +		return false;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> +	if (!pages_valid && range->flags.kfree_mapping) {
> +		kfree(range->dma_addr);
> +		range->flags.kfree_mapping = false;
> +		range->pages = NULL;
> +	}
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	return pages_valid;
> +}
> +
> +/**
> + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function gets pages for a GPU SVM range and ensures they are
> mapped for
> + * DMA access.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{

Is it possible to split this function up to make it look more neat?


> +	struct mmu_interval_notifier *notifier = &range->notifier-
> >notifier;
> +	struct hmm_range hmm_range = {
> +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> ? 0 :
> +			HMM_PFN_REQ_WRITE),
> +		.notifier = notifier,
> +		.start = range->va.start,
> +		.end = range->va.end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long i, j;
> +	unsigned long npages = npages_in_range(range->va.start,
> range->va.end);
> +	unsigned int order = 0;
> +	unsigned long *pfns;
> +	struct page **pages;
> +	int err = 0;
> +	bool vram_pages = !!range->flags.migrate_vram;
> +	bool alloc_pfns = false, kfree_mapping;
> +
> +retry:
> +	kfree_mapping = false;
> +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> +		return 0;
> +
> +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> >pages) {
> +		if (ctx->prefault)
> +			return 0;
> +
> +		pfns = (unsigned long *)range->pages;
> +		pages = range->pages;
> +		goto map_pages;
> +	}
> +
> +	if (!range->pages) {
> +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> GFP_KERNEL);
> +		if (!pfns)
> +			return -ENOMEM;
> +		alloc_pfns = true;
> +	} else {
> +		pfns = (unsigned long *)range->pages;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +	}
> +
> +	hmm_range.hmm_pfns = pfns;
> +	while (true) {
> +		/* Must be checked after mmu_interval_read_begin */
> +		if (range->flags.unmapped) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		if (!ctx->mmap_locked) {
> +			/*
> +			 * XXX: HMM locking document indicates only
> a read-lock
> +			 * is required but there apears to be a
> window between
> +			 * the MMU_NOTIFY_MIGRATE event triggered in
> a CPU fault
> +			 * via migrate_vma_setup and the pages
> actually moving
> +			 * in migrate_vma_finalize in which this
> code can grab
> +			 * garbage pages. Grabbing the write-lock if
> the range
> +			 * is attached to vram appears to protect
> against this
> +			 * race.
> +			 */
> +			if (vram_pages)
> +				mmap_write_lock(mm);
> +			else
> +				mmap_read_lock(mm);
> +		}
> +		err = hmm_range_fault(&hmm_range);
> +		if (!ctx->mmap_locked) {
> +			if (vram_pages)
> +				mmap_write_unlock(mm);
> +			else
> +				mmap_read_unlock(mm);
> +		}
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (!ctx->mmap_locked)
> +		mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +
> +	if (ctx->prefault) {
> +		range->pages = pages;
> +		goto set_seqno;
> +	}
> +
> +map_pages:
> +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> +		WARN_ON_ONCE(!range->vram_allocation);
> +
> +		for (i = 0; i < npages; ++i) {
> +			pages[i] = hmm_pfn_to_page(pfns[i]);
> +
> +			if
> (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> +				err = -EOPNOTSUPP;
> +				goto err_free;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->flags.has_vram_pages = true;
> +		range->pages = pages;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm,
> range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	} else {
> +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> +
> +		for_each_dma_page(i, j, npages, order) {

Here it looks like you're assuming that all pages are the same order?
With THP that's definitely not the case, (unless hmm somehow thinks
they are 4K pages). This probably work because we only end up here in
the HugeTLB case where all pages are forced to the same oder.

> +			if (WARN_ON_ONCE(i && order !=
> +					
> hmm_pfn_to_map_order(pfns[i]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +			order = hmm_pfn_to_map_order(pfns[i]);
> +
> +			pages[j] = hmm_pfn_to_page(pfns[i]);
> +			if
> (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +
> +			set_page_dirty_lock(pages[j]);
> +			mark_page_accessed(pages[j]);
> +
> +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> +						   pages[j], 0,
> +						   PAGE_SIZE <<
> order,
> +						  
> DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev,
> dma_addr[j])) {
> +				err = -EFAULT;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +		}
> +
> +		/* Huge pages, reduce memory footprint */
> +		if (order) {
> +			dma_addr = kmalloc_array(j,
> sizeof(*dma_addr),
> +						 GFP_KERNEL);
> +			if (dma_addr) {
> +				for (i = 0; i < j; ++i)
> +					dma_addr[i] =
> (dma_addr_t)pfns[i];
> +				kvfree(pfns);
> +				kfree_mapping = true;
> +			} else {
> +				dma_addr = (dma_addr_t *)pfns;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->order = order;
> +		range->flags.kfree_mapping = kfree_mapping;
> +		range->flags.has_dma_mapping = true;
> +		range->dma_addr = dma_addr;
> +		range->vram_allocation = NULL;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm,
> range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	}
> +
> +	if (err == -EAGAIN)
> +		goto retry;
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	for_each_dma_page(i, j, npages, order)
> +		dma_unmap_page(gpusvm->drm->dev,
> +			       (dma_addr_t)pfns[j],
> +			       PAGE_SIZE << order,
> DMA_BIDIRECTIONAL);
> +err_free:
> +	if (alloc_pfns)
> +		kvfree(pfns);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If
> @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write
> mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> called on
> + * each GPU SVM range attached to notifier in gpusvm->ops-
> >invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> +					   unsigned long
> *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> grate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU
> SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_vram_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to mapped
> pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU
> SVM. It
> + * iterates over each page frame number provided in @migrate_pfn,
> maps the
> + * corresponding page, and stores the DMA address in the provided
> @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int
> *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page =
> migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for GPU
> Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in
> @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the
> corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction
> dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev,
> dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *                   failure of this function.
> + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> The caller
> + *                   should hold a reference to the VRAM allocation,
> which
> + *                   should be dropped via ops->vram_allocation or
> upon the
> + *                   failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to VRAM. It
> performs the
> + * necessary setup and invokes the driver-specific operations for
> migration to
> + * VRAM. Upon successful return, @vram_allocation can safely
> reference @range
> + * until ops->vram_release is called which only upon successful
> return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_vram)
> +		return -EINVAL;
> +
> +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> >copy_to_vram ||
> +	    !gpusvm->ops->copy_to_sram)
> +		return -EOPNOTSUPP;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(range);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> npages, not
> +	 * always an error. Need to revisit possible cases and how
> to handle. We
> +	 * could prefault on migrate.cpages != npages via
> hmm_range_fault.
> +	 */
> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> vram_allocation, npages,
> +					     migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.src, npages,
> DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_vram_page(page, zdd);
> +	}
> +
> +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind vram allocation to range and zdd */
> +	range->vram_allocation = vram_allocation;
> +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> VM area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the SRAM migrate page frame numbers
> (PFNs) for the
> + * specified VM area structure. It allocates and locks pages in the
> VM area for
> + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> vm_area_struct *vas,
> +						unsigned long
> npages,
> +						unsigned long
> *src_mpfn,
> +						unsigned long *mpfn,
> u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);
> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> lock and
> + * migration done via migrate_device_* functions. Fallback path as
> it is
> + * preferred to issue migrations with mmap lock.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> +				    struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = npages_in_range(range->va.start, range->va.end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> npages;
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> >vram_allocation,
> +					     npages, src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_vma_range(gpusvm->mm,
> +				       gpusvm-
> >device_private_page_owner, src,
> +				       npages, range->va.start);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> src, dst, 0);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   dst, npages,
> DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @vas: Pointer to the VM area structure
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @start: Start address of the migration range
> + * @end: End address of the migration range
> + *
> + * This internal function performs the migration of the specified
> GPU SVM range
> + * to SRAM. It sets up the migration, populates + dma maps SRAM
> PFNs, and
> + * invokes the driver-specific operations for migration to SRAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +					struct vm_area_struct *vas,
> +					struct page *page,
> +					u64 start, u64 end)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page	= page,
> +	};
> +	unsigned long npages;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	/* Corner where VMA area struct has been partially unmapped
> */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> +						   migrate.src,
> migrate.dst,
> +						   start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.dst, npages,
> +					   DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function initiates the migration of the specified GPU SVM
> range to
> + * SRAM. It performs necessary checks and invokes the internal
> migration
> + * function for actual migration.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	int err;
> +	bool retry = false;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		if (ctx->trylock_mmap) {
> +			if (!mmap_read_trylock(mm))  {
> +				err =
> drm_gpusvm_evict_to_sram(gpusvm, range);
> +				goto err_mmput;
> +			}
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	/*
> +	 * Loop required to find all VMA area structs for the corner
> case when
> +	 * VRAM backing has been partially unmapped from MM's
> address space.
> +	 */
> +again:
> +	vas = find_vma(mm, start);
> +	if (!vas) {
> +		if (!retry)
> +			err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end <= vas->vm_start || start >= vas->vm_end) {
> +		if (!retry)
> +			err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> end);
> +	if (err)
> +		goto err_mmunlock;
> +
> +	if (vas->vm_end < end) {
> +		retry = true;
> +		start = vas->vm_end;
> +		goto again;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		mmap_read_unlock(mm);
> +		/*
> +		 * Using mmput_async as this function can be called
> while
> +		 * holding a dma-resv lock, and a final put can grab
> the mmap
> +		 * lock, causing a lock inversion.
> +		 */
> +		mmput_async(mm);
> +	}
> +
> +	return 0;
> +
> +err_mmunlock:
> +	if (!ctx->mmap_locked)
> +		mmap_read_unlock(mm);
> +err_mmput:
> +	if (!ctx->mmap_locked)
> +		mmput_async(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device
> data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM
> range to RAM.
> + * It retrieves the GPU SVM range information from the faulting page
> and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> +	int err;
> +
> +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> +					   vmf->vma, vmf->page,
> +					   zdd->range->va.start,
> +					   zdd->range->va.end);
> +
> +	return err ? VM_FAULT_SIGBUS : 0;
> +}
> +
> +/**
> + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> + */
> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> +	.page_free = drm_gpusvm_page_free,
> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> +};
> +
> +/**
> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> operations
> + *
> + * Returns:
> + * Pointer to the GPU SVM device page map operations structure.
> + */
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> +{
> +	return &drm_gpusvm_pagemap_ops;
> +}
> +
> +/**
> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> given address range
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @start: Start address
> + * @end: End address
> + *
> + * Returns:
> + * True if GPU SVM has mapping, False otherwise
> + */
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> u64 end)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> +		struct drm_gpusvm_range *range = NULL;
> +
> +		drm_gpusvm_for_each_range(range, notifier, start,
> end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> b/drivers/gpu/drm/xe/drm_gpusvm.h
> new file mode 100644
> index 000000000000..0ea70f8534a8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> @@ -0,0 +1,415 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __DRM_GPUSVM_H__
> +#define __DRM_GPUSVM_H__
> +
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/workqueue.h>
> +
> +struct dev_pagemap_ops;
> +struct drm_device;
> +struct drm_gpusvm;
> +struct drm_gpusvm_notifier;
> +struct drm_gpusvm_ops;
> +struct drm_gpusvm_range;
> +
> +/**
> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> + *
> + * This structure defines the operations for GPU Shared Virtual
> Memory (SVM).
> + * These operations are provided by the GPU driver to manage SVM
> ranges and
> + * perform operations such as migration between VRAM and system RAM.
> + */
> +struct drm_gpusvm_ops {
> +	/**
> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> +	 *
> +	 * This function shall allocate a GPU SVM notifier.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM notifier on success,
> NULL on failure.
> +	 */
> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> +
> +	/**
> +	 * @notifier_free: Free a GPU SVM notifier (optional)
> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> +	 *
> +	 * This function shall free a GPU SVM notifier.
> +	 */
> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> +
> +	/**
> +	 * @range_alloc: Allocate a GPU SVM range (optional)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 *
> +	 * This function shall allocate a GPU SVM range.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM range on success, NULL
> on failure.
> +	 */
> +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> *gpusvm);
> +
> +	/**
> +	 * @range_free: Free a GPU SVM range (optional)
> +	 * @range: Pointer to the GPU SVM range to be freed
> +	 *
> +	 * This function shall free a GPU SVM range.
> +	 */
> +	void (*range_free)(struct drm_gpusvm_range *range);
> +
> +	/**
> +	 * @vram_release: Release VRAM allocation (optional)
> +	 * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> +	 *
> +	 * This function shall release VRAM allocation and expects
> to drop a
> +	 * reference to VRAM allocation.
> +	 */
> +	void (*vram_release)(void *vram_allocation);
> +
> +	/**
> +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> +	 * @npages: Number of pages to populate
> +	 * @pfn: Array of page frame numbers to populate
> +	 *
> +	 * This function shall populate VRAM page frame numbers
> (PFN).
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> +				 void *vram_allocation,
> +				 unsigned long npages,
> +				 unsigned long *pfn);
> +
> +	/**
> +	 * @copy_to_vram: Copy to VRAM (required for migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (destination)
> +	 * @dma_addr: Pointer to array of DMA addresses (source)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to VRAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @copy_to_sram: Copy to system RAM (required for
> migration)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @pages: Pointer to array of VRAM pages (source)
> +	 * @dma_addr: Pointer to array of DMA addresses
> (destination)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to system RAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> +			    struct page **pages,
> +			    dma_addr_t *dma_addr,
> +			    unsigned long npages);
> +
> +	/**
> +	 * @invalidate: Invalidate GPU SVM notifier (required)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @notifier: Pointer to the GPU SVM notifier
> +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> +	 *
> +	 * This function shall invalidate the GPU page tables. It
> can safely
> +	 * walk the notifier range RB tree/list in this function.
> Called while
> +	 * holding the notifier lock.
> +	 */
> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> +			   struct drm_gpusvm_notifier *notifier,
> +			   const struct mmu_notifier_range
> *mmu_range);
> +};
> +
> +/**
> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> notifier
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: MMU interval notifier
> + * @interval: Interval for the notifier
> + * @rb: Red-black tree node for the parent GPU SVM structure
> notifier tree
> + * @root: Cached root node of the RB tree containing ranges
> + * @range_list: List head containing of ranges in the same order
> they appear in
> + *              interval tree. This is useful to keep iterating
> ranges while
> + *              doing modifications to RB tree.
> + * @flags.removed: Flag indicating whether the MMU interval notifier
> has been
> + *                 removed
> + *
> + * This structure represents a GPU SVM notifier.
> + */
> +struct drm_gpusvm_notifier {
> +	struct drm_gpusvm *gpusvm;
> +	struct mmu_interval_notifier notifier;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} interval;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct rb_root_cached root;
> +	struct list_head range_list;
> +	struct {
> +		u32 removed : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier
> + * @refcount: Reference count for the range
> + * @rb: Red-black tree node for the parent GPU SVM notifier
> structure range tree
> + * @va: Virtual address range
> + * @notifier_seq: Notifier sequence number of the range's pages
> + * @pages: Pointer to the array of pages (if backing store is in
> VRAM)
> + * @dma_addr: DMA address array (if backing store is SRAM and DMA
> mapped)
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping
> size
> + * @flags.migrate_vram: Flag indicating whether the range can be
> migrated to VRAM
> + * @flags.unmapped: Flag indicating if the range has been unmapped
> + * @flags.partial_unmap: Flag indicating if the range has been
> partially unmapped
> + * @flags.has_vram_pages: Flag indicating if the range has vram
> pages
> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> mapping
> + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> allocation based
> + *                       on @order which releases via kfree
> + *
> + * This structure represents a GPU SVM range used for tracking
> memory ranges
> + * mapped in a DRM device.
> + */
> +struct drm_gpusvm_range {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +	struct kref refcount;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} va;
> +	unsigned long notifier_seq;
> +	union {
> +		struct page **pages;
> +		dma_addr_t *dma_addr;
> +	};
> +	void *vram_allocation;
> +	u16 order;
> +	struct {
> +		/* All flags below must be set upon creation */
> +		u16 migrate_vram : 1;
> +		/* All flags below must be set / cleared under
> notifier lock */
> +		u16 unmapped : 1;
> +		u16 partial_unmap : 1;
> +		u16 has_vram_pages : 1;
> +		u16 has_dma_mapping : 1;
> +		u16 kfree_mapping : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm - GPU SVM structure
> + *
> + * @name: Name of the GPU SVM
> + * @drm: Pointer to the DRM device structure
> + * @mm: Pointer to the mm_struct for the address space
> + * @device_private_page_owner: Device private pages owner
> + * @mm_start: Start address of GPU SVM
> + * @mm_range: Range of the GPU SVM
> + * @notifier_size: Size of individual notifiers
> + * @ops: Pointer to the operations structure for GPU SVM
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order.
> + * @num_chunks: Number of chunks
> + * @notifier_lock: Read-write semaphore for protecting notifier
> operations
> + * @zdd_wq: Workqueue for deferred work on zdd destruction
> + * @root: Cached root node of the Red-Black tree containing GPU SVM
> notifiers
> + * @notifier_list: list head containing of notifiers in the same
> order they
> + *                 appear in interval tree. This is useful to keep
> iterating
> + *                 notifiers while doing modifications to RB tree.
> + *
> + * This structure represents a GPU SVM (Shared Virtual Memory) used
> for tracking
> + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> + *
> + * No reference counting is provided, as this is expected to be
> embedded in the
> + * driver VM structure along with the struct drm_gpuvm, which
> handles reference
> + * counting.
> + */
> +struct drm_gpusvm {
> +	const char *name;
> +	struct drm_device *drm;
> +	struct mm_struct *mm;
> +	void *device_private_page_owner;
> +	u64 mm_start;
> +	u64 mm_range;
> +	u64 notifier_size;
> +	const struct drm_gpusvm_ops *ops;
> +	const u64 *chunk_sizes;
> +	int num_chunks;
> +	struct rw_semaphore notifier_lock;
> +	struct workqueue_struct *zdd_wq;
> +	struct rb_root_cached root;
> +	struct list_head notifier_list;
> +};
> +
> +/**
> + * struct drm_gpusvm_ctx - DRM GPU SVM context
> + *
> + * @mmap_locked: mmap lock is locked
> + * @trylock_mmap: trylock mmap lock, used to avoid locking
> inversions
> + *                (e.g.dma-revs -> mmap lock)
> + * @in_notifier: entering from a MMU notifier
> + * @read_only: operating on read-only memory
> + * @vram_possible: possible to use VRAM
> + * @prefault: prefault pages
> + *
> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> + */
> +struct drm_gpusvm_ctx {
> +	u32 mmap_locked :1;
> +	u32 trylock_mmap :1;
> +	u32 in_notifier :1;
> +	u32 read_only :1;
> +	u32 vram_possible :1;
> +	u32 prefault :1;
> +};
> +
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks);
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> +
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range);
> +
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx);
> +
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx);
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> +
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> u64 end);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end);
> +
> +/**
> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, take lock
> + */
> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> +	down_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, drop lock
> + */
> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> +	up_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> + * @range: a pointer to the current GPU SVM range
> + *
> + * Return: A pointer to the next drm_gpusvm_range if available, or
> NULL if the
> + *         current range is the last one or if the input range is
> NULL.
> + */
> +static inline struct drm_gpusvm_range *
> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> +{
> +	if (range && !list_is_last(&range->rb.entry,
> +				   &range->notifier->range_list))
> +		return list_next_entry(range, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> notifier
> + * @range__: Iterator variable for the ranges. If set, it indicates
> the start of
> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> get the range.
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier.
> It is safe
> + * to use while holding the driver SVM lock or the notifier lock.
> + */
> +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> end__)	\
> +	for ((range__) = (range__)
> ?:					\
> +	     drm_gpusvm_range_find((notifier__), (start__),
> (end__));	\
> +	     (range__) && (range__->va.start <
> (end__));		\
> +	     (range__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> + * @range: Pointer to the GPU SVM range structure.
> + * @mmu_range: Pointer to the MMU notifier range structure.
> + *
> + * This function marks a GPU SVM range as unmapped and sets the
> partial_unmap flag
> + * if the range partially falls within the provided MMU notifier
> range.
> + */
> +static inline void
> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> +			      const struct mmu_notifier_range
> *mmu_range)
> +{
> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> +
> +	range->flags.unmapped = true;
> +	if (range->va.start < mmu_range->start ||
> +	    range->va.end > mmu_range->end)
> +		range->flags.partial_unmap = true;
> +}
> +
> +#endif /* __DRM_GPUSVM_H__ */


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-24 10:42   ` Thomas Hellström
@ 2024-09-24 16:30     ` Matthew Brost
  2024-09-25 21:12       ` Matthew Brost
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-09-24 16:30 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Tue, Sep 24, 2024 at 12:42:56PM +0200, Thomas Hellström wrote:
> Hi, Matt,
> 
> Some random review comments on this patch I came across while looking
> at multi-device.
> 
> Thanks,
> Thomas
> 
> 
> On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > This patch introduces support for GPU Shared Virtual Memory (SVM) in
> > the
> > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > sharing of memory between the CPU and GPU, enhancing performance and
> > flexibility in GPU computing tasks.
> > 
> > The patch adds the necessary infrastructure for SVM, including data
> > structures and functions for managing SVM ranges and notifiers. It
> > also
> > provides mechanisms for allocating, deallocating, and migrating
> > memory
> > regions between system RAM and GPU VRAM.
> > 
> > This mid-layer is largely inspired by GPUVM.
> 
> NIT: Naming, Should it be drm_svm rather than drm_gpusvm? For the
> drm_gpuvm component, gpuvm clearly distinguished a gpu_vm from a
> mm_struct but here we don't have the same need.
> 

Can rename.

> > 
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile     |    3 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> >  3 files changed, 2591 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index b9670ae09a9e..b8fc2ee58f1a 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> >  
> >  # core driver code
> >  
> > -xe-y += xe_bb.o \
> > +xe-y += drm_gpusvm.o \
> > +	xe_bb.o \
> >  	xe_bo.o \
> >  	xe_bo_evict.o \
> >  	xe_devcoredump.o \
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > new file mode 100644
> > index 000000000000..fc1e44e6ae72
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > @@ -0,0 +1,2174 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + *
> > + * Authors:
> > + *     Matthew Brost <matthew.brost@intel.com>
> > + */
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interval_tree_generic.h>
> > +#include <linux/hmm.h>
> > +#include <linux/memremap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_types.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/slab.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include "drm_gpusvm.h"
> > +
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework designed to
> > manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient data
> > exchange and
> > + * processing for GPU-accelerated applications by allowing memory
> > sharing and
> > + * synchronization between the CPU's and GPU's virtual address
> > spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU
> > SVM
> > + *		initialization parameter, with a recommendation of
> > 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of
> > ranges that
> > + *		fall within the notifier interval. Notifiers are
> > tracked within
> > + *		a GPU SVM Red-BlacK tree and list and are
> > dynamically inserted
> > + *		or removed as ranges within the interval are created
> > or
> > + *		destroyed.
> > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > managed
> > + *	     by GPU SVM. They are sized based on an array of chunk
> > sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU
> > address space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits
> > within the
> > + *	     faulting CPU address space is chosen for the range
> > size. Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and
> > removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > are tracked in
> > + *	     a notifier's Red-Black tree.
> > + * - Operations: Define the interface for driver-specific SVM
> > operations such as
> > + *		 allocation, page collection, migration,
> > invalidations, and VRAM
> > + *		 release.
> > + *
> > + * This layer provides interfaces for allocating, mapping,
> > migrating, and
> > + * releasing memory ranges between the CPU and GPU. It handles all
> > core memory
> > + * management interactions (DMA mapping, HMM, and migration) and
> > provides
> > + * driver-specific virtual functions (vfuncs). This infrastructure
> > is sufficient
> > + * to build the expected driver components for an SVM implementation
> > as detailed
> > + * below.
> > + *
> > + * Expected Driver Components:
> > + * - GPU page fault handler: Used to create ranges and notifiers
> > based on the
> > + *			     fault address, optionally migrate the
> > range to
> > + *			     VRAM, and create GPU bindings.
> > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > Ranges are
> > + *			expected to be added to the garbage
> > collector upon
> > + *			MMU_NOTIFY_UNMAP event.
> > + */
> > +
> > +/**
> > + * DOC: Locking
> > + *
> > + * GPU SVM handles locking for core MM interactions, i.e., it
> > locks/unlocks the
> > + * mmap lock as needed. Alternatively, if the driver prefers to
> > handle the mmap
> > + * lock itself, a 'locked' argument is provided to the functions
> > that require
> > + * the mmap lock. This option may be useful for drivers that need to
> > call into
> > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > locking
> > + * inversions between the mmap and dma-resv locks.
> > + *
> > + * GPU SVM introduces a global notifier lock, which safeguards the
> > notifier's
> > + * range RB tree and list, as well as the range's DMA mappings and
> > sequence
> > + * number. GPU SVM manages all necessary locking and unlocking
> > operations,
> > + * except for the recheck of the range's sequence number
> > + * (mmu_interval_read_retry) when the driver is committing GPU
> > bindings. This
> > + * lock corresponds to the 'driver->update' lock mentioned in the
> > HMM
> > + * documentation (TODO: Link). Future revisions may transition from
> > a GPU SVM
> > + * global lock to a per-notifier lock if finer-grained locking is
> > deemed
> > + * necessary.
> > + *
> > + * In addition to the locking mentioned above, the driver should
> > implement a
> > + * lock to safeguard core GPU SVM function calls that modify state,
> > such as
> > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > Alternatively,
> > + * these core functions can be called within a single kernel thread,
> > for
> > + * instance, using an ordered work queue. This lock is denoted as
> > + * 'driver_svm_lock' in code examples.
> > + */
> > +
> > +/**
> > + * DOC: Migrataion
> > + *
> > + * The migration support is quite simple, allowing migration between
> > SRAM and
> > + * VRAM at the range granularity. For example, GPU SVM currently
> > does not
> > + * support mixing SRAM and VRAM pages within a range. This means
> > that upon GPU
> > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > fault, the
> > + * entire range is migrated to SRAM.
> > + *
> > + * The reasoning for only supporting range granularity is as
> > follows: it
> > + * simplifies the implementation, and range sizes are driver-defined
> > and should
> > + * be relatively small.
> > + */
> > +
> > +/**
> > + * DOC: Partial Unmapping of Ranges
> > + *
> > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > CPU resulting
> > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> > main one
> > + * being that a subset of the range still has CPU and GPU mappings.
> > If the
> > + * backing store for the range is in VRAM, a subset of the backing
> > store has
> > + * references. One option would be to split the range and VRAM
> > backing store,
> > + * but the implementation for this would be quite complicated. Given
> > that
> > + * partial unmappings are rare and driver-defined range sizes are
> > relatively
> > + * small, GPU SVM does not support splitting of ranges.
> > + *
> > + * With no support for range splitting, upon partial unmapping of a
> > range, the
> > + * driver is expected to invalidate and destroy the entire range. If
> > the range
> > + * has VRAM as its backing, the driver is also expected to migrate
> > any remaining
> > + * pages back to SRAM.
> > + */
> > +
> > +/**
> > + * DOC: Examples
> > + *
> > + * This section provides two examples of how to build the expected
> > driver
> > + * components: the GPU page fault handler and the garbage collector.
> > A third
> > + * example demonstrates a sample invalidation driver vfunc.
> > + *
> > + * The generic code provided does not include logic for complex
> > migration
> > + * policies, optimized invalidations, or other potentially required
> > driver
> > + * locking (e.g., DMA-resv locks).
> > + *
> > + * 1) GPU page fault handler
> > + *
> > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > drm_gpusvm_range *range)
> > + *	{
> > + *		int err = 0;
> > + *
> > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > range);
> > + *
> > + *		drm_gpusvm_notifier_lock(gpusvm);
> > + *		if (drm_gpusvm_range_pages_valid(range))
> > + *			driver_commit_bind(gpusvm, range);
> > + *		else
> > + *			err = -EAGAIN;
> > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > + *
> > + *		return err;
> > + *	}
> > + *
> > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > + *			     u64 gpuva_start, u64 gpuva_end)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *		int err;
> > + *
> > + *		driver_svm_lock();
> > + *	retry:
> > + *		// Always process UNMAPs first so view of GPU SVM
> > ranges is current
> > + *		driver_garbage_collector(gpusvm);
> > + *
> > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > fault_addr,
> > + *							gpuva_start,
> > gpuva_end,
> > + *						        &ctx);
> > + *		if (IS_ERR(range)) {
> > + *			err = PTR_ERR(range);
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		if (driver_migration_policy(range)) {
> > + *			bo = driver_alloc_bo();
> > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > range, bo, &ctx);
> > + *			if (err)	// CPU mappings may have
> > changed
> > + *				goto retry;
> > + *		}
> > + *
> > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &ctx);
> > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > mappings changed
> > + *			goto retry;
> > + *		else if (err)
> > + *			goto unlock;
> > + *
> > + *		err = driver_bind_range(gpusvm, range);
> > + *		if (err == -EAGAIN)	// CPU mappings changed
> > + *			goto retry
> > + *
> > + *	unlock:
> > + *		driver_svm_unlock();
> > + *		return err;
> > + *	}
> > + *
> > + * 2) Garbage Collector.
> > + *
> > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > + *					struct drm_gpusvm_range
> > *range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		// Partial unmap, migrate any remaining VRAM pages
> > back to SRAM
> > + *		if (range->flags.partial_unmap)
> > + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> > &ctx);
> > + *
> > + *		driver_unbind_range(range);
> > + *		drm_gpusvm_range_remove(gpusvm, range);
> > + *	}
> > + *
> > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > + *			__driver_garbage_collector(gpusvm, range);
> > + *	}
> > + *
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier
> > *notifier,
> > + *				 const struct mmu_notifier_range
> > *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > >start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier,
> > mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> > &ctx);
> > + *
> > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range,
> > mmu_range);
> > + *			driver_garbage_collector_add(gpusvm, range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > >interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > >interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> > notifier);
> > +
> 
> Since these trees span struct mm_struct address space which should fit
> in an unsigned long, can we use the generic version (interval_tree.h)
> rather than instantiating two new versions? I figure both contain
> overlapping ranges so we can't use maple trees?
> 

I can look into using a generic version but actually I don't think we
allow overlapping so a maple tree might work here too. I'll likely stick
a generic version in next rev but if the consensus is maple tree we can
switch over to that fairly easy at any point in time as the tree
interaction is completely encapsulated in DRM SVM layer.

> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given
> > range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory
> > range,
> > + * specified by the start and end addresses. It divides the
> > difference
> > + * between the end and start addresses by the page size (PAGE_SIZE)
> > to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @range: Pointer to the GPU SVM range
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up
> > a range
> > + * upon CPU page fault and asynchronously releasing VRAM once the
> > CPU has no
> > + * page references. Asynchronous release is useful because CPU page
> > references
> > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > requires sleeping
> > + * locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_range *range;
>  
> I still believe previous review comments are valid here, considering we
> do have multiple drm_gpusvm per struct mm_struct, potentially all
> mapping the above page.
> 

Exactly which comments?

If it related to the range pointer, that is going to be dropped. All
virtual references from zdd will be dropped (i.e. no pointer to even a
DRM SVM).

> > +	void *vram_allocation;
> 
> NIT: Naming. The core is using device memory or devmem. Should we
> follow.
>

I like devmem. Will change.
 
> Also could we, rather than using av void * use an embeddable struct
> with its own ops rather than using the gpusvm ops for this?
> 

Can you give me code snippet example of what you think this should look
like? Not opposed to this.

> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> > zdd
> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd,
> > destroy_work);
> > +	struct drm_gpusvm_range *range = zdd->range;
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > +	drm_gpusvm_range_put(range);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @range: Pointer to the GPU SVM range.
> > + *
> > + * This function allocates and initializes a new zdd structure. It
> > sets up the
> > + * reference count, initializes the destroy work, and links the
> > provided GPU SVM
> > + * range.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work,
> > drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->range = drm_gpusvm_range_get(range);
> > +	zdd->vram_allocation = NULL;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd
> > structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for asynchronous
> > destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > +
> > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd
> > structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> > start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > (start__), (end__)),	\
> > +	     (next__) =
> > __drm_gpusvm_range_next(range__);				\
> > +	     (range__) && (range__->va.start <
> > (end__));				\
> > +	     (range__) = (next__), (next__) =
> > __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> > the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> > or NULL if
> > + *         the current notifier is the last one or if the input
> > notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm-
> > >notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> > a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> > end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> > notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1),	\
> > +	     (next__) =
> > __drm_gpusvm_notifier_next(notifier__);				\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = (next__), (next__) =
> > __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It
> > sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc
> > under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range
> > *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops
> > drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order
> > with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner =
> > device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +	gpusvm->zdd_wq = system_wq;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault
> > address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier
> > structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)				\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> > and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining
> > ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> > LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any
> > invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next,
> > notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > >notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > >notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree
> > and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree
> > and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > +	range_remove((range__), &(notifier__)->root);		\
> > +	list_del(&(range__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > + *
> > + * This function allocates and initializes the GPU SVM range
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool
> > migrate_vram)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the
> > CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier
> > *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> > >notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> > range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range
> > based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> > the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64
> > gpuva_start,
> > +				       u64 gpuva_end, bool
> > check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > >chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with
> > existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted
> > in. Without
> > +		 * this check, or prefault, on BMG
> > 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case,
> > each process
> > +		 * mallocs 16k but the CPU VMA is ~128k which
> > results in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some
> > processes fail in
> > +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> > != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages device
> > pages from
> > +		 * other processes are collected + faulted in which
> > creates all
> > +		 * sorts of problems. Unsure exactly how this
> > happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --
> > r
> > +		 * process-many-malloc' mallocs at least 64k at a
> > time.
> > +		 */
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range
> > based on the
> > + * fault address. Caller must hold a lock to protect range lookup
> > and insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_vram;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_write_locked(mm);
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > >notifier,
> > +							  mm,
> > notifier->interval.start,
> > +							  notifier-
> > >interval.end -
> > +							  notifier-
> > >interval.start,
> > +							 
> > &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > current
> > +	 * limitations. If/when migrate_vma_* add more support, this
> > logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_vram = ctx->vram_possible &&
> > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> > vas,
> > +						 fault_addr,
> > gpuva_start,
> > +						 gpuva_end,
> > migrate_vram &&
> > +						 !ctx->prefault);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> > chunk_size,
> > +				       migrate_vram);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +	if (ctx->prefault) {
> > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > +
> > +		__ctx.mmap_locked = true;
> > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &__ctx);
> > +		if (err)
> > +			goto err_range_remove;
> > +	}
> > +
> > +out_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +
> > +	return range;
> > +
> > +err_range_remove:
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +err_notifier_remove:
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > + * @i__: the current page index in the iteration
> > + * @j__: the current page index, log order, in the iteration
> > + * @npages__: the total number of pages in the DMA region
> > + * @order__: the order of the pages in the DMA region
> > + *
> > + * This macro iterates over each page in a DMA region. The DMA
> > region
> > + * is assumed to be composed of 2^@order__ pages, and the macro will
> > + * step through the region one block of 2^@order__ pages at a time.
> > + */
> > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > +	     (j__)++, (i__) += 0x1 << (order__))
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function unmap pages associated with a GPU SVM range.
> > Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > *gpusvm,
> > +					   struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		unsigned long i, j, npages = npages_in_range(range-
> > >va.start,
> > +							     range-
> > >va.end);
> > +
> > +		if (range->flags.has_dma_mapping) {
> > +			for_each_dma_page(i, j, npages, range-
> > >order)
> > +				dma_unmap_page(gpusvm->drm->dev,
> > +					       range->dma_addr[j],
> > +					       PAGE_SIZE << range-
> > >order,
> > +					       DMA_BIDIRECTIONAL);
> > +		}
> > +
> > +		range->flags.has_vram_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > +					struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		if (range->flags.kfree_mapping) {
> > +			kfree(range->dma_addr);
> > +			range->flags.kfree_mapping = false;
> > +			range->pages = NULL;
> > +		} else {
> > +			kvfree(range->pages);
> > +			range->pages = NULL;
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also
> > removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > >va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier-
> > >notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified GPU
> > SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the GPU
> > SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its
> > reference count
> > + * reaches zero. If a custom range-free function is provided, it is
> > invoked to
> > + * free the range; otherwise, the range is deallocated using
> > kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range,
> > refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified GPU
> > SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before
> > commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_vram_pages || range-
> > >flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> > unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > +				      struct drm_gpusvm_range
> > *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->pages)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > +	if (!pages_valid && range->flags.kfree_mapping) {
> > +		kfree(range->dma_addr);
> > +		range->flags.kfree_mapping = false;
> > +		range->pages = NULL;
> > +	}
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they are
> > mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> 
> Is it possible to split this function up to make it look more neat?
> 
> 
> > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > >notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> > ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	bool vram_pages = !!range->flags.migrate_vram;
> > +	bool alloc_pfns = false, kfree_mapping;
> > +
> > +retry:
> > +	kfree_mapping = false;
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > +		return 0;
> > +
> > +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> > >pages) {
> > +		if (ctx->prefault)
> > +			return 0;
> > +
> > +		pfns = (unsigned long *)range->pages;
> > +		pages = range->pages;
> > +		goto map_pages;
> > +	}
> > +
> > +	if (!range->pages) {
> > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > GFP_KERNEL);
> > +		if (!pfns)
> > +			return -ENOMEM;
> > +		alloc_pfns = true;
> > +	} else {
> > +		pfns = (unsigned long *)range->pages;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		/* Must be checked after mmu_interval_read_begin */
> > +		if (range->flags.unmapped) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		if (!ctx->mmap_locked) {
> > +			/*
> > +			 * XXX: HMM locking document indicates only
> > a read-lock
> > +			 * is required but there apears to be a
> > window between
> > +			 * the MMU_NOTIFY_MIGRATE event triggered in
> > a CPU fault
> > +			 * via migrate_vma_setup and the pages
> > actually moving
> > +			 * in migrate_vma_finalize in which this
> > code can grab
> > +			 * garbage pages. Grabbing the write-lock if
> > the range
> > +			 * is attached to vram appears to protect
> > against this
> > +			 * race.
> > +			 */
> > +			if (vram_pages)
> > +				mmap_write_lock(mm);
> > +			else
> > +				mmap_read_lock(mm);
> > +		}
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (!ctx->mmap_locked) {
> > +			if (vram_pages)
> > +				mmap_write_unlock(mm);
> > +			else
> > +				mmap_read_unlock(mm);
> > +		}
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (!ctx->mmap_locked)
> > +		mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +
> > +	if (ctx->prefault) {
> > +		range->pages = pages;
> > +		goto set_seqno;
> > +	}
> > +
> > +map_pages:
> > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > +		WARN_ON_ONCE(!range->vram_allocation);
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +			if
> > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_free;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->flags.has_vram_pages = true;
> > +		range->pages = pages;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	} else {
> > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > +
> > +		for_each_dma_page(i, j, npages, order) {
> 
> Here it looks like you're assuming that all pages are the same order?
> With THP that's definitely not the case, (unless hmm somehow thinks
> they are 4K pages). This probably work because we only end up here in
> the HugeTLB case where all pages are forced to the same oder.
> 

It assumes the order within a chunk (range size) is all the same. I
thought THP pages order would always be 9 (2M). THP tests
(*-large-malloc) seem to work on LNL.

This falls apart if chunks are larger than 2M as the first 2M could be a
THP and 2nd could not. We discussed that you were changing the dma addr
to support mixed mappings and encode the order. That is likely correct
and would fix this limitation of only supporting 1 order size for chunk.

I may not get this in the rev but agree this should be fixed. We
deferring fixing this be ok with you?

fwiw I haven't seen any ROI on chunks being larger than 2M so Xe likely
won't have chunks larger than that but agree the design should support
this.

Matt

> > +			if (WARN_ON_ONCE(i && order !=
> > +					
> > hmm_pfn_to_map_order(pfns[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +			order = hmm_pfn_to_map_order(pfns[i]);
> > +
> > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > +			if
> > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +
> > +			set_page_dirty_lock(pages[j]);
> > +			mark_page_accessed(pages[j]);
> > +
> > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > +						   pages[j], 0,
> > +						   PAGE_SIZE <<
> > order,
> > +						  
> > DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev,
> > dma_addr[j])) {
> > +				err = -EFAULT;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +		}
> > +
> > +		/* Huge pages, reduce memory footprint */
> > +		if (order) {
> > +			dma_addr = kmalloc_array(j,
> > sizeof(*dma_addr),
> > +						 GFP_KERNEL);
> > +			if (dma_addr) {
> > +				for (i = 0; i < j; ++i)
> > +					dma_addr[i] =
> > (dma_addr_t)pfns[i];
> > +				kvfree(pfns);
> > +				kfree_mapping = true;
> > +			} else {
> > +				dma_addr = (dma_addr_t *)pfns;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->order = order;
> > +		range->flags.kfree_mapping = kfree_mapping;
> > +		range->flags.has_dma_mapping = true;
> > +		range->dma_addr = dma_addr;
> > +		range->vram_allocation = NULL;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	}
> > +
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	for_each_dma_page(i, j, npages, order)
> > +		dma_unmap_page(gpusvm->drm->dev,
> > +			       (dma_addr_t)pfns[j],
> > +			       PAGE_SIZE << order,
> > DMA_BIDIRECTIONAL);
> > +err_free:
> > +	if (alloc_pfns)
> > +		kvfree(pfns);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If
> > @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > >invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long
> > *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > grate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU
> > SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_vram_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> > migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped
> > pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU
> > SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn,
> > maps the
> > + * corresponding page, and stores the DMA address in the provided
> > @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int
> > *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page =
> > migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> > dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> > for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU
> > Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in
> > @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the
> > corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction
> > dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *                   failure of this function.
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> > The caller
> > + *                   should hold a reference to the VRAM allocation,
> > which
> > + *                   should be dropped via ops->vram_allocation or
> > upon the
> > + *                   failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to VRAM. It
> > performs the
> > + * necessary setup and invokes the driver-specific operations for
> > migration to
> > + * VRAM. Upon successful return, @vram_allocation can safely
> > reference @range
> > + * until ops->vram_release is called which only upon successful
> > return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_vram)
> > +		return -EINVAL;
> > +
> > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > >copy_to_vram ||
> > +	    !gpusvm->ops->copy_to_sram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(range);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > npages, not
> > +	 * always an error. Need to revisit possible cases and how
> > to handle. We
> > +	 * could prefault on migrate.cpages != npages via
> > hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > vram_allocation, npages,
> > +					     migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.src, npages,
> > DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_vram_page(page, zdd);
> > +	}
> > +
> > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind vram allocation to range and zdd */
> > +	range->vram_allocation = vram_allocation;
> > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> > VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the SRAM migrate page frame numbers
> > (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the
> > VM area for
> > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > vm_area_struct *vas,
> > +						unsigned long
> > npages,
> > +						unsigned long
> > *src_mpfn,
> > +						unsigned long *mpfn,
> > u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> > lock and
> > + * migration done via migrate_device_* functions. Fallback path as
> > it is
> > + * preferred to issue migrations with mmap lock.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> > +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > npages;
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > >vram_allocation,
> > +					     npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_vma_range(gpusvm->mm,
> > +				       gpusvm-
> > >device_private_page_owner, src,
> > +				       npages, range->va.start);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > src, dst, 0);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   dst, npages,
> > DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @vas: Pointer to the VM area structure
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @start: Start address of the migration range
> > + * @end: End address of the migration range
> > + *
> > + * This internal function performs the migration of the specified
> > GPU SVM range
> > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > PFNs, and
> > + * invokes the driver-specific operations for migration to SRAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +					struct vm_area_struct *vas,
> > +					struct page *page,
> > +					u64 start, u64 end)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page	= page,
> > +	};
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	/* Corner where VMA area struct has been partially unmapped
> > */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > +						   migrate.src,
> > migrate.dst,
> > +						   start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.dst, npages,
> > +					   DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function initiates the migration of the specified GPU SVM
> > range to
> > + * SRAM. It performs necessary checks and invokes the internal
> > migration
> > + * function for actual migration.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err =
> > drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner
> > case when
> > +	 * VRAM backing has been partially unmapped from MM's
> > address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> > +	if (!vas) {
> > +		if (!retry)
> > +			err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > +		if (!retry)
> > +			err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> > end);
> > +	if (err)
> > +		goto err_mmunlock;
> > +
> > +	if (vas->vm_end < end) {
> > +		retry = true;
> > +		start = vas->vm_end;
> > +		goto again;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		mmap_read_unlock(mm);
> > +		/*
> > +		 * Using mmput_async as this function can be called
> > while
> > +		 * holding a dma-resv lock, and a final put can grab
> > the mmap
> > +		 * lock, causing a lock inversion.
> > +		 */
> > +		mmput_async(mm);
> > +	}
> > +
> > +	return 0;
> > +
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked)
> > +		mmap_read_unlock(mm);
> > +err_mmput:
> > +	if (!ctx->mmap_locked)
> > +		mmput_async(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> > with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device
> > data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> > fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM
> > range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page
> > and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > +					   vmf->vma, vmf->page,
> > +					   zdd->range->va.start,
> > +					   zdd->range->va.end);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start,
> > end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..0ea70f8534a8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,415 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual
> > Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM
> > ranges and
> > + * perform operations such as migration between VRAM and system RAM.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success,
> > NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> > *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @vram_release: Release VRAM allocation (optional)
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 *
> > +	 * This function shall release VRAM allocation and expects
> > to drop a
> > +	 * reference to VRAM allocation.
> > +	 */
> > +	void (*vram_release)(void *vram_allocation);
> > +
> > +	/**
> > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate VRAM page frame numbers
> > (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > +				 void *vram_allocation,
> > +				 unsigned long npages,
> > +				 unsigned long *pfn);
> > +
> > +	/**
> > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to VRAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_sram: Copy to system RAM (required for
> > migration)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @pages: Pointer to array of VRAM pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses
> > (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > +			    struct page **pages,
> > +			    dma_addr_t *dma_addr,
> > +			    unsigned long npages);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It
> > can safely
> > +	 * walk the notifier range RB tree/list in this function.
> > Called while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range
> > *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure
> > notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order
> > they appear in
> > + *              interval tree. This is useful to keep iterating
> > ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval notifier
> > has been
> > + *                 removed
> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @pages: Pointer to the array of pages (if backing store is in
> > VRAM)
> > + * @dma_addr: DMA address array (if backing store is SRAM and DMA
> > mapped)
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping
> > size
> > + * @flags.migrate_vram: Flag indicating whether the range can be
> > migrated to VRAM
> > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been
> > partially unmapped
> > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> > mapping
> > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > allocation based
> > + *                       on @order which releases via kfree
> > + *
> > + * This structure represents a GPU SVM range used for tracking
> > memory ranges
> > + * mapped in a DRM device.
> > + */
> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	union {
> > +		struct page **pages;
> > +		dma_addr_t *dma_addr;
> > +	};
> > +	void *vram_allocation;
> > +	u16 order;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_vram : 1;
> > +		/* All flags below must be set / cleared under
> > notifier lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_vram_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +		u16 kfree_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier
> > operations
> > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > + * @root: Cached root node of the Red-Black tree containing GPU SVM
> > notifiers
> > + * @notifier_list: list head containing of notifiers in the same
> > order they
> > + *                 appear in interval tree. This is useful to keep
> > iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory) used
> > for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > + *
> > + * No reference counting is provided, as this is expected to be
> > embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which
> > handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct workqueue_struct *zdd_wq;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @mmap_locked: mmap lock is locked
> > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > inversions
> > + *                (e.g.dma-revs -> mmap lock)
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @vram_possible: possible to use VRAM
> > + * @prefault: prefault pages
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 mmap_locked :1;
> > +	u32 trylock_mmap :1;
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 vram_possible :1;
> > +	u32 prefault :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> > +
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or
> > NULL if the
> > + *         current range is the last one or if the input range is
> > NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates
> > the start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> > get the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier.
> > It is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > end__)	\
> > +	for ((range__) = (range__)
> > ?:					\
> > +	     drm_gpusvm_range_find((notifier__), (start__),
> > (end__));	\
> > +	     (range__) && (range__->va.start <
> > (end__));		\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the
> > partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier
> > range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > +			      const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation
  2024-09-24  9:16 ` [RFC PATCH 00/28] " Simona Vetter
@ 2024-09-24 19:36   ` Matthew Brost
  2024-09-25 11:41     ` Simona Vetter
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-09-24 19:36 UTC (permalink / raw)
  To: Simona Vetter
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	matthew.auld, daniel

On Tue, Sep 24, 2024 at 11:16:01AM +0200, Simona Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote:
> > Continuation of SVM work by Oak Zeng [1][2] based on community feedback.
> > Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for
> > system allocations (e.g., malloc), runtime allocations (e.g., binding a
> > BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM
> > allocations can evict each other). Fully tested; more on this below.
> > 
> > The patch breakdown is as follows:
> > 
> > 1. Preparation patches already on the list [3].
> > 	- Patches 1-3.
> > 	- Please refrain from reviewing these here.	
> > 2. New migrate layer functionality
> > 	- Patch 4.
> > 	- Required for eviction to avoid locking inversion between
> > 	  dma-resv and mmap lock.
> > 3. GPU SVM.
> > 	- Patch 5.
> > 	- This is what needs community review.
> > 	- Inspired by GPUVM.
> > 	- Kernel doc should explain design principles.
> > 	- There is certainly room for optimization of the implementation
> > 	  and improvements with existing core MM interaction. Pulling in
> > 	  pending DMA mapping work [4] and additional core MM support
> > 	  for SVM is also likely desired. However, this serves as a good
> > 	  starting point for any SVM discussions and could be used as a
> > 	  stepping stone to future core MM work.
> > 3. Basic SVM support in Xe (i.e., SRAM backing only).
> > 	- Patches 6-15.
> > 	- The uAPI in the patch could benefit from community input.
> > 4. SVM VRAM migration support in Xe.
> > 	- Patches 16-23.
> > 	- Using TMM BOs for SVM VRAM allocations could use community
> > 	  input. Patch 23 has a detailed explaination of this design
> > 	  choice in the commit message.
> > 5. SVM eviction support in Xe.
> > 	- Patch 24.
> > 	- Should work with exhaustive eviction [5] when it merges.
> > 6. Xe SVM debug / tuning.
> > 	- Patch 25-28.
> > 
> > Kernel documentation and commit messages are relatively light, aside
> > from GPU SVM and uAPI patches as this is an RFC.
> > 
> > Testing has been conducted quite thoroughly with new IGT [6]. Various
> > system allocation types (malloc, mmap, mmap flags, huge pages, different
> > sizes, different alignments), mixing runtime allocations, unmapping
> > corners, invalid faults, and eviction have been tested. Testing scales
> > from single thread to multiple threads and multiple processes. Tests
> > pass on LNL, BMG, PVC 1 tile, and PVC 2 tile.
> > 
> > 1. Multiple GPU support.
> > 	- This is likely to follow or occur in parallel to this work.
> > 2. Userptr unification with GPU SVM.
> > 	- This is essentially designed in my head (likely involving a
> > 	  few new GPU SVM layer functions) but would require some fairly
> > 	  invasive changes to Xe KMD to test out. Therefore, I would
> > 	  like GPU SVM to be reviewed first before proceeding with these
> > 	  changes.
> > 3. Madvise and prefetch IOCTLs
> > 	- This is likely to follow or occur in parallel to this work.
> > 
> > Given the size of the series, I have pushed a GitLab branch for
> > reference [7].
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/128910/
> > [2] https://patchwork.freedesktop.org/series/132229/
> > [3] https://patchwork.freedesktop.org/series/137805/
> > [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> > [5] https://patchwork.freedesktop.org/series/133643/
> > [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2
> > [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads
> 
> Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here
> it is finally. I think with all the experiments and discussions I have

Thanks for putting this together and all the initial reviews.

> fairly clear understanding of the really tricky parts of svm (thanks a lot
> to Matt for all the work done). From my side the key points, sorted
> roughly in how important I think they are:
> 

I've read this through and pretty much agree with everything here 
so won't have a detailed response to everything as there isn't much to
say aside from I agree. A few minor comments below.

> 1. migrate_to_ram path
> 
> I think this is the absolute center piece of making sure we're aligned
> with core mm and don't suffer from deadlocks or livelocks fundamental to
> the gpusvm library design. So this part imo needs to be solid for the
> first version we merge. But of course any core mm changes Matt prototyped
> shouldn't gate merging the drm side since they're nicely decoupled, we
> only need those to validate the design in testing.
> 
> I think the key points are:
> 
> - we rely on the migration pte, temporary page references and page lock
>   only, which with the core mm changes Matt worked on gives us guaranteed
>   reliable migration back to system memory. And we need that, or svm
>   essentially becomes unusable as a concept.
> 
> - we need to support partial migration, including the worst case fallback
>   of only migrating that single page core mm managed to trylock for us
>   while holding the pagetable lock.
> 
>   Since we have guaranteed migration back to system memory we can make the
>   assumption on the gpu fault handling side that we will only ever handle
>   ranges that are entirely in vram (by throwing any partial migrations
>   out). Needs a retry loop for that in the gpu fault side, but I no logner
>   see an issue with that assumption on the gpu fault side otherwise, so
>   not needed for merging or even later until we have a driver that
>   requires partial vram support.
> 

I think pretty quickly we will add partial vram support / mixed mappings
but likely will not be in initial merge.

> - no other driver locks related to that memory range in any way are
>   allowed, and ideally we also test with the forced fallback to mmap_read
>   lock in do_swap_page removed, i.e. calling migrate_to_ram with only
>   holding the read vma lock. Of course driver locks for blitting are
>   allowed, it's only locks related to managing physical memory which are
>   problematic and could result in deadlocks.
> 
> - the drm side must uphold the guarantee of not having elevated page
>   references whithout holding the page lock. Otherwise there's a race and
>   we cannot guarantee migratione to sram.
> 
> - also through the try_to_migrate maze we'll hit our own gpu pte
>   invalidate paths, so there's some requirements there too. But I've put
>   the discussion for that at the very bottom, since most of the gpu pte
>   locking questions are imo not that important, and definitely not
>   important for the first version we merge.
> 
> Everything else below I think we can sort out post merge and just need
> rough alignment on the design.
> 
> 2. eviction
> 
> Requirements much like migrate_to_ram, because otherwise we break the
> migration gurantee:
> 
> - Only looking at physical memory datastructures and locks, no looking at
>   mm/vma structs or relying on those being locked. We rely entirely on
>   reverse maps from try_to_migrate to find all the mappings on both cpu
>   and gpu side (cpu only zone device swap or migration pte entries ofc).
> 
> - Partial migration needs to work to make sure we can get out of any
>   low memory bind.
> 
> 3. gpu fault side
> 
> - We can only rely on mmap_read for hmm_range_fault. And ideally should
>   assume that's not held anywhere else since with per-vma locking I'd
>   expect the mm/vma locking will move into hmm_range_fault. This also
>   means no looking at vma beyond just passing it around as currently
>   needed by core mm functions.
> 
> - Big retry loop to handle all races with the mmu notifier under the gpu
>   pagetable locks/mmu notifier range lock/whatever we end up calling
>   those. Races (especially against concurrent eviction/migrate_to_ram)
>   should _not_ be handled on the fault side by trying to hold locks
>   instead.
> 
> - Long term I think we need to be able to handle concurrent faults, even
>   on hw where there's only one gpu fault handling queue. For performance
>   we absolutely want to prefault aggressively, and that likely happens
>   through cpu ioctls that are entirely independent from the gpu fault
>   handling.
> 

I agree the long term goal is handle concurrent GPU faults and with a
bit of finer grained locking in Xe I have already made this work. The
biggest part which needs to be parallel is migration code which is
taking up roughly 98% of time in the GPU fault handler for a 2M fault
with a split of 90% in migrate_vma_* function and 8% in the copy job.
I've seen tests which mirrors multiple EUs from the same process taking
concurrent GPU faults have large gains in perf. Also the CPU fault
handler is concurrent so it makes a bit of sense that GPU faults are
concurrent too.

GPU faults being concurrent should also enable concurrent prefetches
from CPU IOCTLs which also is likely desired.

I'm not going to include this or any of the other optimizations I have
worked on in what I try to merge initially though as I want to keep this
as simple as possible and also don't want to throw more code at the list
until a working baseline is in.

>   Short term enough (driver-side) locking to make sure this doesn't go
>   boom is enough, I think just some design goal documentation here on how
>   to achieve that is all we need.
> 
> 4. physical memory to virtual backpointer
> 
> No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram
> path and I think Matt already fixed this all anyway.
> 
> 5. gpu pagetable locking
> 
> Or mmu notifier range locking or whatever you want to call it. Like on the
> cpu side this should _only_ protect the pagetable entries and additional
> for us mmu notifier seqno tracking, nothing else.
> 
> Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault
> need to be handled by retrying outside of holding the pagetable locks. If
> we try to impose additional consistency guarantees we'll fall over and
> have a livelock/deadlock fight with core mm in migrate_to_ram. This part
> is required I think for the first version, but we already need that anyway
> to make migrate_to_ram work properly.
> 
> For the actual data structure/locking design I think anything on the
> design line between a single global lock and the radix tree over-the-top
> scalable per-pagetable (spin)lock design of the core mm is fine.
> 

I've seen the bind step in servicing GPU faults take barely any amount
of time so having the GPU page tables protected by the VM's dma-resv
lock seems fine in Xe. This really is up to each driver how it wants to
handle this too.

> The design here with 3 levels (mmu notifer, range, struct page) wouldn't
> be my first choice, but clearly fits on that line so imo is fine for
> initial merging. We might want to make sure that the range locking (I
> guess mostly relevant for the invalidate side, drivers don't see much
> else) is somewhat abstracted so we can easily change that post-merge, but
> not required imo at all.
> 
> For consensus documentation I'd recommend a todo or design documentation
> patch, where we put down both the current design and why it's like that,
> and some of the longer term goals. Then get that acked (imo needs at least
> one other driver that's seriously interested in this, plus I think an ack
> from Danilo for gpuvm interactions), then merge that. SVM is tricky enough
> that I think this would be really useful to make sure we're not
> unecessarily stuck in limbo.
>

I'll include a TODO or design documentation in the next rev.
 
> From my side again I think the only part we really have to get right from
> the start is migrate_to_ram. And I'm confident we've got that now really
> solid.
> 

I think most of all 5 points will be addressed in my next rev. Anything
that isn't falls into an 'optimization we can do later' category which
the design should be coded in a way these optimizations can easily be
added.

Matt

> Oh also you need userspace ofc :-)
> 
> Cheers, Sima
> 
> > Matthew Brost (28):
> >   dma-buf: Split out dma fence array create into alloc and arm functions
> >   drm/xe: Invalidate media_gt TLBs in PT code
> >   drm/xe: Retry BO allocation
> >   mm/migrate: Add migrate_device_vma_range
> >   drm/gpusvm: Add support for GPU Shared Virtual Memory
> >   drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
> >   drm/xe: Add SVM init / fini to faulting VMs
> >   drm/xe: Add dma_addr res cursor
> >   drm/xe: Add SVM range invalidation
> >   drm/gpuvm: Add DRM_GPUVA_OP_USER
> >   drm/xe: Add (re)bind to SVM page fault handler
> >   drm/xe: Add SVM garbage collector
> >   drm/xe: Add unbind to SVM garbage collector
> >   drm/xe: Do not allow system allocator VMA unbind if the GPU has
> >     bindings
> >   drm/xe: Enable system allocator uAPI
> >   drm/xe: Add migrate layer functions for SVM support
> >   drm/xe: Add SVM device memory mirroring
> >   drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
> >   drm/xe: Update PT layer to understand ranges in VRAM
> >   drm/xe: Add Xe SVM populate_vram_pfn vfunc
> >   drm/xe: Add Xe SVM vram_release vfunc
> >   drm/xe: Add BO flags required for SVM
> >   drm/xe: Add SVM VRAM migration
> >   drm/xe: Basic SVM BO eviction
> >   drm/xe: Add SVM debug
> >   drm/xe: Add modparam for SVM notifier size
> >   drm/xe: Add modparam for SVM prefault
> >   drm/gpusvm: Ensure all pages migrated upon eviction
> > 
> >  drivers/dma-buf/dma-fence-array.c    |   78 +-
> >  drivers/gpu/drm/xe/Makefile          |    4 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c      | 2213 ++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h      |  415 +++++
> >  drivers/gpu/drm/xe/xe_bo.c           |   54 +-
> >  drivers/gpu/drm/xe/xe_bo.h           |    2 +
> >  drivers/gpu/drm/xe/xe_bo_types.h     |    3 +
> >  drivers/gpu/drm/xe/xe_device_types.h |    8 +
> >  drivers/gpu/drm/xe/xe_gt_pagefault.c |   17 +-
> >  drivers/gpu/drm/xe/xe_migrate.c      |  150 ++
> >  drivers/gpu/drm/xe/xe_migrate.h      |   10 +
> >  drivers/gpu/drm/xe/xe_module.c       |    7 +
> >  drivers/gpu/drm/xe/xe_module.h       |    2 +
> >  drivers/gpu/drm/xe/xe_pt.c           |  456 +++++-
> >  drivers/gpu/drm/xe/xe_pt.h           |    3 +
> >  drivers/gpu/drm/xe/xe_pt_types.h     |    2 +
> >  drivers/gpu/drm/xe/xe_res_cursor.h   |   50 +-
> >  drivers/gpu/drm/xe/xe_svm.c          |  775 +++++++++
> >  drivers/gpu/drm/xe/xe_svm.h          |   70 +
> >  drivers/gpu/drm/xe/xe_tile.c         |    5 +
> >  drivers/gpu/drm/xe/xe_vm.c           |  286 +++-
> >  drivers/gpu/drm/xe/xe_vm.h           |   15 +-
> >  drivers/gpu/drm/xe/xe_vm_types.h     |   44 +
> >  include/drm/drm_gpuvm.h              |    5 +
> >  include/linux/dma-fence-array.h      |    6 +
> >  include/linux/migrate.h              |    3 +
> >  include/uapi/drm/xe_drm.h            |   19 +-
> >  mm/migrate_device.c                  |   53 +
> >  28 files changed, 4615 insertions(+), 140 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > 
> > -- 
> > 2.34.1
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation
  2024-09-24 19:36   ` Matthew Brost
@ 2024-09-25 11:41     ` Simona Vetter
  0 siblings, 0 replies; 100+ messages in thread
From: Simona Vetter @ 2024-09-25 11:41 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Simona Vetter, intel-xe, dri-devel, airlied, christian.koenig,
	thomas.hellstrom, matthew.auld, daniel

On Tue, Sep 24, 2024 at 07:36:21PM +0000, Matthew Brost wrote:
> On Tue, Sep 24, 2024 at 11:16:01AM +0200, Simona Vetter wrote:
> > On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote:
> > > Continuation of SVM work by Oak Zeng [1][2] based on community feedback.
> > > Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for
> > > system allocations (e.g., malloc), runtime allocations (e.g., binding a
> > > BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM
> > > allocations can evict each other). Fully tested; more on this below.
> > > 
> > > The patch breakdown is as follows:
> > > 
> > > 1. Preparation patches already on the list [3].
> > > 	- Patches 1-3.
> > > 	- Please refrain from reviewing these here.	
> > > 2. New migrate layer functionality
> > > 	- Patch 4.
> > > 	- Required for eviction to avoid locking inversion between
> > > 	  dma-resv and mmap lock.
> > > 3. GPU SVM.
> > > 	- Patch 5.
> > > 	- This is what needs community review.
> > > 	- Inspired by GPUVM.
> > > 	- Kernel doc should explain design principles.
> > > 	- There is certainly room for optimization of the implementation
> > > 	  and improvements with existing core MM interaction. Pulling in
> > > 	  pending DMA mapping work [4] and additional core MM support
> > > 	  for SVM is also likely desired. However, this serves as a good
> > > 	  starting point for any SVM discussions and could be used as a
> > > 	  stepping stone to future core MM work.
> > > 3. Basic SVM support in Xe (i.e., SRAM backing only).
> > > 	- Patches 6-15.
> > > 	- The uAPI in the patch could benefit from community input.
> > > 4. SVM VRAM migration support in Xe.
> > > 	- Patches 16-23.
> > > 	- Using TMM BOs for SVM VRAM allocations could use community
> > > 	  input. Patch 23 has a detailed explaination of this design
> > > 	  choice in the commit message.
> > > 5. SVM eviction support in Xe.
> > > 	- Patch 24.
> > > 	- Should work with exhaustive eviction [5] when it merges.
> > > 6. Xe SVM debug / tuning.
> > > 	- Patch 25-28.
> > > 
> > > Kernel documentation and commit messages are relatively light, aside
> > > from GPU SVM and uAPI patches as this is an RFC.
> > > 
> > > Testing has been conducted quite thoroughly with new IGT [6]. Various
> > > system allocation types (malloc, mmap, mmap flags, huge pages, different
> > > sizes, different alignments), mixing runtime allocations, unmapping
> > > corners, invalid faults, and eviction have been tested. Testing scales
> > > from single thread to multiple threads and multiple processes. Tests
> > > pass on LNL, BMG, PVC 1 tile, and PVC 2 tile.
> > > 
> > > 1. Multiple GPU support.
> > > 	- This is likely to follow or occur in parallel to this work.
> > > 2. Userptr unification with GPU SVM.
> > > 	- This is essentially designed in my head (likely involving a
> > > 	  few new GPU SVM layer functions) but would require some fairly
> > > 	  invasive changes to Xe KMD to test out. Therefore, I would
> > > 	  like GPU SVM to be reviewed first before proceeding with these
> > > 	  changes.
> > > 3. Madvise and prefetch IOCTLs
> > > 	- This is likely to follow or occur in parallel to this work.
> > > 
> > > Given the size of the series, I have pushed a GitLab branch for
> > > reference [7].
> > > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/series/128910/
> > > [2] https://patchwork.freedesktop.org/series/132229/
> > > [3] https://patchwork.freedesktop.org/series/137805/
> > > [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> > > [5] https://patchwork.freedesktop.org/series/133643/
> > > [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2
> > > [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads
> > 
> > Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here
> > it is finally. I think with all the experiments and discussions I have
> 
> Thanks for putting this together and all the initial reviews.
> 
> > fairly clear understanding of the really tricky parts of svm (thanks a lot
> > to Matt for all the work done). From my side the key points, sorted
> > roughly in how important I think they are:
> > 
> 
> I've read this through and pretty much agree with everything here 
> so won't have a detailed response to everything as there isn't much to
> say aside from I agree. A few minor comments below.
> 
> > 1. migrate_to_ram path
> > 
> > I think this is the absolute center piece of making sure we're aligned
> > with core mm and don't suffer from deadlocks or livelocks fundamental to
> > the gpusvm library design. So this part imo needs to be solid for the
> > first version we merge. But of course any core mm changes Matt prototyped
> > shouldn't gate merging the drm side since they're nicely decoupled, we
> > only need those to validate the design in testing.
> > 
> > I think the key points are:
> > 
> > - we rely on the migration pte, temporary page references and page lock
> >   only, which with the core mm changes Matt worked on gives us guaranteed
> >   reliable migration back to system memory. And we need that, or svm
> >   essentially becomes unusable as a concept.
> > 
> > - we need to support partial migration, including the worst case fallback
> >   of only migrating that single page core mm managed to trylock for us
> >   while holding the pagetable lock.
> > 
> >   Since we have guaranteed migration back to system memory we can make the
> >   assumption on the gpu fault handling side that we will only ever handle
> >   ranges that are entirely in vram (by throwing any partial migrations
> >   out). Needs a retry loop for that in the gpu fault side, but I no logner
> >   see an issue with that assumption on the gpu fault side otherwise, so
> >   not needed for merging or even later until we have a driver that
> >   requires partial vram support.
> > 
> 
> I think pretty quickly we will add partial vram support / mixed mappings
> but likely will not be in initial merge.
> 
> > - no other driver locks related to that memory range in any way are
> >   allowed, and ideally we also test with the forced fallback to mmap_read
> >   lock in do_swap_page removed, i.e. calling migrate_to_ram with only
> >   holding the read vma lock. Of course driver locks for blitting are
> >   allowed, it's only locks related to managing physical memory which are
> >   problematic and could result in deadlocks.
> > 
> > - the drm side must uphold the guarantee of not having elevated page
> >   references whithout holding the page lock. Otherwise there's a race and
> >   we cannot guarantee migratione to sram.
> > 
> > - also through the try_to_migrate maze we'll hit our own gpu pte
> >   invalidate paths, so there's some requirements there too. But I've put
> >   the discussion for that at the very bottom, since most of the gpu pte
> >   locking questions are imo not that important, and definitely not
> >   important for the first version we merge.
> > 
> > Everything else below I think we can sort out post merge and just need
> > rough alignment on the design.
> > 
> > 2. eviction
> > 
> > Requirements much like migrate_to_ram, because otherwise we break the
> > migration gurantee:
> > 
> > - Only looking at physical memory datastructures and locks, no looking at
> >   mm/vma structs or relying on those being locked. We rely entirely on
> >   reverse maps from try_to_migrate to find all the mappings on both cpu
> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
> > 
> > - Partial migration needs to work to make sure we can get out of any
> >   low memory bind.
> > 
> > 3. gpu fault side
> > 
> > - We can only rely on mmap_read for hmm_range_fault. And ideally should
> >   assume that's not held anywhere else since with per-vma locking I'd
> >   expect the mm/vma locking will move into hmm_range_fault. This also
> >   means no looking at vma beyond just passing it around as currently
> >   needed by core mm functions.
> > 
> > - Big retry loop to handle all races with the mmu notifier under the gpu
> >   pagetable locks/mmu notifier range lock/whatever we end up calling
> >   those. Races (especially against concurrent eviction/migrate_to_ram)
> >   should _not_ be handled on the fault side by trying to hold locks
> >   instead.
> > 
> > - Long term I think we need to be able to handle concurrent faults, even
> >   on hw where there's only one gpu fault handling queue. For performance
> >   we absolutely want to prefault aggressively, and that likely happens
> >   through cpu ioctls that are entirely independent from the gpu fault
> >   handling.
> > 
> 
> I agree the long term goal is handle concurrent GPU faults and with a
> bit of finer grained locking in Xe I have already made this work. The
> biggest part which needs to be parallel is migration code which is
> taking up roughly 98% of time in the GPU fault handler for a 2M fault
> with a split of 90% in migrate_vma_* function and 8% in the copy job.
> I've seen tests which mirrors multiple EUs from the same process taking
> concurrent GPU faults have large gains in perf. Also the CPU fault
> handler is concurrent so it makes a bit of sense that GPU faults are
> concurrent too.
> 
> GPU faults being concurrent should also enable concurrent prefetches
> from CPU IOCTLs which also is likely desired.
> 
> I'm not going to include this or any of the other optimizations I have
> worked on in what I try to merge initially though as I want to keep this
> as simple as possible and also don't want to throw more code at the list
> until a working baseline is in.
> 
> >   Short term enough (driver-side) locking to make sure this doesn't go
> >   boom is enough, I think just some design goal documentation here on how
> >   to achieve that is all we need.
> > 
> > 4. physical memory to virtual backpointer
> > 
> > No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram
> > path and I think Matt already fixed this all anyway.
> > 
> > 5. gpu pagetable locking
> > 
> > Or mmu notifier range locking or whatever you want to call it. Like on the
> > cpu side this should _only_ protect the pagetable entries and additional
> > for us mmu notifier seqno tracking, nothing else.
> > 
> > Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault
> > need to be handled by retrying outside of holding the pagetable locks. If
> > we try to impose additional consistency guarantees we'll fall over and
> > have a livelock/deadlock fight with core mm in migrate_to_ram. This part
> > is required I think for the first version, but we already need that anyway
> > to make migrate_to_ram work properly.
> > 
> > For the actual data structure/locking design I think anything on the
> > design line between a single global lock and the radix tree over-the-top
> > scalable per-pagetable (spin)lock design of the core mm is fine.
> > 
> 
> I've seen the bind step in servicing GPU faults take barely any amount
> of time so having the GPU page tables protected by the VM's dma-resv
> lock seems fine in Xe. This really is up to each driver how it wants to
> handle this too.

I think if we go to a more fine-grained approach it'd make sense to have
this common, especially if we go with something more radix-tree based.
It's quick tricky to get right in the more extreme cases (*shudders
looking at cpu pgtable walking code), and with mmu notifier seqno handling
there's some additional complexity. And for drivers which don't need that
fine-grained approach it shouldn't hurt.

Also with pagetable lock here I meant the one you're also taking from the
mmu notifier invalidate callback, so unless I'm completely lost that's not
the dma_resv lock of the vm. That is iirc what you call "driver lock" and
I think for initial merging your current approach is entirely fine. It
would also need some changes for concurrent gpu pte (pre-)fault handling,
but that's not what I'm talking about here directly.

> > The design here with 3 levels (mmu notifer, range, struct page) wouldn't
> > be my first choice, but clearly fits on that line so imo is fine for
> > initial merging. We might want to make sure that the range locking (I
> > guess mostly relevant for the invalidate side, drivers don't see much
> > else) is somewhat abstracted so we can easily change that post-merge, but
> > not required imo at all.
> > 
> > For consensus documentation I'd recommend a todo or design documentation
> > patch, where we put down both the current design and why it's like that,
> > and some of the longer term goals. Then get that acked (imo needs at least
> > one other driver that's seriously interested in this, plus I think an ack
> > from Danilo for gpuvm interactions), then merge that. SVM is tricky enough
> > that I think this would be really useful to make sure we're not
> > unecessarily stuck in limbo.
> >
> 
> I'll include a TODO or design documentation in the next rev.
>  
> > From my side again I think the only part we really have to get right from
> > the start is migrate_to_ram. And I'm confident we've got that now really
> > solid.
> > 
> 
> I think most of all 5 points will be addressed in my next rev. Anything
> that isn't falls into an 'optimization we can do later' category which
> the design should be coded in a way these optimizations can easily be
> added.

Just quickly read through your comments and aside from the pagetable
locking one (which I think is just a bit unclarity, not disagrement) I all
agree on all of them.
-Sima

> Matt
> 
> > Oh also you need userspace ofc :-)
> > 
> > Cheers, Sima
> > 
> > > Matthew Brost (28):
> > >   dma-buf: Split out dma fence array create into alloc and arm functions
> > >   drm/xe: Invalidate media_gt TLBs in PT code
> > >   drm/xe: Retry BO allocation
> > >   mm/migrate: Add migrate_device_vma_range
> > >   drm/gpusvm: Add support for GPU Shared Virtual Memory
> > >   drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
> > >   drm/xe: Add SVM init / fini to faulting VMs
> > >   drm/xe: Add dma_addr res cursor
> > >   drm/xe: Add SVM range invalidation
> > >   drm/gpuvm: Add DRM_GPUVA_OP_USER
> > >   drm/xe: Add (re)bind to SVM page fault handler
> > >   drm/xe: Add SVM garbage collector
> > >   drm/xe: Add unbind to SVM garbage collector
> > >   drm/xe: Do not allow system allocator VMA unbind if the GPU has
> > >     bindings
> > >   drm/xe: Enable system allocator uAPI
> > >   drm/xe: Add migrate layer functions for SVM support
> > >   drm/xe: Add SVM device memory mirroring
> > >   drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
> > >   drm/xe: Update PT layer to understand ranges in VRAM
> > >   drm/xe: Add Xe SVM populate_vram_pfn vfunc
> > >   drm/xe: Add Xe SVM vram_release vfunc
> > >   drm/xe: Add BO flags required for SVM
> > >   drm/xe: Add SVM VRAM migration
> > >   drm/xe: Basic SVM BO eviction
> > >   drm/xe: Add SVM debug
> > >   drm/xe: Add modparam for SVM notifier size
> > >   drm/xe: Add modparam for SVM prefault
> > >   drm/gpusvm: Ensure all pages migrated upon eviction
> > > 
> > >  drivers/dma-buf/dma-fence-array.c    |   78 +-
> > >  drivers/gpu/drm/xe/Makefile          |    4 +-
> > >  drivers/gpu/drm/xe/drm_gpusvm.c      | 2213 ++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/drm_gpusvm.h      |  415 +++++
> > >  drivers/gpu/drm/xe/xe_bo.c           |   54 +-
> > >  drivers/gpu/drm/xe/xe_bo.h           |    2 +
> > >  drivers/gpu/drm/xe/xe_bo_types.h     |    3 +
> > >  drivers/gpu/drm/xe/xe_device_types.h |    8 +
> > >  drivers/gpu/drm/xe/xe_gt_pagefault.c |   17 +-
> > >  drivers/gpu/drm/xe/xe_migrate.c      |  150 ++
> > >  drivers/gpu/drm/xe/xe_migrate.h      |   10 +
> > >  drivers/gpu/drm/xe/xe_module.c       |    7 +
> > >  drivers/gpu/drm/xe/xe_module.h       |    2 +
> > >  drivers/gpu/drm/xe/xe_pt.c           |  456 +++++-
> > >  drivers/gpu/drm/xe/xe_pt.h           |    3 +
> > >  drivers/gpu/drm/xe/xe_pt_types.h     |    2 +
> > >  drivers/gpu/drm/xe/xe_res_cursor.h   |   50 +-
> > >  drivers/gpu/drm/xe/xe_svm.c          |  775 +++++++++
> > >  drivers/gpu/drm/xe/xe_svm.h          |   70 +
> > >  drivers/gpu/drm/xe/xe_tile.c         |    5 +
> > >  drivers/gpu/drm/xe/xe_vm.c           |  286 +++-
> > >  drivers/gpu/drm/xe/xe_vm.h           |   15 +-
> > >  drivers/gpu/drm/xe/xe_vm_types.h     |   44 +
> > >  include/drm/drm_gpuvm.h              |    5 +
> > >  include/linux/dma-fence-array.h      |    6 +
> > >  include/linux/migrate.h              |    3 +
> > >  include/uapi/drm/xe_drm.h            |   19 +-
> > >  mm/migrate_device.c                  |   53 +
> > >  28 files changed, 4615 insertions(+), 140 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > 
> > > -- 
> > > 2.34.1
> > > 
> > 
> > -- 
> > Simona Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-24  9:25     ` Simona Vetter
@ 2024-09-25 16:34       ` Zeng, Oak
  0 siblings, 0 replies; 100+ messages in thread
From: Zeng, Oak @ 2024-09-25 16:34 UTC (permalink / raw)
  To: Simona Vetter
  Cc: Brost, Matthew, intel-xe@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	Auld,  Matthew, daniel@ffwll.ch, airlied@gmail.com,
	christian.koenig@amd.com

Hi Sima,

> -----Original Message-----
> From: Simona Vetter <simona.vetter@ffwll.ch>
> Sent: Tuesday, September 24, 2024 5:25 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; intel-
> xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
> thomas.hellstrom@linux.intel.com; Auld, Matthew
> <matthew.auld@intel.com>; daniel@ffwll.ch; airlied@gmail.com;
> christian.koenig@amd.com
> Subject: Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU
> Shared Virtual Memory
> 
> On Fri, Sep 06, 2024 at 06:41:18PM +0000, Zeng, Oak wrote:
> > There are fundamental design conflicts with what we have aligned,
> see inline.
> >
> > > -----Original Message-----
> > > From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On
> Behalf
> > > Of Matthew Brost
> > > Sent: Tuesday, August 27, 2024 10:49 PM
> > > To: intel-xe@lists.freedesktop.org; dri-
> devel@lists.freedesktop.org
> > > Cc: airlied@gmail.com; christian.koenig@amd.com;
> > > thomas.hellstrom@linux.intel.com; Auld, Matthew
> > > <matthew.auld@intel.com>; daniel@ffwll.ch
> > > Subject: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU
> > > Shared Virtual Memory
> > >
> > > This patch introduces support for GPU Shared Virtual Memory
> (SVM)
> > > in the
> > > Direct Rendering Manager (DRM) subsystem. SVM allows for
> > > seamless
> > > sharing of memory between the CPU and GPU, enhancing
> > > performance and
> > > flexibility in GPU computing tasks.
> > >
> > > The patch adds the necessary infrastructure for SVM, including
> data
> > > structures and functions for managing SVM ranges and notifiers. It
> > > also
> > > provides mechanisms for allocating, deallocating, and migrating
> > > memory
> > > regions between system RAM and GPU VRAM.
> > >
> > > This mid-layer is largely inspired by GPUVM.
> > >
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > +++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > >
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > >
> > >  # core driver code
> > >
> > > -xe-y += xe_bb.o \
> > > +xe-y += drm_gpusvm.o \
> > > +	xe_bb.o \
> > >  	xe_bo.o \
> > >  	xe_bo_evict.o \
> > >  	xe_devcoredump.o \
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > new file mode 100644
> > > index 000000000000..fc1e44e6ae72
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > @@ -0,0 +1,2174 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + *
> > > + * Authors:
> > > + *     Matthew Brost <matthew.brost@intel.com>
> > > + */
> > > +
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/interval_tree_generic.h>
> > > +#include <linux/hmm.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/mm_types.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/slab.h>
> > > +
> > > +#include <drm/drm_device.h>
> > > +#include "drm_gpusvm.h"
> > > +
> > > +/**
> > > + * DOC: Overview
> > > + *
> > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > Rendering Manager (DRM)
> > > + *
> > > + * The GPU SVM layer is a component of the DRM framework
> > > designed to manage shared
> > > + * virtual memory between the CPU and GPU. It enables
> efficient
> > > data exchange and
> > > + * processing for GPU-accelerated applications by allowing
> memory
> > > sharing and
> > > + * synchronization between the CPU's and GPU's virtual address
> > > spaces.
> > > + *
> > > + * Key GPU SVM Components:
> > > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > > notifying the
> > > + *		GPU of changes, notifiers are sized based on a GPU
> > > SVM
> > > + *		initialization parameter, with a recommendation of
> > > 512M or
> > > + *		larger. They maintain a Red-BlacK tree and a list of
> > > ranges that
> > > + *		fall within the notifier interval. Notifiers are tracked
> > > within
> > > + *		a GPU SVM Red-BlacK tree and list and are
> > > dynamically inserted
> > > + *		or removed as ranges within the interval are created
> > > or
> > > + *		destroyed.
> > > + * - Ranges: Represent memory ranges mapped in a DRM device
> and
> > > managed
> > > + *	     by GPU SVM.
> >
> >
> > This svm_range concept has introduced a lot of code duplications in
> xekmd,
> > Indicating that this is a wrong design. I think one of the design
> principle is to
> > Reuse, not to duplicate.
> >
> > Look at patch 9, 11, bunch of duplicated codes to page table update,
> invalidate,
> > And page fault handler.
> >
> > I had this range concept in v1 [1], but after we agreed to unify svm
> and userptr
> > Codes during review, I dropped this concept, and the xe_svm
> concept, which ends
> > Up much less duplicated codes in v2[2]. I will say more below why I
> thought the svm
> > Concept can also be removed.
> >
> > Conceptually vma represent a range. Why duplicate?
> 
> Because we cannot rely on mmap_read/write locks or
> vma_read/write locks
> without causing headaches. They are core mm datastructures that the
> gpu
> driver does not own, so for better or worse we have to do a bit of
> duplication.

Seems there is a misunderstanding here. By vma I meant a data structure
In driver representing a range, such as xe_vma, not the core mm vma (
Struct vm_area_struct). Sorry I should be more clear.

The point I tried to make was, svm_range concept is pretty much a duplication
Of the xe_vma concept. If you look at the definition of those two data structure,
They are very similar. This further ends up with codes duplication in page table
Update codes:

- xe_pt_zap_ptes_range, duplicate xe_pt_zap_ptes
- xe_svm_invalidate, duplicate xe_userptr_invalidate
- xe_vm_range_rebind/unbind and many other functions are all duplicated.
- the rb-tree in drm_gpusvm duplicate the rb-tree in drm_gpuvm


> 
> Duplication for no reaons is bad, but trying to avoid necessary
> duplication that's inherit to the design challenge we face is much
> worse.

I agree some duplication is necessary. But let's discuss whether we can avoid
Duplication in this design, whether it is reasonable.

In some PoC codes, I was able to avoid duplication w/o breaking the design.
The idea is to unify userptr codes and svm codes:

So userptr (xe_userptr_vma in xekmd) is a sub-class of xe-vma which is a subclass of drm_gpuva.
We just move the userptr concept up to drm layer and renamed it to hmmptr.

This way we can reuse most of the xekmd userptr codes, and reuse the drm_gpuvm rb-tree.

Also there is no need of the drm_gpusvm concept, similar to the core mm design: there is only
Mm_struct but there isn't any shared mm concept.

To mark a drm_gpuvm participate a svm, we only need to introduce a *mm member to point to
The core mm_struct that this gpuvm participate.

As said, those idea wasn't originated from me. In my v1, it was also xe_svm/svm_range. I learned
Those idea during review. Even today, I still think this is a reasonable design. The key reason is, the svm
Design is nothing more than a userptr with the capability of migration to vram. We already have a
Working userptr build on top of drm_gpuvm, drm_gpuva and xe_vma, why should we re-invent all
Those concepts? 

More details: https://gitlab.freedesktop.org/oak/xe-kernel-driver-svm/-/commits/drm-tip-svm-drm-generic-page-centric-Sep05

Btw, in above codes, it not only support page-centric migration to sram (such as partially migrate a range to sram, worse case one page),
It does the same thing for migration to vram. The key idea is, it introduced a migration vector concept: migration vector collects
All the migratable pages (through migrate_vma_setup), and aggregate those pages into a migration job, regardless it is all pages in
A range or only a subset of pages. The lock design there is also "coincidentally" aligned with what you outlined in your previous email,
See the "Doc: lock design" section of https://gitlab.freedesktop.org/oak/xe-kernel-driver-svm/-/commit/10d1576533f549b0d521dfa997b7087d1926e6ed


Oak

> 
> 
> > [1]
> https://patchwork.freedesktop.org/patch/574898/?series=128910&r
> ev=1
> > [2] https://patchwork.freedesktop.org/series/132229/
> >
> >
> > They are sized based on an array of chunk
> > > sizes, which
> > > + *	     is a GPU SVM initialization parameter, and the CPU address
> > > space.
> > > + *	     Upon GPU fault, the largest aligned chunk that fits within
> > > the
> > > + *	     faulting CPU address space is chosen for the range size.
> > > Ranges are
> > > + *	     expected to be dynamically allocated on GPU fault and
> > > removed on an
> > > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > > are tracked in
> > > + *	     a notifier's Red-Black tree.
> > > + * - Operations: Define the interface for driver-specific SVM
> > > operations such as
> > > + *		 allocation, page collection, migration, invalidations,
> > > and VRAM
> > > + *		 release.
> > > + *
> > > + * This layer provides interfaces for allocating, mapping, migrating,
> > > and
> > > + * releasing memory ranges between the CPU and GPU. It
> handles
> > > all core memory
> > > + * management interactions (DMA mapping, HMM, and
> migration)
> > > and provides
> > > + * driver-specific virtual functions (vfuncs). This infrastructure is
> > > sufficient
> > > + * to build the expected driver components for an SVM
> > > implementation as detailed
> > > + * below.
> > > + *
> > > + * Expected Driver Components:
> > > + * - GPU page fault handler: Used to create ranges and notifiers
> > > based on the
> > > + *			     fault address, optionally migrate the range
> > > to
> > > + *			     VRAM, and create GPU bindings.
> > > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > > Ranges are
> > > + *			expected to be added to the garbage collector
> > > upon
> > > + *			MMU_NOTIFY_UNMAP event.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Locking
> > > + *
> > > + * GPU SVM handles locking for core MM interactions, i.e., it
> > > locks/unlocks the
> > > + * mmap lock as needed. Alternatively, if the driver prefers to
> > > handle the mmap
> > > + * lock itself, a 'locked' argument is provided to the functions that
> > > require
> > > + * the mmap lock. This option may be useful for drivers that need
> to
> > > call into
> > > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > > locking
> > > + * inversions between the mmap and dma-resv locks.
> > > + *
> > > + * GPU SVM introduces a global notifier lock, which safeguards
> the
> > > notifier's
> > > + * range RB tree and list, as well as the range's DMA mappings
> and
> > > sequence
> > > + * number. GPU SVM manages all necessary locking and
> unlocking
> > > operations,
> > > + * except for the recheck of the range's sequence number
> > > + * (mmu_interval_read_retry) when the driver is committing
> GPU
> > > bindings. This
> > > + * lock corresponds to the 'driver->update' lock mentioned in the
> > > HMM
> > > + * documentation (TODO: Link). Future revisions may transition
> from
> > > a GPU SVM
> > > + * global lock to a per-notifier lock if finer-grained locking is
> deemed
> > > + * necessary.
> > > + *
> > > + * In addition to the locking mentioned above, the driver should
> > > implement a
> > > + * lock to safeguard core GPU SVM function calls that modify
> state,
> > > such as
> > > + * drm_gpusvm_range_find_or_insert and
> > > drm_gpusvm_range_remove. Alternatively,
> > > + * these core functions can be called within a single kernel thread,
> > > for
> > > + * instance, using an ordered work queue. This lock is denoted as
> > > + * 'driver_svm_lock' in code examples.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Migrataion
> > > + *
> > > + * The migration support is quite simple, allowing migration
> between
> > > SRAM and
> > > + * VRAM at the range granularity. For example, GPU SVM
> currently
> > > does not
> > > + * support mixing SRAM and VRAM pages within a range. This
> means
> > > that upon GPU
> > > + * fault, the entire range can be migrated to VRAM, and upon
> CPU
> > > fault, the
> > > + * entire range is migrated to SRAM.
> > > + *
> > > + * The reasoning for only supporting range granularity is as
> follows: it
> > > + * simplifies the implementation, and range sizes are driver-
> defined
> > > and should
> > > + * be relatively small.
> >
> > Migration at range granularity just couples the physical world with
> virtual world,
> > Which is against the fundamental page-centric design we aligned
> before.
> >
> > Looking at core mm behavior, the shrinking/swapping doesn't
> operate at vma or any
> > Virtual range granularity. This way we swap out the less frequently
> used pages and
> > Keep the more frequently used pages in ram.
> >
> > Similar thing should be done to vram migration to sram.
> >
> > > + */
> > > +
> > > +/**
> > > + * DOC: Partial Unmapping of Ranges
> > > + *
> > > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped
> by
> > > CPU resulting
> > > + * in MMU_NOTIFY_UNMAP event) presents several challenges,
> >
> > As said above, the challenge is coming from a design choice. In a
> > Page centric design, the challenges don't exist at all.
> 
> See my other reply, as long as migrate_to_ram is entirely page centric
> we're fine. And I think Matt fixed that now.
> 
> The other aspect of being page centric is gpu pagetable locking, and
> there
> I also gained a lot of clarity on what exactly matters, and what doesn't.
> The mmu_notifer -> range -> page design wouldn't be my personal
> first
> choice, but it is a perfectly ok one I think. As long as we follow all the
> other rules we need to follow about page-centric
> locking/refcounting/pte
> invaliation that migrate_to_ram requires.
> 
> Cheers, Sima
> 
> 
> > > with the main one
> > > + * being that a subset of the range still has CPU and GPU
> mappings.
> > > If the
> > > + * backing store for the range is in VRAM, a subset of the backing
> > > store has
> > > + * references. One option would be to split the range and VRAM
> > > backing store,
> > > + * but the implementation for this would be quite complicated.
> > > Given that
> > > + * partial unmappings are rare and driver-defined range sizes are
> > > relatively
> > > + * small, GPU SVM does not support splitting of ranges.
> > > + *
> > > + * With no support for range splitting, upon partial unmapping of
> a
> > > range, the
> > > + * driver is expected to invalidate and destroy the entire range. If
> > > the range
> > > + * has VRAM as its backing, the driver is also expected to migrate
> any
> > > remaining
> > > + * pages back to SRAM.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Examples
> > > + *
> > > + * This section provides two examples of how to build the
> expected
> > > driver
> > > + * components: the GPU page fault handler and the garbage
> > > collector. A third
> > > + * example demonstrates a sample invalidation driver vfunc.
> > > + *
> > > + * The generic code provided does not include logic for complex
> > > migration
> > > + * policies, optimized invalidations, or other potentially required
> > > driver
> > > + * locking (e.g., DMA-resv locks).
> > > + *
> > > + * 1) GPU page fault handler
> > > + *
> > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > > drm_gpusvm_range *range)
> > > + *	{
> > > + *		int err = 0;
> > > + *
> > > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > range);
> > > + *
> > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > + *			driver_commit_bind(gpusvm, range);
> > > + *		else
> > > + *			err = -EAGAIN;
> > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > + *
> > > + *		return err;
> > > + *	}
> > > + *
> > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *		int err;
> > > + *
> > > + *		driver_svm_lock();
> > > + *	retry:
> > > + *		// Always process UNMAPs first so view of GPU SVM
> > > ranges is current
> > > + *		driver_garbage_collector(gpusvm);
> > > + *
> > > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > > fault_addr,
> > > + *							gpuva_start,
> > > gpuva_end,
> > > + *						        &ctx);
> > > + *		if (IS_ERR(range)) {
> > > + *			err = PTR_ERR(range);
> > > + *			goto unlock;
> > > + *		}
> > > + *
> > > + *		if (driver_migration_policy(range)) {
> > > + *			bo = driver_alloc_bo();
> > > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > > range, bo, &ctx);
> > > + *			if (err)	// CPU mappings may have changed
> > > + *				goto retry;
> > > + *		}
> > > + *
> > > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &ctx);
> > > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > > mappings changed
> > > + *			goto retry;
> > > + *		else if (err)
> > > + *			goto unlock;
> > > + *
> > > + *		err = driver_bind_range(gpusvm, range);
> > > + *		if (err == -EAGAIN)	// CPU mappings changed
> > > + *			goto retry
> > > + *
> > > + *	unlock:
> > > + *		driver_svm_unlock();
> > > + *		return err;
> > > + *	}
> > > + *
> > > + * 2) Garbage Collector.
> > > + *
> > > + *	void __driver_garbage_collector(struct drm_gpusvm
> > > *gpusvm,
> > > + *					struct drm_gpusvm_range
> > > *range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		// Partial unmap, migrate any remaining VRAM pages
> > > back to SRAM
> > > + *		if (range->flags.partial_unmap)
> > > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > > range, &ctx);
> > > + *
> > > + *		driver_unbind_range(range);
> > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > + *	}
> > > + *
> > > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > > + *	{
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > > + *			__driver_garbage_collector(gpusvm, range);
> > > + *	}
> > > + *
> > > + * 3) Invalidation driver vfunc.
> > > + *
> > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > + *				 struct drm_gpusvm_notifier *notifier,
> > > + *				 const struct mmu_notifier_range
> > > *mmu_range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > > + *		struct drm_gpusvm_range *range = NULL;
> > > + *
> > > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > > >start, mmu_range->end);
> > > + *
> > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > mmu_range->start,
> > > + *					  mmu_range->end) {
> > > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > > range, &ctx);
> > > + *
> > > + *			if (mmu_range->event !=
> > > MMU_NOTIFY_UNMAP)
> > > + *				continue;
> > > + *
> > > + *			drm_gpusvm_range_set_unmapped(range,
> > > mmu_range);
> > > + *			driver_garbage_collector_add(gpusvm,
> > > range);
> > > + *		}
> > > + *	}
> > > + */
> > > +
> > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)-
> > > >va.start)
> > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)-
> > > >va.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node,
> u64,
> > > rb.__subtree_last,
> > > +		     DRM_GPUSVM_RANGE_START,
> > > DRM_GPUSVM_RANGE_END,
> > > +		     static __maybe_unused, range);
> > > +
> > > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)
> 	((_notifier)-
> > > >interval.start)
> > > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)
> 	((_notifier)-
> > > >interval.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node,
> u64,
> > > +		     rb.__subtree_last,
> > > DRM_GPUSVM_NOTIFIER_START,
> > > +		     DRM_GPUSVM_NOTIFIER_END, static
> > > __maybe_unused, notifier);
> > > +
> > > +/**
> > > + * npages_in_range() - Calculate the number of pages in a given
> > > range
> > > + * @start__: The start address of the range
> > > + * @end__: The end address of the range
> > > + *
> > > + * This macro calculates the number of pages in a given memory
> > > range,
> > > + * specified by the start and end addresses. It divides the
> difference
> > > + * between the end and start addresses by the page size
> > > (PAGE_SIZE) to
> > > + * determine the number of pages in the range.
> > > + *
> > > + * Return: The number of pages in the specified range.
> > > + */
> > > +#define npages_in_range(start__, end__)	\
> > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > +
> > > +/**
> > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > + *
> > > + * @refcount: Reference count for the zdd
> > > + * @destroy_work: Work structure for asynchronous zdd
> > > destruction
> > > + * @range: Pointer to the GPU SVM range
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> > > + *
> > > + * This structure serves as a generic wrapper installed in
> > > + * page->zone_device_data. It provides infrastructure for
> looking up
> > > a range
> > > + * upon CPU page fault and asynchronously releasing VRAM once
> > > the CPU has no
> > > + * page references. Asynchronous release is useful because CPU
> > > page references
> > > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > > requires sleeping
> > > + * locks.
> > > + */
> > > +struct drm_gpusvm_zdd {
> > > +	struct kref refcount;
> > > +	struct work_struct destroy_work;
> > > +	struct drm_gpusvm_range *range;
> > > +	void *vram_allocation;
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > destroying a zdd
> > > + * @w: Pointer to the work_struct
> > > + *
> > > + * This function releases VRAM, puts GPU SVM range, and frees
> zdd.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy_work_func(struct
> > > work_struct *w)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(w, struct drm_gpusvm_zdd,
> > > destroy_work);
> > > +	struct drm_gpusvm_range *range = zdd->range;
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > > +	drm_gpusvm_range_put(range);
> > > +	kfree(zdd);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > + * @range: Pointer to the GPU SVM range.
> > > + *
> > > + * This function allocates and initializes a new zdd structure. It
> sets
> > > up the
> > > + * reference count, initializes the destroy work, and links the
> > > provided GPU SVM
> > > + * range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > > + */
> > > +static struct drm_gpusvm_zdd *
> > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd;
> > > +
> > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > +	if (!zdd)
> > > +		return NULL;
> > > +
> > > +	kref_init(&zdd->refcount);
> > > +	INIT_WORK(&zdd->destroy_work,
> > > drm_gpusvm_zdd_destroy_work_func);
> > > +	zdd->range = drm_gpusvm_range_get(range);
> > > +	zdd->vram_allocation = NULL;
> > > +
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function increments the reference count of the provided
> zdd
> > > structure.
> > > + *
> > > + * Returns: Pointer to the zdd structure.
> > > + */
> > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_get(&zdd->refcount);
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > + * @ref: Pointer to the reference count structure.
> > > + *
> > > + * This function queues the destroy_work of the zdd for
> > > asynchronous destruction.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > +
> > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function decrements the reference count of the provided
> zdd
> > > structure
> > > + * and schedules its destruction if the count drops to zero.
> > > + */
> > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU
> SVM
> > > notifier
> > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > + * @start: Start address of the range
> > > + * @end: End address of the range
> > > + *
> > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier,
> u64
> > > start, u64 end)
> > > +{
> > > +	return range_iter_first(&notifier->root, start, end - 1);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU
> > > SVM ranges in a notifier
> > > + * @range__: Iterator variable for the ranges
> > > + * @next__: Iterator variable for the ranges temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > > while
> > > + * removing ranges from it.
> > > + */
> > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > notifier__, start__, end__)	\
> > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > (start__), (end__)),	\
> > > +	     (next__) = __drm_gpusvm_range_next(range__);
> > > 			\
> > > +	     (range__) && (range__->va.start < (end__));
> > > 			\
> > > +	     (range__) = (next__), (next__) =
> > > __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_notifier_next - get the next
> > > drm_gpusvm_notifier in the list
> > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> or
> > > NULL if
> > > + *         the current notifier is the last one or if the input notifier is
> > > + *         NULL.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > +				      &notifier->gpusvm->notifier_list))
> > > +		return list_next_entry(notifier, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM
> notifiers
> > > in a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> gpusvm.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > start__, end__)		\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > > (start__), (end__) - 1);	\
> > > +	     (notifier__) && (notifier__->interval.start < (end__));
> > > 			\
> > > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over
> GPU
> > > SVM notifiers in a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @next__: Iterator variable for the notifiers temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> gpusvm
> > > while
> > > + * removing notifiers from it.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier_safe(notifier__,
> next__,
> > > gpusvm__, start__, end__)	\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > > (start__), (end__) - 1),	\
> > > +	     (next__) = __drm_gpusvm_notifier_next(notifier__);
> > > 				\
> > > +	     (notifier__) && (notifier__->interval.start < (end__));
> > > 			\
> > > +	     (notifier__) = (next__), (next__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> notifier.
> > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > + * @cur_seq: Current sequence number.
> > > + *
> > > + * This function serves as a generic MMU notifier for GPU SVM. It
> > > sets the MMU
> > > + * notifier sequence number and calls the driver invalidate vfunc
> > > under
> > > + * gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * true if the operation succeeds, false otherwise.
> > > + */
> > > +static bool
> > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> *mni,
> > > +			       const struct mmu_notifier_range
> > > *mmu_range,
> > > +			       unsigned long cur_seq)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier =
> > > +		container_of(mni, typeof(*notifier), notifier);
> > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > +
> > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > +		return false;
> > > +
> > > +	down_write(&gpusvm->notifier_lock);
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > +	up_write(&gpusvm->notifier_lock);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations
> for
> > > GPU SVM
> > > + */
> > > +static const struct mmu_interval_notifier_ops
> > > drm_gpusvm_notifier_ops = {
> > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @name: Name of the GPU SVM.
> > > + * @drm: Pointer to the DRM device structure.
> > > + * @mm: Pointer to the mm_struct for the address space.
> > > + * @device_private_page_owner: Device private pages owner.
> > > + * @mm_start: Start address of GPU SVM.
> > > + * @mm_range: Range of the GPU SVM.
> > > + * @notifier_size: Size of individual notifiers.
> > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending order with
> last
> > > + *               entry being SZ_4K.
> > > + * @num_chunks: Number of chunks.
> > > + *
> > > + * This function initializes the GPU SVM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, a negative error code on failure.
> > > + */
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks)
> > > +{
> > > +	if (!ops->invalidate || !num_chunks)
> > > +		return -EINVAL;
> > > +
> > > +	gpusvm->name = name;
> > > +	gpusvm->drm = drm;
> > > +	gpusvm->mm = mm;
> > > +	gpusvm->device_private_page_owner =
> > > device_private_page_owner;
> > > +	gpusvm->mm_start = mm_start;
> > > +	gpusvm->mm_range = mm_range;
> > > +	gpusvm->notifier_size = notifier_size;
> > > +	gpusvm->ops = ops;
> > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > +	gpusvm->num_chunks = num_chunks;
> > > +	gpusvm->zdd_wq = system_wq;
> > > +
> > > +	mmgrab(mm);
> > > +	gpusvm->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > +
> > > +	init_rwsem(&gpusvm->notifier_lock);
> > > +
> > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > +	might_lock(&gpusvm->notifier_lock);
> > > +	fs_reclaim_release(GFP_KERNEL);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > + * @fault_addr__: Fault address
> > > + *
> > > + * This macro finds the GPU SVM notifier associated with the
> fault
> > > address.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > + */
> > > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)
> > > 	\
> > > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),
> > > 	\
> > > +			    (fault_addr__ + 1))
> > > +
> > > +/**
> > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > given rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_notifier struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > structure.
> > > + */
> > > +#define to_drm_gpusvm_notifier(__node)
> > > 	\
> > > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function inserts the GPU SVM notifier into the GPU SVM
> RB
> > > tree and list.
> > > + */
> > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	notifier_insert(notifier, &gpusvm->root);
> > > +
> > > +	node = rb_prev(&notifier->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > > +	else
> > > +		head = &gpusvm->notifier_list;
> > > +
> > > +	list_add(&notifier->rb.entry, head);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This macro removes the GPU SVM notifier from the GPU SVM
> RB
> > > tree and list.
> > > + */
> > > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)
> > > 	\
> > > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > > +	list_del(&(notifier__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + *
> > > + * This function finalizes the GPU SVM by cleaning up any
> remaining
> > > ranges and
> > > + * notifiers, and dropping a reference to struct MM.
> > > + */
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > +
> > > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm,
> > > 0, LONG_MAX) {
> > > +		struct drm_gpusvm_range *range, *__next;
> > > +
> > > +		/*
> > > +		 * Remove notifier first to avoid racing with any
> > > invalidation
> > > +		 */
> > > +		mmu_interval_notifier_remove(&notifier->notifier);
> > > +		notifier->flags.removed = true;
> > > +
> > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > notifier, 0,
> > > +					       LONG_MAX)
> > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > +	}
> > > +
> > > +	mmdrop(gpusvm->mm);
> > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + *
> > > + * This function allocates and initializes the GPU SVM notifier
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM notifier on success,
> ERR_PTR()
> > > on failure.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > fault_addr)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	if (gpusvm->ops->notifier_alloc)
> > > +		notifier = gpusvm->ops->notifier_alloc();
> > > +	else
> > > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > > +
> > > +	if (!notifier)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	notifier->gpusvm = gpusvm;
> > > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > > >notifier_size);
> > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > >notifier_size);
> > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > +	notifier->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > +
> > > +	return notifier;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function frees the GPU SVM notifier structure.
> > > + */
> > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm
> *gpusvm,
> > > +				     struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > +
> > > +	if (gpusvm->ops->notifier_free)
> > > +		gpusvm->ops->notifier_free(notifier);
> > > +	else
> > > +		kfree(notifier);
> > > +}
> > > +
> > > +/**
> > > + * to_drm_gpusvm_range - retrieve the container struct for a
> given
> > > rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_range struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_range
> structure.
> > > + */
> > > +#define to_drm_gpusvm_range(node__)	\
> > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function inserts the GPU SVM range into the notifier RB
> tree
> > > and list.
> > > + */
> > > +static void drm_gpusvm_range_insert(struct
> drm_gpusvm_notifier
> > > *notifier,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > +	range_insert(range, &notifier->root);
> > > +
> > > +	node = rb_prev(&range->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > +	else
> > > +		head = &notifier->range_list;
> > > +
> > > +	list_add(&range->rb.entry, head);
> > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + * @range__: Pointer to the GPU SVM range structure
> > > + *
> > > + * This macro removes the GPU SVM range from the notifier RB
> tree
> > > and list.
> > > + */
> > > +#define __drm_gpusvm_range_remove(notifier__, range__)
> > > 		\
> > > +	range_remove((range__), &(notifier__)->root);
> > > 	\
> > > +	list_del(&(range__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @fault_addr: Fault address
> > > + * @chunk_size: Chunk size
> > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > + *
> > > + * This function allocates and initializes the GPU SVM range
> structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR()
> on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_range *
> > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > +		       struct drm_gpusvm_notifier *notifier,
> > > +		       u64 fault_addr, u64 chunk_size, bool
> > > migrate_vram)
> > > +{
> > > +	struct drm_gpusvm_range *range;
> > > +
> > > +	if (gpusvm->ops->range_alloc)
> > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > +	else
> > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	kref_init(&range->refcount);
> > > +	range->gpusvm = gpusvm;
> > > +	range->notifier = notifier;
> > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > +	range->notifier_seq = LONG_MAX;
> > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_check_pages - Check pages
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Check if pages between start and end have been faulted in on
> the
> > > CPU. Use to
> > > + * prevent migration of pages without CPU backing store.
> > > + *
> > > + * Returns:
> > > + * True if pages have been faulted into CPU, False otherwise
> > > + */
> > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm
> *gpusvm,
> > > +				   struct drm_gpusvm_notifier
> > > *notifier,
> > > +				   u64 start, u64 end)
> > > +{
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = 0,
> > > +		.notifier = &notifier->notifier,
> > > +		.start = start,
> > > +		.end = end,
> > > +		.dev_private_owner = gpusvm-
> > > >device_private_page_owner,
> > > +	};
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns;
> > > +	unsigned long npages = npages_in_range(start, end);
> > > +	int err, i;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > > +	if (!pfns)
> > > +		return false;
> > > +
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier->notifier);
> > > +	hmm_range.hmm_pfns = pfns;
> > > +
> > > +	while (true) {
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier->notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > +			err = -EFAULT;
> > > +			goto err_free;
> > > +		}
> > > +	}
> > > +
> > > +err_free:
> > > +	kvfree(pfns);
> > > +	return err ? false : true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_chunk_size - Determine chunk size for
> GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @vas: Pointer to the virtual memory area structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @check_pages: Flag indicating whether to check pages
> > > + *
> > > + * This function determines the chunk size for the GPU SVM
> range
> > > based on the
> > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > > and the virtual
> > > + * memory area boundaries.
> > > + *
> > > + * Returns:
> > > + * Chunk size on success, LONG_MAX on failure.
> > > + */
> > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct drm_gpusvm_notifier
> > > *notifier,
> > > +				       struct vm_area_struct *vas,
> > > +				       u64 fault_addr, u64 gpuva_start,
> > > +				       u64 gpuva_end, bool check_pages)
> > > +{
> > > +	u64 start, end;
> > > +	int i = 0;
> > > +
> > > +retry:
> > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > >chunk_sizes[i]);
> > > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > > +
> > > +		if (start >= vas->vm_start && end <= vas->vm_end
> > > &&
> > > +		    start >= notifier->interval.start &&
> > > +		    end <= notifier->interval.end &&
> > > +		    start >= gpuva_start && end <= gpuva_end)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i == gpusvm->num_chunks)
> > > +		return LONG_MAX;
> > > +
> > > +	/*
> > > +	 * If allocation more than page, ensure not to overlap with
> > > existing
> > > +	 * ranges.
> > > +	 */
> > > +	if (end - start != SZ_4K) {
> > > +		struct drm_gpusvm_range *range;
> > > +
> > > +		range = drm_gpusvm_range_find(notifier, start, end);
> > > +		if (range) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +
> > > +		/*
> > > +		 * XXX: Only create range on pages CPU has faulted in.
> > > Without
> > > +		 * this check, or prefault, on BMG
> > > 'xe_exec_system_allocator --r
> > > +		 * process-many-malloc' fails. In the failure case, each
> > > process
> > > +		 * mallocs 16k but the CPU VMA is ~128k which results
> > > in 64k SVM
> > > +		 * ranges. When migrating the SVM ranges, some
> > > processes fail in
> > > +		 * drm_gpusvm_migrate_to_vram with
> > > 'migrate.cpages != npages'
> > > +		 * and then upon drm_gpusvm_range_get_pages
> > > device pages from
> > > +		 * other processes are collected + faulted in which
> > > creates all
> > > +		 * sorts of problems. Unsure exactly how this
> > > happening, also
> > > +		 * problem goes away if 'xe_exec_system_allocator --
> > > r
> > > +		 * process-many-malloc' mallocs at least 64k at a time.
> > > +		 */
> > > +		if (check_pages &&
> > > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > > end)) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > > +	return end - start;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function finds or inserts a newly allocated a GPU SVM
> range
> > > based on the
> > > + * fault address. Caller must hold a lock to protect range lookup
> and
> > > insertion.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> failure.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm
> *gpusvm,
> > > u64 fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct drm_gpusvm_range *range;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	bool notifier_alloc = false;
> > > +	u64 chunk_size;
> > > +	int err;
> > > +	bool migrate_vram;
> > > +
> > > +	if (fault_addr < gpusvm->mm_start ||
> > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > +		err = -EINVAL;
> > > +		goto err_out;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_write_locked(mm);
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > +	if (!notifier) {
> > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > fault_addr);
> > > +		if (IS_ERR(notifier)) {
> > > +			err = PTR_ERR(notifier);
> > > +			goto err_mmunlock;
> > > +		}
> > > +		notifier_alloc = true;
> > > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > > >notifier,
> > > +							  mm, notifier-
> > > >interval.start,
> > > +							  notifier-
> > > >interval.end -
> > > +							  notifier-
> > > >interval.start,
> > > +
> > > &drm_gpusvm_notifier_ops);
> > > +		if (err)
> > > +			goto err_notifier;
> > > +	}
> > > +
> > > +	vas = vma_lookup(mm, fault_addr);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > +		err = -EPERM;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > fault_addr + 1);
> > > +	if (range)
> > > +		goto out_mmunlock;
> > > +	/*
> > > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > > current
> > > +	 * limitations. If/when migrate_vma_* add more support, this
> > > logic will
> > > +	 * have to change.
> > > +	 */
> > > +	migrate_vram = ctx->vram_possible &&
> > > +		vma_is_anonymous(vas)
> > > && !is_vm_hugetlb_page(vas);
> > > +
> > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > notifier, vas,
> > > +						 fault_addr,
> > > gpuva_start,
> > > +						 gpuva_end,
> > > migrate_vram &&
> > > +						 !ctx->prefault);
> > > +	if (chunk_size == LONG_MAX) {
> > > +		err = -EINVAL;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > fault_addr, chunk_size,
> > > +				       migrate_vram);
> > > +	if (IS_ERR(range)) {
> > > +		err = PTR_ERR(range);
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	drm_gpusvm_range_insert(notifier, range);
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > +
> > > +	if (ctx->prefault) {
> > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > +
> > > +		__ctx.mmap_locked = true;
> > > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &__ctx);
> > > +		if (err)
> > > +			goto err_range_remove;
> > > +	}
> > > +
> > > +out_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +
> > > +	return range;
> > > +
> > > +err_range_remove:
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +err_notifier_remove:
> > > +	if (notifier_alloc)
> > > +		mmu_interval_notifier_remove(&notifier->notifier);
> > > +err_notifier:
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return ERR_PTR(err);
> > > +}
> > > +
> > > +/**
> > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > + * @i__: the current page index in the iteration
> > > + * @j__: the current page index, log order, in the iteration
> > > + * @npages__: the total number of pages in the DMA region
> > > + * @order__: the order of the pages in the DMA region
> > > + *
> > > + * This macro iterates over each page in a DMA region. The DMA
> > > region
> > > + * is assumed to be composed of 2^@order__ pages, and the
> macro
> > > will
> > > + * step through the region one block of 2^@order__ pages at a
> time.
> > > + */
> > > +#define for_each_dma_page(i__, j__, npages__, order__)
> 	\
> > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_unmap_pages - Unmap pages
> associated
> > > with a GPU SVM range (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function unmap pages associated with a GPU SVM range.
> > > Assumes and
> > > + * asserts correct locking is in place when called.
> > > + */
> > > +static void __drm_gpusvm_range_unmap_pages(struct
> > > drm_gpusvm *gpusvm,
> > > +					   struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		unsigned long i, j, npages = npages_in_range(range-
> > > >va.start,
> > > +							     range-
> > > >va.end);
> > > +
> > > +		if (range->flags.has_dma_mapping) {
> > > +			for_each_dma_page(i, j, npages, range-
> > > >order)
> > > +				dma_unmap_page(gpusvm->drm-
> > > >dev,
> > > +					       range->dma_addr[j],
> > > +					       PAGE_SIZE << range-
> > > >order,
> > > +					       DMA_BIDIRECTIONAL);
> > > +		}
> > > +
> > > +		range->flags.has_vram_pages = false;
> > > +		range->flags.has_dma_mapping = false;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_free_pages - Free pages associated with
> a
> > > GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function free pages associated with a GPU SVM range.
> > > + */
> > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		if (range->flags.kfree_mapping) {
> > > +			kfree(range->dma_addr);
> > > +			range->flags.kfree_mapping = false;
> > > +			range->pages = NULL;
> > > +		} else {
> > > +			kvfree(range->pages);
> > > +			range->pages = NULL;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range to be removed
> > > + *
> > > + * This function removes the specified GPU SVM range and also
> > > removes the parent
> > > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > > caller must
> > > + * hold a lock to protect range and notifier removal.
> > > + */
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > >va.start);
> > > +	if (WARN_ON_ONCE(!notifier))
> > > +		return;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	drm_gpusvm_range_put(range);
> > > +
> > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > +		if (!notifier->flags.removed)
> > > +			mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function increments the reference count of the specified
> > > GPU SVM range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_get(&range->refcount);
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > + * @refcount: Pointer to the reference counter embedded in the
> > > GPU SVM range
> > > + *
> > > + * This function destroys the specified GPU SVM range when its
> > > reference count
> > > + * reaches zero. If a custom range-free function is provided, it is
> > > invoked to
> > > + * free the range; otherwise, the range is deallocated using
> kfree().
> > > + */
> > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > +{
> > > +	struct drm_gpusvm_range *range =
> > > +		container_of(refcount, struct drm_gpusvm_range,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->range_free)
> > > +		gpusvm->ops->range_free(range);
> > > +	else
> > > +		kfree(range);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function decrements the reference count of the specified
> > > GPU SVM range
> > > + * and frees it when the count reaches zero.
> > > + */
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages
> valid
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called holding gpusvm->notifier_lock and as the last step
> before
> > > commiting a
> > > + * GPU binding.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm
> *gpusvm,
> > > +				  struct drm_gpusvm_range *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	return range->flags.has_vram_pages || range-
> > > >flags.has_dma_mapping;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range
> > > pages valid unlocked
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called without holding gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +static bool
> > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm
> > > *gpusvm,
> > > +				      struct drm_gpusvm_range *range)
> > > +{
> > > +	bool pages_valid;
> > > +
> > > +	if (!range->pages)
> > > +		return false;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > range);
> > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > +		kfree(range->dma_addr);
> > > +		range->flags.kfree_mapping = false;
> > > +		range->pages = NULL;
> > > +	}
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	return pages_valid;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM
> range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function gets pages for a GPU SVM range and ensures
> they
> > > are mapped for
> > > + * DMA access.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm
> *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > > >notifier;
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > >read_only ? 0 :
> > > +			HMM_PFN_REQ_WRITE),
> > > +		.notifier = notifier,
> > > +		.start = range->va.start,
> > > +		.end = range->va.end,
> > > +		.dev_private_owner = gpusvm-
> > > >device_private_page_owner,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long i, j;
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +	unsigned int order = 0;
> > > +	unsigned long *pfns;
> > > +	struct page **pages;
> > > +	int err = 0;
> > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > +	bool alloc_pfns = false, kfree_mapping;
> > > +
> > > +retry:
> > > +	kfree_mapping = false;
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > range))
> > > +		return 0;
> > > +
> > > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > > range->pages) {
> > > +		if (ctx->prefault)
> > > +			return 0;
> > > +
> > > +		pfns = (unsigned long *)range->pages;
> > > +		pages = range->pages;
> > > +		goto map_pages;
> > > +	}
> > > +
> > > +	if (!range->pages) {
> > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +		if (!pfns)
> > > +			return -ENOMEM;
> > > +		alloc_pfns = true;
> > > +	} else {
> > > +		pfns = (unsigned long *)range->pages;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +	}
> > > +
> > > +	hmm_range.hmm_pfns = pfns;
> > > +	while (true) {
> > > +		/* Must be checked after mmu_interval_read_begin
> > > */
> > > +		if (range->flags.unmapped) {
> > > +			err = -EFAULT;
> > > +			break;
> > > +		}
> > > +
> > > +		if (!ctx->mmap_locked) {
> > > +			/*
> > > +			 * XXX: HMM locking document indicates only
> > > a read-lock
> > > +			 * is required but there apears to be a window
> > > between
> > > +			 * the MMU_NOTIFY_MIGRATE event
> > > triggered in a CPU fault
> > > +			 * via migrate_vma_setup and the pages
> > > actually moving
> > > +			 * in migrate_vma_finalize in which this code
> > > can grab
> > > +			 * garbage pages. Grabbing the write-lock if
> > > the range
> > > +			 * is attached to vram appears to protect
> > > against this
> > > +			 * race.
> > > +			 */
> > > +			if (vram_pages)
> > > +				mmap_write_lock(mm);
> > > +			else
> > > +				mmap_read_lock(mm);
> > > +		}
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (!ctx->mmap_locked) {
> > > +			if (vram_pages)
> > > +				mmap_write_unlock(mm);
> > > +			else
> > > +				mmap_read_unlock(mm);
> > > +		}
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (!ctx->mmap_locked)
> > > +		mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +
> > > +	if (ctx->prefault) {
> > > +		range->pages = pages;
> > > +		goto set_seqno;
> > > +	}
> > > +
> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if
> > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +
> > > 	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +
> > > hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if
> > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > > +
> > > +			dma_addr[j] = dma_map_page(gpusvm-
> > > >drm->dev,
> > > +						   pages[j], 0,
> > > +						   PAGE_SIZE << order,
> > > +
> > > DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > dma_addr[j])) {
> > > +				err = -EFAULT;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +		}
> > > +
> > > +		/* Huge pages, reduce memory footprint */
> > > +		if (order) {
> > > +			dma_addr = kmalloc_array(j,
> > > sizeof(*dma_addr),
> > > +						 GFP_KERNEL);
> > > +			if (dma_addr) {
> > > +				for (i = 0; i < j; ++i)
> > > +					dma_addr[i] =
> > > (dma_addr_t)pfns[i];
> > > +				kvfree(pfns);
> > > +				kfree_mapping = true;
> > > +			} else {
> > > +				dma_addr = (dma_addr_t *)pfns;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->order = order;
> > > +		range->flags.kfree_mapping = kfree_mapping;
> > > +		range->flags.has_dma_mapping = true;
> > > +		range->dma_addr = dma_addr;
> > > +		range->vram_allocation = NULL;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +
> > > 	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	}
> > > +
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	for_each_dma_page(i, j, npages, order)
> > > +		dma_unmap_page(gpusvm->drm->dev,
> > > +			       (dma_addr_t)pfns[j],
> > > +			       PAGE_SIZE << order,
> > > DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	if (alloc_pfns)
> > > +		kvfree(pfns);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated
> > > with a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range.
> If
> > > @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > > mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must
> be
> > > called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > >invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long
> > > npages,
> > > +					   unsigned long *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +
> > > 	drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > > grate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM
> page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified GPU
> > > SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages
> for
> > > GPU SVM migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > mapped pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in
> GPU
> > > SVM. It
> > > + * iterates over each page frame number provided in
> @migrate_pfn,
> > > maps the
> > > + * corresponding page, and stores the DMA address in the
> provided
> > > @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during
> mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int
> > > *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page =
> > > migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > PAGE_SIZE, dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages
> previously
> > > mapped for GPU SVM migration
> > > + * @dev: The device for which the pages were mapped
> > > + * @dma_addr: Array of DMA addresses corresponding to
> mapped
> > > pages
> > > + * @npages: Number of pages to unmap
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function unmaps previously mapped pages of memory for
> > > GPU Shared Virtual
> > > + * Memory (SVM). It iterates over each DMA address provided in
> > > @dma_addr, checks
> > > + * if it's valid and not already unmapped, and unmaps the
> > > corresponding page.
> > > + */
> > > +static void drm_gpusvm_migrate_unmap_pages(struct device
> *dev,
> > > +					   dma_addr_t *dma_addr,
> > > +					   unsigned long npages,
> > > +					   enum dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > dma_addr[i]))
> > > +			continue;
> > > +
> > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to
> > > VRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *                   failure of this function.
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> allocation.
> > > The caller
> > > + *                   should hold a reference to the VRAM allocation,
> which
> > > + *                   should be dropped via ops->vram_allocation or upon
> the
> > > + *                   failure of this function.
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function migrates the specified GPU SVM range to VRAM.
> It
> > > performs the
> > > + * necessary setup and invokes the driver-specific operations for
> > > migration to
> > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > reference @range
> > > + * until ops->vram_release is called which only upon successful
> > > return.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm
> *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= gpusvm-
> > > >device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long i, npages = npages_in_range(start, end);
> > > +	struct vm_area_struct *vas;
> > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int err;
> > > +
> > > +	if (!range->flags.migrate_vram)
> > > +		return -EINVAL;
> > > +
> > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > >copy_to_vram ||
> > > +	    !gpusvm->ops->copy_to_sram)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > > npages, not
> > > +	 * always an error. Need to revisit possible cases and how to
> > > handle. We
> > > +	 * could prefault on migrate.cpages != npages via
> > > hmm_range_fault.
> > > +	 */
> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > vram_allocation, npages,
> > > +					     migrate.dst);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.src, npages,
> > > DMA_TO_DEVICE);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > +
> > > +		pages[i] = page;
> > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > +	}
> > > +
> > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages,
> > > dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	/* Upon success bind vram allocation to range and zdd */
> > > +	range->vram_allocation = vram_allocation;
> > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > > Owns ref */
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages,
> > > migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr, npages,
> > > +				       DMA_TO_DEVICE);
> > > +err_free:
> > > +	if (zdd)
> > > +		drm_gpusvm_zdd_put(zdd);
> > > +	kvfree(buf);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM
> > > PFNs for a VM area
> > > + * @vas: Pointer to the VM area structure, can be NULL
> > > + * @npages: Number of pages to populate
> > > + * @src_mpfn: Source array of migrate PFNs
> > > + * @mpfn: Array of migrate PFNs to populate
> > > + * @addr: Start address for PFN allocation
> > > + *
> > > + * This function populates the SRAM migrate page frame
> numbers
> > > (PFNs) for the
> > > + * specified VM area structure. It allocates and locks pages in the
> VM
> > > area for
> > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for
> allocation,
> > > if NULL use
> > > + * alloc_page for allocation.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > vm_area_struct *vas,
> > > +						unsigned long npages,
> > > +						unsigned long
> > > *src_mpfn,
> > > +						unsigned long *mpfn,
> > > u64 addr)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > +		struct page *page;
> > > +
> > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > +			continue;
> > > +
> > > +		if (vas)
> > > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > > addr);
> > > +		else
> > > +			page = alloc_page(GFP_HIGHUSER);
> > > +
> > > +		if (!page)
> > > +			return -ENOMEM;
> > > +
> > > +		lock_page(page);
> > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * Similar to __drm_gpusvm_migrate_to_sram but does not
> require
> > > mmap lock and
> > > + * migration done via migrate_device_* functions. Fallback path
> as it
> > > is
> > > + * preferred to issue migrations with mmap lock.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm
> *gpusvm,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	unsigned long *src, *dst;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	src = buf;
> > > +	dst = buf + (sizeof(*src) * npages);
> > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > >vram_allocation,
> > > +					     npages, src);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > +				       gpusvm-
> > > >device_private_page_owner, src,
> > > +				       npages, range->va.start);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL,
> > > npages, src, dst, 0);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   dst, npages,
> > > DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > > dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > +	migrate_device_pages(src, dst, npages);
> > > +	migrate_device_finalize(src, dst, npages);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr, npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range
> to
> > > SRAM (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the specified
> GPU
> > > SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps
> SRAM
> > > PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > > *gpusvm,
> > > +					struct vm_area_struct *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm-
> > > >device_private_page_owner,
> > > +		.flags		=
> > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	/* Corner where VMA area struct has been partially
> > > unmapped */
> > > +	if (start < vas->vm_start)
> > > +		start = vas->vm_start;
> > > +	if (end > vas->vm_end)
> > > +		end = vas->vm_end;
> > > +
> > > +	migrate.start = start;
> > > +	migrate.end = end;
> > > +	npages = npages_in_range(start, end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/* Raced with another CPU fault, nothing to do */
> > > +	if (!migrate.cpages)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > +						   migrate.src,
> > > migrate.dst,
> > > +						   start);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.dst, npages,
> > > +					   DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages,
> > > dma_addr, npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages,
> > > migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr, npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM
> range
> > > to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function initiates the migration of the specified GPU SVM
> > > range to
> > > + * SRAM. It performs necessary checks and invokes the internal
> > > migration
> > > + * function for actual migration.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err =
> > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the corner
> > > case when
> > > +	 * VRAM backing has been partially unmapped from MM's
> > > address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > > +	if (!vas) {
> > > +		if (!retry)
> > > +			err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > +		if (!retry)
> > > +			err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL,
> > > start, end);
> > > +	if (err)
> > > +		goto err_mmunlock;
> > > +
> > > +	if (vas->vm_end < end) {
> > > +		retry = true;
> > > +		start = vas->vm_end;
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_read_unlock(mm);
> > > +		/*
> > > +		 * Using mmput_async as this function can be called
> > > while
> > > +		 * holding a dma-resv lock, and a final put can grab the
> > > mmap
> > > +		 * lock, causing a lock inversion.
> > > +		 */
> > > +		mmput_async(mm);
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked)
> > > +		mmap_read_unlock(mm);
> > > +err_mmput:
> > > +	if (!ctx->mmap_locked)
> > > +		mmput_async(mm);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > > associated with a page
> > > + * @page: Pointer to the page
> > > + *
> > > + * This function is a callback used to put the GPU SVM zone
> device
> > > data
> > > + * associated with a page when it is being released.
> > > + */
> > > +static void drm_gpusvm_page_free(struct page *page)
> > > +{
> > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to
> RAM
> > > (page fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU
> SVM
> > > range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting
> > > page and invokes
> > > + * the internal migration function to migrate the range back to
> RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > > *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > > >zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range-
> > > >gpusvm,
> > > +					   vmf->vma, vmf->page,
> > > +					   zdd->range->va.start,
> > > +					   zdd->range->va.end);
> > > +
> > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops - Device page map operations for
> > > GPU SVM
> > > + */
> > > +static const struct dev_pagemap_ops
> drm_gpusvm_pagemap_ops =
> > > {
> > > +	.page_free = drm_gpusvm_page_free,
> > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device
> > > page map operations
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM device page map operations structure.
> > > + */
> > > +const struct dev_pagemap_ops
> > > *drm_gpusvm_pagemap_ops_get(void)
> > > +{
> > > +	return &drm_gpusvm_pagemap_ops;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping
> for
> > > the given address range
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM has mapping, False otherwise
> > > + */
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm,
> u64
> > > start, u64 end)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end)
> > > {
> > > +		struct drm_gpusvm_range *range = NULL;
> > > +
> > > +		drm_gpusvm_for_each_range(range, notifier, start,
> > > end)
> > > +			return true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > new file mode 100644
> > > index 000000000000..0ea70f8534a8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > @@ -0,0 +1,415 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_GPUSVM_H__
> > > +#define __DRM_GPUSVM_H__
> > > +
> > > +#include <linux/kref.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct dev_pagemap_ops;
> > > +struct drm_device;
> > > +struct drm_gpusvm;
> > > +struct drm_gpusvm_notifier;
> > > +struct drm_gpusvm_ops;
> > > +struct drm_gpusvm_range;
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > + *
> > > + * This structure defines the operations for GPU Shared Virtual
> > > Memory (SVM).
> > > + * These operations are provided by the GPU driver to manage
> SVM
> > > ranges and
> > > + * perform operations such as migration between VRAM and
> system
> > > RAM.
> > > + */
> > > +struct drm_gpusvm_ops {
> > > +	/**
> > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM notifier.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM notifier on success, NULL
> > > on failure.
> > > +	 */
> > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > +
> > > +	/**
> > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM notifier.
> > > +	 */
> > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > +
> > > +	/**
> > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM range.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > > on failure.
> > > +	 */
> > > +	struct drm_gpusvm_range *(*range_alloc)(struct
> > > drm_gpusvm *gpusvm);
> > > +
> > > +	/**
> > > +	 * @range_free: Free a GPU SVM range (optional)
> > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM range.
> > > +	 */
> > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > +
> > > +	/**
> > > +	 * @vram_release: Release VRAM allocation (optional)
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 *
> > > +	 * This function shall release VRAM allocation and expects to
> > > drop a
> > > +	 * reference to VRAM allocation.
> > > +	 */
> > > +	void (*vram_release)(void *vram_allocation);
> > > +
> > > +	/**
> > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 * @npages: Number of pages to populate
> > > +	 * @pfn: Array of page frame numbers to populate
> > > +	 *
> > > +	 * This function shall populate VRAM page frame numbers
> > > (PFN).
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > +				 void *vram_allocation,
> > > +				 unsigned long npages,
> > > +				 unsigned long *pfn);
> > > +
> > > +	/**
> > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to VRAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @copy_to_sram: Copy to system RAM (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > (destination)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to system RAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > +	 * @mmu_range: Pointer to the mmu_notifier_range
> > > structure
> > > +	 *
> > > +	 * This function shall invalidate the GPU page tables. It can
> > > safely
> > > +	 * walk the notifier range RB tree/list in this function. Called
> > > while
> > > +	 * holding the notifier lock.
> > > +	 */
> > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > +			   struct drm_gpusvm_notifier *notifier,
> > > +			   const struct mmu_notifier_range
> > > *mmu_range);
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_notifier - Structure representing a GPU
> SVM
> > > notifier
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: MMU interval notifier
> > > + * @interval: Interval for the notifier
> > > + * @rb: Red-black tree node for the parent GPU SVM structure
> > > notifier tree
> > > + * @root: Cached root node of the RB tree containing ranges
> > > + * @range_list: List head containing of ranges in the same order
> they
> > > appear in
> > > + *              interval tree. This is useful to keep iterating ranges while
> > > + *              doing modifications to RB tree.
> > > + * @flags.removed: Flag indicating whether the MMU interval
> > > notifier has been
> > > + *                 removed
> > > + *
> > > + * This structure represents a GPU SVM notifier.
> > > + */
> > > +struct drm_gpusvm_notifier {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct mmu_interval_notifier notifier;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} interval;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct rb_root_cached root;
> > > +	struct list_head range_list;
> > > +	struct {
> > > +		u32 removed : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_range - Structure representing a GPU
> SVM
> > > range
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier
> > > + * @refcount: Reference count for the range
> > > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > > structure range tree
> > > + * @va: Virtual address range
> > > + * @notifier_seq: Notifier sequence number of the range's
> pages
> > > + * @pages: Pointer to the array of pages (if backing store is in
> VRAM)
> > > + * @dma_addr: DMA address array (if backing store is SRAM and
> > > DMA mapped)
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> allocation
> > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is
> > > mapping size
> > > + * @flags.migrate_vram: Flag indicating whether the range can
> be
> > > migrated to VRAM
> > > + * @flags.unmapped: Flag indicating if the range has been
> > > unmapped
> > > + * @flags.partial_unmap: Flag indicating if the range has been
> > > partially unmapped
> > > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > > pages
> > > + * @flags.has_dma_mapping: Flag indicating if the range has a
> DMA
> > > mapping
> > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a
> compact
> > > allocation based
> > > + *                       on @order which releases via kfree
> > > + *
> > > + * This structure represents a GPU SVM range used for tracking
> > > memory ranges
> > > + * mapped in a DRM device.
> > > + */
> > > +struct drm_gpusvm_range {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct kref refcount;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} va;
> > > +	unsigned long notifier_seq;
> > > +	union {
> > > +		struct page **pages;
> > > +		dma_addr_t *dma_addr;
> > > +	};
> > > +	void *vram_allocation;
> > > +	u16 order;
> > > +	struct {
> > > +		/* All flags below must be set upon creation */
> > > +		u16 migrate_vram : 1;
> > > +		/* All flags below must be set / cleared under notifier
> > > lock */
> > > +		u16 unmapped : 1;
> > > +		u16 partial_unmap : 1;
> > > +		u16 has_vram_pages : 1;
> > > +		u16 has_dma_mapping : 1;
> > > +		u16 kfree_mapping : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm - GPU SVM structure
> > > + *
> > > + * @name: Name of the GPU SVM
> > > + * @drm: Pointer to the DRM device structure
> > > + * @mm: Pointer to the mm_struct for the address space
> > > + * @device_private_page_owner: Device private pages owner
> > > + * @mm_start: Start address of GPU SVM
> > > + * @mm_range: Range of the GPU SVM
> > > + * @notifier_size: Size of individual notifiers
> > > + * @ops: Pointer to the operations structure for GPU SVM
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending order.
> > > + * @num_chunks: Number of chunks
> > > + * @notifier_lock: Read-write semaphore for protecting notifier
> > > operations
> > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > + * @root: Cached root node of the Red-Black tree containing
> GPU
> > > SVM notifiers
> > > + * @notifier_list: list head containing of notifiers in the same
> order
> > > they
> > > + *                 appear in interval tree. This is useful to keep iterating
> > > + *                 notifiers while doing modifications to RB tree.
> > > + *
> > > + * This structure represents a GPU SVM (Shared Virtual Memory)
> > > used for tracking
> > > + * memory ranges mapped in a DRM (Direct Rendering Manager)
> > > device.
> > > + *
> > > + * No reference counting is provided, as this is expected to be
> > > embedded in the
> > > + * driver VM structure along with the struct drm_gpuvm, which
> > > handles reference
> > > + * counting.
> > > + */
> > > +struct drm_gpusvm {
> > > +	const char *name;
> > > +	struct drm_device *drm;
> > > +	struct mm_struct *mm;
> > > +	void *device_private_page_owner;
> > > +	u64 mm_start;
> > > +	u64 mm_range;
> > > +	u64 notifier_size;
> > > +	const struct drm_gpusvm_ops *ops;
> > > +	const u64 *chunk_sizes;
> > > +	int num_chunks;
> > > +	struct rw_semaphore notifier_lock;
> > > +	struct workqueue_struct *zdd_wq;
> > > +	struct rb_root_cached root;
> > > +	struct list_head notifier_list;
> > > +};
> >
> > I also think the gpusvm concept is a duplication of the drm_gpuvm.
> > Look at the members here, mm_start, mm_range, rb_tree...
> >
> > Maintaining a list of notifier at this layer is odd. Everybody else
> seems
> > Embed the notifier in a range...
> >
> > Mm field is essential for svm though. I think what we can do is,
> introduce a
> > *mm field in drm_gpuvm and introduce uAPI to allow user to say
> one gpuvm
> > Participate svm. If one gpuvm participate svm, we set the mm field
> for this
> > Gpuvm.
> >
> > Another benefit of the proposed way is, multiple gpuvms can share
> address space
> > With single cpu mm process.
> >
> >
> > Oak
> >
> >
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > + *
> > > + * @mmap_locked: mmap lock is locked
> > > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > > inversions
> > > + *                (e.g.dma-revs -> mmap lock)
> > > + * @in_notifier: entering from a MMU notifier
> > > + * @read_only: operating on read-only memory
> > > + * @vram_possible: possible to use VRAM
> > > + * @prefault: prefault pages
> > > + *
> > > + * Context that is DRM GPUSVM is operating in (i.e. user
> arguments).
> > > + */
> > > +struct drm_gpusvm_ctx {
> > > +	u32 mmap_locked :1;
> > > +	u32 trylock_mmap :1;
> > > +	u32 in_notifier :1;
> > > +	u32 read_only :1;
> > > +	u32 vram_possible :1;
> > > +	u32 prefault :1;
> > > +};
> > > +
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks);
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm
> *gpusvm,
> > > u64 fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range
> *range);
> > > +
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm
> *gpusvm,
> > > +				  struct drm_gpusvm_range *range);
> > > +
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm
> *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm
> *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +const struct dev_pagemap_ops
> > > *drm_gpusvm_pagemap_ops_get(void);
> > > +
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm,
> u64
> > > start, u64 end);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier,
> u64
> > > start, u64 end);
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > + */
> > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > +	down_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > + */
> > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > +	up_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in
> the
> > > list
> > > + * @range: a pointer to the current GPU SVM range
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_range if available,
> or
> > > NULL if the
> > > + *         current range is the last one or if the input range is NULL.
> > > + */
> > > +static inline struct drm_gpusvm_range *
> > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > +{
> > > +	if (range && !list_is_last(&range->rb.entry,
> > > +				   &range->notifier->range_list))
> > > +		return list_next_entry(range, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges
> in a
> > > notifier
> > > + * @range__: Iterator variable for the ranges. If set, it indicates
> the
> > > start of
> > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get
> > > the range.
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier.
> It
> > > is safe
> > > + * to use while holding the driver SVM lock or the notifier lock.
> > > + */
> > > +#define drm_gpusvm_for_each_range(range__, notifier__,
> start__,
> > > end__)	\
> > > +	for ((range__) = (range__) ?:
> > > 	\
> > > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));
> > > 	\
> > > +	     (range__) && (range__->va.start < (end__));
> > > 	\
> > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range
> as
> > > unmapped
> > > + * @range: Pointer to the GPU SVM range structure.
> > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > + *
> > > + * This function marks a GPU SVM range as unmapped and sets
> the
> > > partial_unmap flag
> > > + * if the range partially falls within the provided MMU notifier
> range.
> > > + */
> > > +static inline void
> > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range
> > > *range,
> > > +			      const struct mmu_notifier_range
> > > *mmu_range)
> > > +{
> > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > +
> > > +	range->flags.unmapped = true;
> > > +	if (range->va.start < mmu_range->start ||
> > > +	    range->va.end > mmu_range->end)
> > > +		range->flags.partial_unmap = true;
> > > +}
> > > +
> > > +#endif /* __DRM_GPUSVM_H__ */
> > > --
> > > 2.34.1
> >
> 
> --
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-09-24 16:30     ` Matthew Brost
@ 2024-09-25 21:12       ` Matthew Brost
  0 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-09-25 21:12 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Tue, Sep 24, 2024 at 04:30:06PM +0000, Matthew Brost wrote:
> On Tue, Sep 24, 2024 at 12:42:56PM +0200, Thomas Hellström wrote:
> > Hi, Matt,
> > 
> > Some random review comments on this patch I came across while looking
> > at multi-device.
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > This patch introduces support for GPU Shared Virtual Memory (SVM) in
> > > the
> > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > sharing of memory between the CPU and GPU, enhancing performance and
> > > flexibility in GPU computing tasks.
> > > 
> > > The patch adds the necessary infrastructure for SVM, including data
> > > structures and functions for managing SVM ranges and notifiers. It
> > > also
> > > provides mechanisms for allocating, deallocating, and migrating
> > > memory
> > > regions between system RAM and GPU VRAM.
> > > 
> > > This mid-layer is largely inspired by GPUVM.
> > 
> > NIT: Naming, Should it be drm_svm rather than drm_gpusvm? For the
> > drm_gpuvm component, gpuvm clearly distinguished a gpu_vm from a
> > mm_struct but here we don't have the same need.
> > 
> 
> Can rename.
> 
> > > 
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > +++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > >  
> > >  # core driver code
> > >  
> > > -xe-y += xe_bb.o \
> > > +xe-y += drm_gpusvm.o \
> > > +	xe_bb.o \
> > >  	xe_bo.o \
> > >  	xe_bo_evict.o \
> > >  	xe_devcoredump.o \
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > new file mode 100644
> > > index 000000000000..fc1e44e6ae72
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > @@ -0,0 +1,2174 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + *
> > > + * Authors:
> > > + *     Matthew Brost <matthew.brost@intel.com>
> > > + */
> > > +
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/interval_tree_generic.h>
> > > +#include <linux/hmm.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/mm_types.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/slab.h>
> > > +
> > > +#include <drm/drm_device.h>
> > > +#include "drm_gpusvm.h"
> > > +
> > > +/**
> > > + * DOC: Overview
> > > + *
> > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > Rendering Manager (DRM)
> > > + *
> > > + * The GPU SVM layer is a component of the DRM framework designed to
> > > manage shared
> > > + * virtual memory between the CPU and GPU. It enables efficient data
> > > exchange and
> > > + * processing for GPU-accelerated applications by allowing memory
> > > sharing and
> > > + * synchronization between the CPU's and GPU's virtual address
> > > spaces.
> > > + *
> > > + * Key GPU SVM Components:
> > > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > > notifying the
> > > + *		GPU of changes, notifiers are sized based on a GPU
> > > SVM
> > > + *		initialization parameter, with a recommendation of
> > > 512M or
> > > + *		larger. They maintain a Red-BlacK tree and a list of
> > > ranges that
> > > + *		fall within the notifier interval. Notifiers are
> > > tracked within
> > > + *		a GPU SVM Red-BlacK tree and list and are
> > > dynamically inserted
> > > + *		or removed as ranges within the interval are created
> > > or
> > > + *		destroyed.
> > > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > > managed
> > > + *	     by GPU SVM. They are sized based on an array of chunk
> > > sizes, which
> > > + *	     is a GPU SVM initialization parameter, and the CPU
> > > address space.
> > > + *	     Upon GPU fault, the largest aligned chunk that fits
> > > within the
> > > + *	     faulting CPU address space is chosen for the range
> > > size. Ranges are
> > > + *	     expected to be dynamically allocated on GPU fault and
> > > removed on an
> > > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > > are tracked in
> > > + *	     a notifier's Red-Black tree.
> > > + * - Operations: Define the interface for driver-specific SVM
> > > operations such as
> > > + *		 allocation, page collection, migration,
> > > invalidations, and VRAM
> > > + *		 release.
> > > + *
> > > + * This layer provides interfaces for allocating, mapping,
> > > migrating, and
> > > + * releasing memory ranges between the CPU and GPU. It handles all
> > > core memory
> > > + * management interactions (DMA mapping, HMM, and migration) and
> > > provides
> > > + * driver-specific virtual functions (vfuncs). This infrastructure
> > > is sufficient
> > > + * to build the expected driver components for an SVM implementation
> > > as detailed
> > > + * below.
> > > + *
> > > + * Expected Driver Components:
> > > + * - GPU page fault handler: Used to create ranges and notifiers
> > > based on the
> > > + *			     fault address, optionally migrate the
> > > range to
> > > + *			     VRAM, and create GPU bindings.
> > > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > > Ranges are
> > > + *			expected to be added to the garbage
> > > collector upon
> > > + *			MMU_NOTIFY_UNMAP event.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Locking
> > > + *
> > > + * GPU SVM handles locking for core MM interactions, i.e., it
> > > locks/unlocks the
> > > + * mmap lock as needed. Alternatively, if the driver prefers to
> > > handle the mmap
> > > + * lock itself, a 'locked' argument is provided to the functions
> > > that require
> > > + * the mmap lock. This option may be useful for drivers that need to
> > > call into
> > > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > > locking
> > > + * inversions between the mmap and dma-resv locks.
> > > + *
> > > + * GPU SVM introduces a global notifier lock, which safeguards the
> > > notifier's
> > > + * range RB tree and list, as well as the range's DMA mappings and
> > > sequence
> > > + * number. GPU SVM manages all necessary locking and unlocking
> > > operations,
> > > + * except for the recheck of the range's sequence number
> > > + * (mmu_interval_read_retry) when the driver is committing GPU
> > > bindings. This
> > > + * lock corresponds to the 'driver->update' lock mentioned in the
> > > HMM
> > > + * documentation (TODO: Link). Future revisions may transition from
> > > a GPU SVM
> > > + * global lock to a per-notifier lock if finer-grained locking is
> > > deemed
> > > + * necessary.
> > > + *
> > > + * In addition to the locking mentioned above, the driver should
> > > implement a
> > > + * lock to safeguard core GPU SVM function calls that modify state,
> > > such as
> > > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > > Alternatively,
> > > + * these core functions can be called within a single kernel thread,
> > > for
> > > + * instance, using an ordered work queue. This lock is denoted as
> > > + * 'driver_svm_lock' in code examples.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Migrataion
> > > + *
> > > + * The migration support is quite simple, allowing migration between
> > > SRAM and
> > > + * VRAM at the range granularity. For example, GPU SVM currently
> > > does not
> > > + * support mixing SRAM and VRAM pages within a range. This means
> > > that upon GPU
> > > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > > fault, the
> > > + * entire range is migrated to SRAM.
> > > + *
> > > + * The reasoning for only supporting range granularity is as
> > > follows: it
> > > + * simplifies the implementation, and range sizes are driver-defined
> > > and should
> > > + * be relatively small.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Partial Unmapping of Ranges
> > > + *
> > > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > > CPU resulting
> > > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> > > main one
> > > + * being that a subset of the range still has CPU and GPU mappings.
> > > If the
> > > + * backing store for the range is in VRAM, a subset of the backing
> > > store has
> > > + * references. One option would be to split the range and VRAM
> > > backing store,
> > > + * but the implementation for this would be quite complicated. Given
> > > that
> > > + * partial unmappings are rare and driver-defined range sizes are
> > > relatively
> > > + * small, GPU SVM does not support splitting of ranges.
> > > + *
> > > + * With no support for range splitting, upon partial unmapping of a
> > > range, the
> > > + * driver is expected to invalidate and destroy the entire range. If
> > > the range
> > > + * has VRAM as its backing, the driver is also expected to migrate
> > > any remaining
> > > + * pages back to SRAM.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Examples
> > > + *
> > > + * This section provides two examples of how to build the expected
> > > driver
> > > + * components: the GPU page fault handler and the garbage collector.
> > > A third
> > > + * example demonstrates a sample invalidation driver vfunc.
> > > + *
> > > + * The generic code provided does not include logic for complex
> > > migration
> > > + * policies, optimized invalidations, or other potentially required
> > > driver
> > > + * locking (e.g., DMA-resv locks).
> > > + *
> > > + * 1) GPU page fault handler
> > > + *
> > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > > drm_gpusvm_range *range)
> > > + *	{
> > > + *		int err = 0;
> > > + *
> > > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > range);
> > > + *
> > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > + *			driver_commit_bind(gpusvm, range);
> > > + *		else
> > > + *			err = -EAGAIN;
> > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > + *
> > > + *		return err;
> > > + *	}
> > > + *
> > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *		int err;
> > > + *
> > > + *		driver_svm_lock();
> > > + *	retry:
> > > + *		// Always process UNMAPs first so view of GPU SVM
> > > ranges is current
> > > + *		driver_garbage_collector(gpusvm);
> > > + *
> > > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > > fault_addr,
> > > + *							gpuva_start,
> > > gpuva_end,
> > > + *						        &ctx);
> > > + *		if (IS_ERR(range)) {
> > > + *			err = PTR_ERR(range);
> > > + *			goto unlock;
> > > + *		}
> > > + *
> > > + *		if (driver_migration_policy(range)) {
> > > + *			bo = driver_alloc_bo();
> > > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > > range, bo, &ctx);
> > > + *			if (err)	// CPU mappings may have
> > > changed
> > > + *				goto retry;
> > > + *		}
> > > + *
> > > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &ctx);
> > > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > > mappings changed
> > > + *			goto retry;
> > > + *		else if (err)
> > > + *			goto unlock;
> > > + *
> > > + *		err = driver_bind_range(gpusvm, range);
> > > + *		if (err == -EAGAIN)	// CPU mappings changed
> > > + *			goto retry
> > > + *
> > > + *	unlock:
> > > + *		driver_svm_unlock();
> > > + *		return err;
> > > + *	}
> > > + *
> > > + * 2) Garbage Collector.
> > > + *
> > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > + *					struct drm_gpusvm_range
> > > *range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		// Partial unmap, migrate any remaining VRAM pages
> > > back to SRAM
> > > + *		if (range->flags.partial_unmap)
> > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> > > &ctx);
> > > + *
> > > + *		driver_unbind_range(range);
> > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > + *	}
> > > + *
> > > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > > + *	{
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > > + *			__driver_garbage_collector(gpusvm, range);
> > > + *	}
> > > + *
> > > + * 3) Invalidation driver vfunc.
> > > + *
> > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > + *				 struct drm_gpusvm_notifier
> > > *notifier,
> > > + *				 const struct mmu_notifier_range
> > > *mmu_range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > > };
> > > + *		struct drm_gpusvm_range *range = NULL;
> > > + *
> > > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > > >start, mmu_range->end);
> > > + *
> > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > mmu_range->start,
> > > + *					  mmu_range->end) {
> > > + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> > > &ctx);
> > > + *
> > > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > > + *				continue;
> > > + *
> > > + *			drm_gpusvm_range_set_unmapped(range,
> > > mmu_range);
> > > + *			driver_garbage_collector_add(gpusvm, range);
> > > + *		}
> > > + *	}
> > > + */
> > > +
> > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > rb.__subtree_last,
> > > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > > +		     static __maybe_unused, range);
> > > +
> > > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > >interval.start)
> > > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > >interval.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> > > notifier);
> > > +
> > 
> > Since these trees span struct mm_struct address space which should fit
> > in an unsigned long, can we use the generic version (interval_tree.h)
> > rather than instantiating two new versions? I figure both contain
> > overlapping ranges so we can't use maple trees?
> > 
> 
> I can look into using a generic version but actually I don't think we
> allow overlapping so a maple tree might work here too. I'll likely stick
> a generic version in next rev but if the consensus is maple tree we can
> switch over to that fairly easy at any point in time as the tree
> interaction is completely encapsulated in DRM SVM layer.
> 
> > > +/**
> > > + * npages_in_range() - Calculate the number of pages in a given
> > > range
> > > + * @start__: The start address of the range
> > > + * @end__: The end address of the range
> > > + *
> > > + * This macro calculates the number of pages in a given memory
> > > range,
> > > + * specified by the start and end addresses. It divides the
> > > difference
> > > + * between the end and start addresses by the page size (PAGE_SIZE)
> > > to
> > > + * determine the number of pages in the range.
> > > + *
> > > + * Return: The number of pages in the specified range.
> > > + */
> > > +#define npages_in_range(start__, end__)	\
> > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > +
> > > +/**
> > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > + *
> > > + * @refcount: Reference count for the zdd
> > > + * @destroy_work: Work structure for asynchronous zdd destruction
> > > + * @range: Pointer to the GPU SVM range
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > + *
> > > + * This structure serves as a generic wrapper installed in
> > > + * page->zone_device_data. It provides infrastructure for looking up
> > > a range
> > > + * upon CPU page fault and asynchronously releasing VRAM once the
> > > CPU has no
> > > + * page references. Asynchronous release is useful because CPU page
> > > references
> > > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > > requires sleeping
> > > + * locks.
> > > + */
> > > +struct drm_gpusvm_zdd {
> > > +	struct kref refcount;
> > > +	struct work_struct destroy_work;
> > > +	struct drm_gpusvm_range *range;
> >  
> > I still believe previous review comments are valid here, considering we
> > do have multiple drm_gpusvm per struct mm_struct, potentially all
> > mapping the above page.
> > 
> 
> Exactly which comments?
> 
> If it related to the range pointer, that is going to be dropped. All
> virtual references from zdd will be dropped (i.e. no pointer to even a
> DRM SVM).
> 
> > > +	void *vram_allocation;
> > 
> > NIT: Naming. The core is using device memory or devmem. Should we
> > follow.
> >
> 
> I like devmem. Will change.
>  
> > Also could we, rather than using av void * use an embeddable struct
> > with its own ops rather than using the gpusvm ops for this?
> > 
> 
> Can you give me code snippet example of what you think this should look
> like? Not opposed to this.
> 

After reading your write up multi-device, yes I think a embeddable
struct with its own ops make sense.

I think the following ops should be moved to the embeddable struct:

vram_release
populate_vram_pfn
copy_to_vram
copy_to_sram

Also the local vram_attach, vram_detach, and vram_detached in the branch
I shared.

We likely want the device which owns the vram allocation in the embedded
struct too so that can be used for dma mapping when triggering
migration to a remote device or when a migrate_to_ram callback occurs.

Matt

> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> > > zdd
> > > + * @w: Pointer to the work_struct
> > > + *
> > > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(w, struct drm_gpusvm_zdd,
> > > destroy_work);
> > > +	struct drm_gpusvm_range *range = zdd->range;
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > > +	drm_gpusvm_range_put(range);
> > > +	kfree(zdd);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > + * @range: Pointer to the GPU SVM range.
> > > + *
> > > + * This function allocates and initializes a new zdd structure. It
> > > sets up the
> > > + * reference count, initializes the destroy work, and links the
> > > provided GPU SVM
> > > + * range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > > + */
> > > +static struct drm_gpusvm_zdd *
> > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd;
> > > +
> > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > +	if (!zdd)
> > > +		return NULL;
> > > +
> > > +	kref_init(&zdd->refcount);
> > > +	INIT_WORK(&zdd->destroy_work,
> > > drm_gpusvm_zdd_destroy_work_func);
> > > +	zdd->range = drm_gpusvm_range_get(range);
> > > +	zdd->vram_allocation = NULL;
> > > +
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function increments the reference count of the provided zdd
> > > structure.
> > > + *
> > > + * Returns: Pointer to the zdd structure.
> > > + */
> > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_get(&zdd->refcount);
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > + * @ref: Pointer to the reference count structure.
> > > + *
> > > + * This function queues the destroy_work of the zdd for asynchronous
> > > destruction.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > +
> > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function decrements the reference count of the provided zdd
> > > structure
> > > + * and schedules its destruction if the count drops to zero.
> > > + */
> > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > + * @start: Start address of the range
> > > + * @end: End address of the range
> > > + *
> > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end)
> > > +{
> > > +	return range_iter_first(&notifier->root, start, end - 1);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > ranges in a notifier
> > > + * @range__: Iterator variable for the ranges
> > > + * @next__: Iterator variable for the ranges temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > > while
> > > + * removing ranges from it.
> > > + */
> > > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> > > start__, end__)	\
> > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > (start__), (end__)),	\
> > > +	     (next__) =
> > > __drm_gpusvm_range_next(range__);				\
> > > +	     (range__) && (range__->va.start <
> > > (end__));				\
> > > +	     (range__) = (next__), (next__) =
> > > __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> > > the list
> > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> > > or NULL if
> > > + *         the current notifier is the last one or if the input
> > > notifier is
> > > + *         NULL.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > +{
> > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > +				      &notifier->gpusvm-
> > > >notifier_list))
> > > +		return list_next_entry(notifier, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> > > a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> > > end__)		\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > > (start__), (end__) - 1);	\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> > > notifiers in a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @next__: Iterator variable for the notifiers temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > > while
> > > + * removing notifiers from it.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > gpusvm__, start__, end__)	\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > > (start__), (end__) - 1),	\
> > > +	     (next__) =
> > > __drm_gpusvm_notifier_next(notifier__);				\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) = (next__), (next__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > + * @cur_seq: Current sequence number.
> > > + *
> > > + * This function serves as a generic MMU notifier for GPU SVM. It
> > > sets the MMU
> > > + * notifier sequence number and calls the driver invalidate vfunc
> > > under
> > > + * gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * true if the operation succeeds, false otherwise.
> > > + */
> > > +static bool
> > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > > +			       const struct mmu_notifier_range
> > > *mmu_range,
> > > +			       unsigned long cur_seq)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier =
> > > +		container_of(mni, typeof(*notifier), notifier);
> > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > +
> > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > +		return false;
> > > +
> > > +	down_write(&gpusvm->notifier_lock);
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > +	up_write(&gpusvm->notifier_lock);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > > GPU SVM
> > > + */
> > > +static const struct mmu_interval_notifier_ops
> > > drm_gpusvm_notifier_ops = {
> > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @name: Name of the GPU SVM.
> > > + * @drm: Pointer to the DRM device structure.
> > > + * @mm: Pointer to the mm_struct for the address space.
> > > + * @device_private_page_owner: Device private pages owner.
> > > + * @mm_start: Start address of GPU SVM.
> > > + * @mm_range: Range of the GPU SVM.
> > > + * @notifier_size: Size of individual notifiers.
> > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending order
> > > with last
> > > + *               entry being SZ_4K.
> > > + * @num_chunks: Number of chunks.
> > > + *
> > > + * This function initializes the GPU SVM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, a negative error code on failure.
> > > + */
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks)
> > > +{
> > > +	if (!ops->invalidate || !num_chunks)
> > > +		return -EINVAL;
> > > +
> > > +	gpusvm->name = name;
> > > +	gpusvm->drm = drm;
> > > +	gpusvm->mm = mm;
> > > +	gpusvm->device_private_page_owner =
> > > device_private_page_owner;
> > > +	gpusvm->mm_start = mm_start;
> > > +	gpusvm->mm_range = mm_range;
> > > +	gpusvm->notifier_size = notifier_size;
> > > +	gpusvm->ops = ops;
> > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > +	gpusvm->num_chunks = num_chunks;
> > > +	gpusvm->zdd_wq = system_wq;
> > > +
> > > +	mmgrab(mm);
> > > +	gpusvm->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > +
> > > +	init_rwsem(&gpusvm->notifier_lock);
> > > +
> > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > +	might_lock(&gpusvm->notifier_lock);
> > > +	fs_reclaim_release(GFP_KERNEL);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > + * @fault_addr__: Fault address
> > > + *
> > > + * This macro finds the GPU SVM notifier associated with the fault
> > > address.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > + */
> > > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > > +			    (fault_addr__ + 1))
> > > +
> > > +/**
> > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > given rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_notifier struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > structure.
> > > + */
> > > +#define to_drm_gpusvm_notifier(__node)				\
> > > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > > tree and list.
> > > + */
> > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > > +				       struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	notifier_insert(notifier, &gpusvm->root);
> > > +
> > > +	node = rb_prev(&notifier->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > > +	else
> > > +		head = &gpusvm->notifier_list;
> > > +
> > > +	list_add(&notifier->rb.entry, head);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> > > and list.
> > > + */
> > > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > > +	list_del(&(notifier__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + *
> > > + * This function finalizes the GPU SVM by cleaning up any remaining
> > > ranges and
> > > + * notifiers, and dropping a reference to struct MM.
> > > + */
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > +
> > > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> > > LONG_MAX) {
> > > +		struct drm_gpusvm_range *range, *__next;
> > > +
> > > +		/*
> > > +		 * Remove notifier first to avoid racing with any
> > > invalidation
> > > +		 */
> > > +		mmu_interval_notifier_remove(&notifier->notifier);
> > > +		notifier->flags.removed = true;
> > > +
> > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > notifier, 0,
> > > +					       LONG_MAX)
> > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > +	}
> > > +
> > > +	mmdrop(gpusvm->mm);
> > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + *
> > > + * This function allocates and initializes the GPU SVM notifier
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > > on failure.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	if (gpusvm->ops->notifier_alloc)
> > > +		notifier = gpusvm->ops->notifier_alloc();
> > > +	else
> > > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > > +
> > > +	if (!notifier)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	notifier->gpusvm = gpusvm;
> > > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > > >notifier_size);
> > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > >notifier_size);
> > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > +	notifier->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > +
> > > +	return notifier;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function frees the GPU SVM notifier structure.
> > > + */
> > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > +				     struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > +
> > > +	if (gpusvm->ops->notifier_free)
> > > +		gpusvm->ops->notifier_free(notifier);
> > > +	else
> > > +		kfree(notifier);
> > > +}
> > > +
> > > +/**
> > > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > > rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_range struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > > + */
> > > +#define to_drm_gpusvm_range(node__)	\
> > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function inserts the GPU SVM range into the notifier RB tree
> > > and list.
> > > + */
> > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > *notifier,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > +	range_insert(range, &notifier->root);
> > > +
> > > +	node = rb_prev(&range->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > +	else
> > > +		head = &notifier->range_list;
> > > +
> > > +	list_add(&range->rb.entry, head);
> > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + * @range__: Pointer to the GPU SVM range structure
> > > + *
> > > + * This macro removes the GPU SVM range from the notifier RB tree
> > > and list.
> > > + */
> > > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > > +	range_remove((range__), &(notifier__)->root);		\
> > > +	list_del(&(range__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @fault_addr: Fault address
> > > + * @chunk_size: Chunk size
> > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > + *
> > > + * This function allocates and initializes the GPU SVM range
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_range *
> > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > +		       struct drm_gpusvm_notifier *notifier,
> > > +		       u64 fault_addr, u64 chunk_size, bool
> > > migrate_vram)
> > > +{
> > > +	struct drm_gpusvm_range *range;
> > > +
> > > +	if (gpusvm->ops->range_alloc)
> > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > +	else
> > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	kref_init(&range->refcount);
> > > +	range->gpusvm = gpusvm;
> > > +	range->notifier = notifier;
> > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > +	range->notifier_seq = LONG_MAX;
> > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_check_pages - Check pages
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Check if pages between start and end have been faulted in on the
> > > CPU. Use to
> > > + * prevent migration of pages without CPU backing store.
> > > + *
> > > + * Returns:
> > > + * True if pages have been faulted into CPU, False otherwise
> > > + */
> > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > +				   struct drm_gpusvm_notifier
> > > *notifier,
> > > +				   u64 start, u64 end)
> > > +{
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = 0,
> > > +		.notifier = &notifier->notifier,
> > > +		.start = start,
> > > +		.end = end,
> > > +		.dev_private_owner = gpusvm-
> > > >device_private_page_owner,
> > > +	};
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns;
> > > +	unsigned long npages = npages_in_range(start, end);
> > > +	int err, i;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > > +	if (!pfns)
> > > +		return false;
> > > +
> > > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> > > >notifier);
> > > +	hmm_range.hmm_pfns = pfns;
> > > +
> > > +	while (true) {
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier->notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > +			err = -EFAULT;
> > > +			goto err_free;
> > > +		}
> > > +	}
> > > +
> > > +err_free:
> > > +	kvfree(pfns);
> > > +	return err ? false : true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @vas: Pointer to the virtual memory area structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @check_pages: Flag indicating whether to check pages
> > > + *
> > > + * This function determines the chunk size for the GPU SVM range
> > > based on the
> > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> > > the virtual
> > > + * memory area boundaries.
> > > + *
> > > + * Returns:
> > > + * Chunk size on success, LONG_MAX on failure.
> > > + */
> > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > > +				       struct drm_gpusvm_notifier
> > > *notifier,
> > > +				       struct vm_area_struct *vas,
> > > +				       u64 fault_addr, u64
> > > gpuva_start,
> > > +				       u64 gpuva_end, bool
> > > check_pages)
> > > +{
> > > +	u64 start, end;
> > > +	int i = 0;
> > > +
> > > +retry:
> > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > >chunk_sizes[i]);
> > > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > > +
> > > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > > +		    start >= notifier->interval.start &&
> > > +		    end <= notifier->interval.end &&
> > > +		    start >= gpuva_start && end <= gpuva_end)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i == gpusvm->num_chunks)
> > > +		return LONG_MAX;
> > > +
> > > +	/*
> > > +	 * If allocation more than page, ensure not to overlap with
> > > existing
> > > +	 * ranges.
> > > +	 */
> > > +	if (end - start != SZ_4K) {
> > > +		struct drm_gpusvm_range *range;
> > > +
> > > +		range = drm_gpusvm_range_find(notifier, start, end);
> > > +		if (range) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +
> > > +		/*
> > > +		 * XXX: Only create range on pages CPU has faulted
> > > in. Without
> > > +		 * this check, or prefault, on BMG
> > > 'xe_exec_system_allocator --r
> > > +		 * process-many-malloc' fails. In the failure case,
> > > each process
> > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > results in 64k SVM
> > > +		 * ranges. When migrating the SVM ranges, some
> > > processes fail in
> > > +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> > > != npages'
> > > +		 * and then upon drm_gpusvm_range_get_pages device
> > > pages from
> > > +		 * other processes are collected + faulted in which
> > > creates all
> > > +		 * sorts of problems. Unsure exactly how this
> > > happening, also
> > > +		 * problem goes away if 'xe_exec_system_allocator --
> > > r
> > > +		 * process-many-malloc' mallocs at least 64k at a
> > > time.
> > > +		 */
> > > +		if (check_pages &&
> > > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > > end)) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > > +	return end - start;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function finds or inserts a newly allocated a GPU SVM range
> > > based on the
> > > + * fault address. Caller must hold a lock to protect range lookup
> > > and insertion.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct drm_gpusvm_range *range;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	bool notifier_alloc = false;
> > > +	u64 chunk_size;
> > > +	int err;
> > > +	bool migrate_vram;
> > > +
> > > +	if (fault_addr < gpusvm->mm_start ||
> > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > +		err = -EINVAL;
> > > +		goto err_out;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_write_locked(mm);
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > +	if (!notifier) {
> > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > fault_addr);
> > > +		if (IS_ERR(notifier)) {
> > > +			err = PTR_ERR(notifier);
> > > +			goto err_mmunlock;
> > > +		}
> > > +		notifier_alloc = true;
> > > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > > >notifier,
> > > +							  mm,
> > > notifier->interval.start,
> > > +							  notifier-
> > > >interval.end -
> > > +							  notifier-
> > > >interval.start,
> > > +							 
> > > &drm_gpusvm_notifier_ops);
> > > +		if (err)
> > > +			goto err_notifier;
> > > +	}
> > > +
> > > +	vas = vma_lookup(mm, fault_addr);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > +		err = -EPERM;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > fault_addr + 1);
> > > +	if (range)
> > > +		goto out_mmunlock;
> > > +	/*
> > > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > > current
> > > +	 * limitations. If/when migrate_vma_* add more support, this
> > > logic will
> > > +	 * have to change.
> > > +	 */
> > > +	migrate_vram = ctx->vram_possible &&
> > > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > > +
> > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> > > vas,
> > > +						 fault_addr,
> > > gpuva_start,
> > > +						 gpuva_end,
> > > migrate_vram &&
> > > +						 !ctx->prefault);
> > > +	if (chunk_size == LONG_MAX) {
> > > +		err = -EINVAL;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> > > chunk_size,
> > > +				       migrate_vram);
> > > +	if (IS_ERR(range)) {
> > > +		err = PTR_ERR(range);
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	drm_gpusvm_range_insert(notifier, range);
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > +
> > > +	if (ctx->prefault) {
> > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > +
> > > +		__ctx.mmap_locked = true;
> > > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &__ctx);
> > > +		if (err)
> > > +			goto err_range_remove;
> > > +	}
> > > +
> > > +out_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +
> > > +	return range;
> > > +
> > > +err_range_remove:
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +err_notifier_remove:
> > > +	if (notifier_alloc)
> > > +		mmu_interval_notifier_remove(&notifier->notifier);
> > > +err_notifier:
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return ERR_PTR(err);
> > > +}
> > > +
> > > +/**
> > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > + * @i__: the current page index in the iteration
> > > + * @j__: the current page index, log order, in the iteration
> > > + * @npages__: the total number of pages in the DMA region
> > > + * @order__: the order of the pages in the DMA region
> > > + *
> > > + * This macro iterates over each page in a DMA region. The DMA
> > > region
> > > + * is assumed to be composed of 2^@order__ pages, and the macro will
> > > + * step through the region one block of 2^@order__ pages at a time.
> > > + */
> > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > GPU SVM range (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function unmap pages associated with a GPU SVM range.
> > > Assumes and
> > > + * asserts correct locking is in place when called.
> > > + */
> > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					   struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		unsigned long i, j, npages = npages_in_range(range-
> > > >va.start,
> > > +							     range-
> > > >va.end);
> > > +
> > > +		if (range->flags.has_dma_mapping) {
> > > +			for_each_dma_page(i, j, npages, range-
> > > >order)
> > > +				dma_unmap_page(gpusvm->drm->dev,
> > > +					       range->dma_addr[j],
> > > +					       PAGE_SIZE << range-
> > > >order,
> > > +					       DMA_BIDIRECTIONAL);
> > > +		}
> > > +
> > > +		range->flags.has_vram_pages = false;
> > > +		range->flags.has_dma_mapping = false;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function free pages associated with a GPU SVM range.
> > > + */
> > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > > +					struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		if (range->flags.kfree_mapping) {
> > > +			kfree(range->dma_addr);
> > > +			range->flags.kfree_mapping = false;
> > > +			range->pages = NULL;
> > > +		} else {
> > > +			kvfree(range->pages);
> > > +			range->pages = NULL;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range to be removed
> > > + *
> > > + * This function removes the specified GPU SVM range and also
> > > removes the parent
> > > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > > caller must
> > > + * hold a lock to protect range and notifier removal.
> > > + */
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > >va.start);
> > > +	if (WARN_ON_ONCE(!notifier))
> > > +		return;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	drm_gpusvm_range_put(range);
> > > +
> > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > +		if (!notifier->flags.removed)
> > > +			mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function increments the reference count of the specified GPU
> > > SVM range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_get(&range->refcount);
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > + * @refcount: Pointer to the reference counter embedded in the GPU
> > > SVM range
> > > + *
> > > + * This function destroys the specified GPU SVM range when its
> > > reference count
> > > + * reaches zero. If a custom range-free function is provided, it is
> > > invoked to
> > > + * free the range; otherwise, the range is deallocated using
> > > kfree().
> > > + */
> > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > +{
> > > +	struct drm_gpusvm_range *range =
> > > +		container_of(refcount, struct drm_gpusvm_range,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->range_free)
> > > +		gpusvm->ops->range_free(range);
> > > +	else
> > > +		kfree(range);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function decrements the reference count of the specified GPU
> > > SVM range
> > > + * and frees it when the count reaches zero.
> > > + */
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called holding gpusvm->notifier_lock and as the last step before
> > > commiting a
> > > + * GPU binding.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	return range->flags.has_vram_pages || range-
> > > >flags.has_dma_mapping;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> > > unlocked
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called without holding gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +static bool
> > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > +				      struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	bool pages_valid;
> > > +
> > > +	if (!range->pages)
> > > +		return false;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > +		kfree(range->dma_addr);
> > > +		range->flags.kfree_mapping = false;
> > > +		range->pages = NULL;
> > > +	}
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	return pages_valid;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function gets pages for a GPU SVM range and ensures they are
> > > mapped for
> > > + * DMA access.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > 
> > Is it possible to split this function up to make it look more neat?
> > 
> > 
> > > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > > >notifier;
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> > > ? 0 :
> > > +			HMM_PFN_REQ_WRITE),
> > > +		.notifier = notifier,
> > > +		.start = range->va.start,
> > > +		.end = range->va.end,
> > > +		.dev_private_owner = gpusvm-
> > > >device_private_page_owner,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long i, j;
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +	unsigned int order = 0;
> > > +	unsigned long *pfns;
> > > +	struct page **pages;
> > > +	int err = 0;
> > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > +	bool alloc_pfns = false, kfree_mapping;
> > > +
> > > +retry:
> > > +	kfree_mapping = false;
> > > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > > +		return 0;
> > > +
> > > +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> > > >pages) {
> > > +		if (ctx->prefault)
> > > +			return 0;
> > > +
> > > +		pfns = (unsigned long *)range->pages;
> > > +		pages = range->pages;
> > > +		goto map_pages;
> > > +	}
> > > +
> > > +	if (!range->pages) {
> > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +		if (!pfns)
> > > +			return -ENOMEM;
> > > +		alloc_pfns = true;
> > > +	} else {
> > > +		pfns = (unsigned long *)range->pages;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +	}
> > > +
> > > +	hmm_range.hmm_pfns = pfns;
> > > +	while (true) {
> > > +		/* Must be checked after mmu_interval_read_begin */
> > > +		if (range->flags.unmapped) {
> > > +			err = -EFAULT;
> > > +			break;
> > > +		}
> > > +
> > > +		if (!ctx->mmap_locked) {
> > > +			/*
> > > +			 * XXX: HMM locking document indicates only
> > > a read-lock
> > > +			 * is required but there apears to be a
> > > window between
> > > +			 * the MMU_NOTIFY_MIGRATE event triggered in
> > > a CPU fault
> > > +			 * via migrate_vma_setup and the pages
> > > actually moving
> > > +			 * in migrate_vma_finalize in which this
> > > code can grab
> > > +			 * garbage pages. Grabbing the write-lock if
> > > the range
> > > +			 * is attached to vram appears to protect
> > > against this
> > > +			 * race.
> > > +			 */
> > > +			if (vram_pages)
> > > +				mmap_write_lock(mm);
> > > +			else
> > > +				mmap_read_lock(mm);
> > > +		}
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (!ctx->mmap_locked) {
> > > +			if (vram_pages)
> > > +				mmap_write_unlock(mm);
> > > +			else
> > > +				mmap_read_unlock(mm);
> > > +		}
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (!ctx->mmap_locked)
> > > +		mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +
> > > +	if (ctx->prefault) {
> > > +		range->pages = pages;
> > > +		goto set_seqno;
> > > +	}
> > > +
> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if
> > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > 
> > Here it looks like you're assuming that all pages are the same order?
> > With THP that's definitely not the case, (unless hmm somehow thinks
> > they are 4K pages). This probably work because we only end up here in
> > the HugeTLB case where all pages are forced to the same oder.
> > 
> 
> It assumes the order within a chunk (range size) is all the same. I
> thought THP pages order would always be 9 (2M). THP tests
> (*-large-malloc) seem to work on LNL.
> 
> This falls apart if chunks are larger than 2M as the first 2M could be a
> THP and 2nd could not. We discussed that you were changing the dma addr
> to support mixed mappings and encode the order. That is likely correct
> and would fix this limitation of only supporting 1 order size for chunk.
> 
> I may not get this in the rev but agree this should be fixed. We
> deferring fixing this be ok with you?
> 
> fwiw I haven't seen any ROI on chunks being larger than 2M so Xe likely
> won't have chunks larger than that but agree the design should support
> this.
> 
> Matt
> 
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					
> > > hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if
> > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > > +
> > > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > > +						   pages[j], 0,
> > > +						   PAGE_SIZE <<
> > > order,
> > > +						  
> > > DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > dma_addr[j])) {
> > > +				err = -EFAULT;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +		}
> > > +
> > > +		/* Huge pages, reduce memory footprint */
> > > +		if (order) {
> > > +			dma_addr = kmalloc_array(j,
> > > sizeof(*dma_addr),
> > > +						 GFP_KERNEL);
> > > +			if (dma_addr) {
> > > +				for (i = 0; i < j; ++i)
> > > +					dma_addr[i] =
> > > (dma_addr_t)pfns[i];
> > > +				kvfree(pfns);
> > > +				kfree_mapping = true;
> > > +			} else {
> > > +				dma_addr = (dma_addr_t *)pfns;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->order = order;
> > > +		range->flags.kfree_mapping = kfree_mapping;
> > > +		range->flags.has_dma_mapping = true;
> > > +		range->dma_addr = dma_addr;
> > > +		range->vram_allocation = NULL;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	}
> > > +
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	for_each_dma_page(i, j, npages, order)
> > > +		dma_unmap_page(gpusvm->drm->dev,
> > > +			       (dma_addr_t)pfns[j],
> > > +			       PAGE_SIZE << order,
> > > DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	if (alloc_pfns)
> > > +		kvfree(pfns);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range. If
> > > @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > > mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > > called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > >invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > +					   unsigned long
> > > *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > > grate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified GPU
> > > SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> > > migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to mapped
> > > pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in GPU
> > > SVM. It
> > > + * iterates over each page frame number provided in @migrate_pfn,
> > > maps the
> > > + * corresponding page, and stores the DMA address in the provided
> > > @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int
> > > *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page =
> > > migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> > > dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> > > for GPU SVM migration
> > > + * @dev: The device for which the pages were mapped
> > > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > > + * @npages: Number of pages to unmap
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function unmaps previously mapped pages of memory for GPU
> > > Shared Virtual
> > > + * Memory (SVM). It iterates over each DMA address provided in
> > > @dma_addr, checks
> > > + * if it's valid and not already unmapped, and unmaps the
> > > corresponding page.
> > > + */
> > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > +					   dma_addr_t *dma_addr,
> > > +					   unsigned long npages,
> > > +					   enum dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > dma_addr[i]))
> > > +			continue;
> > > +
> > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *                   failure of this function.
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> > > The caller
> > > + *                   should hold a reference to the VRAM allocation,
> > > which
> > > + *                   should be dropped via ops->vram_allocation or
> > > upon the
> > > + *                   failure of this function.
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function migrates the specified GPU SVM range to VRAM. It
> > > performs the
> > > + * necessary setup and invokes the driver-specific operations for
> > > migration to
> > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > reference @range
> > > + * until ops->vram_release is called which only upon successful
> > > return.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long i, npages = npages_in_range(start, end);
> > > +	struct vm_area_struct *vas;
> > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int err;
> > > +
> > > +	if (!range->flags.migrate_vram)
> > > +		return -EINVAL;
> > > +
> > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > >copy_to_vram ||
> > > +	    !gpusvm->ops->copy_to_sram)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > > npages, not
> > > +	 * always an error. Need to revisit possible cases and how
> > > to handle. We
> > > +	 * could prefault on migrate.cpages != npages via
> > > hmm_range_fault.
> > > +	 */
> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > vram_allocation, npages,
> > > +					     migrate.dst);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.src, npages,
> > > DMA_TO_DEVICE);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > +
> > > +		pages[i] = page;
> > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > +	}
> > > +
> > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	/* Upon success bind vram allocation to range and zdd */
> > > +	range->vram_allocation = vram_allocation;
> > > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > > Owns ref */
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > > npages,
> > > +				       DMA_TO_DEVICE);
> > > +err_free:
> > > +	if (zdd)
> > > +		drm_gpusvm_zdd_put(zdd);
> > > +	kvfree(buf);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> > > VM area
> > > + * @vas: Pointer to the VM area structure, can be NULL
> > > + * @npages: Number of pages to populate
> > > + * @src_mpfn: Source array of migrate PFNs
> > > + * @mpfn: Array of migrate PFNs to populate
> > > + * @addr: Start address for PFN allocation
> > > + *
> > > + * This function populates the SRAM migrate page frame numbers
> > > (PFNs) for the
> > > + * specified VM area structure. It allocates and locks pages in the
> > > VM area for
> > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > > if NULL use
> > > + * alloc_page for allocation.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > vm_area_struct *vas,
> > > +						unsigned long
> > > npages,
> > > +						unsigned long
> > > *src_mpfn,
> > > +						unsigned long *mpfn,
> > > u64 addr)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > +		struct page *page;
> > > +
> > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > +			continue;
> > > +
> > > +		if (vas)
> > > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > > addr);
> > > +		else
> > > +			page = alloc_page(GFP_HIGHUSER);
> > > +
> > > +		if (!page)
> > > +			return -ENOMEM;
> > > +
> > > +		lock_page(page);
> > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> > > lock and
> > > + * migration done via migrate_device_* functions. Fallback path as
> > > it is
> > > + * preferred to issue migrations with mmap lock.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > +				    struct drm_gpusvm_range *range)
> > > +{
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	unsigned long *src, *dst;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	npages = npages_in_range(range->va.start, range->va.end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> > > +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	src = buf;
> > > +	dst = buf + (sizeof(*src) * npages);
> > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > > npages;
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > >vram_allocation,
> > > +					     npages, src);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > +				       gpusvm-
> > > >device_private_page_owner, src,
> > > +				       npages, range->va.start);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > > src, dst, 0);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   dst, npages,
> > > DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > +	migrate_device_pages(src, dst, npages);
> > > +	migrate_device_finalize(src, dst, npages);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > > (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the specified
> > > GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > > PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +					struct vm_area_struct *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	/* Corner where VMA area struct has been partially unmapped
> > > */
> > > +	if (start < vas->vm_start)
> > > +		start = vas->vm_start;
> > > +	if (end > vas->vm_end)
> > > +		end = vas->vm_end;
> > > +
> > > +	migrate.start = start;
> > > +	migrate.end = end;
> > > +	npages = npages_in_range(start, end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/* Raced with another CPU fault, nothing to do */
> > > +	if (!migrate.cpages)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > +						   migrate.src,
> > > migrate.dst,
> > > +						   start);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.dst, npages,
> > > +					   DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > > SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function initiates the migration of the specified GPU SVM
> > > range to
> > > + * SRAM. It performs necessary checks and invokes the internal
> > > migration
> > > + * function for actual migration.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err =
> > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the corner
> > > case when
> > > +	 * VRAM backing has been partially unmapped from MM's
> > > address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > > +	if (!vas) {
> > > +		if (!retry)
> > > +			err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > +		if (!retry)
> > > +			err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> > > end);
> > > +	if (err)
> > > +		goto err_mmunlock;
> > > +
> > > +	if (vas->vm_end < end) {
> > > +		retry = true;
> > > +		start = vas->vm_end;
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_read_unlock(mm);
> > > +		/*
> > > +		 * Using mmput_async as this function can be called
> > > while
> > > +		 * holding a dma-resv lock, and a final put can grab
> > > the mmap
> > > +		 * lock, causing a lock inversion.
> > > +		 */
> > > +		mmput_async(mm);
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked)
> > > +		mmap_read_unlock(mm);
> > > +err_mmput:
> > > +	if (!ctx->mmap_locked)
> > > +		mmput_async(mm);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> > > with a page
> > > + * @page: Pointer to the page
> > > + *
> > > + * This function is a callback used to put the GPU SVM zone device
> > > data
> > > + * associated with a page when it is being released.
> > > + */
> > > +static void drm_gpusvm_page_free(struct page *page)
> > > +{
> > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> > > fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU SVM
> > > range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting page
> > > and invokes
> > > + * the internal migration function to migrate the range back to RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > +					   vmf->vma, vmf->page,
> > > +					   zdd->range->va.start,
> > > +					   zdd->range->va.end);
> > > +
> > > +	return err ? VM_FAULT_SIGBUS : 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > > + */
> > > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > > +	.page_free = drm_gpusvm_page_free,
> > > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > > operations
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM device page map operations structure.
> > > + */
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > > +{
> > > +	return &drm_gpusvm_pagemap_ops;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > > given address range
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM has mapping, False otherwise
> > > + */
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > > u64 end)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > > +		struct drm_gpusvm_range *range = NULL;
> > > +
> > > +		drm_gpusvm_for_each_range(range, notifier, start,
> > > end)
> > > +			return true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > new file mode 100644
> > > index 000000000000..0ea70f8534a8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > > @@ -0,0 +1,415 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __DRM_GPUSVM_H__
> > > +#define __DRM_GPUSVM_H__
> > > +
> > > +#include <linux/kref.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct dev_pagemap_ops;
> > > +struct drm_device;
> > > +struct drm_gpusvm;
> > > +struct drm_gpusvm_notifier;
> > > +struct drm_gpusvm_ops;
> > > +struct drm_gpusvm_range;
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > > + *
> > > + * This structure defines the operations for GPU Shared Virtual
> > > Memory (SVM).
> > > + * These operations are provided by the GPU driver to manage SVM
> > > ranges and
> > > + * perform operations such as migration between VRAM and system RAM.
> > > + */
> > > +struct drm_gpusvm_ops {
> > > +	/**
> > > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM notifier.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM notifier on success,
> > > NULL on failure.
> > > +	 */
> > > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > > +
> > > +	/**
> > > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM notifier.
> > > +	 */
> > > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > > +
> > > +	/**
> > > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 *
> > > +	 * This function shall allocate a GPU SVM range.
> > > +	 *
> > > +	 * Returns:
> > > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > > on failure.
> > > +	 */
> > > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> > > *gpusvm);
> > > +
> > > +	/**
> > > +	 * @range_free: Free a GPU SVM range (optional)
> > > +	 * @range: Pointer to the GPU SVM range to be freed
> > > +	 *
> > > +	 * This function shall free a GPU SVM range.
> > > +	 */
> > > +	void (*range_free)(struct drm_gpusvm_range *range);
> > > +
> > > +	/**
> > > +	 * @vram_release: Release VRAM allocation (optional)
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 *
> > > +	 * This function shall release VRAM allocation and expects
> > > to drop a
> > > +	 * reference to VRAM allocation.
> > > +	 */
> > > +	void (*vram_release)(void *vram_allocation);
> > > +
> > > +	/**
> > > +	 * @populate_vram_pfn: Populate VRAM PFN (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > +	 * @npages: Number of pages to populate
> > > +	 * @pfn: Array of page frame numbers to populate
> > > +	 *
> > > +	 * This function shall populate VRAM page frame numbers
> > > (PFN).
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*populate_vram_pfn)(struct drm_gpusvm *gpusvm,
> > > +				 void *vram_allocation,
> > > +				 unsigned long npages,
> > > +				 unsigned long *pfn);
> > > +
> > > +	/**
> > > +	 * @copy_to_vram: Copy to VRAM (required for migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (destination)
> > > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to VRAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_vram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @copy_to_sram: Copy to system RAM (required for
> > > migration)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @pages: Pointer to array of VRAM pages (source)
> > > +	 * @dma_addr: Pointer to array of DMA addresses
> > > (destination)
> > > +	 * @npages: Number of pages to copy
> > > +	 *
> > > +	 * This function shall copy pages to system RAM.
> > > +	 *
> > > +	 * Returns:
> > > +	 * 0 on success, a negative error code on failure.
> > > +	 */
> > > +	int (*copy_to_sram)(struct drm_gpusvm *gpusvm,
> > > +			    struct page **pages,
> > > +			    dma_addr_t *dma_addr,
> > > +			    unsigned long npages);
> > > +
> > > +	/**
> > > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > > +	 * @gpusvm: Pointer to the GPU SVM
> > > +	 * @notifier: Pointer to the GPU SVM notifier
> > > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > > +	 *
> > > +	 * This function shall invalidate the GPU page tables. It
> > > can safely
> > > +	 * walk the notifier range RB tree/list in this function.
> > > Called while
> > > +	 * holding the notifier lock.
> > > +	 */
> > > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > > +			   struct drm_gpusvm_notifier *notifier,
> > > +			   const struct mmu_notifier_range
> > > *mmu_range);
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > > notifier
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: MMU interval notifier
> > > + * @interval: Interval for the notifier
> > > + * @rb: Red-black tree node for the parent GPU SVM structure
> > > notifier tree
> > > + * @root: Cached root node of the RB tree containing ranges
> > > + * @range_list: List head containing of ranges in the same order
> > > they appear in
> > > + *              interval tree. This is useful to keep iterating
> > > ranges while
> > > + *              doing modifications to RB tree.
> > > + * @flags.removed: Flag indicating whether the MMU interval notifier
> > > has been
> > > + *                 removed
> > > + *
> > > + * This structure represents a GPU SVM notifier.
> > > + */
> > > +struct drm_gpusvm_notifier {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct mmu_interval_notifier notifier;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} interval;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct rb_root_cached root;
> > > +	struct list_head range_list;
> > > +	struct {
> > > +		u32 removed : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > > + *
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier
> > > + * @refcount: Reference count for the range
> > > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > > structure range tree
> > > + * @va: Virtual address range
> > > + * @notifier_seq: Notifier sequence number of the range's pages
> > > + * @pages: Pointer to the array of pages (if backing store is in
> > > VRAM)
> > > + * @dma_addr: DMA address array (if backing store is SRAM and DMA
> > > mapped)
> > > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > > + * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping
> > > size
> > > + * @flags.migrate_vram: Flag indicating whether the range can be
> > > migrated to VRAM
> > > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > > + * @flags.partial_unmap: Flag indicating if the range has been
> > > partially unmapped
> > > + * @flags.has_vram_pages: Flag indicating if the range has vram
> > > pages
> > > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> > > mapping
> > > + * @flags.kfree_mapping: Flag indicating @dma_addr is a compact
> > > allocation based
> > > + *                       on @order which releases via kfree
> > > + *
> > > + * This structure represents a GPU SVM range used for tracking
> > > memory ranges
> > > + * mapped in a DRM device.
> > > + */
> > > +struct drm_gpusvm_range {
> > > +	struct drm_gpusvm *gpusvm;
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct kref refcount;
> > > +	struct {
> > > +		struct rb_node node;
> > > +		struct list_head entry;
> > > +		u64 __subtree_last;
> > > +	} rb;
> > > +	struct {
> > > +		u64 start;
> > > +		u64 end;
> > > +	} va;
> > > +	unsigned long notifier_seq;
> > > +	union {
> > > +		struct page **pages;
> > > +		dma_addr_t *dma_addr;
> > > +	};
> > > +	void *vram_allocation;
> > > +	u16 order;
> > > +	struct {
> > > +		/* All flags below must be set upon creation */
> > > +		u16 migrate_vram : 1;
> > > +		/* All flags below must be set / cleared under
> > > notifier lock */
> > > +		u16 unmapped : 1;
> > > +		u16 partial_unmap : 1;
> > > +		u16 has_vram_pages : 1;
> > > +		u16 has_dma_mapping : 1;
> > > +		u16 kfree_mapping : 1;
> > > +	} flags;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm - GPU SVM structure
> > > + *
> > > + * @name: Name of the GPU SVM
> > > + * @drm: Pointer to the DRM device structure
> > > + * @mm: Pointer to the mm_struct for the address space
> > > + * @device_private_page_owner: Device private pages owner
> > > + * @mm_start: Start address of GPU SVM
> > > + * @mm_range: Range of the GPU SVM
> > > + * @notifier_size: Size of individual notifiers
> > > + * @ops: Pointer to the operations structure for GPU SVM
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending order.
> > > + * @num_chunks: Number of chunks
> > > + * @notifier_lock: Read-write semaphore for protecting notifier
> > > operations
> > > + * @zdd_wq: Workqueue for deferred work on zdd destruction
> > > + * @root: Cached root node of the Red-Black tree containing GPU SVM
> > > notifiers
> > > + * @notifier_list: list head containing of notifiers in the same
> > > order they
> > > + *                 appear in interval tree. This is useful to keep
> > > iterating
> > > + *                 notifiers while doing modifications to RB tree.
> > > + *
> > > + * This structure represents a GPU SVM (Shared Virtual Memory) used
> > > for tracking
> > > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > > + *
> > > + * No reference counting is provided, as this is expected to be
> > > embedded in the
> > > + * driver VM structure along with the struct drm_gpuvm, which
> > > handles reference
> > > + * counting.
> > > + */
> > > +struct drm_gpusvm {
> > > +	const char *name;
> > > +	struct drm_device *drm;
> > > +	struct mm_struct *mm;
> > > +	void *device_private_page_owner;
> > > +	u64 mm_start;
> > > +	u64 mm_range;
> > > +	u64 notifier_size;
> > > +	const struct drm_gpusvm_ops *ops;
> > > +	const u64 *chunk_sizes;
> > > +	int num_chunks;
> > > +	struct rw_semaphore notifier_lock;
> > > +	struct workqueue_struct *zdd_wq;
> > > +	struct rb_root_cached root;
> > > +	struct list_head notifier_list;
> > > +};
> > > +
> > > +/**
> > > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > > + *
> > > + * @mmap_locked: mmap lock is locked
> > > + * @trylock_mmap: trylock mmap lock, used to avoid locking
> > > inversions
> > > + *                (e.g.dma-revs -> mmap lock)
> > > + * @in_notifier: entering from a MMU notifier
> > > + * @read_only: operating on read-only memory
> > > + * @vram_possible: possible to use VRAM
> > > + * @prefault: prefault pages
> > > + *
> > > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > > + */
> > > +struct drm_gpusvm_ctx {
> > > +	u32 mmap_locked :1;
> > > +	u32 trylock_mmap :1;
> > > +	u32 in_notifier :1;
> > > +	u32 read_only :1;
> > > +	u32 vram_possible :1;
> > > +	u32 prefault :1;
> > > +};
> > > +
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks);
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > > +
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range);
> > > +
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range *range,
> > > +				  const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx);
> > > +
> > > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > > +
> > > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > > u64 end);
> > > +
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end);
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, take lock
> > > + */
> > > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > > +	down_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure.
> > > + *
> > > + * Abstract client usage GPU SVM notifier lock, drop lock
> > > + */
> > > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > > +	up_read(&(gpusvm__)->notifier_lock)
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > > + * @range: a pointer to the current GPU SVM range
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_range if available, or
> > > NULL if the
> > > + *         current range is the last one or if the input range is
> > > NULL.
> > > + */
> > > +static inline struct drm_gpusvm_range *
> > > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > > +{
> > > +	if (range && !list_is_last(&range->rb.entry,
> > > +				   &range->notifier->range_list))
> > > +		return list_next_entry(range, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > > notifier
> > > + * @range__: Iterator variable for the ranges. If set, it indicates
> > > the start of
> > > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> > > get the range.
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a notifier.
> > > It is safe
> > > + * to use while holding the driver SVM lock or the notifier lock.
> > > + */
> > > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > > end__)	\
> > > +	for ((range__) = (range__)
> > > ?:					\
> > > +	     drm_gpusvm_range_find((notifier__), (start__),
> > > (end__));	\
> > > +	     (range__) && (range__->va.start <
> > > (end__));		\
> > > +	     (range__) = __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > > + * @range: Pointer to the GPU SVM range structure.
> > > + * @mmu_range: Pointer to the MMU notifier range structure.
> > > + *
> > > + * This function marks a GPU SVM range as unmapped and sets the
> > > partial_unmap flag
> > > + * if the range partially falls within the provided MMU notifier
> > > range.
> > > + */
> > > +static inline void
> > > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > > +			      const struct mmu_notifier_range
> > > *mmu_range)
> > > +{
> > > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > > +
> > > +	range->flags.unmapped = true;
> > > +	if (range->va.start < mmu_range->start ||
> > > +	    range->va.end > mmu_range->end)
> > > +		range->flags.partial_unmap = true;
> > > +}
> > > +
> > > +#endif /* __DRM_GPUSVM_H__ */
> > 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                     ` (6 preceding siblings ...)
  2024-09-24 10:42   ` Thomas Hellström
@ 2024-10-09 10:50   ` Thomas Hellström
  2024-10-16  3:18     ` Matthew Brost
  7 siblings, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-10-09 10:50 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: airlied, christian.koenig, matthew.auld, daniel

Hi, Matthew.

Some comments below around migrating to SRAM.


On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> This patch introduces support for GPU Shared Virtual Memory (SVM) in
> the
> Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> sharing of memory between the CPU and GPU, enhancing performance and
> flexibility in GPU computing tasks.
> 
> The patch adds the necessary infrastructure for SVM, including data
> structures and functions for managing SVM ranges and notifiers. It
> also
> provides mechanisms for allocating, deallocating, and migrating
> memory
> regions between system RAM and GPU VRAM.
> 
> This mid-layer is largely inspired by GPUVM.
> 
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile     |    3 +-
>  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
>  3 files changed, 2591 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index b9670ae09a9e..b8fc2ee58f1a 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
>  
>  # core driver code
>  
> -xe-y += xe_bb.o \
> +xe-y += drm_gpusvm.o \
> +	xe_bb.o \
>  	xe_bo.o \
>  	xe_bo_evict.o \
>  	xe_devcoredump.o \
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> b/drivers/gpu/drm/xe/drm_gpusvm.c
> new file mode 100644
> index 000000000000..fc1e44e6ae72
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> @@ -0,0 +1,2174 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + *
> + * Authors:
> + *     Matthew Brost <matthew.brost@intel.com>
> + */
> +
> +#include <linux/dma-mapping.h>
> +#include <linux/interval_tree_generic.h>
> +#include <linux/hmm.h>
> +#include <linux/memremap.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_types.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +
> +#include <drm/drm_device.h>
> +#include "drm_gpusvm.h"
> +
> +/**
> + * DOC: Overview
> + *
> + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> Rendering Manager (DRM)
> + *
> + * The GPU SVM layer is a component of the DRM framework designed to
> manage shared
> + * virtual memory between the CPU and GPU. It enables efficient data
> exchange and
> + * processing for GPU-accelerated applications by allowing memory
> sharing and
> + * synchronization between the CPU's and GPU's virtual address
> spaces.
> + *
> + * Key GPU SVM Components:
> + * - Notifiers: Notifiers: Used for tracking memory intervals and
> notifying the
> + *		GPU of changes, notifiers are sized based on a GPU
> SVM
> + *		initialization parameter, with a recommendation of
> 512M or
> + *		larger. They maintain a Red-BlacK tree and a list of
> ranges that
> + *		fall within the notifier interval. Notifiers are
> tracked within
> + *		a GPU SVM Red-BlacK tree and list and are
> dynamically inserted
> + *		or removed as ranges within the interval are created
> or
> + *		destroyed.
> + * - Ranges: Represent memory ranges mapped in a DRM device and
> managed
> + *	     by GPU SVM. They are sized based on an array of chunk
> sizes, which
> + *	     is a GPU SVM initialization parameter, and the CPU
> address space.
> + *	     Upon GPU fault, the largest aligned chunk that fits
> within the
> + *	     faulting CPU address space is chosen for the range
> size. Ranges are
> + *	     expected to be dynamically allocated on GPU fault and
> removed on an
> + *	     MMU notifier UNMAP event. As mentioned above, ranges
> are tracked in
> + *	     a notifier's Red-Black tree.
> + * - Operations: Define the interface for driver-specific SVM
> operations such as
> + *		 allocation, page collection, migration,
> invalidations, and VRAM
> + *		 release.
> + *
> + * This layer provides interfaces for allocating, mapping,
> migrating, and
> + * releasing memory ranges between the CPU and GPU. It handles all
> core memory
> + * management interactions (DMA mapping, HMM, and migration) and
> provides
> + * driver-specific virtual functions (vfuncs). This infrastructure
> is sufficient
> + * to build the expected driver components for an SVM implementation
> as detailed
> + * below.
> + *
> + * Expected Driver Components:
> + * - GPU page fault handler: Used to create ranges and notifiers
> based on the
> + *			     fault address, optionally migrate the
> range to
> + *			     VRAM, and create GPU bindings.
> + * - Garbage collector: Used to destroy GPU bindings for ranges.
> Ranges are
> + *			expected to be added to the garbage
> collector upon
> + *			MMU_NOTIFY_UNMAP event.
> + */
> +
> +/**
> + * DOC: Locking
> + *
> + * GPU SVM handles locking for core MM interactions, i.e., it
> locks/unlocks the
> + * mmap lock as needed. Alternatively, if the driver prefers to
> handle the mmap
> + * lock itself, a 'locked' argument is provided to the functions
> that require
> + * the mmap lock. This option may be useful for drivers that need to
> call into
> + * GPU SVM while also holding a dma-resv lock, thus preventing
> locking
> + * inversions between the mmap and dma-resv locks.
> + *
> + * GPU SVM introduces a global notifier lock, which safeguards the
> notifier's
> + * range RB tree and list, as well as the range's DMA mappings and
> sequence
> + * number. GPU SVM manages all necessary locking and unlocking
> operations,
> + * except for the recheck of the range's sequence number
> + * (mmu_interval_read_retry) when the driver is committing GPU
> bindings. This
> + * lock corresponds to the 'driver->update' lock mentioned in the
> HMM
> + * documentation (TODO: Link). Future revisions may transition from
> a GPU SVM
> + * global lock to a per-notifier lock if finer-grained locking is
> deemed
> + * necessary.
> + *
> + * In addition to the locking mentioned above, the driver should
> implement a
> + * lock to safeguard core GPU SVM function calls that modify state,
> such as
> + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> Alternatively,
> + * these core functions can be called within a single kernel thread,
> for
> + * instance, using an ordered work queue. This lock is denoted as
> + * 'driver_svm_lock' in code examples.
> + */
> +
> +/**
> + * DOC: Migrataion
> + *
> + * The migration support is quite simple, allowing migration between
> SRAM and
> + * VRAM at the range granularity. For example, GPU SVM currently
> does not
> + * support mixing SRAM and VRAM pages within a range. This means
> that upon GPU
> + * fault, the entire range can be migrated to VRAM, and upon CPU
> fault, the
> + * entire range is migrated to SRAM.
> + *
> + * The reasoning for only supporting range granularity is as
> follows: it
> + * simplifies the implementation, and range sizes are driver-defined
> and should
> + * be relatively small.
> + */
> +
> +/**
> + * DOC: Partial Unmapping of Ranges
> + *
> + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> CPU resulting
> + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> main one
> + * being that a subset of the range still has CPU and GPU mappings.
> If the
> + * backing store for the range is in VRAM, a subset of the backing
> store has
> + * references. One option would be to split the range and VRAM
> backing store,
> + * but the implementation for this would be quite complicated. Given
> that
> + * partial unmappings are rare and driver-defined range sizes are
> relatively
> + * small, GPU SVM does not support splitting of ranges.
> + *
> + * With no support for range splitting, upon partial unmapping of a
> range, the
> + * driver is expected to invalidate and destroy the entire range. If
> the range
> + * has VRAM as its backing, the driver is also expected to migrate
> any remaining
> + * pages back to SRAM.
> + */
> +
> +/**
> + * DOC: Examples
> + *
> + * This section provides two examples of how to build the expected
> driver
> + * components: the GPU page fault handler and the garbage collector.
> A third
> + * example demonstrates a sample invalidation driver vfunc.
> + *
> + * The generic code provided does not include logic for complex
> migration
> + * policies, optimized invalidations, or other potentially required
> driver
> + * locking (e.g., DMA-resv locks).
> + *
> + * 1) GPU page fault handler
> + *
> + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> drm_gpusvm_range *range)
> + *	{
> + *		int err = 0;
> + *
> + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> range);
> + *
> + *		drm_gpusvm_notifier_lock(gpusvm);
> + *		if (drm_gpusvm_range_pages_valid(range))
> + *			driver_commit_bind(gpusvm, range);
> + *		else
> + *			err = -EAGAIN;
> + *		drm_gpusvm_notifier_unlock(gpusvm);
> + *
> + *		return err;
> + *	}
> + *
> + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> + *			     u64 gpuva_start, u64 gpuva_end)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *		int err;
> + *
> + *		driver_svm_lock();
> + *	retry:
> + *		// Always process UNMAPs first so view of GPU SVM
> ranges is current
> + *		driver_garbage_collector(gpusvm);
> + *
> + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> fault_addr,
> + *							gpuva_start,
> gpuva_end,
> + *						        &ctx);
> + *		if (IS_ERR(range)) {
> + *			err = PTR_ERR(range);
> + *			goto unlock;
> + *		}
> + *
> + *		if (driver_migration_policy(range)) {
> + *			bo = driver_alloc_bo();
> + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> range, bo, &ctx);
> + *			if (err)	// CPU mappings may have
> changed
> + *				goto retry;
> + *		}
> + *
> + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &ctx);
> + *		if (err == -EFAULT || err == -EPERM)	// CPU
> mappings changed
> + *			goto retry;
> + *		else if (err)
> + *			goto unlock;
> + *
> + *		err = driver_bind_range(gpusvm, range);
> + *		if (err == -EAGAIN)	// CPU mappings changed
> + *			goto retry
> + *
> + *	unlock:
> + *		driver_svm_unlock();
> + *		return err;
> + *	}
> + *
> + * 2) Garbage Collector.
> + *
> + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> + *					struct drm_gpusvm_range
> *range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		// Partial unmap, migrate any remaining VRAM pages
> back to SRAM
> + *		if (range->flags.partial_unmap)
> + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> &ctx);
> + *
> + *		driver_unbind_range(range);
> + *		drm_gpusvm_range_remove(gpusvm, range);
> + *	}
> + *
> + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> + *	{
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		for_each_range_in_garbage_collector(gpusvm, range)
> + *			__driver_garbage_collector(gpusvm, range);
> + *	}
> + *
> + * 3) Invalidation driver vfunc.
> + *
> + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> + *				 struct drm_gpusvm_notifier
> *notifier,
> + *				 const struct mmu_notifier_range
> *mmu_range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> };
> + *		struct drm_gpusvm_range *range = NULL;
> + *
> + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> >start, mmu_range->end);
> + *
> + *		drm_gpusvm_for_each_range(range, notifier,
> mmu_range->start,
> + *					  mmu_range->end) {
> + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> &ctx);
> + *
> + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> + *				continue;
> + *
> + *			drm_gpusvm_range_set_unmapped(range,
> mmu_range);
> + *			driver_garbage_collector_add(gpusvm, range);
> + *		}
> + *	}
> + */
> +
> +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> rb.__subtree_last,
> +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> +		     static __maybe_unused, range);
> +
> +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> >interval.start)
> +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> >interval.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> notifier);
> +
> +/**
> + * npages_in_range() - Calculate the number of pages in a given
> range
> + * @start__: The start address of the range
> + * @end__: The end address of the range
> + *
> + * This macro calculates the number of pages in a given memory
> range,
> + * specified by the start and end addresses. It divides the
> difference
> + * between the end and start addresses by the page size (PAGE_SIZE)
> to
> + * determine the number of pages in the range.
> + *
> + * Return: The number of pages in the specified range.
> + */
> +#define npages_in_range(start__, end__)	\
> +	(((end__) - (start__)) >> PAGE_SHIFT)
> +
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd destruction
> + * @range: Pointer to the GPU SVM range
> + * @vram_allocation: Driver-private pointer to the VRAM allocation
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up
> a range
> + * upon CPU page fault and asynchronously releasing VRAM once the
> CPU has no
> + * page references. Asynchronous release is useful because CPU page
> references
> + * can be dropped in IRQ contexts, while releasing VRAM likely
> requires sleeping
> + * locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_range *range;
> +	void *vram_allocation;
> +};
> +
> +/**
> + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> zdd
> + * @w: Pointer to the work_struct
> + *
> + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> + */
> +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(w, struct drm_gpusvm_zdd,
> destroy_work);
> +	struct drm_gpusvm_range *range = zdd->range;
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> +		gpusvm->ops->vram_release(zdd->vram_allocation);
> +	drm_gpusvm_range_put(range);
> +	kfree(zdd);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> + * @range: Pointer to the GPU SVM range.
> + *
> + * This function allocates and initializes a new zdd structure. It
> sets up the
> + * reference count, initializes the destroy work, and links the
> provided GPU SVM
> + * range.
> + *
> + * Returns:
> + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_zdd *
> +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_zdd *zdd;
> +
> +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> +	if (!zdd)
> +		return NULL;
> +
> +	kref_init(&zdd->refcount);
> +	INIT_WORK(&zdd->destroy_work,
> drm_gpusvm_zdd_destroy_work_func);
> +	zdd->range = drm_gpusvm_range_get(range);
> +	zdd->vram_allocation = NULL;
> +
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function increments the reference count of the provided zdd
> structure.
> + *
> + * Returns: Pointer to the zdd structure.
> + */
> +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> drm_gpusvm_zdd *zdd)
> +{
> +	kref_get(&zdd->refcount);
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> + * @ref: Pointer to the reference count structure.
> + *
> + * This function queues the destroy_work of the zdd for asynchronous
> destruction.
> + */
> +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> +
> +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_put - Put a zdd reference.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function decrements the reference count of the provided zdd
> structure
> + * and schedules its destruction if the count drops to zero.
> + */
> +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> + * @notifier: Pointer to the GPU SVM notifier structure.
> + * @start: Start address of the range
> + * @end: End address of the range
> + *
> + * Return: A pointer to the drm_gpusvm_range if found or NULL
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end)
> +{
> +	return range_iter_first(&notifier->root, start, end - 1);
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> ranges in a notifier
> + * @range__: Iterator variable for the ranges
> + * @next__: Iterator variable for the ranges temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier
> while
> + * removing ranges from it.
> + */
> +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> start__, end__)	\
> +	for ((range__) = drm_gpusvm_range_find((notifier__),
> (start__), (end__)),	\
> +	     (next__) =
> __drm_gpusvm_range_next(range__);				\
> +	     (range__) && (range__->va.start <
> (end__));				\
> +	     (range__) = (next__), (next__) =
> __drm_gpusvm_range_next(range__))
> +
> +/**
> + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> the list
> + * @notifier: a pointer to the current drm_gpusvm_notifier
> + *
> + * Return: A pointer to the next drm_gpusvm_notifier if available,
> or NULL if
> + *         the current notifier is the last one or if the input
> notifier is
> + *         NULL.
> + */
> +static struct drm_gpusvm_notifier *
> +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> +{
> +	if (notifier && !list_is_last(&notifier->rb.entry,
> +				      &notifier->gpusvm-
> >notifier_list))
> +		return list_next_entry(notifier, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> + */
> +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> end__)		\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1);	\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @next__: Iterator variable for the notifiers temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> while
> + * removing notifiers from it.
> + */
> +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> gpusvm__, start__, end__)	\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1),	\
> +	     (next__) =
> __drm_gpusvm_notifier_next(notifier__);				\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = (next__), (next__) =
> __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> + * @mni: Pointer to the mmu_interval_notifier structure.
> + * @mmu_range: Pointer to the mmu_notifier_range structure.
> + * @cur_seq: Current sequence number.
> + *
> + * This function serves as a generic MMU notifier for GPU SVM. It
> sets the MMU
> + * notifier sequence number and calls the driver invalidate vfunc
> under
> + * gpusvm->notifier_lock.
> + *
> + * Returns:
> + * true if the operation succeeds, false otherwise.
> + */
> +static bool
> +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> +			       const struct mmu_notifier_range
> *mmu_range,
> +			       unsigned long cur_seq)
> +{
> +	struct drm_gpusvm_notifier *notifier =
> +		container_of(mni, typeof(*notifier), notifier);
> +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> +
> +	if (!mmu_notifier_range_blockable(mmu_range))
> +		return false;
> +
> +	down_write(&gpusvm->notifier_lock);
> +	mmu_interval_set_seq(mni, cur_seq);
> +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> +	up_write(&gpusvm->notifier_lock);
> +
> +	return true;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> GPU SVM
> + */
> +static const struct mmu_interval_notifier_ops
> drm_gpusvm_notifier_ops = {
> +	.invalidate = drm_gpusvm_notifier_invalidate,
> +};
> +
> +/**
> + * drm_gpusvm_init - Initialize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @name: Name of the GPU SVM.
> + * @drm: Pointer to the DRM device structure.
> + * @mm: Pointer to the mm_struct for the address space.
> + * @device_private_page_owner: Device private pages owner.
> + * @mm_start: Start address of GPU SVM.
> + * @mm_range: Range of the GPU SVM.
> + * @notifier_size: Size of individual notifiers.
> + * @ops: Pointer to the operations structure for GPU SVM.
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order
> with last
> + *               entry being SZ_4K.
> + * @num_chunks: Number of chunks.
> + *
> + * This function initializes the GPU SVM.
> + *
> + * Returns:
> + * 0 on success, a negative error code on failure.
> + */
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks)
> +{
> +	if (!ops->invalidate || !num_chunks)
> +		return -EINVAL;
> +
> +	gpusvm->name = name;
> +	gpusvm->drm = drm;
> +	gpusvm->mm = mm;
> +	gpusvm->device_private_page_owner =
> device_private_page_owner;
> +	gpusvm->mm_start = mm_start;
> +	gpusvm->mm_range = mm_range;
> +	gpusvm->notifier_size = notifier_size;
> +	gpusvm->ops = ops;
> +	gpusvm->chunk_sizes = chunk_sizes;
> +	gpusvm->num_chunks = num_chunks;
> +	gpusvm->zdd_wq = system_wq;
> +
> +	mmgrab(mm);
> +	gpusvm->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> +
> +	init_rwsem(&gpusvm->notifier_lock);
> +
> +	fs_reclaim_acquire(GFP_KERNEL);
> +	might_lock(&gpusvm->notifier_lock);
> +	fs_reclaim_release(GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure
> + * @fault_addr__: Fault address
> + *
> + * This macro finds the GPU SVM notifier associated with the fault
> address.
> + *
> + * Returns:
> + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> + */
> +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> +			    (fault_addr__ + 1))
> +
> +/**
> + * to_drm_gpusvm_notifier - retrieve the container struct for a
> given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_notifier struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_notifier
> structure.
> + */
> +#define to_drm_gpusvm_notifier(__node)				\
> +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> +
> +/**
> + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function inserts the GPU SVM notifier into the GPU SVM RB
> tree and list.
> + */
> +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	notifier_insert(notifier, &gpusvm->root);
> +
> +	node = rb_prev(&notifier->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> +	else
> +		head = &gpusvm->notifier_list;
> +
> +	list_add(&notifier->rb.entry, head);
> +}
> +
> +/**
> + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM tructure
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + *
> + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> and list.
> + */
> +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> +	list_del(&(notifier__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_fini - Finalize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + *
> + * This function finalizes the GPU SVM by cleaning up any remaining
> ranges and
> + * notifiers, and dropping a reference to struct MM.
> + */
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> +{
> +	struct drm_gpusvm_notifier *notifier, *next;
> +
> +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> LONG_MAX) {
> +		struct drm_gpusvm_range *range, *__next;
> +
> +		/*
> +		 * Remove notifier first to avoid racing with any
> invalidation
> +		 */
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +		notifier->flags.removed = true;
> +
> +		drm_gpusvm_for_each_range_safe(range, __next,
> notifier, 0,
> +					       LONG_MAX)
> +			drm_gpusvm_range_remove(gpusvm, range);
> +	}
> +
> +	mmdrop(gpusvm->mm);
> +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> +}
> +
> +/**
> + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + *
> + * This function allocates and initializes the GPU SVM notifier
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> on failure.
> + */
> +static struct drm_gpusvm_notifier *
> +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	if (gpusvm->ops->notifier_alloc)
> +		notifier = gpusvm->ops->notifier_alloc();
> +	else
> +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> +
> +	if (!notifier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	notifier->gpusvm = gpusvm;
> +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> >notifier_size);
> +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> >notifier_size);
> +	INIT_LIST_HEAD(&notifier->rb.entry);
> +	notifier->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&notifier->range_list);
> +
> +	return notifier;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function frees the GPU SVM notifier structure.
> + */
> +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> +				     struct drm_gpusvm_notifier
> *notifier)
> +{
> +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> +
> +	if (gpusvm->ops->notifier_free)
> +		gpusvm->ops->notifier_free(notifier);
> +	else
> +		kfree(notifier);
> +}
> +
> +/**
> + * to_drm_gpusvm_range - retrieve the container struct for a given
> rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_range struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_range structure.
> + */
> +#define to_drm_gpusvm_range(node__)	\
> +	container_of((node__), struct drm_gpusvm_range, rb.node)
> +
> +/**
> + * drm_gpusvm_range_insert - Insert GPU SVM range
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function inserts the GPU SVM range into the notifier RB tree
> and list.
> + */
> +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> *notifier,
> +				    struct drm_gpusvm_range *range)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> +	range_insert(range, &notifier->root);
> +
> +	node = rb_prev(&range->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> +	else
> +		head = &notifier->range_list;
> +
> +	list_add(&range->rb.entry, head);
> +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> +}
> +
> +/**
> + * __drm_gpusvm_range_remove - Remove GPU SVM range
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + * @range__: Pointer to the GPU SVM range structure
> + *
> + * This macro removes the GPU SVM range from the notifier RB tree
> and list.
> + */
> +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> +	range_remove((range__), &(notifier__)->root);		\
> +	list_del(&(range__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @fault_addr: Fault address
> + * @chunk_size: Chunk size
> + * @migrate_vram: Flag indicating whether to migrate VRAM
> + *
> + * This function allocates and initializes the GPU SVM range
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> failure.
> + */
> +static struct drm_gpusvm_range *
> +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> +		       struct drm_gpusvm_notifier *notifier,
> +		       u64 fault_addr, u64 chunk_size, bool
> migrate_vram)
> +{
> +	struct drm_gpusvm_range *range;
> +
> +	if (gpusvm->ops->range_alloc)
> +		range = gpusvm->ops->range_alloc(gpusvm);
> +	else
> +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&range->refcount);
> +	range->gpusvm = gpusvm;
> +	range->notifier = notifier;
> +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> +	INIT_LIST_HEAD(&range->rb.entry);
> +	range->notifier_seq = LONG_MAX;
> +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_check_pages - Check pages
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @start: Start address
> + * @end: End address
> + *
> + * Check if pages between start and end have been faulted in on the
> CPU. Use to
> + * prevent migration of pages without CPU backing store.
> + *
> + * Returns:
> + * True if pages have been faulted into CPU, False otherwise
> + */
> +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> +				   struct drm_gpusvm_notifier
> *notifier,
> +				   u64 start, u64 end)
> +{
> +	struct hmm_range hmm_range = {
> +		.default_flags = 0,
> +		.notifier = &notifier->notifier,
> +		.start = start,
> +		.end = end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns;
> +	unsigned long npages = npages_in_range(start, end);
> +	int err, i;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return false;
> +
> +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> >notifier);
> +	hmm_range.hmm_pfns = pfns;
> +
> +	while (true) {
> +		err = hmm_range_fault(&hmm_range);
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(&notifier->notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (err)
> +		goto err_free;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!(pfns[i] & HMM_PFN_VALID)) {
> +			err = -EFAULT;
> +			goto err_free;
> +		}
> +	}
> +
> +err_free:
> +	kvfree(pfns);
> +	return err ? false : true;
> +}
> +
> +/**
> + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @vas: Pointer to the virtual memory area structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @check_pages: Flag indicating whether to check pages
> + *
> + * This function determines the chunk size for the GPU SVM range
> based on the
> + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> the virtual
> + * memory area boundaries.
> + *
> + * Returns:
> + * Chunk size on success, LONG_MAX on failure.
> + */
> +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier,
> +				       struct vm_area_struct *vas,
> +				       u64 fault_addr, u64
> gpuva_start,
> +				       u64 gpuva_end, bool
> check_pages)
> +{
> +	u64 start, end;
> +	int i = 0;
> +
> +retry:
> +	for (; i < gpusvm->num_chunks; ++i) {
> +		start = ALIGN_DOWN(fault_addr, gpusvm-
> >chunk_sizes[i]);
> +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> +
> +		if (start >= vas->vm_start && end <= vas->vm_end &&
> +		    start >= notifier->interval.start &&
> +		    end <= notifier->interval.end &&
> +		    start >= gpuva_start && end <= gpuva_end)
> +			break;
> +	}
> +
> +	if (i == gpusvm->num_chunks)
> +		return LONG_MAX;
> +
> +	/*
> +	 * If allocation more than page, ensure not to overlap with
> existing
> +	 * ranges.
> +	 */
> +	if (end - start != SZ_4K) {
> +		struct drm_gpusvm_range *range;
> +
> +		range = drm_gpusvm_range_find(notifier, start, end);
> +		if (range) {
> +			++i;
> +			goto retry;
> +		}
> +
> +		/*
> +		 * XXX: Only create range on pages CPU has faulted
> in. Without
> +		 * this check, or prefault, on BMG
> 'xe_exec_system_allocator --r
> +		 * process-many-malloc' fails. In the failure case,
> each process
> +		 * mallocs 16k but the CPU VMA is ~128k which
> results in 64k SVM
> +		 * ranges. When migrating the SVM ranges, some
> processes fail in
> +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> != npages'
> +		 * and then upon drm_gpusvm_range_get_pages device
> pages from
> +		 * other processes are collected + faulted in which
> creates all
> +		 * sorts of problems. Unsure exactly how this
> happening, also
> +		 * problem goes away if 'xe_exec_system_allocator --
> r
> +		 * process-many-malloc' mallocs at least 64k at a
> time.
> +		 */
> +		if (check_pages &&
> +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> end)) {
> +			++i;
> +			goto retry;
> +		}
> +	}
> +
> +	return end - start;
> +}
> +
> +/**
> + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @ctx: GPU SVM context
> + *
> + * This function finds or inserts a newly allocated a GPU SVM range
> based on the
> + * fault address. Caller must hold a lock to protect range lookup
> and insertion.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +	struct drm_gpusvm_range *range;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	bool notifier_alloc = false;
> +	u64 chunk_size;
> +	int err;
> +	bool migrate_vram;
> +
> +	if (fault_addr < gpusvm->mm_start ||
> +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_write_locked(mm);
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> +	if (!notifier) {
> +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> fault_addr);
> +		if (IS_ERR(notifier)) {
> +			err = PTR_ERR(notifier);
> +			goto err_mmunlock;
> +		}
> +		notifier_alloc = true;
> +		err = mmu_interval_notifier_insert_locked(&notifier-
> >notifier,
> +							  mm,
> notifier->interval.start,
> +							  notifier-
> >interval.end -
> +							  notifier-
> >interval.start,
> +							 
> &drm_gpusvm_notifier_ops);
> +		if (err)
> +			goto err_notifier;
> +	}
> +
> +	vas = vma_lookup(mm, fault_addr);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_notifier_remove;
> +	}
> +
> +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> +		err = -EPERM;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_find(notifier, fault_addr,
> fault_addr + 1);
> +	if (range)
> +		goto out_mmunlock;
> +	/*
> +	 * XXX: Short-circuiting migration based on migrate_vma_*
> current
> +	 * limitations. If/when migrate_vma_* add more support, this
> logic will
> +	 * have to change.
> +	 */
> +	migrate_vram = ctx->vram_possible &&
> +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> +
> +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> vas,
> +						 fault_addr,
> gpuva_start,
> +						 gpuva_end,
> migrate_vram &&
> +						 !ctx->prefault);
> +	if (chunk_size == LONG_MAX) {
> +		err = -EINVAL;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> chunk_size,
> +				       migrate_vram);
> +	if (IS_ERR(range)) {
> +		err = PTR_ERR(range);
> +		goto err_notifier_remove;
> +	}
> +
> +	drm_gpusvm_range_insert(notifier, range);
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> +
> +	if (ctx->prefault) {
> +		struct drm_gpusvm_ctx __ctx = *ctx;
> +
> +		__ctx.mmap_locked = true;
> +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &__ctx);
> +		if (err)
> +			goto err_range_remove;
> +	}
> +
> +out_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +
> +	return range;
> +
> +err_range_remove:
> +	__drm_gpusvm_range_remove(notifier, range);
> +err_notifier_remove:
> +	if (notifier_alloc)
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +err_notifier:
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * for_each_dma_page - iterate over pages in a DMA regio`n
> + * @i__: the current page index in the iteration
> + * @j__: the current page index, log order, in the iteration
> + * @npages__: the total number of pages in the DMA region
> + * @order__: the order of the pages in the DMA region
> + *
> + * This macro iterates over each page in a DMA region. The DMA
> region
> + * is assumed to be composed of 2^@order__ pages, and the macro will
> + * step through the region one block of 2^@order__ pages at a time.
> + */
> +#define for_each_dma_page(i__, j__, npages__, order__)	\
> +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> +	     (j__)++, (i__) += 0x1 << (order__))
> +
> +/**
> + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> GPU SVM range (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function unmap pages associated with a GPU SVM range.
> Assumes and
> + * asserts correct locking is in place when called.
> + */
> +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> *gpusvm,
> +					   struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		unsigned long i, j, npages = npages_in_range(range-
> >va.start,
> +							     range-
> >va.end);
> +
> +		if (range->flags.has_dma_mapping) {
> +			for_each_dma_page(i, j, npages, range-
> >order)
> +				dma_unmap_page(gpusvm->drm->dev,
> +					       range->dma_addr[j],
> +					       PAGE_SIZE << range-
> >order,
> +					       DMA_BIDIRECTIONAL);
> +		}
> +
> +		range->flags.has_vram_pages = false;
> +		range->flags.has_dma_mapping = false;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function free pages associated with a GPU SVM range.
> + */
> +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> +					struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->pages) {
> +		if (range->flags.kfree_mapping) {
> +			kfree(range->dma_addr);
> +			range->flags.kfree_mapping = false;
> +			range->pages = NULL;
> +		} else {
> +			kvfree(range->pages);
> +			range->pages = NULL;
> +		}
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_remove - Remove GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function removes the specified GPU SVM range and also
> removes the parent
> + * GPU SVM notifier if no more ranges remain in the notifier. The
> caller must
> + * hold a lock to protect range and notifier removal.
> + */
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> >va.start);
> +	if (WARN_ON_ONCE(!notifier))
> +		return;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +	drm_gpusvm_range_free_pages(gpusvm, range);
> +	__drm_gpusvm_range_remove(notifier, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	drm_gpusvm_range_put(range);
> +
> +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> +		if (!notifier->flags.removed)
> +			mmu_interval_notifier_remove(&notifier-
> >notifier);
> +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function increments the reference count of the specified GPU
> SVM range.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> +{
> +	kref_get(&range->refcount);
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> + * @refcount: Pointer to the reference counter embedded in the GPU
> SVM range
> + *
> + * This function destroys the specified GPU SVM range when its
> reference count
> + * reaches zero. If a custom range-free function is provided, it is
> invoked to
> + * free the range; otherwise, the range is deallocated using
> kfree().
> + */
> +static void drm_gpusvm_range_destroy(struct kref *refcount)
> +{
> +	struct drm_gpusvm_range *range =
> +		container_of(refcount, struct drm_gpusvm_range,
> refcount);
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->range_free)
> +		gpusvm->ops->range_free(range);
> +	else
> +		kfree(range);
> +}
> +
> +/**
> + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function decrements the reference count of the specified GPU
> SVM range
> + * and frees it when the count reaches zero.
> + */
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> +{
> +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called holding gpusvm->notifier_lock and as the last step before
> commiting a
> + * GPU binding.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	return range->flags.has_vram_pages || range-
> >flags.has_dma_mapping;
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> unlocked
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called without holding gpusvm->notifier_lock.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +static bool
> +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> +				      struct drm_gpusvm_range
> *range)
> +{
> +	bool pages_valid;
> +
> +	if (!range->pages)
> +		return false;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> +	if (!pages_valid && range->flags.kfree_mapping) {
> +		kfree(range->dma_addr);
> +		range->flags.kfree_mapping = false;
> +		range->pages = NULL;
> +	}
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	return pages_valid;
> +}
> +
> +/**
> + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function gets pages for a GPU SVM range and ensures they are
> mapped for
> + * DMA access.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct mmu_interval_notifier *notifier = &range->notifier-
> >notifier;
> +	struct hmm_range hmm_range = {
> +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> ? 0 :
> +			HMM_PFN_REQ_WRITE),
> +		.notifier = notifier,
> +		.start = range->va.start,
> +		.end = range->va.end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long i, j;
> +	unsigned long npages = npages_in_range(range->va.start,
> range->va.end);
> +	unsigned int order = 0;
> +	unsigned long *pfns;
> +	struct page **pages;
> +	int err = 0;
> +	bool vram_pages = !!range->flags.migrate_vram;
> +	bool alloc_pfns = false, kfree_mapping;
> +
> +retry:
> +	kfree_mapping = false;
> +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> +		return 0;
> +
> +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> >pages) {
> +		if (ctx->prefault)
> +			return 0;
> +
> +		pfns = (unsigned long *)range->pages;
> +		pages = range->pages;
> +		goto map_pages;
> +	}
> +
> +	if (!range->pages) {
> +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> GFP_KERNEL);
> +		if (!pfns)
> +			return -ENOMEM;
> +		alloc_pfns = true;
> +	} else {
> +		pfns = (unsigned long *)range->pages;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +	}
> +
> +	hmm_range.hmm_pfns = pfns;
> +	while (true) {
> +		/* Must be checked after mmu_interval_read_begin */
> +		if (range->flags.unmapped) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		if (!ctx->mmap_locked) {
> +			/*
> +			 * XXX: HMM locking document indicates only
> a read-lock
> +			 * is required but there apears to be a
> window between
> +			 * the MMU_NOTIFY_MIGRATE event triggered in
> a CPU fault
> +			 * via migrate_vma_setup and the pages
> actually moving
> +			 * in migrate_vma_finalize in which this
> code can grab
> +			 * garbage pages. Grabbing the write-lock if
> the range
> +			 * is attached to vram appears to protect
> against this
> +			 * race.
> +			 */
> +			if (vram_pages)
> +				mmap_write_lock(mm);
> +			else
> +				mmap_read_lock(mm);
> +		}
> +		err = hmm_range_fault(&hmm_range);
> +		if (!ctx->mmap_locked) {
> +			if (vram_pages)
> +				mmap_write_unlock(mm);
> +			else
> +				mmap_read_unlock(mm);
> +		}
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (!ctx->mmap_locked)
> +		mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +
> +	if (ctx->prefault) {
> +		range->pages = pages;
> +		goto set_seqno;
> +	}
> +
> +map_pages:
> +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> +		WARN_ON_ONCE(!range->vram_allocation);
> +
> +		for (i = 0; i < npages; ++i) {
> +			pages[i] = hmm_pfn_to_page(pfns[i]);
> +
> +			if
> (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> +				err = -EOPNOTSUPP;
> +				goto err_free;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->flags.has_vram_pages = true;
> +		range->pages = pages;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm,
> range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	} else {
> +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> +
> +		for_each_dma_page(i, j, npages, order) {
> +			if (WARN_ON_ONCE(i && order !=
> +					
> hmm_pfn_to_map_order(pfns[i]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +			order = hmm_pfn_to_map_order(pfns[i]);
> +
> +			pages[j] = hmm_pfn_to_page(pfns[i]);
> +			if
> (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> +				err = -EOPNOTSUPP;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +
> +			set_page_dirty_lock(pages[j]);
> +			mark_page_accessed(pages[j]);
> +
> +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> +						   pages[j], 0,
> +						   PAGE_SIZE <<
> order,
> +						  
> DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev,
> dma_addr[j])) {
> +				err = -EFAULT;
> +				npages = i;
> +				goto err_unmap;
> +			}
> +		}
> +
> +		/* Huge pages, reduce memory footprint */
> +		if (order) {
> +			dma_addr = kmalloc_array(j,
> sizeof(*dma_addr),
> +						 GFP_KERNEL);
> +			if (dma_addr) {
> +				for (i = 0; i < j; ++i)
> +					dma_addr[i] =
> (dma_addr_t)pfns[i];
> +				kvfree(pfns);
> +				kfree_mapping = true;
> +			} else {
> +				dma_addr = (dma_addr_t *)pfns;
> +			}
> +		}
> +
> +		/* Do not race with notifier unmapping pages */
> +		drm_gpusvm_notifier_lock(gpusvm);
> +		range->order = order;
> +		range->flags.kfree_mapping = kfree_mapping;
> +		range->flags.has_dma_mapping = true;
> +		range->dma_addr = dma_addr;
> +		range->vram_allocation = NULL;
> +		if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +			err = -EAGAIN;
> +			__drm_gpusvm_range_unmap_pages(gpusvm,
> range);
> +		}
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +	}
> +
> +	if (err == -EAGAIN)
> +		goto retry;
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	for_each_dma_page(i, j, npages, order)
> +		dma_unmap_page(gpusvm->drm->dev,
> +			       (dma_addr_t)pfns[j],
> +			       PAGE_SIZE << order,
> DMA_BIDIRECTIONAL);
> +err_free:
> +	if (alloc_pfns)
> +		kvfree(pfns);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If
> @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write
> mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> called on
> + * each GPU SVM range attached to notifier in gpusvm->ops-
> >invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> +					   unsigned long
> *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> grate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU
> SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_vram_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to mapped
> pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU
> SVM. It
> + * iterates over each page frame number provided in @migrate_pfn,
> maps the
> + * corresponding page, and stores the DMA address in the provided
> @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int
> *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page =
> migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for GPU
> Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in
> @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the
> corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction
> dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev,
> dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *                   failure of this function.
> + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> The caller
> + *                   should hold a reference to the VRAM allocation,
> which
> + *                   should be dropped via ops->vram_allocation or
> upon the
> + *                   failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to VRAM. It
> performs the
> + * necessary setup and invokes the driver-specific operations for
> migration to
> + * VRAM. Upon successful return, @vram_allocation can safely
> reference @range
> + * until ops->vram_release is called which only upon successful
> return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       void *vram_allocation,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_vram)
> +		return -EINVAL;
> +
> +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> >copy_to_vram ||
> +	    !gpusvm->ops->copy_to_sram)
> +		return -EOPNOTSUPP;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		mmap_write_lock(mm);
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(range);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> npages, not
> +	 * always an error. Need to revisit possible cases and how
> to handle. We
> +	 * could prefault on migrate.cpages != npages via
> hmm_range_fault.
> +	 */
> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> vram_allocation, npages,
> +					     migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.src, npages,
> DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_vram_page(page, zdd);
> +	}
> +
> +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind vram allocation to range and zdd */
> +	range->vram_allocation = vram_allocation;
> +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	if (!ctx->mmap_locked) {
> +		mmap_write_unlock(mm);
> +		mmput(mm);
> +	}
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> VM area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the SRAM migrate page frame numbers
> (PFNs) for the
> + * specified VM area structure. It allocates and locks pages in the
> VM area for
> + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> vm_area_struct *vas,
> +						unsigned long
> npages,
> +						unsigned long
> *src_mpfn,
> +						unsigned long *mpfn,
> u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);
> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> lock and
> + * migration done via migrate_device_* functions. Fallback path as
> it is
> + * preferred to issue migrations with mmap lock.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> +				    struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = npages_in_range(range->va.start, range->va.end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> npages;
> +
> +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> >vram_allocation,
> +					     npages, src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_vma_range(gpusvm->mm,
> +				       gpusvm-
> >device_private_page_owner, src,
> +				       npages, range->va.start);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> src, dst, 0);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   dst, npages,
> DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @vas: Pointer to the VM area structure
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @start: Start address of the migration range
> + * @end: End address of the migration range
> + *
> + * This internal function performs the migration of the specified
> GPU SVM range
> + * to SRAM. It sets up the migration, populates + dma maps SRAM
> PFNs, and
> + * invokes the driver-specific operations for migration to SRAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +					struct vm_area_struct *vas,
> +					struct page *page,
> +					u64 start, u64 end)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page	= page,
> +	};
> +	unsigned long npages;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	/* Corner where VMA area struct has been partially unmapped
> */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> +						   migrate.src,
> migrate.dst,
> +						   start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> dma_addr,
> +					   migrate.dst, npages,
> +					   DMA_BIDIRECTIONAL);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);

See comments below which pages we actually want to migrate.


> +
> +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> npages,
> +				       DMA_BIDIRECTIONAL);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> SRAM
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function initiates the migration of the specified GPU SVM
> range to
> + * SRAM. It performs necessary checks and invokes the internal
> migration
> + * function for actual migration.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	u64 start = range->va.start, end = range->va.end;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	int err;
> +	bool retry = false;
> +
> +	if (!ctx->mmap_locked) {
> +		if (!mmget_not_zero(mm)) {
> +			err = -EFAULT;
> +			goto err_out;
> +		}
> +		if (ctx->trylock_mmap) {
> +			if (!mmap_read_trylock(mm))  {
> +				err =
> drm_gpusvm_evict_to_sram(gpusvm, range);
> +				goto err_mmput;
> +			}
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
> +	mmap_assert_locked(mm);
> +
> +	/*
> +	 * Loop required to find all VMA area structs for the corner
> case when
> +	 * VRAM backing has been partially unmapped from MM's
> address space.
> +	 */
> +again:
> +	vas = find_vma(mm, start);
> +	if (!vas) {
> +		if (!retry)
> +			err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end <= vas->vm_start || start >= vas->vm_end) {
> +		if (!retry)
> +			err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> end);

This function is typically called from the vm side to get a clean mm as
a last resort after get_pages() fail. As such should we have it evict
*everything*, even foreign device memory, and mismatching local device
pages. If so, we could use hmm_range_fault() with a NULL page owner +
faulting to do that.

> +	if (err)
> +		goto err_mmunlock;
> +
> +	if (vas->vm_end < end) {
> +		retry = true;
> +		start = vas->vm_end;
> +		goto again;
> +	}
> +
> +	if (!ctx->mmap_locked) {
> +		mmap_read_unlock(mm);
> +		/*
> +		 * Using mmput_async as this function can be called
> while
> +		 * holding a dma-resv lock, and a final put can grab
> the mmap
> +		 * lock, causing a lock inversion.
> +		 */
> +		mmput_async(mm);
> +	}
> +
> +	return 0;
> +
> +err_mmunlock:
> +	if (!ctx->mmap_locked)
> +		mmap_read_unlock(mm);
> +err_mmput:
> +	if (!ctx->mmap_locked)
> +		mmput_async(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device
> data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM
> range to RAM.
> + * It retrieves the GPU SVM range information from the faulting page
> and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> +	int err;
> +
> +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> +					   vmf->vma, vmf->page,
> +					   zdd->range->va.start,
> +					   zdd->range->va.end);

When called from here, since this is a pagemap op, we should ensure we
only migrate our own pagemap to RAM?

/Thanks,
Thomas


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-09 10:50   ` Thomas Hellström
@ 2024-10-16  3:18     ` Matthew Brost
  2024-10-16  6:27       ` Thomas Hellström
  0 siblings, 1 reply; 100+ messages in thread
From: Matthew Brost @ 2024-10-16  3:18 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Wed, Oct 09, 2024 at 12:50:42PM +0200, Thomas Hellström wrote:
> Hi, Matthew.
> 
> Some comments below around migrating to SRAM.
> 
> 
> On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > This patch introduces support for GPU Shared Virtual Memory (SVM) in
> > the
> > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > sharing of memory between the CPU and GPU, enhancing performance and
> > flexibility in GPU computing tasks.
> > 
> > The patch adds the necessary infrastructure for SVM, including data
> > structures and functions for managing SVM ranges and notifiers. It
> > also
> > provides mechanisms for allocating, deallocating, and migrating
> > memory
> > regions between system RAM and GPU VRAM.
> > 
> > This mid-layer is largely inspired by GPUVM.
> > 
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile     |    3 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> >  3 files changed, 2591 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index b9670ae09a9e..b8fc2ee58f1a 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> >  
> >  # core driver code
> >  
> > -xe-y += xe_bb.o \
> > +xe-y += drm_gpusvm.o \
> > +	xe_bb.o \
> >  	xe_bo.o \
> >  	xe_bo_evict.o \
> >  	xe_devcoredump.o \
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > new file mode 100644
> > index 000000000000..fc1e44e6ae72
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > @@ -0,0 +1,2174 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + *
> > + * Authors:
> > + *     Matthew Brost <matthew.brost@intel.com>
> > + */
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interval_tree_generic.h>
> > +#include <linux/hmm.h>
> > +#include <linux/memremap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_types.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/slab.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include "drm_gpusvm.h"
> > +
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework designed to
> > manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient data
> > exchange and
> > + * processing for GPU-accelerated applications by allowing memory
> > sharing and
> > + * synchronization between the CPU's and GPU's virtual address
> > spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU
> > SVM
> > + *		initialization parameter, with a recommendation of
> > 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of
> > ranges that
> > + *		fall within the notifier interval. Notifiers are
> > tracked within
> > + *		a GPU SVM Red-BlacK tree and list and are
> > dynamically inserted
> > + *		or removed as ranges within the interval are created
> > or
> > + *		destroyed.
> > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > managed
> > + *	     by GPU SVM. They are sized based on an array of chunk
> > sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU
> > address space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits
> > within the
> > + *	     faulting CPU address space is chosen for the range
> > size. Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and
> > removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > are tracked in
> > + *	     a notifier's Red-Black tree.
> > + * - Operations: Define the interface for driver-specific SVM
> > operations such as
> > + *		 allocation, page collection, migration,
> > invalidations, and VRAM
> > + *		 release.
> > + *
> > + * This layer provides interfaces for allocating, mapping,
> > migrating, and
> > + * releasing memory ranges between the CPU and GPU. It handles all
> > core memory
> > + * management interactions (DMA mapping, HMM, and migration) and
> > provides
> > + * driver-specific virtual functions (vfuncs). This infrastructure
> > is sufficient
> > + * to build the expected driver components for an SVM implementation
> > as detailed
> > + * below.
> > + *
> > + * Expected Driver Components:
> > + * - GPU page fault handler: Used to create ranges and notifiers
> > based on the
> > + *			     fault address, optionally migrate the
> > range to
> > + *			     VRAM, and create GPU bindings.
> > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > Ranges are
> > + *			expected to be added to the garbage
> > collector upon
> > + *			MMU_NOTIFY_UNMAP event.
> > + */
> > +
> > +/**
> > + * DOC: Locking
> > + *
> > + * GPU SVM handles locking for core MM interactions, i.e., it
> > locks/unlocks the
> > + * mmap lock as needed. Alternatively, if the driver prefers to
> > handle the mmap
> > + * lock itself, a 'locked' argument is provided to the functions
> > that require
> > + * the mmap lock. This option may be useful for drivers that need to
> > call into
> > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > locking
> > + * inversions between the mmap and dma-resv locks.
> > + *
> > + * GPU SVM introduces a global notifier lock, which safeguards the
> > notifier's
> > + * range RB tree and list, as well as the range's DMA mappings and
> > sequence
> > + * number. GPU SVM manages all necessary locking and unlocking
> > operations,
> > + * except for the recheck of the range's sequence number
> > + * (mmu_interval_read_retry) when the driver is committing GPU
> > bindings. This
> > + * lock corresponds to the 'driver->update' lock mentioned in the
> > HMM
> > + * documentation (TODO: Link). Future revisions may transition from
> > a GPU SVM
> > + * global lock to a per-notifier lock if finer-grained locking is
> > deemed
> > + * necessary.
> > + *
> > + * In addition to the locking mentioned above, the driver should
> > implement a
> > + * lock to safeguard core GPU SVM function calls that modify state,
> > such as
> > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > Alternatively,
> > + * these core functions can be called within a single kernel thread,
> > for
> > + * instance, using an ordered work queue. This lock is denoted as
> > + * 'driver_svm_lock' in code examples.
> > + */
> > +
> > +/**
> > + * DOC: Migrataion
> > + *
> > + * The migration support is quite simple, allowing migration between
> > SRAM and
> > + * VRAM at the range granularity. For example, GPU SVM currently
> > does not
> > + * support mixing SRAM and VRAM pages within a range. This means
> > that upon GPU
> > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > fault, the
> > + * entire range is migrated to SRAM.
> > + *
> > + * The reasoning for only supporting range granularity is as
> > follows: it
> > + * simplifies the implementation, and range sizes are driver-defined
> > and should
> > + * be relatively small.
> > + */
> > +
> > +/**
> > + * DOC: Partial Unmapping of Ranges
> > + *
> > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > CPU resulting
> > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> > main one
> > + * being that a subset of the range still has CPU and GPU mappings.
> > If the
> > + * backing store for the range is in VRAM, a subset of the backing
> > store has
> > + * references. One option would be to split the range and VRAM
> > backing store,
> > + * but the implementation for this would be quite complicated. Given
> > that
> > + * partial unmappings are rare and driver-defined range sizes are
> > relatively
> > + * small, GPU SVM does not support splitting of ranges.
> > + *
> > + * With no support for range splitting, upon partial unmapping of a
> > range, the
> > + * driver is expected to invalidate and destroy the entire range. If
> > the range
> > + * has VRAM as its backing, the driver is also expected to migrate
> > any remaining
> > + * pages back to SRAM.
> > + */
> > +
> > +/**
> > + * DOC: Examples
> > + *
> > + * This section provides two examples of how to build the expected
> > driver
> > + * components: the GPU page fault handler and the garbage collector.
> > A third
> > + * example demonstrates a sample invalidation driver vfunc.
> > + *
> > + * The generic code provided does not include logic for complex
> > migration
> > + * policies, optimized invalidations, or other potentially required
> > driver
> > + * locking (e.g., DMA-resv locks).
> > + *
> > + * 1) GPU page fault handler
> > + *
> > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > drm_gpusvm_range *range)
> > + *	{
> > + *		int err = 0;
> > + *
> > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > range);
> > + *
> > + *		drm_gpusvm_notifier_lock(gpusvm);
> > + *		if (drm_gpusvm_range_pages_valid(range))
> > + *			driver_commit_bind(gpusvm, range);
> > + *		else
> > + *			err = -EAGAIN;
> > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > + *
> > + *		return err;
> > + *	}
> > + *
> > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > + *			     u64 gpuva_start, u64 gpuva_end)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *		int err;
> > + *
> > + *		driver_svm_lock();
> > + *	retry:
> > + *		// Always process UNMAPs first so view of GPU SVM
> > ranges is current
> > + *		driver_garbage_collector(gpusvm);
> > + *
> > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > fault_addr,
> > + *							gpuva_start,
> > gpuva_end,
> > + *						        &ctx);
> > + *		if (IS_ERR(range)) {
> > + *			err = PTR_ERR(range);
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		if (driver_migration_policy(range)) {
> > + *			bo = driver_alloc_bo();
> > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > range, bo, &ctx);
> > + *			if (err)	// CPU mappings may have
> > changed
> > + *				goto retry;
> > + *		}
> > + *
> > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &ctx);
> > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > mappings changed
> > + *			goto retry;
> > + *		else if (err)
> > + *			goto unlock;
> > + *
> > + *		err = driver_bind_range(gpusvm, range);
> > + *		if (err == -EAGAIN)	// CPU mappings changed
> > + *			goto retry
> > + *
> > + *	unlock:
> > + *		driver_svm_unlock();
> > + *		return err;
> > + *	}
> > + *
> > + * 2) Garbage Collector.
> > + *
> > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > + *					struct drm_gpusvm_range
> > *range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		// Partial unmap, migrate any remaining VRAM pages
> > back to SRAM
> > + *		if (range->flags.partial_unmap)
> > + *			drm_gpusvm_migrate_to_sram(gpusvm, range,
> > &ctx);
> > + *
> > + *		driver_unbind_range(range);
> > + *		drm_gpusvm_range_remove(gpusvm, range);
> > + *	}
> > + *
> > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > + *			__driver_garbage_collector(gpusvm, range);
> > + *	}
> > + *
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier
> > *notifier,
> > + *				 const struct mmu_notifier_range
> > *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > >start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier,
> > mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> > &ctx);
> > + *
> > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range,
> > mmu_range);
> > + *			driver_garbage_collector_add(gpusvm, range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > >interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > >interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> > notifier);
> > +
> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given
> > range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory
> > range,
> > + * specified by the start and end addresses. It divides the
> > difference
> > + * between the end and start addresses by the page size (PAGE_SIZE)
> > to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @range: Pointer to the GPU SVM range
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up
> > a range
> > + * upon CPU page fault and asynchronously releasing VRAM once the
> > CPU has no
> > + * page references. Asynchronous release is useful because CPU page
> > references
> > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > requires sleeping
> > + * locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_range *range;
> > +	void *vram_allocation;
> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> > zdd
> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases VRAM, puts GPU SVM range, and frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd,
> > destroy_work);
> > +	struct drm_gpusvm_range *range = zdd->range;
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > +	drm_gpusvm_range_put(range);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @range: Pointer to the GPU SVM range.
> > + *
> > + * This function allocates and initializes a new zdd structure. It
> > sets up the
> > + * reference count, initializes the destroy work, and links the
> > provided GPU SVM
> > + * range.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work,
> > drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->range = drm_gpusvm_range_get(range);
> > +	zdd->vram_allocation = NULL;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd
> > structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for asynchronous
> > destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > +
> > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd
> > structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> > start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > (start__), (end__)),	\
> > +	     (next__) =
> > __drm_gpusvm_range_next(range__);				\
> > +	     (range__) && (range__->va.start <
> > (end__));				\
> > +	     (range__) = (next__), (next__) =
> > __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> > the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> > or NULL if
> > + *         the current notifier is the last one or if the input
> > notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm-
> > >notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> > a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> > end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> > notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1),	\
> > +	     (next__) =
> > __drm_gpusvm_notifier_next(notifier__);				\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = (next__), (next__) =
> > __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It
> > sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc
> > under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range
> > *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops
> > drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order
> > with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner =
> > device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +	gpusvm->zdd_wq = system_wq;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault
> > address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier
> > structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)				\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> > and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining
> > ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> > LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any
> > invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next,
> > notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > >notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > >notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree
> > and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree
> > and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > +	range_remove((range__), &(notifier__)->root);		\
> > +	list_del(&(range__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > + *
> > + * This function allocates and initializes the GPU SVM range
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool
> > migrate_vram)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the
> > CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier
> > *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> > >notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> > range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range
> > based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> > the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64
> > gpuva_start,
> > +				       u64 gpuva_end, bool
> > check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > >chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with
> > existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted
> > in. Without
> > +		 * this check, or prefault, on BMG
> > 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case,
> > each process
> > +		 * mallocs 16k but the CPU VMA is ~128k which
> > results in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some
> > processes fail in
> > +		 * drm_gpusvm_migrate_to_vram with 'migrate.cpages
> > != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages device
> > pages from
> > +		 * other processes are collected + faulted in which
> > creates all
> > +		 * sorts of problems. Unsure exactly how this
> > happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --
> > r
> > +		 * process-many-malloc' mallocs at least 64k at a
> > time.
> > +		 */
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range
> > based on the
> > + * fault address. Caller must hold a lock to protect range lookup
> > and insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_vram;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_write_locked(mm);
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert_locked(&notifier-
> > >notifier,
> > +							  mm,
> > notifier->interval.start,
> > +							  notifier-
> > >interval.end -
> > +							  notifier-
> > >interval.start,
> > +							 
> > &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > current
> > +	 * limitations. If/when migrate_vma_* add more support, this
> > logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_vram = ctx->vram_possible &&
> > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> > vas,
> > +						 fault_addr,
> > gpuva_start,
> > +						 gpuva_end,
> > migrate_vram &&
> > +						 !ctx->prefault);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> > chunk_size,
> > +				       migrate_vram);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +	if (ctx->prefault) {
> > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > +
> > +		__ctx.mmap_locked = true;
> > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &__ctx);
> > +		if (err)
> > +			goto err_range_remove;
> > +	}
> > +
> > +out_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +
> > +	return range;
> > +
> > +err_range_remove:
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +err_notifier_remove:
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > + * @i__: the current page index in the iteration
> > + * @j__: the current page index, log order, in the iteration
> > + * @npages__: the total number of pages in the DMA region
> > + * @order__: the order of the pages in the DMA region
> > + *
> > + * This macro iterates over each page in a DMA region. The DMA
> > region
> > + * is assumed to be composed of 2^@order__ pages, and the macro will
> > + * step through the region one block of 2^@order__ pages at a time.
> > + */
> > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > +	     (j__)++, (i__) += 0x1 << (order__))
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function unmap pages associated with a GPU SVM range.
> > Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > *gpusvm,
> > +					   struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		unsigned long i, j, npages = npages_in_range(range-
> > >va.start,
> > +							     range-
> > >va.end);
> > +
> > +		if (range->flags.has_dma_mapping) {
> > +			for_each_dma_page(i, j, npages, range-
> > >order)
> > +				dma_unmap_page(gpusvm->drm->dev,
> > +					       range->dma_addr[j],
> > +					       PAGE_SIZE << range-
> > >order,
> > +					       DMA_BIDIRECTIONAL);
> > +		}
> > +
> > +		range->flags.has_vram_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > +					struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->pages) {
> > +		if (range->flags.kfree_mapping) {
> > +			kfree(range->dma_addr);
> > +			range->flags.kfree_mapping = false;
> > +			range->pages = NULL;
> > +		} else {
> > +			kvfree(range->pages);
> > +			range->pages = NULL;
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also
> > removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > >va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier-
> > >notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified GPU
> > SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the GPU
> > SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its
> > reference count
> > + * reaches zero. If a custom range-free function is provided, it is
> > invoked to
> > + * free the range; otherwise, the range is deallocated using
> > kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range,
> > refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified GPU
> > SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before
> > commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_vram_pages || range-
> > >flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> > unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > +				      struct drm_gpusvm_range
> > *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->pages)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > +	if (!pages_valid && range->flags.kfree_mapping) {
> > +		kfree(range->dma_addr);
> > +		range->flags.kfree_mapping = false;
> > +		range->pages = NULL;
> > +	}
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they are
> > mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > >notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> > ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	bool vram_pages = !!range->flags.migrate_vram;
> > +	bool alloc_pfns = false, kfree_mapping;
> > +
> > +retry:
> > +	kfree_mapping = false;
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > +		return 0;
> > +
> > +	if (range->notifier_seq == hmm_range.notifier_seq && range-
> > >pages) {
> > +		if (ctx->prefault)
> > +			return 0;
> > +
> > +		pfns = (unsigned long *)range->pages;
> > +		pages = range->pages;
> > +		goto map_pages;
> > +	}
> > +
> > +	if (!range->pages) {
> > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > GFP_KERNEL);
> > +		if (!pfns)
> > +			return -ENOMEM;
> > +		alloc_pfns = true;
> > +	} else {
> > +		pfns = (unsigned long *)range->pages;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		/* Must be checked after mmu_interval_read_begin */
> > +		if (range->flags.unmapped) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		if (!ctx->mmap_locked) {
> > +			/*
> > +			 * XXX: HMM locking document indicates only
> > a read-lock
> > +			 * is required but there apears to be a
> > window between
> > +			 * the MMU_NOTIFY_MIGRATE event triggered in
> > a CPU fault
> > +			 * via migrate_vma_setup and the pages
> > actually moving
> > +			 * in migrate_vma_finalize in which this
> > code can grab
> > +			 * garbage pages. Grabbing the write-lock if
> > the range
> > +			 * is attached to vram appears to protect
> > against this
> > +			 * race.
> > +			 */
> > +			if (vram_pages)
> > +				mmap_write_lock(mm);
> > +			else
> > +				mmap_read_lock(mm);
> > +		}
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (!ctx->mmap_locked) {
> > +			if (vram_pages)
> > +				mmap_write_unlock(mm);
> > +			else
> > +				mmap_read_unlock(mm);
> > +		}
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (!ctx->mmap_locked)
> > +		mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +
> > +	if (ctx->prefault) {
> > +		range->pages = pages;
> > +		goto set_seqno;
> > +	}
> > +
> > +map_pages:
> > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > +		WARN_ON_ONCE(!range->vram_allocation);
> > +
> > +		for (i = 0; i < npages; ++i) {
> > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > +
> > +			if
> > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_free;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->flags.has_vram_pages = true;
> > +		range->pages = pages;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	} else {
> > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > +
> > +		for_each_dma_page(i, j, npages, order) {
> > +			if (WARN_ON_ONCE(i && order !=
> > +					
> > hmm_pfn_to_map_order(pfns[i]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +			order = hmm_pfn_to_map_order(pfns[i]);
> > +
> > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > +			if
> > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > +				err = -EOPNOTSUPP;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +
> > +			set_page_dirty_lock(pages[j]);
> > +			mark_page_accessed(pages[j]);
> > +
> > +			dma_addr[j] = dma_map_page(gpusvm->drm->dev,
> > +						   pages[j], 0,
> > +						   PAGE_SIZE <<
> > order,
> > +						  
> > DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev,
> > dma_addr[j])) {
> > +				err = -EFAULT;
> > +				npages = i;
> > +				goto err_unmap;
> > +			}
> > +		}
> > +
> > +		/* Huge pages, reduce memory footprint */
> > +		if (order) {
> > +			dma_addr = kmalloc_array(j,
> > sizeof(*dma_addr),
> > +						 GFP_KERNEL);
> > +			if (dma_addr) {
> > +				for (i = 0; i < j; ++i)
> > +					dma_addr[i] =
> > (dma_addr_t)pfns[i];
> > +				kvfree(pfns);
> > +				kfree_mapping = true;
> > +			} else {
> > +				dma_addr = (dma_addr_t *)pfns;
> > +			}
> > +		}
> > +
> > +		/* Do not race with notifier unmapping pages */
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +		range->order = order;
> > +		range->flags.kfree_mapping = kfree_mapping;
> > +		range->flags.has_dma_mapping = true;
> > +		range->dma_addr = dma_addr;
> > +		range->vram_allocation = NULL;
> > +		if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +			err = -EAGAIN;
> > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > range);
> > +		}
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +	}
> > +
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	for_each_dma_page(i, j, npages, order)
> > +		dma_unmap_page(gpusvm->drm->dev,
> > +			       (dma_addr_t)pfns[j],
> > +			       PAGE_SIZE << order,
> > DMA_BIDIRECTIONAL);
> > +err_free:
> > +	if (alloc_pfns)
> > +		kvfree(pfns);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If
> > @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > >invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long
> > *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > grate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU
> > SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_vram_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> > migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped
> > pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU
> > SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn,
> > maps the
> > + * corresponding page, and stores the DMA address in the provided
> > @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int
> > *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page =
> > migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> > dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> > for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU
> > Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in
> > @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the
> > corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction
> > dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *                   failure of this function.
> > + * @vram_allocation: Driver-private pointer to the VRAM allocation.
> > The caller
> > + *                   should hold a reference to the VRAM allocation,
> > which
> > + *                   should be dropped via ops->vram_allocation or
> > upon the
> > + *                   failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to VRAM. It
> > performs the
> > + * necessary setup and invokes the driver-specific operations for
> > migration to
> > + * VRAM. Upon successful return, @vram_allocation can safely
> > reference @range
> > + * until ops->vram_release is called which only upon successful
> > return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       void *vram_allocation,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_vram)
> > +		return -EINVAL;
> > +
> > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > >copy_to_vram ||
> > +	    !gpusvm->ops->copy_to_sram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		mmap_write_lock(mm);
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(range);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > npages, not
> > +	 * always an error. Need to revisit possible cases and how
> > to handle. We
> > +	 * could prefault on migrate.cpages != npages via
> > hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > vram_allocation, npages,
> > +					     migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.src, npages,
> > DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_vram_page(page, zdd);
> > +	}
> > +
> > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind vram allocation to range and zdd */
> > +	range->vram_allocation = vram_allocation;
> > +	WRITE_ONCE(zdd->vram_allocation, vram_allocation);	/*
> > Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked) {
> > +		mmap_write_unlock(mm);
> > +		mmput(mm);
> > +	}
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for a
> > VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the SRAM migrate page frame numbers
> > (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the
> > VM area for
> > + * SRAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > vm_area_struct *vas,
> > +						unsigned long
> > npages,
> > +						unsigned long
> > *src_mpfn,
> > +						unsigned long *mpfn,
> > u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_sram but does not require mmap
> > lock and
> > + * migration done via migrate_device_* functions. Fallback path as
> > it is
> > + * preferred to issue migrations with mmap lock.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> > +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > npages;
> > +
> > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > >vram_allocation,
> > +					     npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_vma_range(gpusvm->mm,
> > +				       gpusvm-
> > >device_private_page_owner, src,
> > +				       npages, range->va.start);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > src, dst, 0);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   dst, npages,
> > DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @vas: Pointer to the VM area structure
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @start: Start address of the migration range
> > + * @end: End address of the migration range
> > + *
> > + * This internal function performs the migration of the specified
> > GPU SVM range
> > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > PFNs, and
> > + * invokes the driver-specific operations for migration to SRAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +					struct vm_area_struct *vas,
> > +					struct page *page,
> > +					u64 start, u64 end)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page	= page,
> > +	};
> > +	unsigned long npages;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	/* Corner where VMA area struct has been partially unmapped
> > */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > +						   migrate.src,
> > migrate.dst,
> > +						   start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > dma_addr,
> > +					   migrate.dst, npages,
> > +					   DMA_BIDIRECTIONAL);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> 
> See comments below which pages we actually want to migrate.
> 
> 
> > +
> > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev, dma_addr,
> > npages,
> > +				       DMA_BIDIRECTIONAL);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > SRAM
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function initiates the migration of the specified GPU SVM
> > range to
> > + * SRAM. It performs necessary checks and invokes the internal
> > migration
> > + * function for actual migration.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	int err;
> > +	bool retry = false;
> > +
> > +	if (!ctx->mmap_locked) {
> > +		if (!mmget_not_zero(mm)) {
> > +			err = -EFAULT;
> > +			goto err_out;
> > +		}
> > +		if (ctx->trylock_mmap) {
> > +			if (!mmap_read_trylock(mm))  {
> > +				err =
> > drm_gpusvm_evict_to_sram(gpusvm, range);
> > +				goto err_mmput;
> > +			}
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> > +	mmap_assert_locked(mm);
> > +
> > +	/*
> > +	 * Loop required to find all VMA area structs for the corner
> > case when
> > +	 * VRAM backing has been partially unmapped from MM's
> > address space.
> > +	 */
> > +again:
> > +	vas = find_vma(mm, start);
> > +	if (!vas) {
> > +		if (!retry)
> > +			err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > +		if (!retry)
> > +			err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL, start,
> > end);
> 
> This function is typically called from the vm side to get a clean mm as
> a last resort after get_pages() fail. As such should we have it evict
> *everything*, even foreign device memory, and mismatching local device
> pages. If so, we could use hmm_range_fault() with a NULL page owner +
> faulting to do that.
> 

I've actually tried that and it seemed to mostly work well and actually
would be my preference as this avoids a VMA lookup in GPU SVM.

I think it is problem though if some of the pages are partially unmapped
though as hmm_range_fault will abort if fault cannot be resolved. Maybe
I'm mistaken on this. I won't get this in rev2 but will put this on my
list to continue to play around with.

> > +	if (err)
> > +		goto err_mmunlock;
> > +
> > +	if (vas->vm_end < end) {
> > +		retry = true;
> > +		start = vas->vm_end;
> > +		goto again;
> > +	}
> > +
> > +	if (!ctx->mmap_locked) {
> > +		mmap_read_unlock(mm);
> > +		/*
> > +		 * Using mmput_async as this function can be called
> > while
> > +		 * holding a dma-resv lock, and a final put can grab
> > the mmap
> > +		 * lock, causing a lock inversion.
> > +		 */
> > +		mmput_async(mm);
> > +	}
> > +
> > +	return 0;
> > +
> > +err_mmunlock:
> > +	if (!ctx->mmap_locked)
> > +		mmap_read_unlock(mm);
> > +err_mmput:
> > +	if (!ctx->mmap_locked)
> > +		mmput_async(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> > with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device
> > data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> > fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM
> > range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page
> > and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > +					   vmf->vma, vmf->page,
> > +					   zdd->range->va.start,
> > +					   zdd->range->va.end);
> 
> When called from here, since this is a pagemap op, we should ensure we
> only migrate our own pagemap to RAM?
> 

I think you resolve this with the following the patch [1], right? I
think I agree.

Matt

[1] https://patchwork.freedesktop.org/series/139994/

> /Thanks,
> Thomas
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  3:18     ` Matthew Brost
@ 2024-10-16  6:27       ` Thomas Hellström
  2024-10-16  8:24         ` Matthew Brost
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Hellström @ 2024-10-16  6:27 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Wed, 2024-10-16 at 03:18 +0000, Matthew Brost wrote:
> On Wed, Oct 09, 2024 at 12:50:42PM +0200, Thomas Hellström wrote:
> > Hi, Matthew.
> > 
> > Some comments below around migrating to SRAM.
> > 
> > 
> > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > This patch introduces support for GPU Shared Virtual Memory (SVM)
> > > in
> > > the
> > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > sharing of memory between the CPU and GPU, enhancing performance
> > > and
> > > flexibility in GPU computing tasks.
> > > 
> > > The patch adds the necessary infrastructure for SVM, including
> > > data
> > > structures and functions for managing SVM ranges and notifiers.
> > > It
> > > also
> > > provides mechanisms for allocating, deallocating, and migrating
> > > memory
> > > regions between system RAM and GPU VRAM.
> > > 
> > > This mid-layer is largely inspired by GPUVM.
> > > 
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > +++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > >  
> > >  # core driver code
> > >  
> > > -xe-y += xe_bb.o \
> > > +xe-y += drm_gpusvm.o \
> > > +	xe_bb.o \
> > >  	xe_bo.o \
> > >  	xe_bo_evict.o \
> > >  	xe_devcoredump.o \
> > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > new file mode 100644
> > > index 000000000000..fc1e44e6ae72
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > @@ -0,0 +1,2174 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + *
> > > + * Authors:
> > > + *     Matthew Brost <matthew.brost@intel.com>
> > > + */
> > > +
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/interval_tree_generic.h>
> > > +#include <linux/hmm.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/mm_types.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/slab.h>
> > > +
> > > +#include <drm/drm_device.h>
> > > +#include "drm_gpusvm.h"
> > > +
> > > +/**
> > > + * DOC: Overview
> > > + *
> > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > Rendering Manager (DRM)
> > > + *
> > > + * The GPU SVM layer is a component of the DRM framework
> > > designed to
> > > manage shared
> > > + * virtual memory between the CPU and GPU. It enables efficient
> > > data
> > > exchange and
> > > + * processing for GPU-accelerated applications by allowing
> > > memory
> > > sharing and
> > > + * synchronization between the CPU's and GPU's virtual address
> > > spaces.
> > > + *
> > > + * Key GPU SVM Components:
> > > + * - Notifiers: Notifiers: Used for tracking memory intervals
> > > and
> > > notifying the
> > > + *		GPU of changes, notifiers are sized based on a
> > > GPU
> > > SVM
> > > + *		initialization parameter, with a recommendation
> > > of
> > > 512M or
> > > + *		larger. They maintain a Red-BlacK tree and a
> > > list of
> > > ranges that
> > > + *		fall within the notifier interval. Notifiers are
> > > tracked within
> > > + *		a GPU SVM Red-BlacK tree and list and are
> > > dynamically inserted
> > > + *		or removed as ranges within the interval are
> > > created
> > > or
> > > + *		destroyed.
> > > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > > managed
> > > + *	     by GPU SVM. They are sized based on an array of
> > > chunk
> > > sizes, which
> > > + *	     is a GPU SVM initialization parameter, and the CPU
> > > address space.
> > > + *	     Upon GPU fault, the largest aligned chunk that fits
> > > within the
> > > + *	     faulting CPU address space is chosen for the range
> > > size. Ranges are
> > > + *	     expected to be dynamically allocated on GPU fault
> > > and
> > > removed on an
> > > + *	     MMU notifier UNMAP event. As mentioned above,
> > > ranges
> > > are tracked in
> > > + *	     a notifier's Red-Black tree.
> > > + * - Operations: Define the interface for driver-specific SVM
> > > operations such as
> > > + *		 allocation, page collection, migration,
> > > invalidations, and VRAM
> > > + *		 release.
> > > + *
> > > + * This layer provides interfaces for allocating, mapping,
> > > migrating, and
> > > + * releasing memory ranges between the CPU and GPU. It handles
> > > all
> > > core memory
> > > + * management interactions (DMA mapping, HMM, and migration) and
> > > provides
> > > + * driver-specific virtual functions (vfuncs). This
> > > infrastructure
> > > is sufficient
> > > + * to build the expected driver components for an SVM
> > > implementation
> > > as detailed
> > > + * below.
> > > + *
> > > + * Expected Driver Components:
> > > + * - GPU page fault handler: Used to create ranges and notifiers
> > > based on the
> > > + *			     fault address, optionally migrate
> > > the
> > > range to
> > > + *			     VRAM, and create GPU bindings.
> > > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > > Ranges are
> > > + *			expected to be added to the garbage
> > > collector upon
> > > + *			MMU_NOTIFY_UNMAP event.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Locking
> > > + *
> > > + * GPU SVM handles locking for core MM interactions, i.e., it
> > > locks/unlocks the
> > > + * mmap lock as needed. Alternatively, if the driver prefers to
> > > handle the mmap
> > > + * lock itself, a 'locked' argument is provided to the functions
> > > that require
> > > + * the mmap lock. This option may be useful for drivers that
> > > need to
> > > call into
> > > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > > locking
> > > + * inversions between the mmap and dma-resv locks.
> > > + *
> > > + * GPU SVM introduces a global notifier lock, which safeguards
> > > the
> > > notifier's
> > > + * range RB tree and list, as well as the range's DMA mappings
> > > and
> > > sequence
> > > + * number. GPU SVM manages all necessary locking and unlocking
> > > operations,
> > > + * except for the recheck of the range's sequence number
> > > + * (mmu_interval_read_retry) when the driver is committing GPU
> > > bindings. This
> > > + * lock corresponds to the 'driver->update' lock mentioned in
> > > the
> > > HMM
> > > + * documentation (TODO: Link). Future revisions may transition
> > > from
> > > a GPU SVM
> > > + * global lock to a per-notifier lock if finer-grained locking
> > > is
> > > deemed
> > > + * necessary.
> > > + *
> > > + * In addition to the locking mentioned above, the driver should
> > > implement a
> > > + * lock to safeguard core GPU SVM function calls that modify
> > > state,
> > > such as
> > > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > > Alternatively,
> > > + * these core functions can be called within a single kernel
> > > thread,
> > > for
> > > + * instance, using an ordered work queue. This lock is denoted
> > > as
> > > + * 'driver_svm_lock' in code examples.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Migrataion
> > > + *
> > > + * The migration support is quite simple, allowing migration
> > > between
> > > SRAM and
> > > + * VRAM at the range granularity. For example, GPU SVM currently
> > > does not
> > > + * support mixing SRAM and VRAM pages within a range. This means
> > > that upon GPU
> > > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > > fault, the
> > > + * entire range is migrated to SRAM.
> > > + *
> > > + * The reasoning for only supporting range granularity is as
> > > follows: it
> > > + * simplifies the implementation, and range sizes are driver-
> > > defined
> > > and should
> > > + * be relatively small.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Partial Unmapping of Ranges
> > > + *
> > > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped
> > > by
> > > CPU resulting
> > > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with
> > > the
> > > main one
> > > + * being that a subset of the range still has CPU and GPU
> > > mappings.
> > > If the
> > > + * backing store for the range is in VRAM, a subset of the
> > > backing
> > > store has
> > > + * references. One option would be to split the range and VRAM
> > > backing store,
> > > + * but the implementation for this would be quite complicated.
> > > Given
> > > that
> > > + * partial unmappings are rare and driver-defined range sizes
> > > are
> > > relatively
> > > + * small, GPU SVM does not support splitting of ranges.
> > > + *
> > > + * With no support for range splitting, upon partial unmapping
> > > of a
> > > range, the
> > > + * driver is expected to invalidate and destroy the entire
> > > range. If
> > > the range
> > > + * has VRAM as its backing, the driver is also expected to
> > > migrate
> > > any remaining
> > > + * pages back to SRAM.
> > > + */
> > > +
> > > +/**
> > > + * DOC: Examples
> > > + *
> > > + * This section provides two examples of how to build the
> > > expected
> > > driver
> > > + * components: the GPU page fault handler and the garbage
> > > collector.
> > > A third
> > > + * example demonstrates a sample invalidation driver vfunc.
> > > + *
> > > + * The generic code provided does not include logic for complex
> > > migration
> > > + * policies, optimized invalidations, or other potentially
> > > required
> > > driver
> > > + * locking (e.g., DMA-resv locks).
> > > + *
> > > + * 1) GPU page fault handler
> > > + *
> > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > > drm_gpusvm_range *range)
> > > + *	{
> > > + *		int err = 0;
> > > + *
> > > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > range);
> > > + *
> > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > + *			driver_commit_bind(gpusvm, range);
> > > + *		else
> > > + *			err = -EAGAIN;
> > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > + *
> > > + *		return err;
> > > + *	}
> > > + *
> > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *		int err;
> > > + *
> > > + *		driver_svm_lock();
> > > + *	retry:
> > > + *		// Always process UNMAPs first so view of GPU
> > > SVM
> > > ranges is current
> > > + *		driver_garbage_collector(gpusvm);
> > > + *
> > > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > > fault_addr,
> > > +
> > > *							gpuva_start,
> > > gpuva_end,
> > > + *						        &ctx);
> > > + *		if (IS_ERR(range)) {
> > > + *			err = PTR_ERR(range);
> > > + *			goto unlock;
> > > + *		}
> > > + *
> > > + *		if (driver_migration_policy(range)) {
> > > + *			bo = driver_alloc_bo();
> > > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > > range, bo, &ctx);
> > > + *			if (err)	// CPU mappings may have
> > > changed
> > > + *				goto retry;
> > > + *		}
> > > + *
> > > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &ctx);
> > > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > > mappings changed
> > > + *			goto retry;
> > > + *		else if (err)
> > > + *			goto unlock;
> > > + *
> > > + *		err = driver_bind_range(gpusvm, range);
> > > + *		if (err == -EAGAIN)	// CPU mappings changed
> > > + *			goto retry
> > > + *
> > > + *	unlock:
> > > + *		driver_svm_unlock();
> > > + *		return err;
> > > + *	}
> > > + *
> > > + * 2) Garbage Collector.
> > > + *
> > > + *	void __driver_garbage_collector(struct drm_gpusvm
> > > *gpusvm,
> > > + *					struct drm_gpusvm_range
> > > *range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		// Partial unmap, migrate any remaining VRAM
> > > pages
> > > back to SRAM
> > > + *		if (range->flags.partial_unmap)
> > > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > > range,
> > > &ctx);
> > > + *
> > > + *		driver_unbind_range(range);
> > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > + *	}
> > > + *
> > > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > > + *	{
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		for_each_range_in_garbage_collector(gpusvm,
> > > range)
> > > + *			__driver_garbage_collector(gpusvm,
> > > range);
> > > + *	}
> > > + *
> > > + * 3) Invalidation driver vfunc.
> > > + *
> > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > + *				 struct drm_gpusvm_notifier
> > > *notifier,
> > > + *				 const struct mmu_notifier_range
> > > *mmu_range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier =
> > > true,
> > > };
> > > + *		struct drm_gpusvm_range *range = NULL;
> > > + *
> > > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > > > start, mmu_range->end);
> > > + *
> > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > mmu_range->start,
> > > + *					  mmu_range->end) {
> > > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > > range,
> > > &ctx);
> > > + *
> > > + *			if (mmu_range->event !=
> > > MMU_NOTIFY_UNMAP)
> > > + *				continue;
> > > + *
> > > + *			drm_gpusvm_range_set_unmapped(range,
> > > mmu_range);
> > > + *			driver_garbage_collector_add(gpusvm,
> > > range);
> > > + *		}
> > > + *	}
> > > + */
> > > +
> > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > rb.__subtree_last,
> > > +		     DRM_GPUSVM_RANGE_START,
> > > DRM_GPUSVM_RANGE_END,
> > > +		     static __maybe_unused, range);
> > > +
> > > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > > interval.start)
> > > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > > interval.end - 1)
> > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > > +		     rb.__subtree_last,
> > > DRM_GPUSVM_NOTIFIER_START,
> > > +		     DRM_GPUSVM_NOTIFIER_END, static
> > > __maybe_unused,
> > > notifier);
> > > +
> > > +/**
> > > + * npages_in_range() - Calculate the number of pages in a given
> > > range
> > > + * @start__: The start address of the range
> > > + * @end__: The end address of the range
> > > + *
> > > + * This macro calculates the number of pages in a given memory
> > > range,
> > > + * specified by the start and end addresses. It divides the
> > > difference
> > > + * between the end and start addresses by the page size
> > > (PAGE_SIZE)
> > > to
> > > + * determine the number of pages in the range.
> > > + *
> > > + * Return: The number of pages in the specified range.
> > > + */
> > > +#define npages_in_range(start__, end__)	\
> > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > +
> > > +/**
> > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > + *
> > > + * @refcount: Reference count for the zdd
> > > + * @destroy_work: Work structure for asynchronous zdd
> > > destruction
> > > + * @range: Pointer to the GPU SVM range
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation
> > > + *
> > > + * This structure serves as a generic wrapper installed in
> > > + * page->zone_device_data. It provides infrastructure for
> > > looking up
> > > a range
> > > + * upon CPU page fault and asynchronously releasing VRAM once
> > > the
> > > CPU has no
> > > + * page references. Asynchronous release is useful because CPU
> > > page
> > > references
> > > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > > requires sleeping
> > > + * locks.
> > > + */
> > > +struct drm_gpusvm_zdd {
> > > +	struct kref refcount;
> > > +	struct work_struct destroy_work;
> > > +	struct drm_gpusvm_range *range;
> > > +	void *vram_allocation;
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > destroying a
> > > zdd
> > > + * @w: Pointer to the work_struct
> > > + *
> > > + * This function releases VRAM, puts GPU SVM range, and frees
> > > zdd.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct
> > > *w)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(w, struct drm_gpusvm_zdd,
> > > destroy_work);
> > > +	struct drm_gpusvm_range *range = zdd->range;
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > > +	drm_gpusvm_range_put(range);
> > > +	kfree(zdd);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > + * @range: Pointer to the GPU SVM range.
> > > + *
> > > + * This function allocates and initializes a new zdd structure.
> > > It
> > > sets up the
> > > + * reference count, initializes the destroy work, and links the
> > > provided GPU SVM
> > > + * range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_zdd *
> > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd;
> > > +
> > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > +	if (!zdd)
> > > +		return NULL;
> > > +
> > > +	kref_init(&zdd->refcount);
> > > +	INIT_WORK(&zdd->destroy_work,
> > > drm_gpusvm_zdd_destroy_work_func);
> > > +	zdd->range = drm_gpusvm_range_get(range);
> > > +	zdd->vram_allocation = NULL;
> > > +
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function increments the reference count of the provided
> > > zdd
> > > structure.
> > > + *
> > > + * Returns: Pointer to the zdd structure.
> > > + */
> > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_get(&zdd->refcount);
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > + * @ref: Pointer to the reference count structure.
> > > + *
> > > + * This function queues the destroy_work of the zdd for
> > > asynchronous
> > > destruction.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > +
> > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function decrements the reference count of the provided
> > > zdd
> > > structure
> > > + * and schedules its destruction if the count drops to zero.
> > > + */
> > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > notifier
> > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > + * @start: Start address of the range
> > > + * @end: End address of the range
> > > + *
> > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end)
> > > +{
> > > +	return range_iter_first(&notifier->root, start, end -
> > > 1);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > ranges in a notifier
> > > + * @range__: Iterator variable for the ranges
> > > + * @next__: Iterator variable for the ranges temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a
> > > notifier
> > > while
> > > + * removing ranges from it.
> > > + */
> > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > notifier__,
> > > start__, end__)	\
> > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > (start__), (end__)),	\
> > > +	     (next__) =
> > > __drm_gpusvm_range_next(range__);				\
> > > +	     (range__) && (range__->va.start <
> > > (end__));				\
> > > +	     (range__) = (next__), (next__) =
> > > __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier
> > > in
> > > the list
> > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > available,
> > > or NULL if
> > > + *         the current notifier is the last one or if the input
> > > notifier is
> > > + *         NULL.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > +{
> > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > +				      &notifier->gpusvm-
> > > > notifier_list))
> > > +		return list_next_entry(notifier, rb.entry);
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> > > in
> > > a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > gpusvm.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > start__,
> > > end__)		\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > >root,
> > > (start__), (end__) - 1);	\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> > > SVM
> > > notifiers in a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @next__: Iterator variable for the notifiers temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > gpusvm
> > > while
> > > + * removing notifiers from it.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > gpusvm__, start__, end__)	\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > >root,
> > > (start__), (end__) - 1),	\
> > > +	     (next__) =
> > > __drm_gpusvm_notifier_next(notifier__);				\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) = (next__), (next__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > notifier.
> > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > + * @cur_seq: Current sequence number.
> > > + *
> > > + * This function serves as a generic MMU notifier for GPU SVM.
> > > It
> > > sets the MMU
> > > + * notifier sequence number and calls the driver invalidate
> > > vfunc
> > > under
> > > + * gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * true if the operation succeeds, false otherwise.
> > > + */
> > > +static bool
> > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > *mni,
> > > +			       const struct mmu_notifier_range
> > > *mmu_range,
> > > +			       unsigned long cur_seq)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier =
> > > +		container_of(mni, typeof(*notifier), notifier);
> > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > +
> > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > +		return false;
> > > +
> > > +	down_write(&gpusvm->notifier_lock);
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > +	up_write(&gpusvm->notifier_lock);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations
> > > for
> > > GPU SVM
> > > + */
> > > +static const struct mmu_interval_notifier_ops
> > > drm_gpusvm_notifier_ops = {
> > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @name: Name of the GPU SVM.
> > > + * @drm: Pointer to the DRM device structure.
> > > + * @mm: Pointer to the mm_struct for the address space.
> > > + * @device_private_page_owner: Device private pages owner.
> > > + * @mm_start: Start address of GPU SVM.
> > > + * @mm_range: Range of the GPU SVM.
> > > + * @notifier_size: Size of individual notifiers.
> > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending
> > > order
> > > with last
> > > + *               entry being SZ_4K.
> > > + * @num_chunks: Number of chunks.
> > > + *
> > > + * This function initializes the GPU SVM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, a negative error code on failure.
> > > + */
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64
> > > notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks)
> > > +{
> > > +	if (!ops->invalidate || !num_chunks)
> > > +		return -EINVAL;
> > > +
> > > +	gpusvm->name = name;
> > > +	gpusvm->drm = drm;
> > > +	gpusvm->mm = mm;
> > > +	gpusvm->device_private_page_owner =
> > > device_private_page_owner;
> > > +	gpusvm->mm_start = mm_start;
> > > +	gpusvm->mm_range = mm_range;
> > > +	gpusvm->notifier_size = notifier_size;
> > > +	gpusvm->ops = ops;
> > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > +	gpusvm->num_chunks = num_chunks;
> > > +	gpusvm->zdd_wq = system_wq;
> > > +
> > > +	mmgrab(mm);
> > > +	gpusvm->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > +
> > > +	init_rwsem(&gpusvm->notifier_lock);
> > > +
> > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > +	might_lock(&gpusvm->notifier_lock);
> > > +	fs_reclaim_release(GFP_KERNEL);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > + * @fault_addr__: Fault address
> > > + *
> > > + * This macro finds the GPU SVM notifier associated with the
> > > fault
> > > address.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > + */
> > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > fault_addr__)	\
> > > +	notifier_iter_first(&(gpusvm__)->root,
> > > (fault_addr__),	\
> > > +			    (fault_addr__ + 1))
> > > +
> > > +/**
> > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > given rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_notifier struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > structure.
> > > + */
> > > +#define
> > > to_drm_gpusvm_notifier(__node)				\
> > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function inserts the GPU SVM notifier into the GPU SVM
> > > RB
> > > tree and list.
> > > + */
> > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct
> > > drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	notifier_insert(notifier, &gpusvm->root);
> > > +
> > > +	node = rb_prev(&notifier->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > >rb.entry;
> > > +	else
> > > +		head = &gpusvm->notifier_list;
> > > +
> > > +	list_add(&notifier->rb.entry, head);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This macro removes the GPU SVM notifier from the GPU SVM RB
> > > tree
> > > and list.
> > > + */
> > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > notifier__)	\
> > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > >root);	\
> > > +	list_del(&(notifier__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + *
> > > + * This function finalizes the GPU SVM by cleaning up any
> > > remaining
> > > ranges and
> > > + * notifiers, and dropping a reference to struct MM.
> > > + */
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > +
> > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > gpusvm, 0,
> > > LONG_MAX) {
> > > +		struct drm_gpusvm_range *range, *__next;
> > > +
> > > +		/*
> > > +		 * Remove notifier first to avoid racing with
> > > any
> > > invalidation
> > > +		 */
> > > +		mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +		notifier->flags.removed = true;
> > > +
> > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > notifier, 0,
> > > +					       LONG_MAX)
> > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > +	}
> > > +
> > > +	mmdrop(gpusvm->mm);
> > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + *
> > > + * This function allocates and initializes the GPU SVM notifier
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM notifier on success,
> > > ERR_PTR()
> > > on failure.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > fault_addr)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	if (gpusvm->ops->notifier_alloc)
> > > +		notifier = gpusvm->ops->notifier_alloc();
> > > +	else
> > > +		notifier = kzalloc(sizeof(*notifier),
> > > GFP_KERNEL);
> > > +
> > > +	if (!notifier)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	notifier->gpusvm = gpusvm;
> > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > gpusvm-
> > > > notifier_size);
> > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > > notifier_size);
> > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > +	notifier->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > +
> > > +	return notifier;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function frees the GPU SVM notifier structure.
> > > + */
> > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > +				     struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > +
> > > +	if (gpusvm->ops->notifier_free)
> > > +		gpusvm->ops->notifier_free(notifier);
> > > +	else
> > > +		kfree(notifier);
> > > +}
> > > +
> > > +/**
> > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > given
> > > rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_range struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_range
> > > structure.
> > > + */
> > > +#define to_drm_gpusvm_range(node__)	\
> > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function inserts the GPU SVM range into the notifier RB
> > > tree
> > > and list.
> > > + */
> > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > *notifier,
> > > +				    struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > +	range_insert(range, &notifier->root);
> > > +
> > > +	node = rb_prev(&range->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > +	else
> > > +		head = &notifier->range_list;
> > > +
> > > +	list_add(&range->rb.entry, head);
> > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + * @range__: Pointer to the GPU SVM range structure
> > > + *
> > > + * This macro removes the GPU SVM range from the notifier RB
> > > tree
> > > and list.
> > > + */
> > > +#define __drm_gpusvm_range_remove(notifier__,
> > > range__)		\
> > > +	range_remove((range__), &(notifier__)-
> > > >root);		\
> > > +	list_del(&(range__)->rb.entry)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @fault_addr: Fault address
> > > + * @chunk_size: Chunk size
> > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > + *
> > > + * This function allocates and initializes the GPU SVM range
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR()
> > > on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_range *
> > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > +		       struct drm_gpusvm_notifier *notifier,
> > > +		       u64 fault_addr, u64 chunk_size, bool
> > > migrate_vram)
> > > +{
> > > +	struct drm_gpusvm_range *range;
> > > +
> > > +	if (gpusvm->ops->range_alloc)
> > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > +	else
> > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	kref_init(&range->refcount);
> > > +	range->gpusvm = gpusvm;
> > > +	range->notifier = notifier;
> > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > +	range->notifier_seq = LONG_MAX;
> > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_check_pages - Check pages
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Check if pages between start and end have been faulted in on
> > > the
> > > CPU. Use to
> > > + * prevent migration of pages without CPU backing store.
> > > + *
> > > + * Returns:
> > > + * True if pages have been faulted into CPU, False otherwise
> > > + */
> > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > +				   struct drm_gpusvm_notifier
> > > *notifier,
> > > +				   u64 start, u64 end)
> > > +{
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = 0,
> > > +		.notifier = &notifier->notifier,
> > > +		.start = start,
> > > +		.end = end,
> > > +		.dev_private_owner = gpusvm-
> > > > device_private_page_owner,
> > > +	};
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns;
> > > +	unsigned long npages = npages_in_range(start, end);
> > > +	int err, i;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +	if (!pfns)
> > > +		return false;
> > > +
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier-
> > > > notifier);
> > > +	hmm_range.hmm_pfns = pfns;
> > > +
> > > +	while (true) {
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier->notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > +			err = -EFAULT;
> > > +			goto err_free;
> > > +		}
> > > +	}
> > > +
> > > +err_free:
> > > +	kvfree(pfns);
> > > +	return err ? false : true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> > > SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @vas: Pointer to the virtual memory area structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @check_pages: Flag indicating whether to check pages
> > > + *
> > > + * This function determines the chunk size for the GPU SVM range
> > > based on the
> > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > > and
> > > the virtual
> > > + * memory area boundaries.
> > > + *
> > > + * Returns:
> > > + * Chunk size on success, LONG_MAX on failure.
> > > + */
> > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct
> > > drm_gpusvm_notifier
> > > *notifier,
> > > +				       struct vm_area_struct
> > > *vas,
> > > +				       u64 fault_addr, u64
> > > gpuva_start,
> > > +				       u64 gpuva_end, bool
> > > check_pages)
> > > +{
> > > +	u64 start, end;
> > > +	int i = 0;
> > > +
> > > +retry:
> > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > chunk_sizes[i]);
> > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > >chunk_sizes[i]);
> > > +
> > > +		if (start >= vas->vm_start && end <= vas->vm_end
> > > &&
> > > +		    start >= notifier->interval.start &&
> > > +		    end <= notifier->interval.end &&
> > > +		    start >= gpuva_start && end <= gpuva_end)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i == gpusvm->num_chunks)
> > > +		return LONG_MAX;
> > > +
> > > +	/*
> > > +	 * If allocation more than page, ensure not to overlap
> > > with
> > > existing
> > > +	 * ranges.
> > > +	 */
> > > +	if (end - start != SZ_4K) {
> > > +		struct drm_gpusvm_range *range;
> > > +
> > > +		range = drm_gpusvm_range_find(notifier, start,
> > > end);
> > > +		if (range) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +
> > > +		/*
> > > +		 * XXX: Only create range on pages CPU has
> > > faulted
> > > in. Without
> > > +		 * this check, or prefault, on BMG
> > > 'xe_exec_system_allocator --r
> > > +		 * process-many-malloc' fails. In the failure
> > > case,
> > > each process
> > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > results in 64k SVM
> > > +		 * ranges. When migrating the SVM ranges, some
> > > processes fail in
> > > +		 * drm_gpusvm_migrate_to_vram with
> > > 'migrate.cpages
> > > != npages'
> > > +		 * and then upon drm_gpusvm_range_get_pages
> > > device
> > > pages from
> > > +		 * other processes are collected + faulted in
> > > which
> > > creates all
> > > +		 * sorts of problems. Unsure exactly how this
> > > happening, also
> > > +		 * problem goes away if
> > > 'xe_exec_system_allocator --
> > > r
> > > +		 * process-many-malloc' mallocs at least 64k at
> > > a
> > > time.
> > > +		 */
> > > +		if (check_pages &&
> > > +		    !drm_gpusvm_check_pages(gpusvm, notifier,
> > > start,
> > > end)) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > > +	return end - start;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function finds or inserts a newly allocated a GPU SVM
> > > range
> > > based on the
> > > + * fault address. Caller must hold a lock to protect range
> > > lookup
> > > and insertion.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx
> > > *ctx)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct drm_gpusvm_range *range;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	bool notifier_alloc = false;
> > > +	u64 chunk_size;
> > > +	int err;
> > > +	bool migrate_vram;
> > > +
> > > +	if (fault_addr < gpusvm->mm_start ||
> > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > +		err = -EINVAL;
> > > +		goto err_out;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_write_locked(mm);
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > +	if (!notifier) {
> > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > fault_addr);
> > > +		if (IS_ERR(notifier)) {
> > > +			err = PTR_ERR(notifier);
> > > +			goto err_mmunlock;
> > > +		}
> > > +		notifier_alloc = true;
> > > +		err =
> > > mmu_interval_notifier_insert_locked(&notifier-
> > > > notifier,
> > > +							  mm,
> > > notifier->interval.start,
> > > +							 
> > > notifier-
> > > > interval.end -
> > > +							 
> > > notifier-
> > > > interval.start,
> > > +							 
> > > &drm_gpusvm_notifier_ops);
> > > +		if (err)
> > > +			goto err_notifier;
> > > +	}
> > > +
> > > +	vas = vma_lookup(mm, fault_addr);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > +		err = -EPERM;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > fault_addr + 1);
> > > +	if (range)
> > > +		goto out_mmunlock;
> > > +	/*
> > > +	 * XXX: Short-circuiting migration based on
> > > migrate_vma_*
> > > current
> > > +	 * limitations. If/when migrate_vma_* add more support,
> > > this
> > > logic will
> > > +	 * have to change.
> > > +	 */
> > > +	migrate_vram = ctx->vram_possible &&
> > > +		vma_is_anonymous(vas) &&
> > > !is_vm_hugetlb_page(vas);
> > > +
> > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > notifier,
> > > vas,
> > > +						 fault_addr,
> > > gpuva_start,
> > > +						 gpuva_end,
> > > migrate_vram &&
> > > +						 !ctx-
> > > >prefault);
> > > +	if (chunk_size == LONG_MAX) {
> > > +		err = -EINVAL;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > fault_addr,
> > > chunk_size,
> > > +				       migrate_vram);
> > > +	if (IS_ERR(range)) {
> > > +		err = PTR_ERR(range);
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	drm_gpusvm_range_insert(notifier, range);
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > +
> > > +	if (ctx->prefault) {
> > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > +
> > > +		__ctx.mmap_locked = true;
> > > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > &__ctx);
> > > +		if (err)
> > > +			goto err_range_remove;
> > > +	}
> > > +
> > > +out_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +
> > > +	return range;
> > > +
> > > +err_range_remove:
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +err_notifier_remove:
> > > +	if (notifier_alloc)
> > > +		mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +err_notifier:
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return ERR_PTR(err);
> > > +}
> > > +
> > > +/**
> > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > + * @i__: the current page index in the iteration
> > > + * @j__: the current page index, log order, in the iteration
> > > + * @npages__: the total number of pages in the DMA region
> > > + * @order__: the order of the pages in the DMA region
> > > + *
> > > + * This macro iterates over each page in a DMA region. The DMA
> > > region
> > > + * is assumed to be composed of 2^@order__ pages, and the macro
> > > will
> > > + * step through the region one block of 2^@order__ pages at a
> > > time.
> > > + */
> > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with
> > > a
> > > GPU SVM range (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function unmap pages associated with a GPU SVM range.
> > > Assumes and
> > > + * asserts correct locking is in place when called.
> > > + */
> > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					   struct
> > > drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		unsigned long i, j, npages =
> > > npages_in_range(range-
> > > > va.start,
> > > +							    
> > > range-
> > > > va.end);
> > > +
> > > +		if (range->flags.has_dma_mapping) {
> > > +			for_each_dma_page(i, j, npages, range-
> > > > order)
> > > +				dma_unmap_page(gpusvm->drm->dev,
> > > +					       range-
> > > >dma_addr[j],
> > > +					       PAGE_SIZE <<
> > > range-
> > > > order,
> > > +					      
> > > DMA_BIDIRECTIONAL);
> > > +		}
> > > +
> > > +		range->flags.has_vram_pages = false;
> > > +		range->flags.has_dma_mapping = false;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_free_pages - Free pages associated with a
> > > GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function free pages associated with a GPU SVM range.
> > > + */
> > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->pages) {
> > > +		if (range->flags.kfree_mapping) {
> > > +			kfree(range->dma_addr);
> > > +			range->flags.kfree_mapping = false;
> > > +			range->pages = NULL;
> > > +		} else {
> > > +			kvfree(range->pages);
> > > +			range->pages = NULL;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range to be removed
> > > + *
> > > + * This function removes the specified GPU SVM range and also
> > > removes the parent
> > > + * GPU SVM notifier if no more ranges remain in the notifier.
> > > The
> > > caller must
> > > + * hold a lock to protect range and notifier removal.
> > > + */
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > va.start);
> > > +	if (WARN_ON_ONCE(!notifier))
> > > +		return;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	drm_gpusvm_range_put(range);
> > > +
> > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > +		if (!notifier->flags.removed)
> > > +			mmu_interval_notifier_remove(&notifier-
> > > > notifier);
> > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function increments the reference count of the specified
> > > GPU
> > > SVM range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_get(&range->refcount);
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > + * @refcount: Pointer to the reference counter embedded in the
> > > GPU
> > > SVM range
> > > + *
> > > + * This function destroys the specified GPU SVM range when its
> > > reference count
> > > + * reaches zero. If a custom range-free function is provided, it
> > > is
> > > invoked to
> > > + * free the range; otherwise, the range is deallocated using
> > > kfree().
> > > + */
> > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > +{
> > > +	struct drm_gpusvm_range *range =
> > > +		container_of(refcount, struct drm_gpusvm_range,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->range_free)
> > > +		gpusvm->ops->range_free(range);
> > > +	else
> > > +		kfree(range);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function decrements the reference count of the specified
> > > GPU
> > > SVM range
> > > + * and frees it when the count reaches zero.
> > > + */
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called holding gpusvm->notifier_lock and as the last step
> > > before
> > > commiting a
> > > + * GPU binding.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	return range->flags.has_vram_pages || range-
> > > > flags.has_dma_mapping;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages
> > > valid
> > > unlocked
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called without holding gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +static bool
> > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > +				      struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	bool pages_valid;
> > > +
> > > +	if (!range->pages)
> > > +		return false;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > range);
> > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > +		kfree(range->dma_addr);
> > > +		range->flags.kfree_mapping = false;
> > > +		range->pages = NULL;
> > > +	}
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	return pages_valid;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function gets pages for a GPU SVM range and ensures they
> > > are
> > > mapped for
> > > + * DMA access.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct mmu_interval_notifier *notifier = &range-
> > > >notifier-
> > > > notifier;
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > >read_only
> > > ? 0 :
> > > +			HMM_PFN_REQ_WRITE),
> > > +		.notifier = notifier,
> > > +		.start = range->va.start,
> > > +		.end = range->va.end,
> > > +		.dev_private_owner = gpusvm-
> > > > device_private_page_owner,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long i, j;
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +	unsigned int order = 0;
> > > +	unsigned long *pfns;
> > > +	struct page **pages;
> > > +	int err = 0;
> > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > +	bool alloc_pfns = false, kfree_mapping;
> > > +
> > > +retry:
> > > +	kfree_mapping = false;
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > range))
> > > +		return 0;
> > > +
> > > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > > range-
> > > > pages) {
> > > +		if (ctx->prefault)
> > > +			return 0;
> > > +
> > > +		pfns = (unsigned long *)range->pages;
> > > +		pages = range->pages;
> > > +		goto map_pages;
> > > +	}
> > > +
> > > +	if (!range->pages) {
> > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +		if (!pfns)
> > > +			return -ENOMEM;
> > > +		alloc_pfns = true;
> > > +	} else {
> > > +		pfns = (unsigned long *)range->pages;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +	}
> > > +
> > > +	hmm_range.hmm_pfns = pfns;
> > > +	while (true) {
> > > +		/* Must be checked after mmu_interval_read_begin
> > > */
> > > +		if (range->flags.unmapped) {
> > > +			err = -EFAULT;
> > > +			break;
> > > +		}
> > > +
> > > +		if (!ctx->mmap_locked) {
> > > +			/*
> > > +			 * XXX: HMM locking document indicates
> > > only
> > > a read-lock
> > > +			 * is required but there apears to be a
> > > window between
> > > +			 * the MMU_NOTIFY_MIGRATE event
> > > triggered in
> > > a CPU fault
> > > +			 * via migrate_vma_setup and the pages
> > > actually moving
> > > +			 * in migrate_vma_finalize in which this
> > > code can grab
> > > +			 * garbage pages. Grabbing the write-
> > > lock if
> > > the range
> > > +			 * is attached to vram appears to
> > > protect
> > > against this
> > > +			 * race.
> > > +			 */
> > > +			if (vram_pages)
> > > +				mmap_write_lock(mm);
> > > +			else
> > > +				mmap_read_lock(mm);
> > > +		}
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (!ctx->mmap_locked) {
> > > +			if (vram_pages)
> > > +				mmap_write_unlock(mm);
> > > +			else
> > > +				mmap_read_unlock(mm);
> > > +		}
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (!ctx->mmap_locked)
> > > +		mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +
> > > +	if (ctx->prefault) {
> > > +		range->pages = pages;
> > > +		goto set_seqno;
> > > +	}
> > > +
> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if
> > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					
> > > hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if
> > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > > +
> > > +			dma_addr[j] = dma_map_page(gpusvm->drm-
> > > >dev,
> > > +						   pages[j], 0,
> > > +						   PAGE_SIZE <<
> > > order,
> > > +						  
> > > DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > dma_addr[j])) {
> > > +				err = -EFAULT;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +		}
> > > +
> > > +		/* Huge pages, reduce memory footprint */
> > > +		if (order) {
> > > +			dma_addr = kmalloc_array(j,
> > > sizeof(*dma_addr),
> > > +						 GFP_KERNEL);
> > > +			if (dma_addr) {
> > > +				for (i = 0; i < j; ++i)
> > > +					dma_addr[i] =
> > > (dma_addr_t)pfns[i];
> > > +				kvfree(pfns);
> > > +				kfree_mapping = true;
> > > +			} else {
> > > +				dma_addr = (dma_addr_t *)pfns;
> > > +			}
> > > +		}
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->order = order;
> > > +		range->flags.kfree_mapping = kfree_mapping;
> > > +		range->flags.has_dma_mapping = true;
> > > +		range->dma_addr = dma_addr;
> > > +		range->vram_allocation = NULL;
> > > +		if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	}
> > > +
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	for_each_dma_page(i, j, npages, order)
> > > +		dma_unmap_page(gpusvm->drm->dev,
> > > +			       (dma_addr_t)pfns[j],
> > > +			       PAGE_SIZE << order,
> > > DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	if (alloc_pfns)
> > > +		kvfree(pfns);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range.
> > > If
> > > @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in
> > > write
> > > mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > Must be
> > > called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range,
> > > +				  const struct drm_gpusvm_ctx
> > > *ctx)
> > > +{
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm-
> > > >notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > +					   unsigned long
> > > *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_pag
> > > e(mi
> > > grate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified
> > > GPU
> > > SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU
> > > SVM
> > > migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > mapped
> > > pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in GPU
> > > SVM. It
> > > + * iterates over each page frame number provided in
> > > @migrate_pfn,
> > > maps the
> > > + * corresponding page, and stores the DMA address in the
> > > provided
> > > @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int
> > > *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page =
> > > migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > PAGE_SIZE,
> > > dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> > > mapped
> > > for GPU SVM migration
> > > + * @dev: The device for which the pages were mapped
> > > + * @dma_addr: Array of DMA addresses corresponding to mapped
> > > pages
> > > + * @npages: Number of pages to unmap
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function unmaps previously mapped pages of memory for
> > > GPU
> > > Shared Virtual
> > > + * Memory (SVM). It iterates over each DMA address provided in
> > > @dma_addr, checks
> > > + * if it's valid and not already unmapped, and unmaps the
> > > corresponding page.
> > > + */
> > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > +					   dma_addr_t *dma_addr,
> > > +					   unsigned long npages,
> > > +					   enum
> > > dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > dma_addr[i]))
> > > +			continue;
> > > +
> > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE,
> > > dir);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *                   failure of this function.
> > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > allocation.
> > > The caller
> > > + *                   should hold a reference to the VRAM
> > > allocation,
> > > which
> > > + *                   should be dropped via ops->vram_allocation
> > > or
> > > upon the
> > > + *                   failure of this function.
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function migrates the specified GPU SVM range to VRAM.
> > > It
> > > performs the
> > > + * necessary setup and invokes the driver-specific operations
> > > for
> > > migration to
> > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > reference @range
> > > + * until ops->vram_release is called which only upon successful
> > > return.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       void *vram_allocation,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= gpusvm-
> > > >device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	unsigned long i, npages = npages_in_range(start, end);
> > > +	struct vm_area_struct *vas;
> > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int err;
> > > +
> > > +	if (!range->flags.migrate_vram)
> > > +		return -EINVAL;
> > > +
> > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > > copy_to_vram ||
> > > +	    !gpusvm->ops->copy_to_sram)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		mmap_write_lock(mm);
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and
> > > migrate.cpages !=
> > > npages, not
> > > +	 * always an error. Need to revisit possible cases and
> > > how
> > > to handle. We
> > > +	 * could prefault on migrate.cpages != npages via
> > > hmm_range_fault.
> > > +	 */
> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > vram_allocation, npages,
> > > +					     migrate.dst);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.src, npages,
> > > DMA_TO_DEVICE);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > +
> > > +		pages[i] = page;
> > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > +	}
> > > +
> > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	/* Upon success bind vram allocation to range and zdd */
> > > +	range->vram_allocation = vram_allocation;
> > > +	WRITE_ONCE(zdd->vram_allocation,
> > > vram_allocation);	/*
> > > Owns ref */
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages,
> > > migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > npages,
> > > +				       DMA_TO_DEVICE);
> > > +err_free:
> > > +	if (zdd)
> > > +		drm_gpusvm_zdd_put(zdd);
> > > +	kvfree(buf);
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_write_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for
> > > a
> > > VM area
> > > + * @vas: Pointer to the VM area structure, can be NULL
> > > + * @npages: Number of pages to populate
> > > + * @src_mpfn: Source array of migrate PFNs
> > > + * @mpfn: Array of migrate PFNs to populate
> > > + * @addr: Start address for PFN allocation
> > > + *
> > > + * This function populates the SRAM migrate page frame numbers
> > > (PFNs) for the
> > > + * specified VM area structure. It allocates and locks pages in
> > > the
> > > VM area for
> > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for
> > > allocation,
> > > if NULL use
> > > + * alloc_page for allocation.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > vm_area_struct *vas,
> > > +						unsigned long
> > > npages,
> > > +						unsigned long
> > > *src_mpfn,
> > > +						unsigned long
> > > *mpfn,
> > > u64 addr)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > +		struct page *page;
> > > +
> > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > +			continue;
> > > +
> > > +		if (vas)
> > > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > > addr);
> > > +		else
> > > +			page = alloc_page(GFP_HIGHUSER);
> > > +
> > > +		if (!page)
> > > +			return -ENOMEM;
> > > +
> > > +		lock_page(page);
> > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require
> > > mmap
> > > lock and
> > > + * migration done via migrate_device_* functions. Fallback path
> > > as
> > > it is
> > > + * preferred to issue migrations with mmap lock.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > +				    struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	unsigned long *src, *dst;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	npages = npages_in_range(range->va.start, range-
> > > >va.end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*src) +
> > > sizeof(*dma_addr)
> > > +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	src = buf;
> > > +	dst = buf + (sizeof(*src) * npages);
> > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > > npages;
> > > +
> > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > > vram_allocation,
> > > +					     npages, src);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > +				       gpusvm-
> > > > device_private_page_owner, src,
> > > +				       npages, range->va.start);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > > src, dst, 0);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   dst, npages,
> > > DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > +	migrate_device_pages(src, dst, npages);
> > > +	migrate_device_finalize(src, dst, npages);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > > (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the
> > > specified
> > > GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > > PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > > *gpusvm,
> > > +					struct vm_area_struct
> > > *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm-
> > > >device_private_page_owner,
> > > +		.flags		=
> > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	/* Corner where VMA area struct has been partially
> > > unmapped
> > > */
> > > +	if (start < vas->vm_start)
> > > +		start = vas->vm_start;
> > > +	if (end > vas->vm_end)
> > > +		end = vas->vm_end;
> > > +
> > > +	migrate.start = start;
> > > +	migrate.end = end;
> > > +	npages = npages_in_range(start, end);
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_out;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > sizeof(*dma_addr))
> > > * npages;
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/* Raced with another CPU fault, nothing to do */
> > > +	if (!migrate.cpages)
> > > +		goto err_free;
> > > +
> > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > +						   migrate.src,
> > > migrate.dst,
> > > +						   start);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > +					   migrate.dst, npages,
> > > +					   DMA_BIDIRECTIONAL);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +	for (i = 0; i < npages; ++i)
> > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > 
> > See comments below which pages we actually want to migrate.
> > 
> > 
> > > +
> > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > npages);
> > > +	if (err)
> > > +		goto err_finalize;
> > > +
> > > +err_finalize:
> > > +	if (err)
> > > +		drm_gpusvm_migration_put_pages(npages,
> > > migrate.dst);
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > dma_addr,
> > > npages,
> > > +				       DMA_BIDIRECTIONAL);
> > > +err_free:
> > > +	kvfree(buf);
> > > +err_out:
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > > SRAM
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function initiates the migration of the specified GPU
> > > SVM
> > > range to
> > > + * SRAM. It performs necessary checks and invokes the internal
> > > migration
> > > + * function for actual migration.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	u64 start = range->va.start, end = range->va.end;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	int err;
> > > +	bool retry = false;
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		if (!mmget_not_zero(mm)) {
> > > +			err = -EFAULT;
> > > +			goto err_out;
> > > +		}
> > > +		if (ctx->trylock_mmap) {
> > > +			if (!mmap_read_trylock(mm))  {
> > > +				err =
> > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > +				goto err_mmput;
> > > +			}
> > > +		} else {
> > > +			mmap_read_lock(mm);
> > > +		}
> > > +	}
> > > +
> > > +	mmap_assert_locked(mm);
> > > +
> > > +	/*
> > > +	 * Loop required to find all VMA area structs for the
> > > corner
> > > case when
> > > +	 * VRAM backing has been partially unmapped from MM's
> > > address space.
> > > +	 */
> > > +again:
> > > +	vas = find_vma(mm, start);
> > > +	if (!vas) {
> > > +		if (!retry)
> > > +			err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > +		if (!retry)
> > > +			err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL,
> > > start,
> > > end);
> > 
> > This function is typically called from the vm side to get a clean
> > mm as
> > a last resort after get_pages() fail. As such should we have it
> > evict
> > *everything*, even foreign device memory, and mismatching local
> > device
> > pages. If so, we could use hmm_range_fault() with a NULL page owner
> > +
> > faulting to do that.
> > 
> 
> I've actually tried that and it seemed to mostly work well and
> actually
> would be my preference as this avoids a VMA lookup in GPU SVM.
> 
> I think it is problem though if some of the pages are partially
> unmapped
> though as hmm_range_fault will abort if fault cannot be resolved.
> Maybe
> I'm mistaken on this. I won't get this in rev2 but will put this on
> my
> list to continue to play around with.

OK. Presumably if faulting fails we should try a narrower range unless
the page actually hitting the gpu pagefault is unmapped, to ensure we
make progress rather than aborting?


> 
> > > +	if (err)
> > > +		goto err_mmunlock;
> > > +
> > > +	if (vas->vm_end < end) {
> > > +		retry = true;
> > > +		start = vas->vm_end;
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (!ctx->mmap_locked) {
> > > +		mmap_read_unlock(mm);
> > > +		/*
> > > +		 * Using mmput_async as this function can be
> > > called
> > > while
> > > +		 * holding a dma-resv lock, and a final put can
> > > grab
> > > the mmap
> > > +		 * lock, causing a lock inversion.
> > > +		 */
> > > +		mmput_async(mm);
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_mmunlock:
> > > +	if (!ctx->mmap_locked)
> > > +		mmap_read_unlock(mm);
> > > +err_mmput:
> > > +	if (!ctx->mmap_locked)
> > > +		mmput_async(mm);
> > > +err_out:
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > > associated
> > > with a page
> > > + * @page: Pointer to the page
> > > + *
> > > + * This function is a callback used to put the GPU SVM zone
> > > device
> > > data
> > > + * associated with a page when it is being released.
> > > + */
> > > +static void drm_gpusvm_page_free(struct page *page)
> > > +{
> > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > > (page
> > > fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU
> > > SVM
> > > range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting
> > > page
> > > and invokes
> > > + * the internal migration function to migrate the range back to
> > > RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > > *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > > >zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > +					   vmf->vma, vmf->page,
> > > +					   zdd->range->va.start,
> > > +					   zdd->range->va.end);
> > 
> > When called from here, since this is a pagemap op, we should ensure
> > we
> > only migrate our own pagemap to RAM?
> > 
> 
> I think you resolve this with the following the patch [1], right? I
> think I agree.

It doesn't fully resolve it, but adds the capability to do more
specified filtering. Another option would be to use the pagemap ptr
rather than the device ptr as device_private owner, but that would OTOH
require a wider filtering in hmm_range_fault() so that (or a similar)
patch would be needed anyway.

Thanks,
Thomas

> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/139994/
> 
> > /Thanks,
> > Thomas
> > 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  6:27       ` Thomas Hellström
@ 2024-10-16  8:24         ` Matthew Brost
  0 siblings, 0 replies; 100+ messages in thread
From: Matthew Brost @ 2024-10-16  8:24 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, airlied, christian.koenig, matthew.auld,
	daniel

On Wed, Oct 16, 2024 at 08:27:51AM +0200, Thomas Hellström wrote:
> On Wed, 2024-10-16 at 03:18 +0000, Matthew Brost wrote:
> > On Wed, Oct 09, 2024 at 12:50:42PM +0200, Thomas Hellström wrote:
> > > Hi, Matthew.
> > > 
> > > Some comments below around migrating to SRAM.
> > > 
> > > 
> > > On Tue, 2024-08-27 at 19:48 -0700, Matthew Brost wrote:
> > > > This patch introduces support for GPU Shared Virtual Memory (SVM)
> > > > in
> > > > the
> > > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > > sharing of memory between the CPU and GPU, enhancing performance
> > > > and
> > > > flexibility in GPU computing tasks.
> > > > 
> > > > The patch adds the necessary infrastructure for SVM, including
> > > > data
> > > > structures and functions for managing SVM ranges and notifiers.
> > > > It
> > > > also
> > > > provides mechanisms for allocating, deallocating, and migrating
> > > > memory
> > > > regions between system RAM and GPU VRAM.
> > > > 
> > > > This mid-layer is largely inspired by GPUVM.
> > > > 
> > > > Cc: Dave Airlie <airlied@redhat.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile     |    3 +-
> > > >  drivers/gpu/drm/xe/drm_gpusvm.c | 2174
> > > > +++++++++++++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/drm_gpusvm.h |  415 ++++++
> > > >  3 files changed, 2591 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > > >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > b/drivers/gpu/drm/xe/Makefile
> > > > index b9670ae09a9e..b8fc2ee58f1a 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > > > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> > > >  
> > > >  # core driver code
> > > >  
> > > > -xe-y += xe_bb.o \
> > > > +xe-y += drm_gpusvm.o \
> > > > +	xe_bb.o \
> > > >  	xe_bo.o \
> > > >  	xe_bo_evict.o \
> > > >  	xe_devcoredump.o \
> > > > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > new file mode 100644
> > > > index 000000000000..fc1e44e6ae72
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > > > @@ -0,0 +1,2174 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + *
> > > > + * Authors:
> > > > + *     Matthew Brost <matthew.brost@intel.com>
> > > > + */
> > > > +
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/interval_tree_generic.h>
> > > > +#include <linux/hmm.h>
> > > > +#include <linux/memremap.h>
> > > > +#include <linux/migrate.h>
> > > > +#include <linux/mm_types.h>
> > > > +#include <linux/pagemap.h>
> > > > +#include <linux/slab.h>
> > > > +
> > > > +#include <drm/drm_device.h>
> > > > +#include "drm_gpusvm.h"
> > > > +
> > > > +/**
> > > > + * DOC: Overview
> > > > + *
> > > > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > > > Rendering Manager (DRM)
> > > > + *
> > > > + * The GPU SVM layer is a component of the DRM framework
> > > > designed to
> > > > manage shared
> > > > + * virtual memory between the CPU and GPU. It enables efficient
> > > > data
> > > > exchange and
> > > > + * processing for GPU-accelerated applications by allowing
> > > > memory
> > > > sharing and
> > > > + * synchronization between the CPU's and GPU's virtual address
> > > > spaces.
> > > > + *
> > > > + * Key GPU SVM Components:
> > > > + * - Notifiers: Notifiers: Used for tracking memory intervals
> > > > and
> > > > notifying the
> > > > + *		GPU of changes, notifiers are sized based on a
> > > > GPU
> > > > SVM
> > > > + *		initialization parameter, with a recommendation
> > > > of
> > > > 512M or
> > > > + *		larger. They maintain a Red-BlacK tree and a
> > > > list of
> > > > ranges that
> > > > + *		fall within the notifier interval. Notifiers are
> > > > tracked within
> > > > + *		a GPU SVM Red-BlacK tree and list and are
> > > > dynamically inserted
> > > > + *		or removed as ranges within the interval are
> > > > created
> > > > or
> > > > + *		destroyed.
> > > > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > > > managed
> > > > + *	     by GPU SVM. They are sized based on an array of
> > > > chunk
> > > > sizes, which
> > > > + *	     is a GPU SVM initialization parameter, and the CPU
> > > > address space.
> > > > + *	     Upon GPU fault, the largest aligned chunk that fits
> > > > within the
> > > > + *	     faulting CPU address space is chosen for the range
> > > > size. Ranges are
> > > > + *	     expected to be dynamically allocated on GPU fault
> > > > and
> > > > removed on an
> > > > + *	     MMU notifier UNMAP event. As mentioned above,
> > > > ranges
> > > > are tracked in
> > > > + *	     a notifier's Red-Black tree.
> > > > + * - Operations: Define the interface for driver-specific SVM
> > > > operations such as
> > > > + *		 allocation, page collection, migration,
> > > > invalidations, and VRAM
> > > > + *		 release.
> > > > + *
> > > > + * This layer provides interfaces for allocating, mapping,
> > > > migrating, and
> > > > + * releasing memory ranges between the CPU and GPU. It handles
> > > > all
> > > > core memory
> > > > + * management interactions (DMA mapping, HMM, and migration) and
> > > > provides
> > > > + * driver-specific virtual functions (vfuncs). This
> > > > infrastructure
> > > > is sufficient
> > > > + * to build the expected driver components for an SVM
> > > > implementation
> > > > as detailed
> > > > + * below.
> > > > + *
> > > > + * Expected Driver Components:
> > > > + * - GPU page fault handler: Used to create ranges and notifiers
> > > > based on the
> > > > + *			     fault address, optionally migrate
> > > > the
> > > > range to
> > > > + *			     VRAM, and create GPU bindings.
> > > > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> > > > Ranges are
> > > > + *			expected to be added to the garbage
> > > > collector upon
> > > > + *			MMU_NOTIFY_UNMAP event.
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Locking
> > > > + *
> > > > + * GPU SVM handles locking for core MM interactions, i.e., it
> > > > locks/unlocks the
> > > > + * mmap lock as needed. Alternatively, if the driver prefers to
> > > > handle the mmap
> > > > + * lock itself, a 'locked' argument is provided to the functions
> > > > that require
> > > > + * the mmap lock. This option may be useful for drivers that
> > > > need to
> > > > call into
> > > > + * GPU SVM while also holding a dma-resv lock, thus preventing
> > > > locking
> > > > + * inversions between the mmap and dma-resv locks.
> > > > + *
> > > > + * GPU SVM introduces a global notifier lock, which safeguards
> > > > the
> > > > notifier's
> > > > + * range RB tree and list, as well as the range's DMA mappings
> > > > and
> > > > sequence
> > > > + * number. GPU SVM manages all necessary locking and unlocking
> > > > operations,
> > > > + * except for the recheck of the range's sequence number
> > > > + * (mmu_interval_read_retry) when the driver is committing GPU
> > > > bindings. This
> > > > + * lock corresponds to the 'driver->update' lock mentioned in
> > > > the
> > > > HMM
> > > > + * documentation (TODO: Link). Future revisions may transition
> > > > from
> > > > a GPU SVM
> > > > + * global lock to a per-notifier lock if finer-grained locking
> > > > is
> > > > deemed
> > > > + * necessary.
> > > > + *
> > > > + * In addition to the locking mentioned above, the driver should
> > > > implement a
> > > > + * lock to safeguard core GPU SVM function calls that modify
> > > > state,
> > > > such as
> > > > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > > > Alternatively,
> > > > + * these core functions can be called within a single kernel
> > > > thread,
> > > > for
> > > > + * instance, using an ordered work queue. This lock is denoted
> > > > as
> > > > + * 'driver_svm_lock' in code examples.
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Migrataion
> > > > + *
> > > > + * The migration support is quite simple, allowing migration
> > > > between
> > > > SRAM and
> > > > + * VRAM at the range granularity. For example, GPU SVM currently
> > > > does not
> > > > + * support mixing SRAM and VRAM pages within a range. This means
> > > > that upon GPU
> > > > + * fault, the entire range can be migrated to VRAM, and upon CPU
> > > > fault, the
> > > > + * entire range is migrated to SRAM.
> > > > + *
> > > > + * The reasoning for only supporting range granularity is as
> > > > follows: it
> > > > + * simplifies the implementation, and range sizes are driver-
> > > > defined
> > > > and should
> > > > + * be relatively small.
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Partial Unmapping of Ranges
> > > > + *
> > > > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped
> > > > by
> > > > CPU resulting
> > > > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with
> > > > the
> > > > main one
> > > > + * being that a subset of the range still has CPU and GPU
> > > > mappings.
> > > > If the
> > > > + * backing store for the range is in VRAM, a subset of the
> > > > backing
> > > > store has
> > > > + * references. One option would be to split the range and VRAM
> > > > backing store,
> > > > + * but the implementation for this would be quite complicated.
> > > > Given
> > > > that
> > > > + * partial unmappings are rare and driver-defined range sizes
> > > > are
> > > > relatively
> > > > + * small, GPU SVM does not support splitting of ranges.
> > > > + *
> > > > + * With no support for range splitting, upon partial unmapping
> > > > of a
> > > > range, the
> > > > + * driver is expected to invalidate and destroy the entire
> > > > range. If
> > > > the range
> > > > + * has VRAM as its backing, the driver is also expected to
> > > > migrate
> > > > any remaining
> > > > + * pages back to SRAM.
> > > > + */
> > > > +
> > > > +/**
> > > > + * DOC: Examples
> > > > + *
> > > > + * This section provides two examples of how to build the
> > > > expected
> > > > driver
> > > > + * components: the GPU page fault handler and the garbage
> > > > collector.
> > > > A third
> > > > + * example demonstrates a sample invalidation driver vfunc.
> > > > + *
> > > > + * The generic code provided does not include logic for complex
> > > > migration
> > > > + * policies, optimized invalidations, or other potentially
> > > > required
> > > > driver
> > > > + * locking (e.g., DMA-resv locks).
> > > > + *
> > > > + * 1) GPU page fault handler
> > > > + *
> > > > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > > > drm_gpusvm_range *range)
> > > > + *	{
> > > > + *		int err = 0;
> > > > + *
> > > > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > > > range);
> > > > + *
> > > > + *		drm_gpusvm_notifier_lock(gpusvm);
> > > > + *		if (drm_gpusvm_range_pages_valid(range))
> > > > + *			driver_commit_bind(gpusvm, range);
> > > > + *		else
> > > > + *			err = -EAGAIN;
> > > > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > > > + *
> > > > + *		return err;
> > > > + *	}
> > > > + *
> > > > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr,
> > > > + *			     u64 gpuva_start, u64 gpuva_end)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > + *		int err;
> > > > + *
> > > > + *		driver_svm_lock();
> > > > + *	retry:
> > > > + *		// Always process UNMAPs first so view of GPU
> > > > SVM
> > > > ranges is current
> > > > + *		driver_garbage_collector(gpusvm);
> > > > + *
> > > > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > > > fault_addr,
> > > > +
> > > > *							gpuva_start,
> > > > gpuva_end,
> > > > + *						        &ctx);
> > > > + *		if (IS_ERR(range)) {
> > > > + *			err = PTR_ERR(range);
> > > > + *			goto unlock;
> > > > + *		}
> > > > + *
> > > > + *		if (driver_migration_policy(range)) {
> > > > + *			bo = driver_alloc_bo();
> > > > + *			err = drm_gpusvm_migrate_to_vram(gpusvm,
> > > > range, bo, &ctx);
> > > > + *			if (err)	// CPU mappings may have
> > > > changed
> > > > + *				goto retry;
> > > > + *		}
> > > > + *
> > > > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > > &ctx);
> > > > + *		if (err == -EFAULT || err == -EPERM)	// CPU
> > > > mappings changed
> > > > + *			goto retry;
> > > > + *		else if (err)
> > > > + *			goto unlock;
> > > > + *
> > > > + *		err = driver_bind_range(gpusvm, range);
> > > > + *		if (err == -EAGAIN)	// CPU mappings changed
> > > > + *			goto retry
> > > > + *
> > > > + *	unlock:
> > > > + *		driver_svm_unlock();
> > > > + *		return err;
> > > > + *	}
> > > > + *
> > > > + * 2) Garbage Collector.
> > > > + *
> > > > + *	void __driver_garbage_collector(struct drm_gpusvm
> > > > *gpusvm,
> > > > + *					struct drm_gpusvm_range
> > > > *range)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = {};
> > > > + *
> > > > + *		assert_driver_svm_locked(gpusvm);
> > > > + *
> > > > + *		// Partial unmap, migrate any remaining VRAM
> > > > pages
> > > > back to SRAM
> > > > + *		if (range->flags.partial_unmap)
> > > > + *			drm_gpusvm_migrate_to_sram(gpusvm,
> > > > range,
> > > > &ctx);
> > > > + *
> > > > + *		driver_unbind_range(range);
> > > > + *		drm_gpusvm_range_remove(gpusvm, range);
> > > > + *	}
> > > > + *
> > > > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > > > + *	{
> > > > + *		assert_driver_svm_locked(gpusvm);
> > > > + *
> > > > + *		for_each_range_in_garbage_collector(gpusvm,
> > > > range)
> > > > + *			__driver_garbage_collector(gpusvm,
> > > > range);
> > > > + *	}
> > > > + *
> > > > + * 3) Invalidation driver vfunc.
> > > > + *
> > > > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > > > + *				 struct drm_gpusvm_notifier
> > > > *notifier,
> > > > + *				 const struct mmu_notifier_range
> > > > *mmu_range)
> > > > + *	{
> > > > + *		struct drm_gpusvm_ctx ctx = { .in_notifier =
> > > > true,
> > > > };
> > > > + *		struct drm_gpusvm_range *range = NULL;
> > > > + *
> > > > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > > > > start, mmu_range->end);
> > > > + *
> > > > + *		drm_gpusvm_for_each_range(range, notifier,
> > > > mmu_range->start,
> > > > + *					  mmu_range->end) {
> > > > + *			drm_gpusvm_range_unmap_pages(gpusvm,
> > > > range,
> > > > &ctx);
> > > > + *
> > > > + *			if (mmu_range->event !=
> > > > MMU_NOTIFY_UNMAP)
> > > > + *				continue;
> > > > + *
> > > > + *			drm_gpusvm_range_set_unmapped(range,
> > > > mmu_range);
> > > > + *			driver_garbage_collector_add(gpusvm,
> > > > range);
> > > > + *		}
> > > > + *	}
> > > > + */
> > > > +
> > > > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > > > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > > > rb.__subtree_last,
> > > > +		     DRM_GPUSVM_RANGE_START,
> > > > DRM_GPUSVM_RANGE_END,
> > > > +		     static __maybe_unused, range);
> > > > +
> > > > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > > > > interval.start)
> > > > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > > > > interval.end - 1)
> > > > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > > > +		     rb.__subtree_last,
> > > > DRM_GPUSVM_NOTIFIER_START,
> > > > +		     DRM_GPUSVM_NOTIFIER_END, static
> > > > __maybe_unused,
> > > > notifier);
> > > > +
> > > > +/**
> > > > + * npages_in_range() - Calculate the number of pages in a given
> > > > range
> > > > + * @start__: The start address of the range
> > > > + * @end__: The end address of the range
> > > > + *
> > > > + * This macro calculates the number of pages in a given memory
> > > > range,
> > > > + * specified by the start and end addresses. It divides the
> > > > difference
> > > > + * between the end and start addresses by the page size
> > > > (PAGE_SIZE)
> > > > to
> > > > + * determine the number of pages in the range.
> > > > + *
> > > > + * Return: The number of pages in the specified range.
> > > > + */
> > > > +#define npages_in_range(start__, end__)	\
> > > > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > > > +
> > > > +/**
> > > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > > + *
> > > > + * @refcount: Reference count for the zdd
> > > > + * @destroy_work: Work structure for asynchronous zdd
> > > > destruction
> > > > + * @range: Pointer to the GPU SVM range
> > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > allocation
> > > > + *
> > > > + * This structure serves as a generic wrapper installed in
> > > > + * page->zone_device_data. It provides infrastructure for
> > > > looking up
> > > > a range
> > > > + * upon CPU page fault and asynchronously releasing VRAM once
> > > > the
> > > > CPU has no
> > > > + * page references. Asynchronous release is useful because CPU
> > > > page
> > > > references
> > > > + * can be dropped in IRQ contexts, while releasing VRAM likely
> > > > requires sleeping
> > > > + * locks.
> > > > + */
> > > > +struct drm_gpusvm_zdd {
> > > > +	struct kref refcount;
> > > > +	struct work_struct destroy_work;
> > > > +	struct drm_gpusvm_range *range;
> > > > +	void *vram_allocation;
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > > destroying a
> > > > zdd
> > > > + * @w: Pointer to the work_struct
> > > > + *
> > > > + * This function releases VRAM, puts GPU SVM range, and frees
> > > > zdd.
> > > > + */
> > > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct
> > > > *w)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd =
> > > > +		container_of(w, struct drm_gpusvm_zdd,
> > > > destroy_work);
> > > > +	struct drm_gpusvm_range *range = zdd->range;
> > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > +
> > > > +	if (gpusvm->ops->vram_release && zdd->vram_allocation)
> > > > +		gpusvm->ops->vram_release(zdd->vram_allocation);
> > > > +	drm_gpusvm_range_put(range);
> > > > +	kfree(zdd);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > > + * @range: Pointer to the GPU SVM range.
> > > > + *
> > > > + * This function allocates and initializes a new zdd structure.
> > > > It
> > > > sets up the
> > > > + * reference count, initializes the destroy work, and links the
> > > > provided GPU SVM
> > > > + * range.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > > failure.
> > > > + */
> > > > +static struct drm_gpusvm_zdd *
> > > > +drm_gpusvm_zdd_alloc(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd;
> > > > +
> > > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > > +	if (!zdd)
> > > > +		return NULL;
> > > > +
> > > > +	kref_init(&zdd->refcount);
> > > > +	INIT_WORK(&zdd->destroy_work,
> > > > drm_gpusvm_zdd_destroy_work_func);
> > > > +	zdd->range = drm_gpusvm_range_get(range);
> > > > +	zdd->vram_allocation = NULL;
> > > > +
> > > > +	return zdd;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > > + * @zdd: Pointer to the zdd structure.
> > > > + *
> > > > + * This function increments the reference count of the provided
> > > > zdd
> > > > structure.
> > > > + *
> > > > + * Returns: Pointer to the zdd structure.
> > > > + */
> > > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > > drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	kref_get(&zdd->refcount);
> > > > +	return zdd;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > > + * @ref: Pointer to the reference count structure.
> > > > + *
> > > > + * This function queues the destroy_work of the zdd for
> > > > asynchronous
> > > > destruction.
> > > > + */
> > > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd =
> > > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > > refcount);
> > > > +	struct drm_gpusvm *gpusvm = zdd->range->gpusvm;
> > > > +
> > > > +	queue_work(gpusvm->zdd_wq, &zdd->destroy_work);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > > + * @zdd: Pointer to the zdd structure.
> > > > + *
> > > > + * This function decrements the reference count of the provided
> > > > zdd
> > > > structure
> > > > + * and schedules its destruction if the count drops to zero.
> > > > + */
> > > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > > notifier
> > > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > > + * @start: Start address of the range
> > > > + * @end: End address of the range
> > > > + *
> > > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > > start, u64 end)
> > > > +{
> > > > +	return range_iter_first(&notifier->root, start, end -
> > > > 1);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > > ranges in a notifier
> > > > + * @range__: Iterator variable for the ranges
> > > > + * @next__: Iterator variable for the ranges temporay storage
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the range
> > > > + * @end__: End address of the range
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > notifier
> > > > while
> > > > + * removing ranges from it.
> > > > + */
> > > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > > notifier__,
> > > > start__, end__)	\
> > > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > > (start__), (end__)),	\
> > > > +	     (next__) =
> > > > __drm_gpusvm_range_next(range__);				\
> > > > +	     (range__) && (range__->va.start <
> > > > (end__));				\
> > > > +	     (range__) = (next__), (next__) =
> > > > __drm_gpusvm_range_next(range__))
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier
> > > > in
> > > > the list
> > > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > > + *
> > > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > > available,
> > > > or NULL if
> > > > + *         the current notifier is the last one or if the input
> > > > notifier is
> > > > + *         NULL.
> > > > + */
> > > > +static struct drm_gpusvm_notifier *
> > > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > > +{
> > > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > > +				      &notifier->gpusvm-
> > > > > notifier_list))
> > > > +		return list_next_entry(notifier, rb.entry);
> > > > +
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> > > > in
> > > > a gpusvm
> > > > + * @notifier__: Iterator variable for the notifiers
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the notifier
> > > > + * @end__: End address of the notifier
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > gpusvm.
> > > > + */
> > > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > > start__,
> > > > end__)		\
> > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > >root,
> > > > (start__), (end__) - 1);	\
> > > > +	     (notifier__) && (notifier__->interval.start <
> > > > (end__));			\
> > > > +	     (notifier__) =
> > > > __drm_gpusvm_notifier_next(notifier__))
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> > > > SVM
> > > > notifiers in a gpusvm
> > > > + * @notifier__: Iterator variable for the notifiers
> > > > + * @next__: Iterator variable for the notifiers temporay storage
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the notifier
> > > > + * @end__: End address of the notifier
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > gpusvm
> > > > while
> > > > + * removing notifiers from it.
> > > > + */
> > > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > > gpusvm__, start__, end__)	\
> > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > >root,
> > > > (start__), (end__) - 1),	\
> > > > +	     (next__) =
> > > > __drm_gpusvm_notifier_next(notifier__);				\
> > > > +	     (notifier__) && (notifier__->interval.start <
> > > > (end__));			\
> > > > +	     (notifier__) = (next__), (next__) =
> > > > __drm_gpusvm_notifier_next(notifier__))
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > > notifier.
> > > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > > + * @cur_seq: Current sequence number.
> > > > + *
> > > > + * This function serves as a generic MMU notifier for GPU SVM.
> > > > It
> > > > sets the MMU
> > > > + * notifier sequence number and calls the driver invalidate
> > > > vfunc
> > > > under
> > > > + * gpusvm->notifier_lock.
> > > > + *
> > > > + * Returns:
> > > > + * true if the operation succeeds, false otherwise.
> > > > + */
> > > > +static bool
> > > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > > *mni,
> > > > +			       const struct mmu_notifier_range
> > > > *mmu_range,
> > > > +			       unsigned long cur_seq)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier =
> > > > +		container_of(mni, typeof(*notifier), notifier);
> > > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > > +
> > > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > > +		return false;
> > > > +
> > > > +	down_write(&gpusvm->notifier_lock);
> > > > +	mmu_interval_set_seq(mni, cur_seq);
> > > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > > +	up_write(&gpusvm->notifier_lock);
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations
> > > > for
> > > > GPU SVM
> > > > + */
> > > > +static const struct mmu_interval_notifier_ops
> > > > drm_gpusvm_notifier_ops = {
> > > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + * @name: Name of the GPU SVM.
> > > > + * @drm: Pointer to the DRM device structure.
> > > > + * @mm: Pointer to the mm_struct for the address space.
> > > > + * @device_private_page_owner: Device private pages owner.
> > > > + * @mm_start: Start address of GPU SVM.
> > > > + * @mm_range: Range of the GPU SVM.
> > > > + * @notifier_size: Size of individual notifiers.
> > > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > range
> > > > allocation.
> > > > + *               Entries should be powers of 2 in descending
> > > > order
> > > > with last
> > > > + *               entry being SZ_4K.
> > > > + * @num_chunks: Number of chunks.
> > > > + *
> > > > + * This function initializes the GPU SVM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, a negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > +		    const char *name, struct drm_device *drm,
> > > > +		    struct mm_struct *mm, void
> > > > *device_private_page_owner,
> > > > +		    u64 mm_start, u64 mm_range, u64
> > > > notifier_size,
> > > > +		    const struct drm_gpusvm_ops *ops,
> > > > +		    const u64 *chunk_sizes, int num_chunks)
> > > > +{
> > > > +	if (!ops->invalidate || !num_chunks)
> > > > +		return -EINVAL;
> > > > +
> > > > +	gpusvm->name = name;
> > > > +	gpusvm->drm = drm;
> > > > +	gpusvm->mm = mm;
> > > > +	gpusvm->device_private_page_owner =
> > > > device_private_page_owner;
> > > > +	gpusvm->mm_start = mm_start;
> > > > +	gpusvm->mm_range = mm_range;
> > > > +	gpusvm->notifier_size = notifier_size;
> > > > +	gpusvm->ops = ops;
> > > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > > +	gpusvm->num_chunks = num_chunks;
> > > > +	gpusvm->zdd_wq = system_wq;
> > > > +
> > > > +	mmgrab(mm);
> > > > +	gpusvm->root = RB_ROOT_CACHED;
> > > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > > +
> > > > +	init_rwsem(&gpusvm->notifier_lock);
> > > > +
> > > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > > +	might_lock(&gpusvm->notifier_lock);
> > > > +	fs_reclaim_release(GFP_KERNEL);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > > + * @fault_addr__: Fault address
> > > > + *
> > > > + * This macro finds the GPU SVM notifier associated with the
> > > > fault
> > > > address.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > > + */
> > > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > > fault_addr__)	\
> > > > +	notifier_iter_first(&(gpusvm__)->root,
> > > > (fault_addr__),	\
> > > > +			    (fault_addr__ + 1))
> > > > +
> > > > +/**
> > > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > > given rbtree node
> > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > drm_gpusvm_notifier struct
> > > > + *
> > > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > > structure.
> > > > + */
> > > > +#define
> > > > to_drm_gpusvm_notifier(__node)				\
> > > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > > rb.node)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This function inserts the GPU SVM notifier into the GPU SVM
> > > > RB
> > > > tree and list.
> > > > + */
> > > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > > *gpusvm,
> > > > +				       struct
> > > > drm_gpusvm_notifier
> > > > *notifier)
> > > > +{
> > > > +	struct rb_node *node;
> > > > +	struct list_head *head;
> > > > +
> > > > +	notifier_insert(notifier, &gpusvm->root);
> > > > +
> > > > +	node = rb_prev(&notifier->rb.node);
> > > > +	if (node)
> > > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > > >rb.entry;
> > > > +	else
> > > > +		head = &gpusvm->notifier_list;
> > > > +
> > > > +	list_add(&notifier->rb.entry, head);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This macro removes the GPU SVM notifier from the GPU SVM RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > > notifier__)	\
> > > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > > >root);	\
> > > > +	list_del(&(notifier__)->rb.entry)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * This function finalizes the GPU SVM by cleaning up any
> > > > remaining
> > > > ranges and
> > > > + * notifiers, and dropping a reference to struct MM.
> > > > + */
> > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > > +
> > > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > > gpusvm, 0,
> > > > LONG_MAX) {
> > > > +		struct drm_gpusvm_range *range, *__next;
> > > > +
> > > > +		/*
> > > > +		 * Remove notifier first to avoid racing with
> > > > any
> > > > invalidation
> > > > +		 */
> > > > +		mmu_interval_notifier_remove(&notifier-
> > > > >notifier);
> > > > +		notifier->flags.removed = true;
> > > > +
> > > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > > notifier, 0,
> > > > +					       LONG_MAX)
> > > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > > +	}
> > > > +
> > > > +	mmdrop(gpusvm->mm);
> > > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @fault_addr: Fault address
> > > > + *
> > > > + * This function allocates and initializes the GPU SVM notifier
> > > > structure.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated GPU SVM notifier on success,
> > > > ERR_PTR()
> > > > on failure.
> > > > + */
> > > > +static struct drm_gpusvm_notifier *
> > > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	if (gpusvm->ops->notifier_alloc)
> > > > +		notifier = gpusvm->ops->notifier_alloc();
> > > > +	else
> > > > +		notifier = kzalloc(sizeof(*notifier),
> > > > GFP_KERNEL);
> > > > +
> > > > +	if (!notifier)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	notifier->gpusvm = gpusvm;
> > > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > > gpusvm-
> > > > > notifier_size);
> > > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > > > notifier_size);
> > > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > > +	notifier->root = RB_ROOT_CACHED;
> > > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > > +
> > > > +	return notifier;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This function frees the GPU SVM notifier structure.
> > > > + */
> > > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > > +				     struct drm_gpusvm_notifier
> > > > *notifier)
> > > > +{
> > > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > > +
> > > > +	if (gpusvm->ops->notifier_free)
> > > > +		gpusvm->ops->notifier_free(notifier);
> > > > +	else
> > > > +		kfree(notifier);
> > > > +}
> > > > +
> > > > +/**
> > > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > > given
> > > > rbtree node
> > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > drm_gpusvm_range struct
> > > > + *
> > > > + * Return: A pointer to the containing drm_gpusvm_range
> > > > structure.
> > > > + */
> > > > +#define to_drm_gpusvm_range(node__)	\
> > > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function inserts the GPU SVM range into the notifier RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > > *notifier,
> > > > +				    struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	struct rb_node *node;
> > > > +	struct list_head *head;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > > +	range_insert(range, &notifier->root);
> > > > +
> > > > +	node = rb_prev(&range->rb.node);
> > > > +	if (node)
> > > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > > +	else
> > > > +		head = &notifier->range_list;
> > > > +
> > > > +	list_add(&range->rb.entry, head);
> > > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > + * @range__: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This macro removes the GPU SVM range from the notifier RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +#define __drm_gpusvm_range_remove(notifier__,
> > > > range__)		\
> > > > +	range_remove((range__), &(notifier__)-
> > > > >root);		\
> > > > +	list_del(&(range__)->rb.entry)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @fault_addr: Fault address
> > > > + * @chunk_size: Chunk size
> > > > + * @migrate_vram: Flag indicating whether to migrate VRAM
> > > > + *
> > > > + * This function allocates and initializes the GPU SVM range
> > > > structure.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR()
> > > > on
> > > > failure.
> > > > + */
> > > > +static struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > > +		       struct drm_gpusvm_notifier *notifier,
> > > > +		       u64 fault_addr, u64 chunk_size, bool
> > > > migrate_vram)
> > > > +{
> > > > +	struct drm_gpusvm_range *range;
> > > > +
> > > > +	if (gpusvm->ops->range_alloc)
> > > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > > +	else
> > > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > +
> > > > +	if (!range)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	kref_init(&range->refcount);
> > > > +	range->gpusvm = gpusvm;
> > > > +	range->notifier = notifier;
> > > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > > +	range->notifier_seq = LONG_MAX;
> > > > +	range->flags.migrate_vram = migrate_vram ? 1 : 0;
> > > > +
> > > > +	return range;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_check_pages - Check pages
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @start: Start address
> > > > + * @end: End address
> > > > + *
> > > > + * Check if pages between start and end have been faulted in on
> > > > the
> > > > CPU. Use to
> > > > + * prevent migration of pages without CPU backing store.
> > > > + *
> > > > + * Returns:
> > > > + * True if pages have been faulted into CPU, False otherwise
> > > > + */
> > > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > > +				   struct drm_gpusvm_notifier
> > > > *notifier,
> > > > +				   u64 start, u64 end)
> > > > +{
> > > > +	struct hmm_range hmm_range = {
> > > > +		.default_flags = 0,
> > > > +		.notifier = &notifier->notifier,
> > > > +		.start = start,
> > > > +		.end = end,
> > > > +		.dev_private_owner = gpusvm-
> > > > > device_private_page_owner,
> > > > +	};
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long *pfns;
> > > > +	unsigned long npages = npages_in_range(start, end);
> > > > +	int err, i;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +	if (!pfns)
> > > > +		return false;
> > > > +
> > > > +	hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(&notifier-
> > > > > notifier);
> > > > +	hmm_range.hmm_pfns = pfns;
> > > > +
> > > > +	while (true) {
> > > > +		err = hmm_range_fault(&hmm_range);
> > > > +		if (err == -EBUSY) {
> > > > +			if (time_after(jiffies, timeout))
> > > > +				break;
> > > > +
> > > > +			hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(&notifier->notifier);
> > > > +			continue;
> > > > +		}
> > > > +		break;
> > > > +	}
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_free;
> > > > +		}
> > > > +	}
> > > > +
> > > > +err_free:
> > > > +	kvfree(pfns);
> > > > +	return err ? false : true;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> > > > SVM
> > > > range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @vas: Pointer to the virtual memory area structure
> > > > + * @fault_addr: Fault address
> > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > + * @check_pages: Flag indicating whether to check pages
> > > > + *
> > > > + * This function determines the chunk size for the GPU SVM range
> > > > based on the
> > > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > > > and
> > > > the virtual
> > > > + * memory area boundaries.
> > > > + *
> > > > + * Returns:
> > > > + * Chunk size on success, LONG_MAX on failure.
> > > > + */
> > > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > > *gpusvm,
> > > > +				       struct
> > > > drm_gpusvm_notifier
> > > > *notifier,
> > > > +				       struct vm_area_struct
> > > > *vas,
> > > > +				       u64 fault_addr, u64
> > > > gpuva_start,
> > > > +				       u64 gpuva_end, bool
> > > > check_pages)
> > > > +{
> > > > +	u64 start, end;
> > > > +	int i = 0;
> > > > +
> > > > +retry:
> > > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > > chunk_sizes[i]);
> > > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > > >chunk_sizes[i]);
> > > > +
> > > > +		if (start >= vas->vm_start && end <= vas->vm_end
> > > > &&
> > > > +		    start >= notifier->interval.start &&
> > > > +		    end <= notifier->interval.end &&
> > > > +		    start >= gpuva_start && end <= gpuva_end)
> > > > +			break;
> > > > +	}
> > > > +
> > > > +	if (i == gpusvm->num_chunks)
> > > > +		return LONG_MAX;
> > > > +
> > > > +	/*
> > > > +	 * If allocation more than page, ensure not to overlap
> > > > with
> > > > existing
> > > > +	 * ranges.
> > > > +	 */
> > > > +	if (end - start != SZ_4K) {
> > > > +		struct drm_gpusvm_range *range;
> > > > +
> > > > +		range = drm_gpusvm_range_find(notifier, start,
> > > > end);
> > > > +		if (range) {
> > > > +			++i;
> > > > +			goto retry;
> > > > +		}
> > > > +
> > > > +		/*
> > > > +		 * XXX: Only create range on pages CPU has
> > > > faulted
> > > > in. Without
> > > > +		 * this check, or prefault, on BMG
> > > > 'xe_exec_system_allocator --r
> > > > +		 * process-many-malloc' fails. In the failure
> > > > case,
> > > > each process
> > > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > > results in 64k SVM
> > > > +		 * ranges. When migrating the SVM ranges, some
> > > > processes fail in
> > > > +		 * drm_gpusvm_migrate_to_vram with
> > > > 'migrate.cpages
> > > > != npages'
> > > > +		 * and then upon drm_gpusvm_range_get_pages
> > > > device
> > > > pages from
> > > > +		 * other processes are collected + faulted in
> > > > which
> > > > creates all
> > > > +		 * sorts of problems. Unsure exactly how this
> > > > happening, also
> > > > +		 * problem goes away if
> > > > 'xe_exec_system_allocator --
> > > > r
> > > > +		 * process-many-malloc' mallocs at least 64k at
> > > > a
> > > > time.
> > > > +		 */
> > > > +		if (check_pages &&
> > > > +		    !drm_gpusvm_check_pages(gpusvm, notifier,
> > > > start,
> > > > end)) {
> > > > +			++i;
> > > > +			goto retry;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return end - start;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > > range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @fault_addr: Fault address
> > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function finds or inserts a newly allocated a GPU SVM
> > > > range
> > > > based on the
> > > > + * fault address. Caller must hold a lock to protect range
> > > > lookup
> > > > and insertion.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > > failure.
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr,
> > > > +				u64 gpuva_start, u64 gpuva_end,
> > > > +				const struct drm_gpusvm_ctx
> > > > *ctx)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +	struct drm_gpusvm_range *range;
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	struct vm_area_struct *vas;
> > > > +	bool notifier_alloc = false;
> > > > +	u64 chunk_size;
> > > > +	int err;
> > > > +	bool migrate_vram;
> > > > +
> > > > +	if (fault_addr < gpusvm->mm_start ||
> > > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > > +		err = -EINVAL;
> > > > +		goto err_out;
> > > > +	}
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		mmap_write_lock(mm);
> > > > +	}
> > > > +
> > > > +	mmap_assert_write_locked(mm);
> > > > +
> > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > > +	if (!notifier) {
> > > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > > fault_addr);
> > > > +		if (IS_ERR(notifier)) {
> > > > +			err = PTR_ERR(notifier);
> > > > +			goto err_mmunlock;
> > > > +		}
> > > > +		notifier_alloc = true;
> > > > +		err =
> > > > mmu_interval_notifier_insert_locked(&notifier-
> > > > > notifier,
> > > > +							  mm,
> > > > notifier->interval.start,
> > > > +							 
> > > > notifier-
> > > > > interval.end -
> > > > +							 
> > > > notifier-
> > > > > interval.start,
> > > > +							 
> > > > &drm_gpusvm_notifier_ops);
> > > > +		if (err)
> > > > +			goto err_notifier;
> > > > +	}
> > > > +
> > > > +	vas = vma_lookup(mm, fault_addr);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > > +		err = -EPERM;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > > fault_addr + 1);
> > > > +	if (range)
> > > > +		goto out_mmunlock;
> > > > +	/*
> > > > +	 * XXX: Short-circuiting migration based on
> > > > migrate_vma_*
> > > > current
> > > > +	 * limitations. If/when migrate_vma_* add more support,
> > > > this
> > > > logic will
> > > > +	 * have to change.
> > > > +	 */
> > > > +	migrate_vram = ctx->vram_possible &&
> > > > +		vma_is_anonymous(vas) &&
> > > > !is_vm_hugetlb_page(vas);
> > > > +
> > > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > > notifier,
> > > > vas,
> > > > +						 fault_addr,
> > > > gpuva_start,
> > > > +						 gpuva_end,
> > > > migrate_vram &&
> > > > +						 !ctx-
> > > > >prefault);
> > > > +	if (chunk_size == LONG_MAX) {
> > > > +		err = -EINVAL;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > > fault_addr,
> > > > chunk_size,
> > > > +				       migrate_vram);
> > > > +	if (IS_ERR(range)) {
> > > > +		err = PTR_ERR(range);
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	drm_gpusvm_range_insert(notifier, range);
> > > > +	if (notifier_alloc)
> > > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > > +
> > > > +	if (ctx->prefault) {
> > > > +		struct drm_gpusvm_ctx __ctx = *ctx;
> > > > +
> > > > +		__ctx.mmap_locked = true;
> > > > +		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > > > &__ctx);
> > > > +		if (err)
> > > > +			goto err_range_remove;
> > > > +	}
> > > > +
> > > > +out_mmunlock:
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_write_unlock(mm);
> > > > +		mmput(mm);
> > > > +	}
> > > > +
> > > > +	return range;
> > > > +
> > > > +err_range_remove:
> > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > +err_notifier_remove:
> > > > +	if (notifier_alloc)
> > > > +		mmu_interval_notifier_remove(&notifier-
> > > > >notifier);
> > > > +err_notifier:
> > > > +	if (notifier_alloc)
> > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_write_unlock(mm);
> > > > +		mmput(mm);
> > > > +	}
> > > > +err_out:
> > > > +	return ERR_PTR(err);
> > > > +}
> > > > +
> > > > +/**
> > > > + * for_each_dma_page - iterate over pages in a DMA regio`n
> > > > + * @i__: the current page index in the iteration
> > > > + * @j__: the current page index, log order, in the iteration
> > > > + * @npages__: the total number of pages in the DMA region
> > > > + * @order__: the order of the pages in the DMA region
> > > > + *
> > > > + * This macro iterates over each page in a DMA region. The DMA
> > > > region
> > > > + * is assumed to be composed of 2^@order__ pages, and the macro
> > > > will
> > > > + * step through the region one block of 2^@order__ pages at a
> > > > time.
> > > > + */
> > > > +#define for_each_dma_page(i__, j__, npages__, order__)	\
> > > > +	for ((i__) = 0, (j__) = 0; (i__) < (npages__);	\
> > > > +	     (j__)++, (i__) += 0x1 << (order__))
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with
> > > > a
> > > > GPU SVM range (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function unmap pages associated with a GPU SVM range.
> > > > Assumes and
> > > > + * asserts correct locking is in place when called.
> > > > + */
> > > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					   struct
> > > > drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	if (range->pages) {
> > > > +		unsigned long i, j, npages =
> > > > npages_in_range(range-
> > > > > va.start,
> > > > +							    
> > > > range-
> > > > > va.end);
> > > > +
> > > > +		if (range->flags.has_dma_mapping) {
> > > > +			for_each_dma_page(i, j, npages, range-
> > > > > order)
> > > > +				dma_unmap_page(gpusvm->drm->dev,
> > > > +					       range-
> > > > >dma_addr[j],
> > > > +					       PAGE_SIZE <<
> > > > range-
> > > > > order,
> > > > +					      
> > > > DMA_BIDIRECTIONAL);
> > > > +		}
> > > > +
> > > > +		range->flags.has_vram_pages = false;
> > > > +		range->flags.has_dma_mapping = false;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_free_pages - Free pages associated with a
> > > > GPU
> > > > SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function free pages associated with a GPU SVM range.
> > > > + */
> > > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	if (range->pages) {
> > > > +		if (range->flags.kfree_mapping) {
> > > > +			kfree(range->dma_addr);
> > > > +			range->flags.kfree_mapping = false;
> > > > +			range->pages = NULL;
> > > > +		} else {
> > > > +			kvfree(range->pages);
> > > > +			range->pages = NULL;
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range to be removed
> > > > + *
> > > > + * This function removes the specified GPU SVM range and also
> > > > removes the parent
> > > > + * GPU SVM notifier if no more ranges remain in the notifier.
> > > > The
> > > > caller must
> > > > + * hold a lock to protect range and notifier removal.
> > > > + */
> > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > +			     struct drm_gpusvm_range *range)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > > va.start);
> > > > +	if (WARN_ON_ONCE(!notifier))
> > > > +		return;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +
> > > > +	drm_gpusvm_range_put(range);
> > > > +
> > > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > > +		if (!notifier->flags.removed)
> > > > +			mmu_interval_notifier_remove(&notifier-
> > > > > notifier);
> > > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > > + * @range: Pointer to the GPU SVM range
> > > > + *
> > > > + * This function increments the reference count of the specified
> > > > GPU
> > > > SVM range.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM range.
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	kref_get(&range->refcount);
> > > > +
> > > > +	return range;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > > + * @refcount: Pointer to the reference counter embedded in the
> > > > GPU
> > > > SVM range
> > > > + *
> > > > + * This function destroys the specified GPU SVM range when its
> > > > reference count
> > > > + * reaches zero. If a custom range-free function is provided, it
> > > > is
> > > > invoked to
> > > > + * free the range; otherwise, the range is deallocated using
> > > > kfree().
> > > > + */
> > > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > > +{
> > > > +	struct drm_gpusvm_range *range =
> > > > +		container_of(refcount, struct drm_gpusvm_range,
> > > > refcount);
> > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > +
> > > > +	if (gpusvm->ops->range_free)
> > > > +		gpusvm->ops->range_free(range);
> > > > +	else
> > > > +		kfree(range);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > > + * @range: Pointer to the GPU SVM range
> > > > + *
> > > > + * This function decrements the reference count of the specified
> > > > GPU
> > > > SVM range
> > > > + * and frees it when the count reaches zero.
> > > > + */
> > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function determines if a GPU SVM range pages are valid.
> > > > Expected be
> > > > + * called holding gpusvm->notifier_lock and as the last step
> > > > before
> > > > commiting a
> > > > + * GPU binding.
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > + */
> > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	return range->flags.has_vram_pages || range-
> > > > > flags.has_dma_mapping;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages
> > > > valid
> > > > unlocked
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function determines if a GPU SVM range pages are valid.
> > > > Expected be
> > > > + * called without holding gpusvm->notifier_lock.
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > + */
> > > > +static bool
> > > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > > +				      struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	bool pages_valid;
> > > > +
> > > > +	if (!range->pages)
> > > > +		return false;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > > range);
> > > > +	if (!pages_valid && range->flags.kfree_mapping) {
> > > > +		kfree(range->dma_addr);
> > > > +		range->flags.kfree_mapping = false;
> > > > +		range->pages = NULL;
> > > > +	}
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +
> > > > +	return pages_valid;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function gets pages for a GPU SVM range and ensures they
> > > > are
> > > > mapped for
> > > > + * DMA access.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	struct mmu_interval_notifier *notifier = &range-
> > > > >notifier-
> > > > > notifier;
> > > > +	struct hmm_range hmm_range = {
> > > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > > >read_only
> > > > ? 0 :
> > > > +			HMM_PFN_REQ_WRITE),
> > > > +		.notifier = notifier,
> > > > +		.start = range->va.start,
> > > > +		.end = range->va.end,
> > > > +		.dev_private_owner = gpusvm-
> > > > > device_private_page_owner,
> > > > +	};
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long i, j;
> > > > +	unsigned long npages = npages_in_range(range->va.start,
> > > > range->va.end);
> > > > +	unsigned int order = 0;
> > > > +	unsigned long *pfns;
> > > > +	struct page **pages;
> > > > +	int err = 0;
> > > > +	bool vram_pages = !!range->flags.migrate_vram;
> > > > +	bool alloc_pfns = false, kfree_mapping;
> > > > +
> > > > +retry:
> > > > +	kfree_mapping = false;
> > > > +	hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(notifier);
> > > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > > range))
> > > > +		return 0;
> > > > +
> > > > +	if (range->notifier_seq == hmm_range.notifier_seq &&
> > > > range-
> > > > > pages) {
> > > > +		if (ctx->prefault)
> > > > +			return 0;
> > > > +
> > > > +		pfns = (unsigned long *)range->pages;
> > > > +		pages = range->pages;
> > > > +		goto map_pages;
> > > > +	}
> > > > +
> > > > +	if (!range->pages) {
> > > > +		pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +		if (!pfns)
> > > > +			return -ENOMEM;
> > > > +		alloc_pfns = true;
> > > > +	} else {
> > > > +		pfns = (unsigned long *)range->pages;
> > > > +	}
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	hmm_range.hmm_pfns = pfns;
> > > > +	while (true) {
> > > > +		/* Must be checked after mmu_interval_read_begin
> > > > */
> > > > +		if (range->flags.unmapped) {
> > > > +			err = -EFAULT;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (!ctx->mmap_locked) {
> > > > +			/*
> > > > +			 * XXX: HMM locking document indicates
> > > > only
> > > > a read-lock
> > > > +			 * is required but there apears to be a
> > > > window between
> > > > +			 * the MMU_NOTIFY_MIGRATE event
> > > > triggered in
> > > > a CPU fault
> > > > +			 * via migrate_vma_setup and the pages
> > > > actually moving
> > > > +			 * in migrate_vma_finalize in which this
> > > > code can grab
> > > > +			 * garbage pages. Grabbing the write-
> > > > lock if
> > > > the range
> > > > +			 * is attached to vram appears to
> > > > protect
> > > > against this
> > > > +			 * race.
> > > > +			 */
> > > > +			if (vram_pages)
> > > > +				mmap_write_lock(mm);
> > > > +			else
> > > > +				mmap_read_lock(mm);
> > > > +		}
> > > > +		err = hmm_range_fault(&hmm_range);
> > > > +		if (!ctx->mmap_locked) {
> > > > +			if (vram_pages)
> > > > +				mmap_write_unlock(mm);
> > > > +			else
> > > > +				mmap_read_unlock(mm);
> > > > +		}
> > > > +
> > > > +		if (err == -EBUSY) {
> > > > +			if (time_after(jiffies, timeout))
> > > > +				break;
> > > > +
> > > > +			hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(notifier);
> > > > +			continue;
> > > > +		}
> > > > +		break;
> > > > +	}
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmput(mm);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	pages = (struct page **)pfns;
> > > > +
> > > > +	if (ctx->prefault) {
> > > > +		range->pages = pages;
> > > > +		goto set_seqno;
> > > > +	}
> > > > +
> > > > +map_pages:
> > > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > > +
> > > > +		for (i = 0; i < npages; ++i) {
> > > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > > +
> > > > +			if
> > > > (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				goto err_free;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->flags.has_vram_pages = true;
> > > > +		range->pages = pages;
> > > > +		if (mmu_interval_read_retry(notifier,
> > > > hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > > range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	} else {
> > > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > > +
> > > > +		for_each_dma_page(i, j, npages, order) {
> > > > +			if (WARN_ON_ONCE(i && order !=
> > > > +					
> > > > hmm_pfn_to_map_order(pfns[i]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > > +
> > > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > > +			if
> > > > (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > > +				err = -EOPNOTSUPP;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			set_page_dirty_lock(pages[j]);
> > > > +			mark_page_accessed(pages[j]);
> > > > +
> > > > +			dma_addr[j] = dma_map_page(gpusvm->drm-
> > > > >dev,
> > > > +						   pages[j], 0,
> > > > +						   PAGE_SIZE <<
> > > > order,
> > > > +						  
> > > > DMA_BIDIRECTIONAL);
> > > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > > dma_addr[j])) {
> > > > +				err = -EFAULT;
> > > > +				npages = i;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Huge pages, reduce memory footprint */
> > > > +		if (order) {
> > > > +			dma_addr = kmalloc_array(j,
> > > > sizeof(*dma_addr),
> > > > +						 GFP_KERNEL);
> > > > +			if (dma_addr) {
> > > > +				for (i = 0; i < j; ++i)
> > > > +					dma_addr[i] =
> > > > (dma_addr_t)pfns[i];
> > > > +				kvfree(pfns);
> > > > +				kfree_mapping = true;
> > > > +			} else {
> > > > +				dma_addr = (dma_addr_t *)pfns;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* Do not race with notifier unmapping pages */
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +		range->order = order;
> > > > +		range->flags.kfree_mapping = kfree_mapping;
> > > > +		range->flags.has_dma_mapping = true;
> > > > +		range->dma_addr = dma_addr;
> > > > +		range->vram_allocation = NULL;
> > > > +		if (mmu_interval_read_retry(notifier,
> > > > hmm_range.notifier_seq)) {
> > > > +			err = -EAGAIN;
> > > > +			__drm_gpusvm_range_unmap_pages(gpusvm,
> > > > range);
> > > > +		}
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	}
> > > > +
> > > > +	if (err == -EAGAIN)
> > > > +		goto retry;
> > > > +set_seqno:
> > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_unmap:
> > > > +	for_each_dma_page(i, j, npages, order)
> > > > +		dma_unmap_page(gpusvm->drm->dev,
> > > > +			       (dma_addr_t)pfns[j],
> > > > +			       PAGE_SIZE << order,
> > > > DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	if (alloc_pfns)
> > > > +		kvfree(pfns);
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > > GPU
> > > > SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function unmaps pages associated with a GPU SVM range.
> > > > If
> > > > @in_notifier
> > > > + * is set, it is assumed that gpusvm->notifier_lock is held in
> > > > write
> > > > mode; if it
> > > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > > Must be
> > > > called on
> > > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > > invalidate for IOMMU
> > > > + * security model.
> > > > + */
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range,
> > > > +				  const struct drm_gpusvm_ctx
> > > > *ctx)
> > > > +{
> > > > +	if (ctx->in_notifier)
> > > > +		lockdep_assert_held_write(&gpusvm-
> > > > >notifier_lock);
> > > > +	else
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > > +
> > > > +	if (!ctx->in_notifier)
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > + * @page: Pointer to the page to put
> > > > + *
> > > > + * This function unlocks and puts a page.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > + * @npages: Number of pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > + *
> > > > + * This function puts an array of pages.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > > +					   unsigned long
> > > > *migrate_pfn)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!migrate_pfn[i])
> > > > +			continue;
> > > > +
> > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_pag
> > > > e(mi
> > > > grate_pfn[i]));
> > > > +		migrate_pfn[i] = 0;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_get_vram_page - Get a reference to a VRAM page
> > > > + * @page: Pointer to the page
> > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > + *
> > > > + * This function associates the given page with the specified
> > > > GPU
> > > > SVM zone
> > > > + * device data and initializes it for zone device usage.
> > > > + */
> > > > +static void drm_gpusvm_get_vram_page(struct page *page,
> > > > +				     struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > +	zone_device_page_init(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU
> > > > SVM
> > > > migration
> > > > + * @dev: The device for which the pages are being mapped
> > > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > > mapped
> > > > pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > + * @npages: Number of pages to map
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function maps pages of memory for migration usage in GPU
> > > > SVM. It
> > > > + * iterates over each page frame number provided in
> > > > @migrate_pfn,
> > > > maps the
> > > > + * corresponding page, and stores the DMA address in the
> > > > provided
> > > > @dma_addr
> > > > + * array.
> > > > + *
> > > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > > mapping.
> > > > + */
> > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > +					dma_addr_t *dma_addr,
> > > > +					long unsigned int
> > > > *migrate_pfn,
> > > > +					unsigned long npages,
> > > > +					enum dma_data_direction
> > > > dir)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		struct page *page =
> > > > migrate_pfn_to_page(migrate_pfn[i]);
> > > > +
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > +			return -EFAULT;
> > > > +
> > > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > > PAGE_SIZE,
> > > > dir);
> > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > +			return -EFAULT;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously
> > > > mapped
> > > > for GPU SVM migration
> > > > + * @dev: The device for which the pages were mapped
> > > > + * @dma_addr: Array of DMA addresses corresponding to mapped
> > > > pages
> > > > + * @npages: Number of pages to unmap
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function unmaps previously mapped pages of memory for
> > > > GPU
> > > > Shared Virtual
> > > > + * Memory (SVM). It iterates over each DMA address provided in
> > > > @dma_addr, checks
> > > > + * if it's valid and not already unmapped, and unmaps the
> > > > corresponding page.
> > > > + */
> > > > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > > > +					   dma_addr_t *dma_addr,
> > > > +					   unsigned long npages,
> > > > +					   enum
> > > > dma_data_direction
> > > > dir)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > > > dma_addr[i]))
> > > > +			continue;
> > > > +
> > > > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE,
> > > > dir);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_vram - Migrate GPU SVM range to VRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *                   failure of this function.
> > > > + * @vram_allocation: Driver-private pointer to the VRAM
> > > > allocation.
> > > > The caller
> > > > + *                   should hold a reference to the VRAM
> > > > allocation,
> > > > which
> > > > + *                   should be dropped via ops->vram_allocation
> > > > or
> > > > upon the
> > > > + *                   failure of this function.
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function migrates the specified GPU SVM range to VRAM.
> > > > It
> > > > performs the
> > > > + * necessary setup and invokes the driver-specific operations
> > > > for
> > > > migration to
> > > > + * VRAM. Upon successful return, @vram_allocation can safely
> > > > reference @range
> > > > + * until ops->vram_release is called which only upon successful
> > > > return.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_vram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       void *vram_allocation,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	u64 start = range->va.start, end = range->va.end;
> > > > +	struct migrate_vma migrate = {
> > > > +		.start		= start,
> > > > +		.end		= end,
> > > > +		.pgmap_owner	= gpusvm-
> > > > >device_private_page_owner,
> > > > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > > > +	};
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	unsigned long i, npages = npages_in_range(start, end);
> > > > +	struct vm_area_struct *vas;
> > > > +	struct drm_gpusvm_zdd *zdd = NULL;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int err;
> > > > +
> > > > +	if (!range->flags.migrate_vram)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!gpusvm->ops->populate_vram_pfn || !gpusvm->ops-
> > > > > copy_to_vram ||
> > > > +	    !gpusvm->ops->copy_to_sram)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		mmap_write_lock(mm);
> > > > +	}
> > > > +
> > > > +	mmap_assert_locked(mm);
> > > > +
> > > > +	vas = vma_lookup(mm, start);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > > +		err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (!vma_is_anonymous(vas)) {
> > > > +		err = -EBUSY;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr))
> > > > * npages;
> > > > +
> > > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > > +	if (!zdd) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/*
> > > > +	 * FIXME: Below cases, !migrate.cpages and
> > > > migrate.cpages !=
> > > > npages, not
> > > > +	 * always an error. Need to revisit possible cases and
> > > > how
> > > > to handle. We
> > > > +	 * could prefault on migrate.cpages != npages via
> > > > hmm_range_fault.
> > > > +	 */
> > > > +
> > > > +	if (!migrate.cpages) {
> > > > +		err = -EFAULT;
> > > > +		goto err_free;
> > > > +	}
> > > > +
> > > > +	if (migrate.cpages != npages) {
> > > > +		err = -EBUSY;
> > > > +		goto err_finalize;
> > > > +	}
> > > > +
> > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm,
> > > > vram_allocation, npages,
> > > > +					     migrate.dst);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > +					   migrate.src, npages,
> > > > DMA_TO_DEVICE);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > > > +
> > > > +		pages[i] = page;
> > > > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > > > +		drm_gpusvm_get_vram_page(page, zdd);
> > > > +	}
> > > > +
> > > > +	err = gpusvm->ops->copy_to_vram(gpusvm, pages, dma_addr,
> > > > npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	/* Upon success bind vram allocation to range and zdd */
> > > > +	range->vram_allocation = vram_allocation;
> > > > +	WRITE_ONCE(zdd->vram_allocation,
> > > > vram_allocation);	/*
> > > > Owns ref */
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > migrate.dst);
> > > > +	migrate_vma_pages(&migrate);
> > > > +	migrate_vma_finalize(&migrate);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > npages,
> > > > +				       DMA_TO_DEVICE);
> > > > +err_free:
> > > > +	if (zdd)
> > > > +		drm_gpusvm_zdd_put(zdd);
> > > > +	kvfree(buf);
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_write_unlock(mm);
> > > > +		mmput(mm);
> > > > +	}
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_populate_sram_pfn - Populate SRAM PFNs for
> > > > a
> > > > VM area
> > > > + * @vas: Pointer to the VM area structure, can be NULL
> > > > + * @npages: Number of pages to populate
> > > > + * @src_mpfn: Source array of migrate PFNs
> > > > + * @mpfn: Array of migrate PFNs to populate
> > > > + * @addr: Start address for PFN allocation
> > > > + *
> > > > + * This function populates the SRAM migrate page frame numbers
> > > > (PFNs) for the
> > > > + * specified VM area structure. It allocates and locks pages in
> > > > the
> > > > VM area for
> > > > + * SRAM usage. If vas is non-NULL use alloc_page_vma for
> > > > allocation,
> > > > if NULL use
> > > > + * alloc_page for allocation.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_migrate_populate_sram_pfn(struct
> > > > vm_area_struct *vas,
> > > > +						unsigned long
> > > > npages,
> > > > +						unsigned long
> > > > *src_mpfn,
> > > > +						unsigned long
> > > > *mpfn,
> > > > u64 addr)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > > > +		struct page *page;
> > > > +
> > > > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > > > +			continue;
> > > > +
> > > > +		if (vas)
> > > > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > > > addr);
> > > > +		else
> > > > +			page = alloc_page(GFP_HIGHUSER);
> > > > +
> > > > +		if (!page)
> > > > +			return -ENOMEM;
> > > > +
> > > > +		lock_page(page);
> > > > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_evict_to_sram - Evict GPU SVM range to SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * Similar to __drm_gpusvm_migrate_to_sram but does not require
> > > > mmap
> > > > lock and
> > > > + * migration done via migrate_device_* functions. Fallback path
> > > > as
> > > > it is
> > > > + * preferred to issue migrations with mmap lock.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int drm_gpusvm_evict_to_sram(struct drm_gpusvm *gpusvm,
> > > > +				    struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	unsigned long *src, *dst;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	npages = npages_in_range(range->va.start, range-
> > > > >va.end);
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*src) +
> > > > sizeof(*dma_addr)
> > > > +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_out;
> > > > +	}
> > > > +	src = buf;
> > > > +	dst = buf + (sizeof(*src) * npages);
> > > > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > > > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > > > npages;
> > > > +
> > > > +	err = gpusvm->ops->populate_vram_pfn(gpusvm, range-
> > > > > vram_allocation,
> > > > +					     npages, src);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	err = migrate_device_vma_range(gpusvm->mm,
> > > > +				       gpusvm-
> > > > > device_private_page_owner, src,
> > > > +				       npages, range->va.start);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(NULL, npages,
> > > > src, dst, 0);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > +					   dst, npages,
> > > > DMA_BIDIRECTIONAL);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i)
> > > > +		pages[i] = migrate_pfn_to_page(src[i]);
> > > > +
> > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > > npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages, dst);
> > > > +	migrate_device_pages(src, dst, npages);
> > > > +	migrate_device_finalize(src, dst, npages);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > npages,
> > > > +				       DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	kvfree(buf);
> > > > +err_out:
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM
> > > > (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @vas: Pointer to the VM area structure
> > > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > > + * @start: Start address of the migration range
> > > > + * @end: End address of the migration range
> > > > + *
> > > > + * This internal function performs the migration of the
> > > > specified
> > > > GPU SVM range
> > > > + * to SRAM. It sets up the migration, populates + dma maps SRAM
> > > > PFNs, and
> > > > + * invokes the driver-specific operations for migration to SRAM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					struct vm_area_struct
> > > > *vas,
> > > > +					struct page *page,
> > > > +					u64 start, u64 end)
> > > > +{
> > > > +	struct migrate_vma migrate = {
> > > > +		.vma		= vas,
> > > > +		.pgmap_owner	= gpusvm-
> > > > >device_private_page_owner,
> > > > +		.flags		=
> > > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page	= page,
> > > > +	};
> > > > +	unsigned long npages;
> > > > +	struct page **pages;
> > > > +	dma_addr_t *dma_addr;
> > > > +	void *buf;
> > > > +	int i, err = 0;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	/* Corner where VMA area struct has been partially
> > > > unmapped
> > > > */
> > > > +	if (start < vas->vm_start)
> > > > +		start = vas->vm_start;
> > > > +	if (end > vas->vm_end)
> > > > +		end = vas->vm_end;
> > > > +
> > > > +	migrate.start = start;
> > > > +	migrate.end = end;
> > > > +	npages = npages_in_range(start, end);
> > > > +
> > > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr) +
> > > > +		       sizeof(*pages), GFP_KERNEL);
> > > > +	if (!buf) {
> > > > +		err = -ENOMEM;
> > > > +		goto err_out;
> > > > +	}
> > > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > > +	pages = buf + (2 * sizeof(*migrate.src) +
> > > > sizeof(*dma_addr))
> > > > * npages;
> > > > +
> > > > +	migrate.vma = vas;
> > > > +	migrate.src = buf;
> > > > +	migrate.dst = migrate.src + npages;
> > > > +
> > > > +	err = migrate_vma_setup(&migrate);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	/* Raced with another CPU fault, nothing to do */
> > > > +	if (!migrate.cpages)
> > > > +		goto err_free;
> > > > +
> > > > +	err = drm_gpusvm_migrate_populate_sram_pfn(vas, npages,
> > > > +						   migrate.src,
> > > > migrate.dst,
> > > > +						   start);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	err = drm_gpusvm_migrate_map_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > +					   migrate.dst, npages,
> > > > +					   DMA_BIDIRECTIONAL);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +	for (i = 0; i < npages; ++i)
> > > > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > > 
> > > See comments below which pages we actually want to migrate.
> > > 
> > > 
> > > > +
> > > > +	err = gpusvm->ops->copy_to_sram(gpusvm, pages, dma_addr,
> > > > npages);
> > > > +	if (err)
> > > > +		goto err_finalize;
> > > > +
> > > > +err_finalize:
> > > > +	if (err)
> > > > +		drm_gpusvm_migration_put_pages(npages,
> > > > migrate.dst);
> > > > +	migrate_vma_pages(&migrate);
> > > > +	migrate_vma_finalize(&migrate);
> > > > +	drm_gpusvm_migrate_unmap_pages(gpusvm->drm->dev,
> > > > dma_addr,
> > > > npages,
> > > > +				       DMA_BIDIRECTIONAL);
> > > > +err_free:
> > > > +	kvfree(buf);
> > > > +err_out:
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_sram - Migrate (evict) GPU SVM range to
> > > > SRAM
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function initiates the migration of the specified GPU
> > > > SVM
> > > > range to
> > > > + * SRAM. It performs necessary checks and invokes the internal
> > > > migration
> > > > + * function for actual migration.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	u64 start = range->va.start, end = range->va.end;
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	struct vm_area_struct *vas;
> > > > +	int err;
> > > > +	bool retry = false;
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		if (!mmget_not_zero(mm)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_out;
> > > > +		}
> > > > +		if (ctx->trylock_mmap) {
> > > > +			if (!mmap_read_trylock(mm))  {
> > > > +				err =
> > > > drm_gpusvm_evict_to_sram(gpusvm, range);
> > > > +				goto err_mmput;
> > > > +			}
> > > > +		} else {
> > > > +			mmap_read_lock(mm);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	mmap_assert_locked(mm);
> > > > +
> > > > +	/*
> > > > +	 * Loop required to find all VMA area structs for the
> > > > corner
> > > > case when
> > > > +	 * VRAM backing has been partially unmapped from MM's
> > > > address space.
> > > > +	 */
> > > > +again:
> > > > +	vas = find_vma(mm, start);
> > > > +	if (!vas) {
> > > > +		if (!retry)
> > > > +			err = -ENOENT;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	if (end <= vas->vm_start || start >= vas->vm_end) {
> > > > +		if (!retry)
> > > > +			err = -EINVAL;
> > > > +		goto err_mmunlock;
> > > > +	}
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(gpusvm, vas, NULL,
> > > > start,
> > > > end);
> > > 
> > > This function is typically called from the vm side to get a clean
> > > mm as
> > > a last resort after get_pages() fail. As such should we have it
> > > evict
> > > *everything*, even foreign device memory, and mismatching local
> > > device
> > > pages. If so, we could use hmm_range_fault() with a NULL page owner
> > > +
> > > faulting to do that.
> > > 
> > 
> > I've actually tried that and it seemed to mostly work well and
> > actually
> > would be my preference as this avoids a VMA lookup in GPU SVM.
> > 
> > I think it is problem though if some of the pages are partially
> > unmapped
> > though as hmm_range_fault will abort if fault cannot be resolved.
> > Maybe
> > I'm mistaken on this. I won't get this in rev2 but will put this on
> > my
> > list to continue to play around with.
> 
> OK. Presumably if faulting fails we should try a narrower range unless
> the page actually hitting the gpu pagefault is unmapped, to ensure we
> make progress rather than aborting?
> 

I think the easiest thing would be add a flag to HMM that says continue
on fault failure. Now I remember another issue, hmm_range_fault doesn't
work for coherent pages if we ever decide to use them.

Maybe we can do something like hmm_range_fault without fault bit set to
collect device pages and then use migrate_device_* functions to evict.
Think drm_gpusvm_evict_to_ram in v2 (just posted) with
populate_devmem_pfn replaced with hmm_range_fault. That seems like this
would work. Maybe I'm missing a race here though, likely gets racier
with multi-GPU too but seems workable.

> 
> > 
> > > > +	if (err)
> > > > +		goto err_mmunlock;
> > > > +
> > > > +	if (vas->vm_end < end) {
> > > > +		retry = true;
> > > > +		start = vas->vm_end;
> > > > +		goto again;
> > > > +	}
> > > > +
> > > > +	if (!ctx->mmap_locked) {
> > > > +		mmap_read_unlock(mm);
> > > > +		/*
> > > > +		 * Using mmput_async as this function can be
> > > > called
> > > > while
> > > > +		 * holding a dma-resv lock, and a final put can
> > > > grab
> > > > the mmap
> > > > +		 * lock, causing a lock inversion.
> > > > +		 */
> > > > +		mmput_async(mm);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_mmunlock:
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmap_read_unlock(mm);
> > > > +err_mmput:
> > > > +	if (!ctx->mmap_locked)
> > > > +		mmput_async(mm);
> > > > +err_out:
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_page_free - Put GPU SVM zone device data
> > > > associated
> > > > with a page
> > > > + * @page: Pointer to the page
> > > > + *
> > > > + * This function is a callback used to put the GPU SVM zone
> > > > device
> > > > data
> > > > + * associated with a page when it is being released.
> > > > + */
> > > > +static void drm_gpusvm_page_free(struct page *page)
> > > > +{
> > > > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > > > (page
> > > > fault handler)
> > > > + * @vmf: Pointer to the fault information structure
> > > > + *
> > > > + * This function is a page fault handler used to migrate a GPU
> > > > SVM
> > > > range to RAM.
> > > > + * It retrieves the GPU SVM range information from the faulting
> > > > page
> > > > and invokes
> > > > + * the internal migration function to migrate the range back to
> > > > RAM.
> > > > + *
> > > > + * Returns:
> > > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > > + */
> > > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault
> > > > *vmf)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd = vmf->page-
> > > > >zone_device_data;
> > > > +	int err;
> > > > +
> > > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > > > +					   vmf->vma, vmf->page,
> > > > +					   zdd->range->va.start,
> > > > +					   zdd->range->va.end);
> > > 
> > > When called from here, since this is a pagemap op, we should ensure
> > > we
> > > only migrate our own pagemap to RAM?
> > > 
> > 
> > I think you resolve this with the following the patch [1], right? I
> > think I agree.
> 
> It doesn't fully resolve it, but adds the capability to do more
> specified filtering. Another option would be to use the pagemap ptr
> rather than the device ptr as device_private owner, but that would OTOH
> require a wider filtering in hmm_range_fault() so that (or a similar)
> patch would be needed anyway.
>

Yea pagemap group is likely a better device_private_owner. Then I think
we'd drop gpusvm->device_private_owner pointer too which is likely a
good idea anyways.

Matt
 
> Thanks,
> Thomas
> 
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/139994/
> > 
> > > /Thanks,
> > > Thomas
> > > 
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2024-10-16  8:25 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 01/28] dma-buf: Split out dma fence array create into alloc and arm functions Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 02/28] drm/xe: Invalidate media_gt TLBs in PT code Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 03/28] drm/xe: Retry BO allocation Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range Matthew Brost
2024-08-29  9:03   ` Daniel Vetter
2024-08-29 15:58     ` Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
2024-08-28 14:31   ` Daniel Vetter
2024-08-28 14:46     ` Christian König
2024-08-28 15:43       ` Matthew Brost
2024-08-28 16:06         ` Alex Deucher
2024-08-28 16:25         ` Daniel Vetter
2024-08-29 16:40           ` Matthew Brost
2024-09-02 11:29             ` Daniel Vetter
2024-08-30  5:00     ` Matthew Brost
2024-09-02 11:36       ` Daniel Vetter
2024-08-28 18:50   ` Daniel Vetter
2024-08-29 16:49     ` Matthew Brost
2024-09-02 11:40       ` Daniel Vetter
2024-08-29  9:16   ` Thomas Hellström
2024-08-29 17:45     ` Matthew Brost
2024-08-29 18:13       ` Matthew Brost
2024-08-29 19:18       ` Thomas Hellström
2024-08-29 20:56         ` Matthew Brost
2024-08-30  8:18           ` Thomas Hellström
2024-08-30 13:58             ` Matthew Brost
2024-09-02  9:57               ` Thomas Hellström
2024-08-30  9:57           ` Thomas Hellström
2024-08-30 13:47             ` Matthew Brost
2024-09-02  9:45               ` Thomas Hellström
2024-09-02 12:33           ` Daniel Vetter
2024-09-04 12:27             ` Thomas Hellström
2024-09-24  8:41               ` Simona Vetter
2024-08-30  1:35     ` Matthew Brost
2024-08-29  9:45   ` Daniel Vetter
2024-08-29 17:27     ` Matthew Brost
2024-09-02 11:53       ` Daniel Vetter
2024-09-02 17:03         ` Matthew Brost
2024-09-11 16:06           ` Matthew Brost
2024-08-30  9:16   ` Thomas Hellström
2024-09-02 12:20     ` Daniel Vetter
2024-09-06 18:41   ` Zeng, Oak
2024-09-24  9:25     ` Simona Vetter
2024-09-25 16:34       ` Zeng, Oak
2024-09-24 10:42   ` Thomas Hellström
2024-09-24 16:30     ` Matthew Brost
2024-09-25 21:12       ` Matthew Brost
2024-10-09 10:50   ` Thomas Hellström
2024-10-16  3:18     ` Matthew Brost
2024-10-16  6:27       ` Thomas Hellström
2024-10-16  8:24         ` Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 06/28] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 07/28] drm/xe: Add SVM init / fini to faulting VMs Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 08/28] drm/xe: Add dma_addr res cursor Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 09/28] drm/xe: Add SVM range invalidation Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 10/28] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 11/28] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 12/28] drm/xe: Add SVM garbage collector Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 13/28] drm/xe: Add unbind to " Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 14/28] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 15/28] drm/xe: Enable system allocator uAPI Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 16/28] drm/xe: Add migrate layer functions for SVM support Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 17/28] drm/xe: Add SVM device memory mirroring Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 18/28] drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 19/28] drm/xe: Update PT layer to understand ranges in VRAM Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 20/28] drm/xe: Add Xe SVM populate_vram_pfn vfunc Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 21/28] drm/xe: Add Xe SVM vram_release vfunc Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 22/28] drm/xe: Add BO flags required for SVM Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration Matthew Brost
2024-08-28 16:06   ` Daniel Vetter
2024-08-28 18:22     ` Daniel Vetter
2024-08-29  9:24     ` Christian König
2024-08-29  9:53       ` Thomas Hellström
2024-08-29 11:02         ` Daniel Vetter
2024-08-29 22:12           ` Matthew Brost
2024-08-29 22:23             ` Matthew Brost
2024-09-02 11:01             ` Christian König
2024-09-02 12:50               ` Daniel Vetter
2024-09-02 12:48             ` Daniel Vetter
2024-09-02 22:20               ` Matthew Brost
2024-09-03  8:07                 ` Simona Vetter
2024-08-29 14:30         ` Christian König
2024-08-29 21:53           ` Matthew Brost
2024-08-29 21:48       ` Matthew Brost
2024-09-02 13:02         ` Daniel Vetter
2024-08-28  2:48 ` [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction Matthew Brost
2024-08-29 10:14   ` Daniel Vetter
2024-08-29 15:55     ` Matthew Brost
2024-09-02 13:05       ` Daniel Vetter
2024-08-28  2:48 ` [RFC PATCH 25/28] drm/xe: Add SVM debug Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 26/28] drm/xe: Add modparam for SVM notifier size Matthew Brost
2024-08-28  2:49 ` [RFC PATCH 27/28] drm/xe: Add modparam for SVM prefault Matthew Brost
2024-08-28  2:49 ` [RFC PATCH 28/28] drm/gpusvm: Ensure all pages migrated upon eviction Matthew Brost
2024-08-28  2:55 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation Patchwork
2024-08-28  2:55 ` ✗ CI.checkpatch: warning " Patchwork
2024-08-28  2:56 ` ✗ CI.KUnit: failure " Patchwork
2024-09-24  9:16 ` [RFC PATCH 00/28] " Simona Vetter
2024-09-24 19:36   ` Matthew Brost
2024-09-25 11:41     ` Simona Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox