✗ CI.Patch_applied: failure for series starting with [CI,01/42] mm/hmm: let users to tag specific PFNs

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* ✗ CI.Patch_applied: failure for series starting with [CI,01/42] mm/hmm: let users to tag specific PFNs
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
@ 2024-06-13  4:20 ` Patchwork
  2024-06-13  4:23 ` [CI 02/42] dma-mapping: provide an interface to allocate IOVA Oak Zeng
                   ` (40 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2024-06-13  4:20 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: series starting with [CI,01/42] mm/hmm: let users to tag specific PFNs
URL   : https://patchwork.freedesktop.org/series/134798/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 19bd824e89ec drm-tip: 2024y-06m-13d-04h-13m-41s UTC integration manifest
=== git am output follows ===
error: patch failed: drivers/gpu/drm/xe/xe_trace.h:426
error: drivers/gpu/drm/xe/xe_trace.h: patch does not apply
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Applying: mm/hmm: let users to tag specific PFNs
Applying: dma-mapping: provide an interface to allocate IOVA
Applying: dma-mapping: provide callbacks to link/unlink pages to specific IOVA
Applying: iommu/dma: Provide an interface to allow preallocate IOVA
Applying: iommu/dma: Prepare map/unmap page functions to receive IOVA
Applying: iommu/dma: Implement link/unlink page callbacks
Applying: drm: Move GPUVA_START/LAST to drm_gpuvm.h
Applying: drm/svm: Mark drm_gpuvm to participate SVM
Applying: drm/svm: introduce drm_mem_region concept
Applying: drm/svm: introduce hmmptr and helper functions
Applying: drm/svm: Introduce helper to remap drm memory region
Applying: drm/svm: handle CPU page fault
Applying: drm/svm: Migrate a range of hmmptr to vram
Applying: drm/svm: Add DRM SVM documentation
Applying: drm/xe: s/xe_tile_migrate_engine/xe_tile_migrate_exec_queue
Applying: drm/xe: Add xe_vm_pgtable_update_op to xe_vma_ops
Applying: drm/xe: Convert multiple bind ops into single job
Applying: drm/xe: Update VM trace events
Patch failed at 0018 drm/xe: Update VM trace events
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [CI 01/42] mm/hmm: let users to tag specific PFNs
@ 2024-06-13  4:23 Oak Zeng
  2024-06-13  4:20 ` ✗ CI.Patch_applied: failure for series starting with [CI,01/42] " Patchwork
                   ` (41 more replies)
  0 siblings, 42 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new sticky flag, which isn't overwritten by HMM range fault.
Such flag allows users to tag specific PFNs with extra data in addition
to already filled by HMM.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 34 +++++++++++++++++++++-------------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..b90902baa593 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,7 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_STICKY - Flag preserved on input-to-output transformation
  *
  * On input:
  * 0                 - Return the current state of the page, do not fault it.
@@ -36,6 +37,8 @@ enum hmm_pfn_flags {
 	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
 	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+	/* Sticky lag, carried from Input to Output */
+	HMM_PFN_STICKY = 1UL << (BITS_PER_LONG - 7),
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
 
 	/* Input flags */
diff --git a/mm/hmm.c b/mm/hmm.c
index 93aebd9cc130..a244071085b1 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -44,8 +44,10 @@ static int hmm_pfns_fill(unsigned long addr, unsigned long end,
 {
 	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
 
-	for (; addr < end; addr += PAGE_SIZE, i++)
-		range->hmm_pfns[i] = cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++) {
+		range->hmm_pfns[i] &= HMM_PFN_STICKY;
+		range->hmm_pfns[i] |= cpu_flags;
+	}
 	return 0;
 }
 
@@ -202,8 +204,10 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 		return hmm_vma_fault(addr, end, required_fault, walk);
 
 	pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		hmm_pfns[i] = pfn | cpu_flags;
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		hmm_pfns[i] &= HMM_PFN_STICKY;
+		hmm_pfns[i] |= pfn | cpu_flags;
+	}
 	return 0;
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -236,7 +240,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
 			goto fault;
-		*hmm_pfn = 0;
+		*hmm_pfn = *hmm_pfn & HMM_PFN_STICKY;
 		return 0;
 	}
 
@@ -253,14 +257,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
+			*hmm_pfn = (*hmm_pfn & HMM_PFN_STICKY) | swp_offset_pfn(entry) | cpu_flags;
 			return 0;
 		}
 
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (!required_fault) {
-			*hmm_pfn = 0;
+			*hmm_pfn = *hmm_pfn & HMM_PFN_STICKY;
 			return 0;
 		}
 
@@ -304,11 +308,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			pte_unmap(ptep);
 			return -EFAULT;
 		}
-		*hmm_pfn = HMM_PFN_ERROR;
+		*hmm_pfn = (*hmm_pfn & HMM_PFN_STICKY) | HMM_PFN_ERROR;
 		return 0;
 	}
 
-	*hmm_pfn = pte_pfn(pte) | cpu_flags;
+	*hmm_pfn = (*hmm_pfn & HMM_PFN_STICKY) | pte_pfn(pte) | cpu_flags;
 	return 0;
 
 fault:
@@ -448,8 +452,10 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 		}
 
 		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-		for (i = 0; i < npages; ++i, ++pfn)
-			hmm_pfns[i] = pfn | cpu_flags;
+		for (i = 0; i < npages; ++i, ++pfn) {
+			hmm_pfns[i] &= HMM_PFN_STICKY;
+			hmm_pfns[i] |= pfn | cpu_flags;
+		}
 		goto out_unlock;
 	}
 
@@ -507,8 +513,10 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	}
 
 	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
-	for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		range->hmm_pfns[i] = pfn | cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		range->hmm_pfns[i] &= HMM_PFN_STICKY;
+		range->hmm_pfns[i] |= pfn | cpu_flags;
+	}
 
 	spin_unlock(ptl);
 	return 0;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 02/42] dma-mapping: provide an interface to allocate IOVA
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
  2024-06-13  4:20 ` ✗ CI.Patch_applied: failure for series starting with [CI,01/42] " Patchwork
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 03/42] dma-mapping: provide callbacks to link/unlink pages to specific IOVA Oak Zeng
                   ` (39 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

From: Leon Romanovsky <leonro@nvidia.com>

Existing .map_page() callback provides two things at the same time:
allocates IOVA and links DMA pages. That combination works great for
most of the callers who use it in control paths, but less effective
in fast paths.

These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.

Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-map-ops.h |  3 +++
 include/linux/dma-mapping.h | 20 ++++++++++++++++++++
 kernel/dma/mapping.c        | 30 ++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 02a1c825896b..23e5e2f63a1c 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -86,6 +86,9 @@ struct dma_map_ops {
 	size_t (*max_mapping_size)(struct device *dev);
 	size_t (*opt_mapping_size)(void);
 	unsigned long (*get_merge_boundary)(struct device *dev);
+
+	dma_addr_t (*alloc_iova)(struct device *dev, size_t size);
+	void (*free_iova)(struct device *dev, dma_addr_t dma_addr, size_t size);
 };
 
 #ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f693aafe221f..34a3b6420606 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -91,6 +91,16 @@ static inline void debug_dma_map_single(struct device *dev, const void *addr,
 }
 #endif /* CONFIG_DMA_API_DEBUG */
 
+struct dma_iova_attrs {
+	/* OUT field */
+	dma_addr_t addr;
+	/* IN fields */
+	struct device *dev;
+	size_t size;
+	enum dma_data_direction dir;
+	unsigned long attrs;
+};
+
 #ifdef CONFIG_HAS_DMA
 static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
 {
@@ -101,6 +111,9 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
 	return 0;
 }
 
+int dma_alloc_iova(struct dma_iova_attrs *iova);
+void dma_free_iova(struct dma_iova_attrs *iova);
+
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
 		size_t offset, size_t size, enum dma_data_direction dir,
 		unsigned long attrs);
@@ -150,6 +163,13 @@ void dma_vunmap_noncontiguous(struct device *dev, void *vaddr);
 int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
 		size_t size, struct sg_table *sgt);
 #else /* CONFIG_HAS_DMA */
+static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
+{
+	return -EOPNOTSUPP;
+}
+static inline void dma_free_iova(struct dma_iova_attrs *iova)
+{
+}
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
 		struct page *page, size_t offset, size_t size,
 		enum dma_data_direction dir, unsigned long attrs)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 81de84318ccc..4d14637d186b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -183,6 +183,36 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
 }
 EXPORT_SYMBOL(dma_unmap_page_attrs);
 
+int dma_alloc_iova(struct dma_iova_attrs *iova)
+{
+	struct device *dev = iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (dma_map_direct(dev, ops) || !ops->alloc_iova) {
+		iova->addr = 0;
+		return 0;
+	}
+
+	iova->addr = ops->alloc_iova(dev, iova->size);
+	if (dma_mapping_error(dev, iova->addr))
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(dma_alloc_iova);
+
+void dma_free_iova(struct dma_iova_attrs *iova)
+{
+	struct device *dev = iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (dma_map_direct(dev, ops) || !ops->free_iova)
+		return;
+
+	ops->free_iova(dev, iova->addr, iova->size);
+}
+EXPORT_SYMBOL(dma_free_iova);
+
 static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	 int nents, enum dma_data_direction dir, unsigned long attrs)
 {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 03/42] dma-mapping: provide callbacks to link/unlink pages to specific IOVA
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
  2024-06-13  4:20 ` ✗ CI.Patch_applied: failure for series starting with [CI,01/42] " Patchwork
  2024-06-13  4:23 ` [CI 02/42] dma-mapping: provide an interface to allocate IOVA Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 04/42] iommu/dma: Provide an interface to allow preallocate IOVA Oak Zeng
                   ` (38 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new DMA link/unlink API to provide a way for advanced users
to directly map/unmap pages without ned to allocate IOVA on every map
call.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-map-ops.h | 10 +++++++
 include/linux/dma-mapping.h | 13 +++++++++
 kernel/dma/debug.h          |  2 ++
 kernel/dma/direct.h         |  3 ++
 kernel/dma/mapping.c        | 57 +++++++++++++++++++++++++++++++++++++
 5 files changed, 85 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 23e5e2f63a1c..292326ac5a12 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -89,6 +89,13 @@ struct dma_map_ops {
 
 	dma_addr_t (*alloc_iova)(struct device *dev, size_t size);
 	void (*free_iova)(struct device *dev, dma_addr_t dma_addr, size_t size);
+	dma_addr_t (*link_range)(struct device *dev, struct page *page,
+				 unsigned long offset, dma_addr_t addr,
+				 size_t size, enum dma_data_direction dir,
+				 unsigned long attrs);
+	void (*unlink_range)(struct device *dev, dma_addr_t dma_handle,
+			     size_t size, enum dma_data_direction dir,
+			     unsigned long attrs);
 };
 
 #ifdef CONFIG_DMA_OPS
@@ -440,6 +447,9 @@ bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg,
 #define arch_dma_unmap_sg_direct(d, s, n)	(false)
 #endif
 
+#define arch_dma_link_range_direct arch_dma_map_page_direct
+#define arch_dma_unlink_range_direct arch_dma_unmap_page_direct
+
 #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
 void arch_setup_dma_ops(struct device *dev, bool coherent);
 #else
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 34a3b6420606..223b5477e36e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -113,6 +113,9 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
 
 int dma_alloc_iova(struct dma_iova_attrs *iova);
 void dma_free_iova(struct dma_iova_attrs *iova);
+dma_addr_t dma_link_range(struct page *page, unsigned long offset,
+			  struct dma_iova_attrs *iova, dma_addr_t dma_offset);
+void dma_unlink_range(struct dma_iova_attrs *iova, dma_addr_t dma_offset);
 
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
 		size_t offset, size_t size, enum dma_data_direction dir,
@@ -170,6 +173,16 @@ static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
 static inline void dma_free_iova(struct dma_iova_attrs *iova)
 {
 }
+static inline dma_addr_t dma_link_range(struct page *page, unsigned long offset,
+					struct dma_iova_attrs *iova,
+					dma_addr_t dma_offset)
+{
+	return DMA_MAPPING_ERROR;
+}
+static inline void dma_unlink_range(struct dma_iova_attrs *iova,
+				    dma_addr_t dma_offset)
+{
+}
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
 		struct page *page, size_t offset, size_t size,
 		enum dma_data_direction dir, unsigned long attrs)
diff --git a/kernel/dma/debug.h b/kernel/dma/debug.h
index f525197d3cae..3d529f355c6d 100644
--- a/kernel/dma/debug.h
+++ b/kernel/dma/debug.h
@@ -127,4 +127,6 @@ static inline void debug_dma_sync_sg_for_device(struct device *dev,
 {
 }
 #endif /* CONFIG_DMA_API_DEBUG */
+#define debug_dma_link_range debug_dma_map_page
+#define debug_dma_unlink_range debug_dma_unmap_page
 #endif /* _KERNEL_DMA_DEBUG_H */
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 18d346118fe8..1c30e1cd607a 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -125,4 +125,7 @@ static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr,
 		swiotlb_tbl_unmap_single(dev, phys, size, dir,
 					 attrs | DMA_ATTR_SKIP_CPU_SYNC);
 }
+
+#define dma_direct_link_range dma_direct_map_page
+#define dma_direct_unlink_range dma_direct_unmap_page
 #endif /* _KERNEL_DMA_DIRECT_H */
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 4d14637d186b..787d7516434f 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -213,6 +213,63 @@ void dma_free_iova(struct dma_iova_attrs *iova)
 }
 EXPORT_SYMBOL(dma_free_iova);
 
+/**
+ * dma_link_range - Link a physical page to DMA address
+ * @page: The page to be mapped
+ * @offset: The offset within the page
+ * @iova: Preallocated IOVA attributes
+ * @dma_offset: DMA offset form which this page needs to be linked
+ *
+ * dma_alloc_iova() allocates IOVA based on the size specified by ther user in
+ * iova->size. Call this function after IOVA allocation to link @page from
+ * @offset to get the DMA address. Note that very first call to this function
+ * will have @dma_offset set to 0 in the IOVA space allocated from
+ * dma_alloc_iova(). For subsequent calls to this function on same @iova,
+ * @dma_offset needs to be advanced by the caller with the size of previous
+ * page that was linked + DMA address returned for the previous page that was
+ * linked by this function.
+ */
+dma_addr_t dma_link_range(struct page *page, unsigned long offset,
+			  struct dma_iova_attrs *iova, dma_addr_t dma_offset)
+{
+	struct device *dev = iova->dev;
+	size_t size = iova->size;
+	enum dma_data_direction dir = iova->dir;
+	unsigned long attrs = iova->attrs;
+	dma_addr_t addr = iova->addr + dma_offset;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (dma_map_direct(dev, ops) ||
+	    arch_dma_link_range_direct(dev, page_to_phys(page) + offset + size))
+		addr = dma_direct_link_range(dev, page, offset, size, dir, attrs);
+	else if (ops->link_range)
+		addr = ops->link_range(dev, page, offset, addr, size, dir, attrs);
+
+	kmsan_handle_dma(page, offset, size, dir);
+	debug_dma_link_range(dev, page, offset, size, dir, addr, attrs);
+	return addr;
+}
+EXPORT_SYMBOL(dma_link_range);
+
+void dma_unlink_range(struct dma_iova_attrs *iova, dma_addr_t dma_offset)
+{
+	struct device *dev = iova->dev;
+	size_t size = iova->size;
+	enum dma_data_direction dir = iova->dir;
+	unsigned long attrs = iova->attrs;
+	dma_addr_t addr = iova->addr + dma_offset;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (dma_map_direct(dev, ops) ||
+	    arch_dma_unlink_range_direct(dev, addr + size))
+		dma_direct_unlink_range(dev, addr, size, dir, attrs);
+	else if (ops->unlink_range)
+		ops->unlink_range(dev, addr, size, dir, attrs);
+
+	debug_dma_unlink_range(dev, addr, size, dir);
+}
+EXPORT_SYMBOL(dma_unlink_range);
+
 static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	 int nents, enum dma_data_direction dir, unsigned long attrs)
 {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 04/42] iommu/dma: Provide an interface to allow preallocate IOVA
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (2 preceding siblings ...)
  2024-06-13  4:23 ` [CI 03/42] dma-mapping: provide callbacks to link/unlink pages to specific IOVA Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 05/42] iommu/dma: Prepare map/unmap page functions to receive IOVA Oak Zeng
                   ` (37 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

From: Leon Romanovsky <leonro@nvidia.com>

Separate IOVA allocation to dedicated callback so it will allow
cache of IOVA and reuse it in fast paths for devices which support
ODP (on-demand-paging) mechanism.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 50 +++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 43520e7275cc..9ce8298047f5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -357,7 +357,7 @@ int iommu_dma_init_fq(struct iommu_domain *domain)
 	atomic_set(&cookie->fq_timer_on, 0);
 	/*
 	 * Prevent incomplete fq state being observable. Pairs with path from
-	 * __iommu_dma_unmap() through iommu_dma_free_iova() to queue_iova()
+	 * __iommu_dma_unmap() through __iommu_dma_free_iova() to queue_iova()
 	 */
 	smp_wmb();
 	WRITE_ONCE(cookie->fq_domain, domain);
@@ -758,7 +758,7 @@ static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
 	}
 }
 
-static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
+static dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
 		size_t size, u64 dma_limit, struct device *dev)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -804,7 +804,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
 	return (dma_addr_t)iova << shift;
 }
 
-static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+static void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 		dma_addr_t iova, size_t size, struct iommu_iotlb_gather *gather)
 {
 	struct iova_domain *iovad = &cookie->iovad;
@@ -841,7 +841,7 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
 
 	if (!iotlb_gather.queued)
 		iommu_iotlb_sync(domain, &iotlb_gather);
-	iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
+	__iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
 }
 
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
@@ -864,12 +864,12 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 
 	size = iova_align(iovad, size + iova_off);
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
 	if (!iova)
 		return DMA_MAPPING_ERROR;
 
 	if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
-		iommu_dma_free_iova(cookie, iova, size, NULL);
+		__iommu_dma_free_iova(cookie, iova, size, NULL);
 		return DMA_MAPPING_ERROR;
 	}
 	return iova + iova_off;
@@ -973,7 +973,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 		return NULL;
 
 	size = iova_align(iovad, size);
-	iova = iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
 	if (!iova)
 		goto out_free_pages;
 
@@ -1007,7 +1007,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 out_free_sg:
 	sg_free_table(sgt);
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_pages:
 	__iommu_dma_free_pages(pages, count);
 	return NULL;
@@ -1442,7 +1442,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	if (!iova_len)
 		return __finalise_sg(dev, sg, nents, 0);
 
-	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
 	if (!iova) {
 		ret = -ENOMEM;
 		goto out_restore_sg;
@@ -1459,7 +1459,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	return __finalise_sg(dev, sg, nents, iova);
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, iova_len, NULL);
+	__iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
 	__invalidate_sg(sg, nents);
 out:
@@ -1720,6 +1720,30 @@ static size_t iommu_dma_max_mapping_size(struct device *dev)
 	return SIZE_MAX;
 }
 
+static dma_addr_t iommu_dma_alloc_iova(struct device *dev, size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	dma_addr_t dma_mask = dma_get_mask(dev);
+
+	size = iova_align(iovad, size);
+	return __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+}
+
+static void iommu_dma_free_iova(struct device *dev, dma_addr_t iova,
+				size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	struct iommu_iotlb_gather iotlb_gather;
+
+	size = iova_align(iovad, size);
+	iommu_iotlb_gather_init(&iotlb_gather);
+	__iommu_dma_free_iova(cookie, iova, size, &iotlb_gather);
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
 	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED |
 				  DMA_F_CAN_SKIP_SYNC,
@@ -1744,6 +1768,8 @@ static const struct dma_map_ops iommu_dma_ops = {
 	.get_merge_boundary	= iommu_dma_get_merge_boundary,
 	.opt_mapping_size	= iommu_dma_opt_mapping_size,
 	.max_mapping_size       = iommu_dma_max_mapping_size,
+	.alloc_iova		= iommu_dma_alloc_iova,
+	.free_iova		= iommu_dma_free_iova,
 };
 
 void iommu_setup_dma_ops(struct device *dev)
@@ -1786,7 +1812,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	if (!msi_page)
 		return NULL;
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_free_page;
 
@@ -1800,7 +1826,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	return msi_page;
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_page:
 	kfree(msi_page);
 	return NULL;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 05/42] iommu/dma: Prepare map/unmap page functions to receive IOVA
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (3 preceding siblings ...)
  2024-06-13  4:23 ` [CI 04/42] iommu/dma: Provide an interface to allow preallocate IOVA Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 06/42] iommu/dma: Implement link/unlink page callbacks Oak Zeng
                   ` (36 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

From: Leon Romanovsky <leonro@nvidia.com>

Extend the existing map_page/unmap_page function implementations to get
preallocated IOVA. In such case, the IOVA allocation needs to be
skipped, but rest of the code stays the same.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 68 ++++++++++++++++++++++++++-------------
 1 file changed, 45 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9ce8298047f5..dbef2581a98c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -822,7 +822,7 @@ static void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 }
 
 static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
-		size_t size)
+			      size_t size, bool free_iova)
 {
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -841,17 +841,19 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
 
 	if (!iotlb_gather.queued)
 		iommu_iotlb_sync(domain, &iotlb_gather);
-	__iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
+	if (free_iova)
+		__iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
 }
 
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
-		size_t size, int prot, u64 dma_mask)
+				  dma_addr_t iova, size_t size, int prot,
+				  u64 dma_mask)
 {
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
 	size_t iova_off = iova_offset(iovad, phys);
-	dma_addr_t iova;
+	bool no_iova = !iova;
 
 	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
 	    iommu_deferred_attach(dev, domain))
@@ -864,12 +866,14 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 
 	size = iova_align(iovad, size + iova_off);
 
-	iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+	if (no_iova)
+		iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
 	if (!iova)
 		return DMA_MAPPING_ERROR;
 
 	if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
-		__iommu_dma_free_iova(cookie, iova, size, NULL);
+		if (no_iova)
+			__iommu_dma_free_iova(cookie, iova, size, NULL);
 		return DMA_MAPPING_ERROR;
 	}
 	return iova + iova_off;
@@ -1034,7 +1038,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
 	return vaddr;
 
 out_unmap:
-	__iommu_dma_unmap(dev, *dma_handle, size);
+	__iommu_dma_unmap(dev, *dma_handle, size, true);
 	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
 	return NULL;
 }
@@ -1063,7 +1067,7 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
 {
 	struct dma_sgt_handle *sh = sgt_handle(sgt);
 
-	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
+	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size, true);
 	__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
 	sg_free_table(&sh->sgt);
 	kfree(sh);
@@ -1134,9 +1138,11 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
 			arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
 }
 
-static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
-		unsigned long offset, size_t size, enum dma_data_direction dir,
-		unsigned long attrs)
+static dma_addr_t __iommu_dma_map_pages(struct device *dev, struct page *page,
+					unsigned long offset, dma_addr_t iova,
+					size_t size,
+					enum dma_data_direction dir,
+					unsigned long attrs)
 {
 	phys_addr_t phys = page_to_phys(page) + offset;
 	bool coherent = dev_is_dma_coherent(dev);
@@ -1144,7 +1150,7 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
-	dma_addr_t iova, dma_mask = dma_get_mask(dev);
+	dma_addr_t addr, dma_mask = dma_get_mask(dev);
 
 	/*
 	 * If both the physical buffer start address and size are
@@ -1188,14 +1194,23 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		arch_sync_dma_for_device(phys, size, dir);
 
-	iova = __iommu_dma_map(dev, phys, size, prot, dma_mask);
-	if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
+	addr = __iommu_dma_map(dev, phys, iova, size, prot, dma_mask);
+	if (addr == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
 		swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
-	return iova;
+	return addr;
 }
 
-static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
-		size_t size, enum dma_data_direction dir, unsigned long attrs)
+static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
+				     unsigned long offset, size_t size,
+				     enum dma_data_direction dir,
+				     unsigned long attrs)
+{
+	return __iommu_dma_map_pages(dev, page, offset, 0, size, dir, attrs);
+}
+
+static void __iommu_dma_unmap_pages(struct device *dev, dma_addr_t dma_handle,
+				    size_t size, enum dma_data_direction dir,
+				    unsigned long attrs, bool free_iova)
 {
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
 	phys_addr_t phys;
@@ -1207,12 +1222,19 @@ static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && !dev_is_dma_coherent(dev))
 		arch_sync_dma_for_cpu(phys, size, dir);
 
-	__iommu_dma_unmap(dev, dma_handle, size);
+	__iommu_dma_unmap(dev, dma_handle, size, free_iova);
 
 	if (unlikely(is_swiotlb_buffer(dev, phys)))
 		swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
 }
 
+static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
+				 size_t size, enum dma_data_direction dir,
+				 unsigned long attrs)
+{
+	__iommu_dma_unmap_pages(dev, dma_handle, size, dir, attrs, true);
+}
+
 /*
  * Prepare a successfully-mapped scatterlist to give back to the caller.
  *
@@ -1515,13 +1537,13 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 	}
 
 	if (end)
-		__iommu_dma_unmap(dev, start, end - start);
+		__iommu_dma_unmap(dev, start, end - start, true);
 }
 
 static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-	return __iommu_dma_map(dev, phys, size,
+	return __iommu_dma_map(dev, phys, 0, size,
 			dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO,
 			dma_get_mask(dev));
 }
@@ -1529,7 +1551,7 @@ static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
 static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-	__iommu_dma_unmap(dev, handle, size);
+	__iommu_dma_unmap(dev, handle, size, true);
 }
 
 static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
@@ -1566,7 +1588,7 @@ static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
 static void iommu_dma_free(struct device *dev, size_t size, void *cpu_addr,
 		dma_addr_t handle, unsigned long attrs)
 {
-	__iommu_dma_unmap(dev, handle, size);
+	__iommu_dma_unmap(dev, handle, size, true);
 	__iommu_dma_free(dev, size, cpu_addr);
 }
 
@@ -1632,7 +1654,7 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
 	if (!cpu_addr)
 		return NULL;
 
-	*handle = __iommu_dma_map(dev, page_to_phys(page), size, ioprot,
+	*handle = __iommu_dma_map(dev, page_to_phys(page), 0, size, ioprot,
 			dev->coherent_dma_mask);
 	if (*handle == DMA_MAPPING_ERROR) {
 		__iommu_dma_free(dev, size, cpu_addr);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 06/42] iommu/dma: Implement link/unlink page callbacks
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (4 preceding siblings ...)
  2024-06-13  4:23 ` [CI 05/42] iommu/dma: Prepare map/unmap page functions to receive IOVA Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 07/42] drm: Move GPUVA_START/LAST to drm_gpuvm.h Oak Zeng
                   ` (35 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

From: Leon Romanovsky <leonro@nvidia.com>

Add an implementation of link/unlink interface to perform in map/unmap
pages in fast patch for pre-allocated IOVA.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index dbef2581a98c..ddf402e7c8a3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1766,6 +1766,21 @@ static void iommu_dma_free_iova(struct device *dev, dma_addr_t iova,
 	__iommu_dma_free_iova(cookie, iova, size, &iotlb_gather);
 }
 
+static dma_addr_t iommu_dma_link_range(struct device *dev, struct page *page,
+				       unsigned long offset, dma_addr_t iova,
+				       size_t size, enum dma_data_direction dir,
+				       unsigned long attrs)
+{
+	return __iommu_dma_map_pages(dev, page, offset, iova, size, dir, attrs);
+}
+
+static void iommu_dma_unlink_range(struct device *dev, dma_addr_t addr,
+				   size_t size, enum dma_data_direction dir,
+				   unsigned long attrs)
+{
+	__iommu_dma_unmap_pages(dev, addr, size, dir, attrs, false);
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
 	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED |
 				  DMA_F_CAN_SKIP_SYNC,
@@ -1792,6 +1807,8 @@ static const struct dma_map_ops iommu_dma_ops = {
 	.max_mapping_size       = iommu_dma_max_mapping_size,
 	.alloc_iova		= iommu_dma_alloc_iova,
 	.free_iova		= iommu_dma_free_iova,
+	.link_range		= iommu_dma_link_range,
+	.unlink_range		= iommu_dma_unlink_range,
 };
 
 void iommu_setup_dma_ops(struct device *dev)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 07/42] drm: Move GPUVA_START/LAST to drm_gpuvm.h
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (5 preceding siblings ...)
  2024-06-13  4:23 ` [CI 06/42] iommu/dma: Implement link/unlink page callbacks Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 08/42] drm/svm: Mark drm_gpuvm to participate SVM Oak Zeng
                   ` (34 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

Move them to .h file so they can be used outside of drm_gpuvm.c.
Also add a GPUVA_END micro.

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/drm_gpuvm.c | 3 ---
 include/drm/drm_gpuvm.h     | 4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
index f9eb56f24bef..7402ed6f1d33 100644
--- a/drivers/gpu/drm/drm_gpuvm.c
+++ b/drivers/gpu/drm/drm_gpuvm.c
@@ -867,9 +867,6 @@ __drm_gpuvm_bo_list_del(struct drm_gpuvm *gpuvm, spinlock_t *lock,
 
 #define to_drm_gpuva(__node)	container_of((__node), struct drm_gpuva, rb.node)
 
-#define GPUVA_START(node) ((node)->va.addr)
-#define GPUVA_LAST(node) ((node)->va.addr + (node)->va.range - 1)
-
 /* We do not actually use drm_gpuva_it_next(), tell the compiler to not complain
  * about this.
  */
diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
index 00d4e43b76b6..429dc0d82eba 100644
--- a/include/drm/drm_gpuvm.h
+++ b/include/drm/drm_gpuvm.h
@@ -38,6 +38,10 @@ struct drm_gpuvm;
 struct drm_gpuvm_bo;
 struct drm_gpuvm_ops;
 
+#define GPUVA_START(node) ((node)->va.addr)
+#define GPUVA_LAST(node) ((node)->va.addr + (node)->va.range - 1)
+#define GPUVA_END(node) ((node)->va.addr + (node)->va.range)
+
 /**
  * enum drm_gpuva_flags - flags for struct drm_gpuva
  */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 08/42] drm/svm: Mark drm_gpuvm to participate SVM
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (6 preceding siblings ...)
  2024-06-13  4:23 ` [CI 07/42] drm: Move GPUVA_START/LAST to drm_gpuvm.h Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 09/42] drm/svm: introduce drm_mem_region concept Oak Zeng
                   ` (33 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

A mm_struct field is added to drm_gpuvm. Also add a parameter to
drm_gpuvm_init to say whether this gpuvm participate svm (shared
virtual memory with CPU process). Under SVM, CPU program and GPU
program share one process virtual address space.

Cc: Dave Airlie <airlied@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/drm_gpuvm.c            | 7 ++++++-
 drivers/gpu/drm/nouveau/nouveau_uvmm.c | 2 +-
 drivers/gpu/drm/xe/xe_vm.c             | 2 +-
 include/drm/drm_gpuvm.h                | 8 +++++++-
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
index 7402ed6f1d33..5f246ca472a8 100644
--- a/drivers/gpu/drm/drm_gpuvm.c
+++ b/drivers/gpu/drm/drm_gpuvm.c
@@ -984,6 +984,8 @@ EXPORT_SYMBOL_GPL(drm_gpuvm_resv_object_alloc);
  * @reserve_offset: the start of the kernel reserved GPU VA area
  * @reserve_range: the size of the kernel reserved GPU VA area
  * @ops: &drm_gpuvm_ops called on &drm_gpuvm_sm_map / &drm_gpuvm_sm_unmap
+ * @participate_svm: whether this gpuvm participat shared virtual memory
+ * with CPU mm
  *
  * The &drm_gpuvm must be initialized with this function before use.
  *
@@ -997,7 +999,7 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name,
 	       struct drm_gem_object *r_obj,
 	       u64 start_offset, u64 range,
 	       u64 reserve_offset, u64 reserve_range,
-	       const struct drm_gpuvm_ops *ops)
+	       const struct drm_gpuvm_ops *ops, bool participate_svm)
 {
 	gpuvm->rb.tree = RB_ROOT_CACHED;
 	INIT_LIST_HEAD(&gpuvm->rb.list);
@@ -1016,6 +1018,9 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name,
 	gpuvm->drm = drm;
 	gpuvm->r_obj = r_obj;
 
+	if (participate_svm)
+		gpuvm->mm = current->mm;
+
 	drm_gem_object_get(r_obj);
 
 	drm_gpuvm_warn_check_overflow(gpuvm, start_offset, range);
diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
index ee02cd833c5e..0d11f1733e29 100644
--- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
@@ -1861,7 +1861,7 @@ nouveau_uvmm_ioctl_vm_init(struct drm_device *dev,
 		       NOUVEAU_VA_SPACE_END,
 		       init->kernel_managed_addr,
 		       init->kernel_managed_size,
-		       &gpuvm_ops);
+		       &gpuvm_ops, false);
 	/* GPUVM takes care from here on. */
 	drm_gem_object_put(r_obj);
 
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 99bf7412475c..14069fc9d820 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1363,7 +1363,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	}
 
 	drm_gpuvm_init(&vm->gpuvm, "Xe VM", DRM_GPUVM_RESV_PROTECTED, &xe->drm,
-		       vm_resv_obj, 0, vm->size, 0, 0, &gpuvm_ops);
+		       vm_resv_obj, 0, vm->size, 0, 0, &gpuvm_ops, false);
 
 	drm_gem_object_put(vm_resv_obj);
 
diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
index 429dc0d82eba..838dd7137f07 100644
--- a/include/drm/drm_gpuvm.h
+++ b/include/drm/drm_gpuvm.h
@@ -242,6 +242,12 @@ struct drm_gpuvm {
 	 * @drm: the &drm_device this VM lives in
 	 */
 	struct drm_device *drm;
+	/**
+	 * @mm: the process &mm_struct which create this gpuvm.
+	 * This is only used for shared virtual memory where virtual
+	 * address space is shared b/t CPU and GPU program.
+	 */
+	struct mm_struct *mm;
 
 	/**
 	 * @mm_start: start of the VA space
@@ -342,7 +348,7 @@ void drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name,
 		    struct drm_gem_object *r_obj,
 		    u64 start_offset, u64 range,
 		    u64 reserve_offset, u64 reserve_range,
-		    const struct drm_gpuvm_ops *ops);
+		    const struct drm_gpuvm_ops *ops, bool participate_svm);
 
 /**
  * drm_gpuvm_get() - acquire a struct drm_gpuvm reference
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 09/42] drm/svm: introduce drm_mem_region concept
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (7 preceding siblings ...)
  2024-06-13  4:23 ` [CI 08/42] drm/svm: Mark drm_gpuvm to participate SVM Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 10/42] drm/svm: introduce hmmptr and helper functions Oak Zeng
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

As its name indicates, a drm_mem_region represent a memory region on
a drm device, e.g., a GPU's HBM memory.

In a memory region, we have address information of this region, from
both CPU (hpa_base) and GPU (dpa_base) perspectives. It has a pagemap
member which is used to memremap memory region to ZONE_DEVICE. It also
has some interfaces for drm to callback driver to allocate/free memory from
this region, migrate data to/from this memory region, get a pagemap
owner of a memory region, and get the drm device which own the memory
region.

This is introduced for system allocator implementation, so the memory
allocation and free interfaces are page based.

A few helper functions are also introduced:
1) drm_mem_region_page_to_dpa: calculate device physical address for
   a page
2) drm_page_to_mem_region: retrieve drm memory region that a page
   resides in

v1: use page parameter instead of pfn (Matt)
Add BUG_ON(mr != drm_page_to_mem_region(page)) (Matt)

Cc: Dave Airlie <airlied@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 include/drm/drm_svm.h | 162 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 162 insertions(+)
 create mode 100644 include/drm/drm_svm.h

diff --git a/include/drm/drm_svm.h b/include/drm/drm_svm.h
new file mode 100644
index 000000000000..a383c7251e2b
--- /dev/null
+++ b/include/drm/drm_svm.h
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright Â© 2024 Intel Corporation
+ */
+
+#ifndef _DRM_SVM__
+#define _DRM_SVM__
+
+#include <linux/compiler_types.h>
+#include <linux/memremap.h>
+#include <linux/types.h>
+
+struct dma_fence;
+struct drm_mem_region;
+
+/**
+ * struct migrate_vec - a migration vector is an array of addresses,
+ * each of which represents one page.
+ * When it is system memory page, the address is a dmap-mapped address.
+ * When it is vram page, the address is device physical address.
+ */
+struct migrate_vec {
+	/**
+	 * @mr: the memory region that pages reside.
+	 * When it is system memory page, mr is NULL.
+	 */
+	struct drm_mem_region *mr;
+	/** @npages: number of pages */
+	u64 npages;
+	/**
+	 * @addr_vec: address vector
+	 * each item in addr_vec is an address of a page
+	 */
+	union {
+		/** @dma_addr: dma-mapped address of the page, only valid for system pages*/
+		dma_addr_t dma_addr;
+		/** @dpa: device physical address of the page, only valid for vram page*/
+		phys_addr_t dpa;
+	} addr_vec[1];
+};
+
+/**
+ * struct drm_mem_region_ops - memory region operations such as memory allocation
+ * and migration etc. Driver is supposed to implement those operations.
+ */
+struct drm_mem_region_ops {
+	/**
+	 * @drm_mem_region_alloc_pages: Called from drm to driver to allocate
+	 * device VRAM memory
+	 * @mr: The memory region where to allocate device VRAM memory from
+	 * @npages: number of pages to allocate
+	 * @pfns: Used to return pfns of each page
+	 */
+	int (*drm_mem_region_alloc_pages) (struct drm_mem_region *mr,
+						unsigned long npags, unsigned long *pfns);
+	/**
+	 * @drm_mem_region_free_page: Called from drm to free one page of device memory
+	 * @page: point to the page to free
+	 */
+	void (*drm_mem_region_free_page) (struct page *page);
+	/**
+	 * @drm_mem_region_migrate: Called from drm to migrate memory from src to
+	 * dst. Driver is supposed to implement this function using device hardware
+	 * accelerators such as DMA. DRM subsystem call this function to migrate
+	 * memory b/t system memory and device memory region
+	 *
+	 * @src_vec: source migration vector
+	 * @dst_vec: destination migration vector
+	 */
+	struct dma_fence* (*drm_mem_region_migrate)(struct migrate_vec *src_vec,
+			struct migrate_vec *dst_vec);
+	/**
+	 * @drm_mem_region_pagemap_owner: Return the pagemap owner of a memory
+	 * region. Pagemap owner is the owner of the device memory. It is
+	 * defined by device driver and opaque to drm layer. Drm uses pagemap
+	 * owner to set up page migrations (see hmm function migrate_vma_setup)
+	 * and range population (see hmm functio hmm_range_fault). Driver has
+	 * the freedom to choose the right pagemap owner.
+	 *
+	 * @mr: the memory region which we want to get the page map owner
+	 */
+	void* (*drm_mem_region_pagemap_owner)(struct drm_mem_region *mr);
+	/**
+	 * @drm_mem_region_get_device: Return the drm device which has this memory
+	 * region.
+	 *
+	 * @mr: the memory region
+	 */
+	struct drm_device* (*drm_mem_region_get_device)(struct drm_mem_region *mr);
+};
+
+/**
+ * struct drm_mem_region - memory region structure
+ * This is used to describe a memory region in drm
+ * device, such as HBM memory or CXL extension memory.
+ *
+ * drm_mem_region is converted from the xe_mem_region
+ * concept. xe_mem_region is moved to drm layer and renamed
+ * as drm_mem_region.
+ *
+ * drm_mem_region is supposed to be embedded in some driver struct such as
+ * "struct xe_tile" or "struct amdgpu_device"
+ */
+struct drm_mem_region {
+	/** @dev_private: device private data which is opaque to drm layer */
+	void *dev_private;
+	/** @dpa_base: This memory regions's DPA (device physical address) base */
+	resource_size_t dpa_base;
+	/**
+	 * @usable_size: usable size of VRAM
+	 *
+	 * Usable size of VRAM excluding reserved portions
+	 * (e.g stolen mem)
+	 */
+	resource_size_t usable_size;
+	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
+	struct dev_pagemap pagemap;
+	/**
+	 * @hpa_base: base host physical address
+	 *
+	 * This is generated when remap device memory as ZONE_DEVICE
+	 */
+	resource_size_t hpa_base;
+	/**
+	 * @mr_ops: memory region operation function pointers
+	 */
+	struct drm_mem_region_ops mr_ops;
+};
+
+/**
+ * drm_page_to_mem_region() - Get a page's memory region
+ *
+ * @page: a struct page pointer pointing to a page in vram memory region
+ */
+static inline struct drm_mem_region *drm_page_to_mem_region(struct page *page)
+{
+	return container_of(page->pgmap, struct drm_mem_region, pagemap);
+}
+
+/**
+ * drm_mem_region_page_to_dpa() - Calculate page's device physical address
+ *
+ * @mr: The memory region that page resides in
+ * @page: page to calculate dpa for
+ *
+ * Returns: the device physical address of the page
+ */
+static inline u64 drm_mem_region_page_to_dpa(struct drm_mem_region *mr, struct page *page)
+{
+	u64 pfn = page_to_pfn(page);
+	u64 offset;
+	u64 dpa;
+
+	BUG_ON(mr != drm_page_to_mem_region(page));
+	BUG_ON((pfn << PAGE_SHIFT) < mr->hpa_base );
+	BUG_ON((pfn << PAGE_SHIFT) >=  mr->hpa_base + mr->usable_size);
+	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;
+
+	return dpa;
+}
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 10/42] drm/svm: introduce hmmptr and helper functions
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (8 preceding siblings ...)
  2024-06-13  4:23 ` [CI 09/42] drm/svm: introduce drm_mem_region concept Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 11/42] drm/svm: Introduce helper to remap drm memory region Oak Zeng
                   ` (31 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

A hmmptr is a pointer in a CPU program, like a userptr. but unlike
a userptr, a hmmptr can also be migrated to device local memory. The
other way to look at is, userptr is a special hmmptr without the
capability of migration - userptr's backing store is always in system
memory.

This is built on top of kernel HMM infrastructure thus is called hmmptr.

This is the key concept to implement SVM (shared virtual memory) at drm
drivers. With SVM, all the valid virtual address in a CPU program is
also valid for GPU program, if the GPUVM participate SVM. This is
implemented thru hmmptr concept.

Helper functions are introduced to init, release, populate and dma-map
/unmap hmmptr. Helper will also be introduced to migrate a range of
hmmptr to device memory. With those helpers, driver can easily
implement the SVM address space mirroring and migration functionalities.

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Krishna Bommu <krishnaiah.bommu@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/Kconfig   |   1 +
 drivers/gpu/drm/Makefile  |   1 +
 drivers/gpu/drm/drm_svm.c | 316 ++++++++++++++++++++++++++++++++++++++
 include/drm/drm_svm.h     |  62 ++++++++
 4 files changed, 380 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_svm.c

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 981f43d4ca8c..2c18c8d88444 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -20,6 +20,7 @@ menuconfig DRM
 # device and dmabuf fd. Let's make sure that is available for our userspace.
 	select KCMP
 	select VIDEO
+	select HMM_MIRROR
 	help
 	  Kernel-level support for the Direct Rendering Infrastructure (DRI)
 	  introduced in XFree86 4.0. If you say Y here, you need to select
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 68cc9258ffc4..0006a37c662a 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -89,6 +89,7 @@ drm-$(CONFIG_DRM_PRIVACY_SCREEN) += \
 	drm_privacy_screen_x86.o
 drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
 drm-$(CONFIG_DRM_PANIC) += drm_panic.o
+drm-$(CONFIG_HMM_MIRROR) += ./drm_svm.o
 obj-$(CONFIG_DRM)	+= drm.o
 
 obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
diff --git a/drivers/gpu/drm/drm_svm.c b/drivers/gpu/drm/drm_svm.c
new file mode 100644
index 000000000000..9a164615f866
--- /dev/null
+++ b/drivers/gpu/drm/drm_svm.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright Â© 2024 Intel Corporation
+ */
+
+
+#include <linux/scatterlist.h>
+#include <linux/mmu_notifier.h>
+#include <linux/dma-mapping.h>
+#include <linux/memremap.h>
+#include <drm/drm_gem_dma_helper.h>
+#include <drm/drm_svm.h>
+#include <linux/swap.h>
+#include <linux/bug.h>
+#include <linux/hmm.h>
+#include <linux/mm.h>
+
+static u64 __npages_in_range(unsigned long start, unsigned long end)
+{
+	return (PAGE_ALIGN(end) - PAGE_ALIGN_DOWN(start)) >> PAGE_SHIFT;
+}
+
+/**
+ * __mark_range_accessed() - mark a range is accessed, so core mm
+ * have such information for memory eviction or write back to
+ * hard disk
+ *
+ * @hmm_pfn: hmm_pfn array to mark
+ * @npages: how many pages to mark
+ * @write: if write to this range, we mark pages in this range
+ * as dirty
+ */
+static void __mark_range_accessed(unsigned long *hmm_pfn, int npages, bool write)
+{
+	struct page *page;
+	u64 i;
+
+	for (i = 0; i < npages; i++) {
+		page = hmm_pfn_to_page(hmm_pfn[i]);
+		if (write)
+			set_page_dirty_lock(page);
+
+		mark_page_accessed(page);
+	}
+}
+
+static inline u64 __hmmptr_start(struct drm_hmmptr *hmmptr)
+{
+	struct drm_gpuva *gpuva = hmmptr->get_gpuva(hmmptr);
+	u64 start = GPUVA_START(gpuva);
+
+	return start;
+}
+
+static inline u64 __hmmptr_end(struct drm_hmmptr *hmmptr)
+{
+	struct drm_gpuva *gpuva = hmmptr->get_gpuva(hmmptr);
+	u64 end = GPUVA_END(gpuva);
+
+	return end;
+}
+
+static inline u64 __hmmptr_cpu_start(struct drm_hmmptr *hmmptr)
+{
+	struct drm_gpuva *gpuva = hmmptr->get_gpuva(hmmptr);
+
+	/**
+	 * FIXME: xekmd right now use gem.offset for userptr
+	 * Maybe this need to be reconsidered.
+	 */
+	return gpuva->gem.offset;
+}
+
+static inline u64 __hmmptr_cpu_end(struct drm_hmmptr *hmmptr)
+{
+	return __hmmptr_cpu_start(hmmptr) +
+			(__hmmptr_end(hmmptr) - __hmmptr_start(hmmptr));
+}
+
+/**
+ * drm_svm_hmmptr_unmap_dma_pages() - dma unmap a section (must be page boudary) of
+ * hmmptr from iova space
+ *
+ * @hmmptr: hmmptr to dma unmap
+ * @page_idx: from which page to start the unmapping
+ * @npages: how many pages to unmap
+ */
+void drm_svm_hmmptr_unmap_dma_pages(struct drm_hmmptr *hmmptr, u64 page_idx, u64 npages)
+{
+	u64 tpages = __npages_in_range(__hmmptr_start(hmmptr), __hmmptr_end(hmmptr));
+	unsigned long *hmm_pfn = hmmptr->pfn;
+	struct page *page;
+	u64 i;
+
+	DRM_MM_BUG_ON(page_idx + npages > tpages);
+	for (i = 0; i < npages; i++) {
+		page = hmm_pfn_to_page(hmm_pfn[i + page_idx]);
+		if (!page)
+			continue;
+
+		if (!is_device_private_page(page))
+			dma_unlink_range(&hmmptr->iova, (i + page_idx) << PAGE_SHIFT);
+	}
+}
+EXPORT_SYMBOL_GPL(drm_svm_hmmptr_unmap_dma_pages);
+
+/**
+ * drm_svm_hmmptr_map_dma_pages() - dma map a section (must be page boudary) of
+ * hmmptr to iova space
+ *
+ * @hmmptr: hmmptr to dma map
+ * @page_idx: from which page to start the mapping
+ * @npages: how many pages to map
+ */
+void drm_svm_hmmptr_map_dma_pages(struct drm_hmmptr *hmmptr, u64 page_idx, u64 npages)
+{
+	u64 tpages = __npages_in_range(__hmmptr_start(hmmptr), __hmmptr_end(hmmptr));
+	unsigned long *hmm_pfn = hmmptr->pfn;
+	struct drm_gpuva *gpuva = hmmptr->get_gpuva(hmmptr);
+	struct drm_gpuvm *gpuvm = gpuva->vm;
+	struct drm_device *drm = gpuvm->drm;
+	struct page *page;
+	bool range_is_device_pages;
+	u64 i;
+
+	DRM_MM_BUG_ON(page_idx + npages > tpages);
+	for (i = page_idx; i < page_idx + npages; i++) {
+		page = hmm_pfn_to_page(hmm_pfn[i]);
+		DRM_MM_BUG_ON(!page);
+		if (i == page_idx)
+			range_is_device_pages = is_device_private_page(page);
+
+		if (range_is_device_pages != is_device_private_page(page))
+			drm_warn_once(drm, "Found mixed system and device pages plancement\n");
+
+		if (!is_device_private_page(page))
+			hmmptr->dma_addr[i] = dma_link_range(page, 0, &hmmptr->iova, i << PAGE_SHIFT);
+	}
+}
+EXPORT_SYMBOL_GPL(drm_svm_hmmptr_map_dma_pages);
+
+/**
+ * drm_svm_hmmptr_init() - initialize a hmmptr
+ *
+ * @hmmptr: the hmmptr to initialize
+ * @ops: the mmu interval notifier ops used to invalidate hmmptr
+ */
+int drm_svm_hmmptr_init(struct drm_hmmptr *hmmptr,
+		const struct mmu_interval_notifier_ops *ops)
+{
+	struct drm_gpuva *gpuva = hmmptr->get_gpuva(hmmptr);
+	struct dma_iova_attrs *iova = &hmmptr->iova;
+	struct drm_gpuvm *gpuvm = gpuva->vm;
+	struct drm_device *drm = gpuvm->drm;
+	u64 cpu_va_start = __hmmptr_cpu_start(hmmptr);
+	u64 start = GPUVA_START(gpuva);
+	u64 end = GPUVA_END(gpuva);
+	size_t npages;
+	int ret;
+
+	start = ALIGN_DOWN(start, PAGE_SIZE);
+	end = ALIGN(end, PAGE_SIZE);
+	npages = __npages_in_range(start, end);
+	hmmptr->pfn = kvcalloc(npages, sizeof(*hmmptr->pfn), GFP_KERNEL);
+	if (!hmmptr->pfn)
+		return -ENOMEM;
+
+	hmmptr->dma_addr = kvcalloc(npages, sizeof(*hmmptr->dma_addr), GFP_KERNEL);
+	if (!hmmptr->dma_addr) {
+		ret = -ENOMEM;
+		goto free_pfn;
+	}
+
+	iova->dev = drm->dev;
+	iova->size = end - start;
+	iova->dir = DMA_BIDIRECTIONAL;
+	ret = dma_alloc_iova(iova);
+	if (ret)
+		goto free_dma_addr;
+
+	ret = mmu_interval_notifier_insert(&hmmptr->notifier, current->mm,
+		cpu_va_start, end - start, ops);
+	if (ret)
+		goto free_iova;
+
+	hmmptr->notifier_seq = LONG_MAX;
+	return 0;
+
+free_iova:
+	dma_free_iova(iova);
+free_dma_addr:
+	kvfree(hmmptr->dma_addr);
+free_pfn:
+	kvfree(hmmptr->pfn);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(drm_svm_hmmptr_init);
+
+/**
+ * drm_svm_hmmptr_release() - release a hmmptr
+ *
+ * @hmmptr: the hmmptr to release
+ */
+void drm_svm_hmmptr_release(struct drm_hmmptr *hmmptr)
+{
+	u64 npages = __npages_in_range(__hmmptr_start(hmmptr), __hmmptr_end(hmmptr));
+
+	drm_svm_hmmptr_unmap_dma_pages(hmmptr, 0, npages);
+	mmu_interval_notifier_remove(&hmmptr->notifier);
+	dma_free_iova(&hmmptr->iova);
+	kvfree(hmmptr->pfn);
+	kvfree(hmmptr->dma_addr);
+}
+EXPORT_SYMBOL_GPL(drm_svm_hmmptr_release);
+
+/**
+ * drm_svm_hmmptr_populate() - Populate physical pages of the range of hmmptr
+ *
+ * @hmmptr: hmmptr to populate
+ * @owner: avoid fault for pages owned by owner, only report the current pfn.
+ * @start: start CPU VA of the range
+ * @end: end CPU VA of the range
+ * @write: Populate range for write purpose
+ * @is_mmap_locked: Whether the caller hold mmap lock
+ *
+ * This function populate the physical pages of a hmmptr range. The
+ * populated physical pages is saved in hmmptr's pfn array.
+ * It is similar to get_user_pages but call hmm_range_fault.
+ *
+ * There are two usage model of this API:
+ *
+ * 1) use it for legacy userptr code: pass owner as NULL, fault-in the range
+ * in system pages
+ *
+ * 2) use it for svm: Usually caller would first migrate a range to device
+ * pages, then call this function with owner as the device pages owner. This way
+ * this function won't cause a fault, only report the range's backing pfns which
+ * is already in device memory.
+ *
+ * This function also read mmu notifier sequence # (
+ * mmu_interval_read_begin), for the purpose of later comparison
+ * (through mmu_interval_read_retry). The usage model is, driver first
+ * call this function to populate a range of a hmmptr, then call
+ * mmu_interval_read_retry to see whether need to retry before programming
+ * GPU page table. Since we only populate a sub-range of the whole hmmptr
+ * here, even if the recorded hmmptr->notifier_seq equals to notifier's
+ * current sequence no, it doesn't means the whole hmmptr is up to date.
+ * Driver is *required* to always call this function before check a retry.
+ *
+ * This must be called with mmap read or write lock held.
+ *
+ * returns: 0 for success; negative error no on failure
+ */
+int drm_svm_hmmptr_populate(struct drm_hmmptr *hmmptr, void *owner, u64 start, u64 end,
+            bool write, bool is_mmap_locked)
+{
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	struct hmm_range hmm_range;
+	struct mm_struct *mm = hmmptr->notifier.mm;
+	int pfn_index, npages;
+	int ret;
+
+	DRM_MM_BUG_ON(start < __hmmptr_cpu_start(hmmptr));
+	DRM_MM_BUG_ON(end > __hmmptr_cpu_end(hmmptr));
+	if (!PAGE_ALIGNED(start) || !PAGE_ALIGNED(end))
+		pr_warn("drm svm populate unaligned range [%llx~%llx)\n", start, end);
+
+	if (is_mmap_locked)
+		mmap_assert_locked(mm);
+
+	if (!mmget_not_zero(mm))
+		return -EFAULT;
+
+	hmm_range.notifier = &hmmptr->notifier;
+	hmm_range.start = start;
+	hmm_range.end = end;
+	npages = __npages_in_range(start, end);
+	pfn_index = (start - __hmmptr_cpu_start(hmmptr)) >> PAGE_SHIFT;
+	hmm_range.hmm_pfns = hmmptr->pfn + pfn_index;
+	hmm_range.default_flags = HMM_PFN_REQ_FAULT;
+	if (write)
+		hmm_range.default_flags |= HMM_PFN_REQ_WRITE;
+	hmm_range.dev_private_owner = owner;
+
+	while (true) {
+		hmm_range.notifier_seq = mmu_interval_read_begin(&hmmptr->notifier);
+
+		if (!is_mmap_locked)
+			mmap_read_lock(mm);
+
+		ret = hmm_range_fault(&hmm_range);
+
+		if (!is_mmap_locked)
+			mmap_read_unlock(mm);
+
+		if (ret == -EBUSY) {
+			if (time_after(jiffies, timeout))
+				break;
+
+			continue;
+		}
+		break;
+	}
+
+	mmput(mm);
+
+	if (ret)
+		return ret;
+
+	__mark_range_accessed(hmm_range.hmm_pfns, npages, write);
+	hmmptr->notifier_seq = hmm_range.notifier_seq;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(drm_svm_hmmptr_populate);
diff --git a/include/drm/drm_svm.h b/include/drm/drm_svm.h
index a383c7251e2b..d443f20b5510 100644
--- a/include/drm/drm_svm.h
+++ b/include/drm/drm_svm.h
@@ -7,9 +7,13 @@
 #define _DRM_SVM__
 
 #include <linux/compiler_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/dma-mapping.h>
 #include <linux/memremap.h>
+#include <drm/drm_gpuvm.h>
 #include <linux/types.h>
 
+
 struct dma_fence;
 struct drm_mem_region;
 
@@ -159,4 +163,62 @@ static inline u64 drm_mem_region_page_to_dpa(struct drm_mem_region *mr, struct p
 
 	return dpa;
 }
+
+/**
+ * struct drm_hmmptr- hmmptr pointer
+ *
+ * A hmmptr is a pointer in a CPU program that can be access by GPU program
+ * also, like a userptr. but unlike a userptr, a hmmptr can also be migrated
+ * to device local memory. The other way to look at is, userptr is a special
+ * hmmptr without the capability of migration - userptr's backing store is
+ * always in system memory.
+ *
+ * A hmmptr can have mixed backing pages in system and GPU vram.
+ *
+ * hmmptr is supposed to be embedded in driver's GPU virtual range management
+ * struct such as xe_vma etc. hmmptr itself doesn't have a range. hmmptr
+ * depends on driver's data structure (such as xe_vma) to live in a gpuvm's
+ * process space and RB-tree.
+ *
+ * With hmmptr concept, SVM and traditional userptr can share codes around
+ * mmu notifier, backing store population etc.
+ *
+ * This is built on top of kernel HMM infrastructure thus is called hmmptr.
+ */
+struct drm_hmmptr {
+	/**
+	 * @notifier: MMU notifier for hmmptr
+	 */
+	struct mmu_interval_notifier notifier;
+	/** @notifier_seq: notifier sequence number */
+	unsigned long notifier_seq;
+	/**
+	 * @pfn: An array of pfn used for page population
+	 * Note this is hmm_pfn, not normal core mm pfn
+	 */
+	unsigned long *pfn;
+	/**
+	 * @dma_addr: An array to hold the dma mapped address
+	 * of each page, only used when page is in sram.
+	 */
+	dma_addr_t *dma_addr;
+	/**
+	 * @iova: iova hold the dma-address of this hmmptr.
+	 * iova is only used when the backing pages are in sram.
+	 */
+	struct dma_iova_attrs iova;
+	/**
+	 * @get_gpuva: callback function to get gpuva of this hmmptr
+	 * FIXME: Probably have direct gpuva member in hmmptr
+	 */
+	struct drm_gpuva * (*get_gpuva) (struct drm_hmmptr *hmmptr);
+};
+
+int drm_svm_hmmptr_init(struct drm_hmmptr *hmmptr,
+		const struct mmu_interval_notifier_ops *ops);
+void drm_svm_hmmptr_release(struct drm_hmmptr *hmmptr);
+void drm_svm_hmmptr_map_dma_pages(struct drm_hmmptr *hmmptr, u64 page_idx, u64 npages);
+void drm_svm_hmmptr_unmap_dma_pages(struct drm_hmmptr *hmmptr, u64 page_idx, u64 npages);
+int drm_svm_hmmptr_populate(struct drm_hmmptr *hmmptr, void *owner, u64 start, u64 end,
+            bool write, bool is_mmap_locked);
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 11/42] drm/svm: Introduce helper to remap drm memory region
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (9 preceding siblings ...)
  2024-06-13  4:23 ` [CI 10/42] drm/svm: introduce hmmptr and helper functions Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:23 ` [CI 12/42] drm/svm: handle CPU page fault Oak Zeng
                   ` (30 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

Helper function drm_svm_register_mem_region to remap GPU vram
using devm_memremap_pages, so each GPU vram page is backed by
struct page.

Those struct pages are created to allow hmm migrate buffer b/t
GPU vram and CPU system memory using existing Linux migration
mechanism (i.e., migrating b/t CPU system memory and hard disk).

This is prepare work to enable svm (shared virtual memory) through
Linux kernel hmm framework. The memory remap's page map type is set
to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
vram page get a struct page and can be mapped in CPU page table,
but such pages are treated as GPU's private resource, so CPU can't
access them. If CPU access such page, a page fault is triggered
and page will be migrate to system memory.

For GPU device which supports coherent memory protocol b/t CPU and
GPU (such as CXL and CAPI protocol), we can remap device memory as
MEMORY_DEVICE_COHERENT. This is TBD.

v1: Support a memory type interface for register_mem_region (Himal)

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
---
 drivers/gpu/drm/drm_svm.c | 56 +++++++++++++++++++++++++++++++++++++++
 include/drm/drm_svm.h     |  3 +++
 2 files changed, 59 insertions(+)

diff --git a/drivers/gpu/drm/drm_svm.c b/drivers/gpu/drm/drm_svm.c
index 9a164615f866..741874689e32 100644
--- a/drivers/gpu/drm/drm_svm.c
+++ b/drivers/gpu/drm/drm_svm.c
@@ -13,6 +13,7 @@
 #include <linux/swap.h>
 #include <linux/bug.h>
 #include <linux/hmm.h>
+#include <linux/pci.h>
 #include <linux/mm.h>
 
 static u64 __npages_in_range(unsigned long start, unsigned long end)
@@ -314,3 +315,58 @@ int drm_svm_hmmptr_populate(struct drm_hmmptr *hmmptr, void *owner, u64 start, u
 	return ret;
 }
 EXPORT_SYMBOL_GPL(drm_svm_hmmptr_populate);
+
+static struct dev_pagemap_ops drm_devm_pagemap_ops;
+
+/**
+ * drm_svm_register_mem_region: Remap and provide memmap backing for device memory
+ * @drm: drm device who want to register a memory region
+ * @mr: memory region to register
+ * @type: ZONE_DEVICE memory type
+ *
+ * This remap device memory to host physical address space and create
+ * struct page to back device memory
+ *
+ * Return: 0 on success standard error code otherwise
+ */
+int drm_svm_register_mem_region(const struct drm_device *drm,
+		struct drm_mem_region *mr, enum memory_type type)
+{
+	struct device *dev = &to_pci_dev(drm->dev)->dev;
+	struct resource *res;
+	void *addr;
+	int ret;
+
+	/**FIXME: support MEMORY_DEVICE_COHERENT in the future*/
+	if (type != MEMORY_DEVICE_PRIVATE)
+		return -EINVAL;
+
+	res = devm_request_free_mem_region(dev, &iomem_resource,
+					   mr->usable_size);
+	if (IS_ERR(res)) {
+		ret = PTR_ERR(res);
+		return ret;
+	}
+
+	drm_devm_pagemap_ops.page_free = mr->mr_ops.drm_mem_region_free_page;
+	mr->pagemap.type = type;
+	mr->pagemap.range.start = res->start;
+	mr->pagemap.range.end = res->end;
+	mr->pagemap.nr_range = 1;
+	mr->pagemap.ops = &drm_devm_pagemap_ops;
+	mr->pagemap.owner = mr->mr_ops.drm_mem_region_pagemap_owner(mr);
+	addr = devm_memremap_pages(dev, &mr->pagemap);
+	if (IS_ERR(addr)) {
+		devm_release_mem_region(dev, res->start, resource_size(res));
+		ret = PTR_ERR(addr);
+		drm_err(drm, "Failed to remap memory region %p, errno %d\n",
+				mr, ret);
+		return ret;
+	}
+	mr->hpa_base = res->start;
+
+	drm_info(drm, "Registered device memory [%llx-%llx] to devm, remapped to %pr\n",
+			mr->dpa_base, mr->dpa_base + mr->usable_size, res);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(drm_svm_register_mem_region);
diff --git a/include/drm/drm_svm.h b/include/drm/drm_svm.h
index d443f20b5510..04552cc1c67f 100644
--- a/include/drm/drm_svm.h
+++ b/include/drm/drm_svm.h
@@ -164,6 +164,9 @@ static inline u64 drm_mem_region_page_to_dpa(struct drm_mem_region *mr, struct p
 	return dpa;
 }
 
+int drm_svm_register_mem_region(const struct drm_device *drm,
+		struct drm_mem_region *mr, enum memory_type type);
+
 /**
  * struct drm_hmmptr- hmmptr pointer
  *
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 12/42] drm/svm: handle CPU page fault
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (10 preceding siblings ...)
  2024-06-13  4:23 ` [CI 11/42] drm/svm: Introduce helper to remap drm memory region Oak Zeng
@ 2024-06-13  4:23 ` Oak Zeng
  2024-06-13  4:24 ` [CI 13/42] drm/svm: Migrate a range of hmmptr to vram Oak Zeng
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:23 UTC (permalink / raw)
  To: intel-xe

Under the picture of svm, CPU and GPU program share the same
virtual address space. The backing store of this virtual address
space can be either in system memory or device memory. Since GPU
device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
Any CPU access to device memory causes a page fault. Implement
a page fault handler to migrate memory back to system memory and
map it to CPU page table so the CPU program can proceed.

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/drm_svm.c | 272 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 271 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_svm.c b/drivers/gpu/drm/drm_svm.c
index 741874689e32..80c5b7146ba6 100644
--- a/drivers/gpu/drm/drm_svm.c
+++ b/drivers/gpu/drm/drm_svm.c
@@ -8,6 +8,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/dma-mapping.h>
 #include <linux/memremap.h>
+#include <linux/migrate.h>
 #include <drm/drm_gem_dma_helper.h>
 #include <drm/drm_svm.h>
 #include <linux/swap.h>
@@ -316,7 +317,276 @@ int drm_svm_hmmptr_populate(struct drm_hmmptr *hmmptr, void *owner, u64 start, u
 }
 EXPORT_SYMBOL_GPL(drm_svm_hmmptr_populate);
 
-static struct dev_pagemap_ops drm_devm_pagemap_ops;
+static void __drm_svm_free_pages(unsigned long *mpfn, unsigned long npages)
+{
+	struct page *page;
+	int j;
+
+	for (j = 0; j < npages; j++) {
+		page = migrate_pfn_to_page(mpfn[j]);
+		mpfn[j] = 0;
+		if (page) {
+			unlock_page(page);
+			put_page(page);
+		}
+	}
+}
+
+/**
+ * __drm_svm_alloc_host_pages() - allocate host pages for the fault vma
+ *
+ * @vma: the fault vma that we need allocate page for
+ * @addr: the virtual address that we allocate pages for
+ * @mpfn: used to output the migration pfns of the allocated pages
+ * @npages: number of pages to allocate
+ *
+ * This function allocate host pages for a specified vma.
+ *
+ * When this function returns, the pages are locked.
+ *
+ * Return 0 on success
+ * error code otherwise
+ */
+static int __drm_svm_alloc_host_pages(struct vm_area_struct *vma,
+					unsigned long addr,
+					u64 npages,
+					unsigned long *mpfn)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < npages; i++) {
+		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		if (unlikely(!page))
+			goto free_allocated;
+
+		/**Lock page per hmm requirement, see hmm.rst*/
+		lock_page(page);
+
+		mpfn[i] = migrate_pfn(page_to_pfn(page));
+		addr += PAGE_SIZE;
+	}
+	return 0;
+
+free_allocated:
+	__drm_svm_free_pages(mpfn, i);
+	return -ENOMEM;
+}
+
+static struct migrate_vec *__generate_migrate_vec_vram(unsigned long *mpfn, bool is_migrate_src, unsigned long npages)
+{
+	struct migrate_vec *vec;
+	int size = sizeof(*vec) + sizeof(vec->addr_vec[1]) * (npages - 1);
+	struct drm_mem_region *mr;
+	struct page *page;
+	u64 dpa;
+	int i, j;
+
+	BUG_ON(is_migrate_src && npages != 1);
+	vec = kzalloc(size, GFP_KERNEL);
+	if (!vec)
+		return NULL;
+
+	mr = drm_page_to_mem_region(page);
+	for(i = 0, j = 0; i < npages; i++) {
+		/**
+		 * We only migrate one page from vram to sram on CPU page fault today.
+		 * If this source page is not marked with _MIGRATE flag, something is
+		 * wrong. We need to report error to core mm.
+		 *
+		 * If we move to multiple pages migration, below logic need a revisit,
+		 * as it is fine to only migrate some pages (but not all) as indicated
+		 * by hmm.
+		 */
+		if (is_migrate_src && !(mpfn[i] & MIGRATE_PFN_MIGRATE)) {
+			pr_err("Migrate from vram to sram: MIGRATE_PFN_MIGRATE flag is not set\n");
+			kfree(vec);
+			return NULL;
+		}
+
+		page = migrate_pfn_to_page(mpfn[i]);
+		if (!page || !is_device_private_page(page)) {
+			pr_err("No page or wrong page zone in %s\n", __func__);
+			kfree(vec);
+			return NULL;
+		}
+
+		dpa = drm_mem_region_page_to_dpa(mr, page);
+		vec->addr_vec[j++].dpa = dpa;
+	}
+	vec->mr = mr;
+	vec->npages = j;
+	return vec;
+}
+
+static struct migrate_vec *__generate_migrate_vec_sram(struct device *dev, unsigned long *mpfn,
+		bool is_migrate_src, unsigned long npages)
+{
+	enum dma_data_direction dir = is_migrate_src ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+	struct migrate_vec *vec;
+	int size = sizeof(*vec) + sizeof(vec->addr_vec[1]) * (npages - 1);
+	dma_addr_t dma_addr;
+	struct page *page;
+	int i, j, k;
+
+	page = migrate_pfn_to_page(mpfn[0]);
+	if (unlikely(!page))
+		return NULL;
+
+	vec = kzalloc(size, GFP_KERNEL);
+	if (!vec)
+		return NULL;
+
+	for(i = 0, k =0 ; i < npages; i++) {
+		if (is_migrate_src && !(mpfn[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		page = migrate_pfn_to_page(mpfn[i]);
+		if (!page || is_device_private_page(page)) {
+			pr_err("No page or wrong page zone in %s\n", __func__);
+			goto undo_dma_mapping;
+		}
+
+		dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
+		if (unlikely(dma_mapping_error(dev, dma_addr)))
+			goto undo_dma_mapping;
+
+		vec->addr_vec[k++].dma_addr = dma_addr;
+	}
+
+	vec->mr = NULL;
+	vec->npages = k;
+	return vec;
+
+undo_dma_mapping:
+	for (j = 0; j < k; j++) {
+		if (vec->addr_vec[j].dma_addr)
+			dma_unmap_page(dev, vec->addr_vec[j].dma_addr, PAGE_SIZE, dir);
+	}
+	kfree(vec);
+	return NULL;
+}
+
+static void __free_migrate_vec_sram(struct device *dev, struct migrate_vec *vec, bool is_migrate_src)
+{
+	enum dma_data_direction dir = is_migrate_src ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+	int i;
+
+	BUG_ON(vec->mr != NULL);
+
+	for (i = 0; i < vec->npages; i++) {
+		dma_unmap_page(dev, vec->addr_vec[i].dma_addr, PAGE_SIZE, dir);
+		/** No need to free host pages. migrate_vma_finalize take care */
+	}
+	kfree(vec);
+}
+
+static void __free_migrate_vec_vram(struct migrate_vec *vec)
+{
+	BUG_ON(vec->mr == NULL);
+
+	kfree(vec);
+}
+/**
+ * drm_svm_migrate_to_sram() - Migrate memory back to sram on CPU page fault
+ *
+ * @vmf: cpu vm fault structure, contains fault information such as vma etc.
+ *
+ * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
+ *
+ * This function migrate one page at the fault address. This is the normal
+ * core mm page fault scheme. Linux doesn't aggressively prefault at CPU page
+ * fault time. It only fault-in one page to recover the fault address. Even
+ * if we migrate more than one page, core mm still only program one pte entry
+ * (covers one page).See logic in function handle_pte_fault. do_swap_page
+ * eventually calls to drm_svm_migrate_to_sram in our case.
+ *
+ * We call migrate_vma_setup to set up the migration. During migrate_vma_setup,
+ * device page table is invalidated before migration (by calling the driver registered
+ * mmu notifier)
+ *
+ * We call migrate_vma_finalize to finalize the migration. During migrate_vma_finalize,
+ * device pages of the source buffer is freed (by calling memory region's
+ * drm_mem_region_free_page callback function)
+ *
+ * Return:
+ * 0 on success
+ * VM_FAULT_SIGBUS: failed to migrate page to system memory, application
+ * will be signaled a SIGBUG
+ */
+static vm_fault_t drm_svm_migrate_to_sram(struct vm_fault *vmf)
+{
+	struct drm_mem_region *mr = drm_page_to_mem_region(vmf->page);
+	struct drm_device *drm = mr->mr_ops.drm_mem_region_get_device(mr);
+	unsigned long src_pfn = 0, dst_pfn = 0;
+	struct device *dev = drm->dev;
+	struct vm_area_struct *vma = vmf->vma;
+	struct migrate_vec *src;
+	struct migrate_vec *dst;
+	struct dma_fence *fence;
+	vm_fault_t ret = 0, r;
+
+	struct migrate_vma migrate_vma = {
+		.vma		= vma,
+		.start		= ALIGN_DOWN(vmf->address, PAGE_SIZE),
+		.end		= ALIGN_DOWN(vmf->address, PAGE_SIZE) + PAGE_SIZE,
+		.src		= &src_pfn,
+		.dst		= &dst_pfn,
+		.pgmap_owner	= mr->mr_ops.drm_mem_region_pagemap_owner(mr),
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.fault_page = vmf->page,
+	};
+
+	if (migrate_vma_setup(&migrate_vma) < 0)
+		return VM_FAULT_SIGBUS;
+
+	if (!migrate_vma.cpages)
+		return 0;
+
+	r = __drm_svm_alloc_host_pages(vma, migrate_vma.start, 1, migrate_vma.dst);
+	if (r) {
+		ret = VM_FAULT_OOM;
+		goto migrate_finalize;
+	}
+
+	src = __generate_migrate_vec_vram(migrate_vma.src, true, 1);
+	if (!src) {
+		ret = VM_FAULT_SIGBUS;
+		goto free_host_pages;
+	}
+
+	dst = __generate_migrate_vec_sram(dev, migrate_vma.dst, false, 1);
+	if (!dst) {
+		ret = VM_FAULT_SIGBUS;
+		goto free_migrate_src;
+	}
+
+	fence = mr->mr_ops.drm_mem_region_migrate(src, dst);
+	if (IS_ERR(fence)) {
+		ret = VM_FAULT_SIGBUS;
+		goto free_migrate_dst;
+	}
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
+
+	migrate_vma_pages(&migrate_vma);
+
+free_migrate_dst:
+	__free_migrate_vec_sram(dev, dst, false);
+free_migrate_src:
+	__free_migrate_vec_vram(src);
+free_host_pages:
+	if (ret)
+		__drm_svm_free_pages(migrate_vma.dst, 1);
+migrate_finalize:
+	if (ret)
+		memset(migrate_vma.dst, 0, sizeof(*migrate_vma.dst));
+	migrate_vma_finalize(&migrate_vma);
+	return ret;
+}
+static struct dev_pagemap_ops drm_devm_pagemap_ops = {
+	.migrate_to_ram = drm_svm_migrate_to_sram,
+};
 
 /**
  * drm_svm_register_mem_region: Remap and provide memmap backing for device memory
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 13/42] drm/svm: Migrate a range of hmmptr to vram
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (11 preceding siblings ...)
  2024-06-13  4:23 ` [CI 12/42] drm/svm: handle CPU page fault Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 14/42] drm/svm: Add DRM SVM documentation Oak Zeng
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Introduce a helper function drm_svm_migrate_hmmptr_to_vram to migrate
any sub-range of a hmmptr to vram. The range has to be at page boundary.
This supposed to be called by driver to migrate a hmmptr to vram.

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/drm_svm.c | 121 ++++++++++++++++++++++++++++++++++++++
 include/drm/drm_svm.h     |   3 +
 2 files changed, 124 insertions(+)

diff --git a/drivers/gpu/drm/drm_svm.c b/drivers/gpu/drm/drm_svm.c
index 80c5b7146ba6..d34809b2bad6 100644
--- a/drivers/gpu/drm/drm_svm.c
+++ b/drivers/gpu/drm/drm_svm.c
@@ -640,3 +640,124 @@ int drm_svm_register_mem_region(const struct drm_device *drm,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(drm_svm_register_mem_region);
+
+static void __drm_svm_init_device_pages(unsigned long *pfn, unsigned long npages)
+{
+	struct page *page;
+	int i;
+
+	for(i = 0; i < npages; i++) {
+		page = pfn_to_page(pfn[i]);
+		zone_device_page_init(page);
+		pfn[i] = migrate_pfn(pfn[i]);
+	}
+}
+
+/**
+ * drm_svm_migrate_hmmptr_to_vram() - migrate a sub-range of a hmmptr to vram
+ * Must be called with mmap_read_lock held.
+ *
+ * @vm: the vm that the hmmptr belongs to
+ * @mr: the destination memory region we want to migrate to
+ * @hmmptr: the hmmptr to migrate.
+ * @start: start(CPU virtual address, inclusive) of the range to migrate
+ * @end: end(CPU virtual address, exclusive) of the range to migrate
+ *
+ * Returns: negative errno on faiure, 0 on success
+ */
+int drm_svm_migrate_hmmptr_to_vram(struct drm_gpuvm *vm,
+		struct drm_mem_region *mr,
+		struct drm_hmmptr *hmmptr, unsigned long start, unsigned long end)
+{
+	struct drm_device *drm = mr->mr_ops.drm_mem_region_get_device(mr);
+	struct mm_struct *mm = vm->mm;
+	unsigned long npages = __npages_in_range(start, end);
+	struct vm_area_struct *vas;
+	struct migrate_vma migrate = {
+		.start		= ALIGN_DOWN(start, PAGE_SIZE),
+		.end		= ALIGN(end, PAGE_SIZE),
+		.pgmap_owner	= mr->mr_ops.drm_mem_region_pagemap_owner(mr),
+		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
+	};
+	struct device *dev = drm->dev;
+	struct dma_fence *fence;
+	struct migrate_vec *src;
+	struct migrate_vec *dst;
+	int ret = 0;
+	void *buf;
+
+	mmap_assert_locked(mm);
+
+	BUG_ON(start < __hmmptr_cpu_start(hmmptr));
+	BUG_ON(end > __hmmptr_cpu_end(hmmptr));
+
+	vas = find_vma_intersection(mm, start, end);
+	if (!vas)
+		return -ENOENT;
+
+	migrate.vma = vas;
+	buf = kvcalloc(npages, 2* sizeof(*migrate.src), GFP_KERNEL);
+	if(!buf)
+		return -ENOMEM;
+
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+	ret = migrate_vma_setup(&migrate);
+	if (ret) {
+		drm_warn(drm, "vma setup returned %d for range [0x%lx - 0x%lx]\n",
+				ret, start, end);
+		goto free_buf;
+	}
+
+	/**
+	 * Partial migration is just normal. Print a message for now.
+	 * Once this behavior is verified, delete this warning.
+	 */
+	if (migrate.cpages != npages)
+		drm_warn(drm, "Partial migration for range [0x%lx - 0x%lx], range is %ld pages, migrate only %ld pages\n",
+				start, end, npages, migrate.cpages);
+
+	ret = mr->mr_ops.drm_mem_region_alloc_pages(mr, migrate.cpages, migrate.dst);
+	if (ret)
+		goto migrate_finalize;
+
+	__drm_svm_init_device_pages(migrate.dst, migrate.cpages);
+
+	src = __generate_migrate_vec_sram(dev, migrate.src, true, npages);
+	if (!src) {
+		ret = -EFAULT;
+		goto free_device_pages;
+	}
+
+	dst = __generate_migrate_vec_vram(migrate.dst, false, migrate.cpages);
+	if (!dst) {
+		ret = -EFAULT;
+		goto free_migrate_src;
+	}
+
+	fence = mr->mr_ops.drm_mem_region_migrate(src, dst);
+	if (IS_ERR(fence)) {
+		ret = -EIO;
+		goto free_migrate_dst;
+	}
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
+
+	migrate_vma_pages(&migrate);
+
+free_migrate_dst:
+	__free_migrate_vec_vram(dst);
+free_migrate_src:
+	__free_migrate_vec_sram(dev, src, true);
+free_device_pages:
+	if (ret)
+		__drm_svm_free_pages(migrate.dst, migrate.cpages);
+migrate_finalize:
+	if (ret)
+		memset(migrate.dst, 0, sizeof(*migrate.dst)*migrate.cpages);
+	migrate_vma_finalize(&migrate);
+free_buf:
+	kvfree(buf);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(drm_svm_migrate_hmmptr_to_vram);
diff --git a/include/drm/drm_svm.h b/include/drm/drm_svm.h
index 04552cc1c67f..1cbf83427f2a 100644
--- a/include/drm/drm_svm.h
+++ b/include/drm/drm_svm.h
@@ -224,4 +224,7 @@ void drm_svm_hmmptr_map_dma_pages(struct drm_hmmptr *hmmptr, u64 page_idx, u64 n
 void drm_svm_hmmptr_unmap_dma_pages(struct drm_hmmptr *hmmptr, u64 page_idx, u64 npages);
 int drm_svm_hmmptr_populate(struct drm_hmmptr *hmmptr, void *owner, u64 start, u64 end,
             bool write, bool is_mmap_locked);
+int drm_svm_migrate_hmmptr_to_vram(struct drm_gpuvm *vm,
+		struct drm_mem_region *mr,
+		struct drm_hmmptr *hmmptr, unsigned long start, unsigned long end);
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 14/42] drm/svm: Add DRM SVM documentation
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (12 preceding siblings ...)
  2024-06-13  4:24 ` [CI 13/42] drm/svm: Migrate a range of hmmptr to vram Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 15/42] drm/xe: s/xe_tile_migrate_engine/xe_tile_migrate_exec_queue Oak Zeng
                   ` (27 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

The purpose of DRM SVM design is to provide helpers and infrastructure
to facilitate driver implementation of SVM (shared virtual memory)
function.

The overview part described the helpers that DRM layer provdes and the
virtual function interfaces that driver has to implement if driver use
drm SVM layer to implement SVM functionality.

Unlike legacy DRM BO (buffer object) based physical memory management,
DRM SVM use a complete page centric memory management. This has
fundamental impact to the memory allocate, free, eviction and migration
logic. It also requires the driver to support range (which is required
to be page boundary) based GPU page table update and invalidation. The
page centric design section described this design.

Lock design is one of the key of the design, which is described in the
lock design section.

There are other sections described the memory attributes API design and
the memory eviction design.

Create a DRM SVM section in drm-mm.rst and generate SVM documents under
this document section.

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 Documentation/gpu/drm-mm.rst |  42 +++++
 drivers/gpu/drm/drm_svm.c    | 309 +++++++++++++++++++++++++++++++++++
 2 files changed, 351 insertions(+)

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index d55751cad67c..fa968abb9387 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -573,3 +573,45 @@ Scheduler Function References
 
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_entity.c
    :export:
+
+DRM SVM
+======
+
+Overview
+-------
+
+.. kernel-doc:: drivers/gpu/drm/drm_svm.c
+   :doc: Overview
+
+Page centric design
+------------------
+
+.. kernel-doc:: drivers/gpu/drm/drm_svm.c
+   :doc: page centric design
+
+Lock design
+-----------
+
+.. kernel-doc:: drivers/gpu/drm/drm_svm.c
+   :doc: lock design
+
+Memory hints
+-----------
+
+.. kernel-doc:: drivers/gpu/drm/drm_svm.c
+   :doc: Memory hints
+
+Memory eviction
+-----------
+
+.. kernel-doc:: drivers/gpu/drm/drm_svm.c
+   :doc: pag granularity eviction
+
+Function References
+-----------------------------
+
+.. kernel-doc:: include/drm/drm_svm.h
+   :internal:
+
+.. kernel-doc:: drivers/gpu/drm/drm_svm.c
+   :export:
diff --git a/drivers/gpu/drm/drm_svm.c b/drivers/gpu/drm/drm_svm.c
index d34809b2bad6..967e7addae54 100644
--- a/drivers/gpu/drm/drm_svm.c
+++ b/drivers/gpu/drm/drm_svm.c
@@ -17,6 +17,315 @@
 #include <linux/pci.h>
 #include <linux/mm.h>
 
+/**
+ * DOC: Overview
+ *
+ * Shared Virtual Memory (SVM) allows the programmer to use a shared virtual
+ * address space between threads executing on CPUs and GPUs. It abstracts
+ * away from the user the location of the backing memory, and hence simplifies
+ * the user programming model. In a non-SVM memory model, user need to explicitly
+ * decide memory placement such as device or system memory, also user need to
+ * explicitly migrate memory b/t device and system memory. Under SVM, KMD takes
+ * care of memory migration implicitly.
+ *
+ * SVM makes use of default OS memory allocation and mapping interface such as
+ * malloc() and mmap(). The pointer returned from malloc() and mmap() can be
+ * directly used on both CPU and GPU program. Any CPU program local variables
+ * or global variables can also be used by GPU program.
+ *
+ * SVM also provides API to set virtual address range based memory attributes
+ * such as preferred memory location, memory migration granularity, and memory
+ * atomic attributes etc. This is similar to Linux madvise API.
+ *
+ * DRM SVM implementation is based on Linux kernel Heterogeneous Memory Management
+ * (HMM) framework. HMM provides address space mirroring and data migration helpers.
+ *
+ * This is the DRM layer abstraction of the SVM implementation. The target of this
+ * abstraction is to provide common SVM functions which can be shared between all
+ * DRM drivers. The key functionality of DRM layer is to provide a infrastructure
+ * for driver to achieve process space mirroring and data migration.
+ *
+ * The DRM abstraction is based on two concepts:
+ *
+ * 1) DRM memory region: represents physical memory on GPU devices. Memory region
+ * has operations(callback functions) such as allocate/free memory, memory copy etc.
+ *
+ * 2) hmmptr: similar to userptr with the exception that a hmmptr can be migrated
+ * b/t system memory and device memories. Currently hmmptr is a very thin layer deal
+ * with mmu notifier register/unregister and page population. The hmmptr invalidation
+ * and gpu page table update is left to driver.
+ *
+ * See more details of those concepts in the kernel DOC in drm_svm.h.
+ *
+ * The DRM SVM layer provides helper functions for driver writer to:
+ *
+ * 1) register a DRM memory region, along which some driver callback functions to
+ * allocate/free device memory, migrate memory b/t system memory and GPU device memory
+ * etc.
+ *
+ * 2) create a :c:type:`drm_hmmptr`: Create a hmmptr and register mmu interval notifier
+ * for this hmmptr.
+ *
+ * 3) populate a :c:type:`drm_hmmptr`: Populate the CPU page table, for the purpose of
+ * address space mirroring.
+ *
+ * 4) migrate a range of :c:type:`drm_hmmptr` to GPU device memory
+ *
+ * Note migration of hmmptr (or partial of hmmptr) is DRM internal, triggered by CPU
+ * page fault. This is not exposed to driver.
+ *
+ * With above DRM facilities, it is very simple for vendors to implement a SVM system
+ * alloator driver:
+ *
+ * 1) Implement GPU device memory allocation and free callback functions
+ *
+ * 2) Implement a vendor specific data migration callback function using GPU HW such as DMA
+ *
+ * 3) Implement a vendor specific hmmptr GPU page table invalidation callback function
+ *
+ * 4) On device initialization, register GPU device memory to DRM using the DRM memory
+ * region registration helper
+ *
+ * 4) In GPU page fault handler, call drm helpers to create hmmptr if necessary, migrate
+ * range of hmmptr to GPU device memory per migration policy, and populate range of hmmptr
+ * to program/mirror the hmmptr range in GPU page table. Resume GPU HW execution.
+ *
+ *
+ * There are 3 events which can trigger SVM subsystem in actions:
+ *
+ * 1. A mmu notifier callback
+ *
+ * Since SVM need to mirror the program's CPU virtual address space from GPU side,
+ * when program's CPU address space changes, SVM need to make an identical change
+ * from GPU side. SVM/hmm use mmu interval notifier to achieve this. SVM register
+ * a mmu interval notifier call back function to core mm, and whenever a CPU side
+ * virtual address space is changed (i.e., when a virtual address range is unmapped
+ * from CPU calling munmap), the registered callback function will be called from
+ * core mm. SVM then mirror the CPU address space change from GPU side, i.e., unmap
+ * or invalidate the virtual address range from GPU page table.
+ *
+ * This part of the work is mainly left to driver. Driver need to implement the GPU
+ * page table invalidation function. The invalidation has to be virtual address range
+ * based, instead of invalidate the whole hmmptr on each callback.
+ *
+ * 2. A GPU page fault
+ *
+ * At the very beginning of a process's life, no virtual address of the process
+ * is mapped on GPU page table. So when GPU access any virtual address of the process
+ * a GPU page fault is triggered. SVM then decide the best memory location of the
+ * fault address (mainly from performance consideration. Some times also consider
+ * correctness requirement such as whether GPU can perform atomics operation to
+ * certain memory location), migrate memory if necessary, and map the fault address
+ * to GPU page table.
+ *
+ * This part of work is also mainly left to driver implementor for flexibility consideration.
+ * Driver can call DRM SVM layer helper to migrate a sub-range of a hmmptr to device
+ * memory, and call DRM helper to populate the physical pages of a hmmptr.
+ *
+ * 3. A CPU page fault
+ *
+ * A CPU page fault is usually managed by Linux core mm. But in a CPU and GPU
+ * mix programming environment, the backing store of a virtual address range
+ * can be in GPU's local memory which is not visible to CPU (DEVICE_PRIVATE),
+ * so CPU page fault handler need to migrate such pages to system memory for
+ * CPU to be able to access them. Such memory migration is device specific.
+ * HMM has a callback function (migrate_to_ram function of the dev_pagemap_ops)
+ * for device driver to implement.
+ *
+ * This part of the work is mainly done at DRM abstraction layer. DRM layer
+ * call the vendor specific memory region data migration callback function
+ * to migrate data from GPU memory to system memory, if needed. The whole
+ * CPU page fault processing is transparent to driver implementation.
+ *
+ */
+
+/**
+ * DOC: page centric design
+ *
+ * Unlike the non-SVM memory allocator (such as gem_create, vm_bind etc), there
+ * is no buffer object (BO, such as struct ttm_buffer_object, struct drm_gem_object),
+ * in DRM SVM design. We delibrately choose this implementation option to achieve true
+ * page granularity memory placement, allocation, free, validation, eviction and migration.
+ * In a BO centric world, all above operate at a granularity of buffer object, e.g.,
+ * you can't evict or migrate partial of a BO.
+ *
+ * The drm buddy allocator is essentially buddy block granularity, e.g., user can’t
+ * return a few pages of memory back to buddy (so other users can use the freed pages)
+ * until user finishes the whole buddy block. The minimal free size is buddy block.
+ *
+ * To achieve the goal, a drm page cache layer is introduced. DRM page cache layer
+ * provide interface for user to allocate and free memory at page granularity. DRM
+ * page cache layer allocate memory from DRM buddy if it can't meet user's memory
+ * allocation request with the cached pages. DRM page cache free a buddy block when
+ * all the pages in a block are freed by users.
+ *
+ * Both TTM-based driver and SVM are now required to allocate memory from DRM page
+ * cache layer instead of directly from DRM buddy. This is necessary to achieve goal
+ * because, in the direct DRM buddy scheme, pages freed from TTM-based driver and
+ * from SVM, are kept locally until the whole DRM buddy block is freed. So SVM can't
+ * use TTM freed pages until the whole DRM buddy block is freed, and vice versa.
+ *
+ * To support a true page granularity design, SVM subsystem implemented page granularity
+ * migration, eviction, page table update and invalidation.
+ *
+ * Legacy driver can stay with BO-based scheme, but a lot of driver internal interfaces
+ * such as page table codes are modified to support page granularity scheme so those
+ * interfaces can be shared b/t SVM and BO-based driver.
+ *
+ * TTM-based driver can still free memory at coarse grain granularity such as only free
+ * when the BO is finished; while system allocator will apply the page granularity memory
+ * free.
+ *
+ * Currently page cache layer track the usage of each page using a simple bitmap scheme.
+ * When all pages in a block are not used anymore, it returns this block back to drm
+ * buddy subsystem. This scheme can be fine tunned.
+ *
+ * See more in :doc: Page granularity eviction
+ *
+ */
+
+/**
+ * DOC: lock design
+ *
+ * Since under the picture of SVM, CPU program and GPU program share one virtual
+ * address space, we need rather complex lock design to avoid race conditions.
+ * For example, while GPU is accessing a virtual address range, the same range
+ * could be munmapped from CPU side, or the backing store of the same range
+ * could be moved to another physical place due to memory pressure. The lock
+ * design essentially need to serialize the GPU page table invalidate, data
+ * migration, CPU page table population and GPU page table re-validation.
+ *
+ * There are multiple locks involved in this scheme:
+ *
+ * 1. mm mmap_lock
+ * mm's mmap_lock is a read write semaphore used to protect mm's address
+ * space. Linux core mm hold this lock whenever it need to change process
+ * space's memory mapping, for example, during a user munmap process, or
+ * during a page migration triggered by kernel NUMA balance. MMU interval
+ * notifier callback function is called mmap_lock hold.
+ *
+ * Driver is required to hold mmap_read_lock when it calls the DRM hmmptr
+ * population helper to populate CPU page tables.
+ *
+ * 2. driver's gpuvm dma-resv
+ * driver's dma reservation object is used protect GPU page table update.
+ * This is usually per GPUVM based, e.g., for xekmd this is xe_vm's dma-resv.
+ * For BO type vma, dma resv is enough for page table update. For userptr
+ * and hmmptr, besides dma resv, we need an extra page_table_lock to avoid
+ * page table update collision with userptr invalidation. See below.
+ *
+ * 3. driver's page_table_lock
+ * page_table_lock is used to protect userptr/hmmptr GPU page table update,
+ * to avoid a update collision with userptr invalidation. So page_table_lock
+ * is required in the userptr invalidate callback function. Notifier_lock
+ * is the "user_lock" in the documentation of mmu_interval_read_begin(),
+ * which is a hard requirement per mmu notifier design. In xekmd design,
+ * this lock is xe_vm::userptr::notifier_lock.
+ *
+ * Note currently the page_table_lock is a coarse grain lock design which
+ * is whole GPU VM based. This coarse grain design can be fine tunned to
+ * a per page table based fine grain lock, similar to the core mm split
+ * page table lock.
+ *
+ * Lock order
+ * Acquiring locks in the same order can avoid deadlocks. The locking
+ * order of above locks are:
+ *
+ * mmap_lock => gpuvm::dma-resv => page_table_lock
+ *
+ * This lock order is a hard requirement per existing codes. For example,
+ * the mmap_lock and dma-resv order is explicitly defined in dma_resv_lockdep().
+ *
+ *
+ * Use case, pseudo codes:
+ *
+ * Take the concurrent GPU page fault handler and user munmap as an example,
+ * the locking codes looks broadly like below:
+ *
+ * In GPU page fault handler:                                                                           munmap from user:
+ *
+ * again:
+ * Mmap_read_lock()                                                                                     mmap_write_lock()
+ * call drm_svm_migrate_hmmptr_to_vram to migrate hmmptr if needed                                      mmu_notifier_invalidate_range_start(), call hmmptr->invalidate()
+ * hmmptr.notifier_seq = mmu_interval_read_begin(&hmmptr.notifier)                                          down_write(page_table_lock);
+ * call drm_svm_populate_hmmptr to populate hmmptr                                                          mmu_interval_set_seq();
+ * Mmap_read_unlock                                                                                         invalidate hmmptr from GPU page table
+ *                                                                                                          up_write(page_table_lock);
+ * dma_resv_lock()                                                                                      mmu_notifier_invalidate_range_end()
+ * down_read(page_table_lock);                                                                            mmap_write_unlock();
+ * if (!mmu_interval_read_retry() {
+ *     //collision with userptr invalidation, retry
+ *     up_read(page_table_lock);
+ *     dma_resv_unlock()
+ *     goto again;
+ * }
+ *
+ * update_gpu_page_table()
+ * up_read(page_table_lock);
+ * dma_resv_unlock()
+ *
+ *
+ * In above codes, we hold mmap_read_lock to populate hmmptr because we
+ * need to walk the CPU page table. Since the munmap hold a mmap_write_lock,
+ * the hmmptr population can't colide with munmap.
+ *
+ * After that, we update GPU page table with the hmmptr pfns/pages populated
+ * above. We hold both dma-resv object to protect and page_table_lock for GPU
+ * page table update.
+ *
+ * Since we don't hold the mmap_lock during GPU page table update, user
+ * might perform munmap simultaneously which can cause userptr invalidation.
+ * If such collision happens, we will have to retry hmmptr population.
+ *
+ * Once we confirmed there is no need to retry, we can update GPU page table
+ * safely without have to worry about munmap/hmmptr->invalidate(), because
+ * during this retry confirmation and GPU page table update we hold the
+ * page_table_lock. The same page_table_lock is also hold in the hmmptr->invalidate
+ * callback.
+ */
+
+/**
+ * DOC: Memory hints: TBD
+ */
+
+/**
+ * DOC: Page granularity eviction
+ *
+ * One of the DRM SVM design requirement is to evict memory at true page granularity.
+ * To achieve this goal, we removed buffer object concept in DRM SVM design. This
+ * means we won't build SVM upon TTM infrastructure like amdgpu or xekmd does.
+ *
+ * Even though this approach is true page granularity, it does create a problem for
+ * memory eviction when there is memory pressure: memory allocated by SVM and memory
+ * allocated by TTM should be able to mutually evict from each other. TTM's resource
+ * manager maintains a LRU list for each memory type and this list is used to pick up
+ * the memory eviction victim. Since we don't use TTM for SVM implementation, SVM
+ * allocated memory can't be added to TTM resource manager's LRU list. Thus SVM
+ * allocated memory and TTM allocated memory are not mutually evictable.
+ *
+ * We solve this problem by creating a shared LRU list b/t SVM and TTM, or any other
+ * resource manager. This is named drm evictable LRU.
+ *
+ * The basic idea is, abstract a drm_lru_entity structure which is supposed to be
+ * embedded in ttm_resource structure, or any other resource manager such as used
+ * in SVM implementation. So a LRU entity can represent memory resource used by
+ * both TTM and SVM.
+ *
+ * The resource LRU list is a list of drm_lru_entity. drm_lru_entity has eviction
+ * function pointers which can be used to call back drivers' specific eviction
+ * function to evict a memory resource.
+ *
+ * Driver specific eviction function from both TTM and SVM should follow the same
+ * lock order to avoid any lock inversion. The current plan is, SVM subsystem to
+ * follow exactly the same locking order used in TTM-based eviction process.
+ *
+ * A device-wide global drm_lru_manager is introduced to manage a shared LRU list
+ * between TTM and SVM. drm lru manager provides a evict_first function to evict
+ * the first memory resource from LRU list. This function can be called from TTM,
+ * SVM, thus all the memory allocated in the drm sub-system can be mutually evicted.
+ *
+ */
+
 static u64 __npages_in_range(unsigned long start, unsigned long end)
 {
 	return (PAGE_ALIGN(end) - PAGE_ALIGN_DOWN(start)) >> PAGE_SHIFT;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 15/42] drm/xe: s/xe_tile_migrate_engine/xe_tile_migrate_exec_queue
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (13 preceding siblings ...)
  2024-06-13  4:24 ` [CI 14/42] drm/svm: Add DRM SVM documentation Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 16/42] drm/xe: Add xe_vm_pgtable_update_op to xe_vma_ops Oak Zeng
                   ` (26 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

Engine is old nomenclature, replace with exec queue.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 9 ++++-----
 drivers/gpu/drm/xe/xe_migrate.h | 2 +-
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 7e3fb33110d9..5eb039acfcf0 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -84,15 +84,14 @@ struct xe_migrate {
 #define MAX_PTE_PER_SDI 0x1FE
 
 /**
- * xe_tile_migrate_engine() - Get this tile's migrate engine.
+ * xe_tile_migrate_exec_queue() - Get this tile's migrate exec queue.
  * @tile: The tile.
  *
- * Returns the default migrate engine of this tile.
- * TODO: Perhaps this function is slightly misplaced, and even unneeded?
+ * Returns the default migrate exec queue of this tile.
  *
- * Return: The default migrate engine
+ * Return: The default migrate exec queue
  */
-struct xe_exec_queue *xe_tile_migrate_engine(struct xe_tile *tile)
+struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile)
 {
 	return tile->migrate->q;
 }
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 951f19318ea4..a5bcaafe4a99 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -106,5 +106,5 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 
 void xe_migrate_wait(struct xe_migrate *m);
 
-struct xe_exec_queue *xe_tile_migrate_engine(struct xe_tile *tile);
+struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile);
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 16/42] drm/xe: Add xe_vm_pgtable_update_op to xe_vma_ops
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (14 preceding siblings ...)
  2024-06-13  4:24 ` [CI 15/42] drm/xe: s/xe_tile_migrate_engine/xe_tile_migrate_exec_queue Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 17/42] drm/xe: Convert multiple bind ops into single job Oak Zeng
                   ` (25 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

Each xe_vma_op resolves to 0-3 pt_ops. Add storage for the pt_ops to
xe_vma_ops which is dynamically allocated based the number and types of
xe_vma_op in the xe_vma_ops list. Allocation only implemented in this
patch.

This will help with converting xe_vma_ops (multiple xe_vma_op) in a
atomic update unit.

Cc: Oak Zeng <oak.zeng@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt_types.h | 12 ++++++
 drivers/gpu/drm/xe/xe_vm.c       | 66 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_vm_types.h |  8 ++++
 3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index cee70cb0f014..2093150f461e 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -74,4 +74,16 @@ struct xe_vm_pgtable_update {
 	u32 flags;
 };
 
+/** struct xe_vm_pgtable_update_op - Page table update operation */
+struct xe_vm_pgtable_update_op {
+	/** @entries: entries to update for this operation */
+	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
+	/** @num_entries: number of entries for this update operation */
+	u32 num_entries;
+	/** @bind: is a bind */
+	bool bind;
+	/** @rebind: is a rebind */
+	bool rebind;
+};
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 14069fc9d820..78cbc87c37fa 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -712,6 +712,42 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm)
 		list_empty_careful(&vm->userptr.invalidated)) ? 0 : -EAGAIN;
 }
 
+static int xe_vma_ops_alloc(struct xe_vma_ops *vops)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i) {
+		if (!vops->pt_update_ops[i].num_ops)
+			continue;
+
+		vops->pt_update_ops[i].ops =
+			kmalloc_array(vops->pt_update_ops[i].num_ops,
+				      sizeof(*vops->pt_update_ops[i].ops),
+				      GFP_KERNEL);
+		if (!vops->pt_update_ops[i].ops)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void xe_vma_ops_fini(struct xe_vma_ops *vops)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
+		kfree(vops->pt_update_ops[i].ops);
+}
+
+static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops, u8 tile_mask)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
+		if (BIT(i) & tile_mask)
+			++vops->pt_update_ops[i].num_ops;
+}
+
 static void xe_vm_populate_rebind(struct xe_vma_op *op, struct xe_vma *vma,
 				  u8 tile_mask)
 {
@@ -739,6 +775,7 @@ static int xe_vm_ops_add_rebind(struct xe_vma_ops *vops, struct xe_vma *vma,
 
 	xe_vm_populate_rebind(op, vma, tile_mask);
 	list_add_tail(&op->link, &vops->list);
+	xe_vma_ops_incr_pt_update_ops(vops, tile_mask);
 
 	return 0;
 }
@@ -779,6 +816,10 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 			goto free_ops;
 	}
 
+	err = xe_vma_ops_alloc(&vops);
+	if (err)
+		goto free_ops;
+
 	fence = ops_execute(vm, &vops);
 	if (IS_ERR(fence)) {
 		err = PTR_ERR(fence);
@@ -793,6 +834,7 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 		list_del(&op->link);
 		kfree(op);
 	}
+	xe_vma_ops_fini(&vops);
 
 	return err;
 }
@@ -814,12 +856,20 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 	if (err)
 		return ERR_PTR(err);
 
+	err = xe_vma_ops_alloc(&vops);
+	if (err) {
+		fence = ERR_PTR(err);
+		goto free_ops;
+	}
+
 	fence = ops_execute(vm, &vops);
 
+free_ops:
 	list_for_each_entry_safe(op, next_op, &vops.list, link) {
 		list_del(&op->link);
 		kfree(op);
 	}
+	xe_vma_ops_fini(&vops);
 
 	return fence;
 }
@@ -2282,7 +2332,6 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-
 static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 				   struct drm_gpuva_ops *ops,
 				   struct xe_sync_entry *syncs, u32 num_syncs,
@@ -2334,6 +2383,9 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
+			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+				xe_vma_ops_incr_pt_update_ops(vops,
+							      op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_REMAP:
@@ -2378,6 +2430,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 					vm_dbg(&xe->drm, "REMAP:SKIP_PREV: addr=0x%016llx, range=0x%016llx",
 					       (ULL)op->remap.start,
 					       (ULL)op->remap.range);
+				} else {
+					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
 
@@ -2414,13 +2468,16 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 					vm_dbg(&xe->drm, "REMAP:SKIP_NEXT: addr=0x%016llx, range=0x%016llx",
 					       (ULL)op->remap.start,
 					       (ULL)op->remap.range);
+				} else {
+					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
 		case DRM_GPUVA_OP_PREFETCH:
-			/* Nothing to do */
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
@@ -3267,11 +3324,16 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto unwind_ops;
 	}
 
+	err = xe_vma_ops_alloc(&vops);
+	if (err)
+		goto unwind_ops;
+
 	err = vm_bind_ioctl_ops_execute(vm, &vops);
 
 unwind_ops:
 	if (err && err != -ENODATA)
 		vm_bind_ioctl_ops_unwind(vm, ops, args->num_binds);
+	xe_vma_ops_fini(&vops);
 	for (i = args->num_binds - 1; i >= 0; --i)
 		if (ops[i])
 			drm_gpuva_ops_free(&vm->gpuvm, ops[i]);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index ce1a63a5e3e7..211c88801182 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -21,6 +21,7 @@ struct xe_bo;
 struct xe_sync_entry;
 struct xe_user_fence;
 struct xe_vm;
+struct xe_vm_pgtable_update_op;
 
 #define XE_VMA_READ_ONLY	DRM_GPUVA_USERBITS
 #define XE_VMA_DESTROYED	(DRM_GPUVA_USERBITS << 1)
@@ -368,6 +369,13 @@ struct xe_vma_ops {
 	struct xe_sync_entry *syncs;
 	/** @num_syncs: number of syncs */
 	u32 num_syncs;
+	/** @pt_update_ops: page table update operations */
+	struct {
+		/** @ops: operations */
+		struct xe_vm_pgtable_update_op *ops;
+		/** @num_ops: number of operations */
+		u32 num_ops;
+	} pt_update_ops[XE_MAX_TILES_PER_DEVICE];
 };
 
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 17/42] drm/xe: Convert multiple bind ops into single job
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (15 preceding siblings ...)
  2024-06-13  4:24 ` [CI 16/42] drm/xe: Add xe_vm_pgtable_update_op to xe_vma_ops Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 18/42] drm/xe: Update VM trace events Oak Zeng
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

This aligns with the uAPI of an array of binds or single bind that
results in multiple GPUVA ops to be considered a single atomic
operations.

The implemenation is roughly:
- xe_vma_ops is a list of xe_vma_op (GPUVA op)
- each xe_vma_op resolves to 0-3 PT ops
- xe_vma_ops creates a single job
- if at any point during binding a failure occurs, xe_vma_ops contains
  the information necessary unwind the PT and VMA (GPUVA) state

v2:
 - add missing dma-resv slot reservation (CI, testing)

Cc: Oak Zeng <oak.zeng@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo_types.h |    2 +
 drivers/gpu/drm/xe/xe_migrate.c  |  301 ++++----
 drivers/gpu/drm/xe/xe_migrate.h  |   32 +-
 drivers/gpu/drm/xe/xe_pt.c       | 1108 +++++++++++++++++++-----------
 drivers/gpu/drm/xe/xe_pt.h       |   14 +-
 drivers/gpu/drm/xe/xe_pt_types.h |   36 +
 drivers/gpu/drm/xe/xe_vm.c       |  519 +++-----------
 drivers/gpu/drm/xe/xe_vm.h       |    2 +
 drivers/gpu/drm/xe/xe_vm_types.h |   45 +-
 9 files changed, 1036 insertions(+), 1023 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
index 86422e113d39..02d68873558a 100644
--- a/drivers/gpu/drm/xe/xe_bo_types.h
+++ b/drivers/gpu/drm/xe/xe_bo_types.h
@@ -58,6 +58,8 @@ struct xe_bo {
 #endif
 	/** @freed: List node for delayed put. */
 	struct llist_node freed;
+	/** @update_index: Update index if PT BO */
+	int update_index;
 	/** @created: Whether the bo has passed initial creation */
 	bool created;
 
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 5eb039acfcf0..0457693315e6 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1131,6 +1131,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 }
 
 static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb, u64 ppgtt_ofs,
+			  const struct xe_vm_pgtable_update_op *pt_op,
 			  const struct xe_vm_pgtable_update *update,
 			  struct xe_migrate_pt_update *pt_update)
 {
@@ -1165,8 +1166,12 @@ static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb, u64 ppgtt_ofs,
 		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
 		bb->cs[bb->len++] = lower_32_bits(addr);
 		bb->cs[bb->len++] = upper_32_bits(addr);
-		ops->populate(pt_update, tile, NULL, bb->cs + bb->len, ofs, chunk,
-			      update);
+		if (pt_op->bind)
+			ops->populate(pt_update, tile, NULL, bb->cs + bb->len,
+				      ofs, chunk, update);
+		else
+			ops->clear(pt_update, tile, NULL, bb->cs + bb->len,
+				   ofs, chunk, update);
 
 		bb->len += chunk * 2;
 		ofs += chunk;
@@ -1191,114 +1196,58 @@ struct migrate_test_params {
 
 static struct dma_fence *
 xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
-			       struct xe_vm *vm, struct xe_bo *bo,
-			       const struct  xe_vm_pgtable_update *updates,
-			       u32 num_updates, bool wait_vm,
 			       struct xe_migrate_pt_update *pt_update)
 {
 	XE_TEST_DECLARE(struct migrate_test_params *test =
 			to_migrate_test_params
 			(xe_cur_kunit_priv(XE_TEST_LIVE_MIGRATE));)
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
-	struct dma_fence *fence;
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&pt_update->vops->pt_update_ops[pt_update->tile_id];
 	int err;
-	u32 i;
+	u32 j, i;
 
 	if (XE_TEST_ONLY(test && test->force_gpu))
 		return ERR_PTR(-ETIME);
 
-	if (bo && !dma_resv_test_signaled(bo->ttm.base.resv,
-					  DMA_RESV_USAGE_KERNEL))
-		return ERR_PTR(-ETIME);
-
-	if (wait_vm && !dma_resv_test_signaled(xe_vm_resv(vm),
-					       DMA_RESV_USAGE_BOOKKEEP))
-		return ERR_PTR(-ETIME);
-
 	if (ops->pre_commit) {
 		pt_update->job = NULL;
 		err = ops->pre_commit(pt_update);
 		if (err)
 			return ERR_PTR(err);
 	}
-	for (i = 0; i < num_updates; i++) {
-		const struct xe_vm_pgtable_update *update = &updates[i];
-
-		ops->populate(pt_update, m->tile, &update->pt_bo->vmap, NULL,
-			      update->ofs, update->qwords, update);
-	}
-
-	if (vm) {
-		trace_xe_vm_cpu_bind(vm);
-		xe_device_wmb(vm->xe);
-	}
-
-	fence = dma_fence_get_stub();
-
-	return fence;
-}
 
-static bool no_in_syncs(struct xe_vm *vm, struct xe_exec_queue *q,
-			struct xe_sync_entry *syncs, u32 num_syncs)
-{
-	struct dma_fence *fence;
-	int i;
-
-	for (i = 0; i < num_syncs; i++) {
-		fence = syncs[i].fence;
-
-		if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
-				       &fence->flags))
-			return false;
-	}
-	if (q) {
-		fence = xe_exec_queue_last_fence_get(q, vm);
-		if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
-			dma_fence_put(fence);
-			return false;
+	for (j = 0; j < pt_update_ops->num_ops; ++j) {
+		const struct xe_vm_pgtable_update_op *pt_op =
+			&pt_update_ops->ops[j];
+
+		for (i = 0; i < pt_op->num_entries; i++) {
+			const struct xe_vm_pgtable_update *update =
+				&pt_op->entries[i];
+
+			if (pt_op->bind)
+				ops->populate(pt_update, m->tile,
+					      &update->pt_bo->vmap, NULL,
+					      update->ofs, update->qwords,
+					      update);
+			else
+				ops->clear(pt_update, m->tile,
+					   &update->pt_bo->vmap, NULL,
+					   update->ofs, update->qwords, update);
 		}
-		dma_fence_put(fence);
 	}
 
-	return true;
+	trace_xe_vm_cpu_bind(vm);
+	xe_device_wmb(vm->xe);
+
+	return dma_fence_get_stub();
 }
 
-/**
- * xe_migrate_update_pgtables() - Pipelined page-table update
- * @m: The migrate context.
- * @vm: The vm we'll be updating.
- * @bo: The bo whose dma-resv we will await before updating, or NULL if userptr.
- * @q: The exec queue to be used for the update or NULL if the default
- * migration engine is to be used.
- * @updates: An array of update descriptors.
- * @num_updates: Number of descriptors in @updates.
- * @syncs: Array of xe_sync_entry to await before updating. Note that waits
- * will block the engine timeline.
- * @num_syncs: Number of entries in @syncs.
- * @pt_update: Pointer to a struct xe_migrate_pt_update, which contains
- * pointers to callback functions and, if subclassed, private arguments to
- * those.
- *
- * Perform a pipelined page-table update. The update descriptors are typically
- * built under the same lock critical section as a call to this function. If
- * using the default engine for the updates, they will be performed in the
- * order they grab the job_mutex. If different engines are used, external
- * synchronization is needed for overlapping updates to maintain page-table
- * consistency. Note that the meaing of "overlapping" is that the updates
- * touch the same page-table, which might be a higher-level page-directory.
- * If no pipelining is needed, then updates may be performed by the cpu.
- *
- * Return: A dma_fence that, when signaled, indicates the update completion.
- */
-struct dma_fence *
-xe_migrate_update_pgtables(struct xe_migrate *m,
-			   struct xe_vm *vm,
-			   struct xe_bo *bo,
-			   struct xe_exec_queue *q,
-			   const struct xe_vm_pgtable_update *updates,
-			   u32 num_updates,
-			   struct xe_sync_entry *syncs, u32 num_syncs,
-			   struct xe_migrate_pt_update *pt_update)
+static struct dma_fence *
+__xe_migrate_update_pgtables(struct xe_migrate *m,
+			     struct xe_migrate_pt_update *pt_update,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops)
 {
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
 	struct xe_tile *tile = m->tile;
@@ -1307,59 +1256,45 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 	struct xe_sched_job *job;
 	struct dma_fence *fence;
 	struct drm_suballoc *sa_bo = NULL;
-	struct xe_vma *vma = pt_update->vma;
 	struct xe_bb *bb;
-	u32 i, batch_size, ppgtt_ofs, update_idx, page_ofs = 0;
+	u32 i, j, batch_size = 0, ppgtt_ofs, update_idx, page_ofs = 0;
+	u32 num_updates = 0, current_update = 0;
 	u64 addr;
 	int err = 0;
-	bool usm = !q && xe->info.has_usm;
-	bool first_munmap_rebind = vma &&
-		vma->gpuva.flags & XE_VMA_FIRST_REBIND;
-	struct xe_exec_queue *q_override = !q ? m->q : q;
-	u16 pat_index = xe->pat.idx[XE_CACHE_WB];
+	bool is_migrate = pt_update_ops->q == m->q;
+	bool usm = is_migrate && xe->info.has_usm;
+
+	for (i = 0; i < pt_update_ops->num_ops; ++i) {
+		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
+		struct xe_vm_pgtable_update *updates = pt_op->entries;
+
+		num_updates += pt_op->num_entries;
+		for (j = 0; j < pt_op->num_entries; ++j) {
+			u32 num_cmds = DIV_ROUND_UP(updates[j].qwords, 0x1ff);
 
-	/* Use the CPU if no in syncs and engine is idle */
-	if (no_in_syncs(vm, q, syncs, num_syncs) && xe_exec_queue_is_idle(q_override)) {
-		fence =  xe_migrate_update_pgtables_cpu(m, vm, bo, updates,
-							num_updates,
-							first_munmap_rebind,
-							pt_update);
-		if (!IS_ERR(fence) || fence == ERR_PTR(-EAGAIN))
-			return fence;
+			/* align noop + MI_STORE_DATA_IMM cmd prefix */
+			batch_size += 4 * num_cmds + updates[j].qwords * 2;
+		}
 	}
 
 	/* fixed + PTE entries */
 	if (IS_DGFX(xe))
-		batch_size = 2;
+		batch_size += 2;
 	else
-		batch_size = 6 + num_updates * 2;
-
-	for (i = 0; i < num_updates; i++) {
-		u32 num_cmds = DIV_ROUND_UP(updates[i].qwords, MAX_PTE_PER_SDI);
+		batch_size += 6 + num_updates * 2;
 
-		/* align noop + MI_STORE_DATA_IMM cmd prefix */
-		batch_size += 4 * num_cmds + updates[i].qwords * 2;
-	}
-
-	/*
-	 * XXX: Create temp bo to copy from, if batch_size becomes too big?
-	 *
-	 * Worst case: Sum(2 * (each lower level page size) + (top level page size))
-	 * Should be reasonably bound..
-	 */
-	xe_tile_assert(tile, batch_size < SZ_128K);
-
-	bb = xe_bb_new(gt, batch_size, !q && xe->info.has_usm);
+	bb = xe_bb_new(gt, batch_size, usm);
 	if (IS_ERR(bb))
 		return ERR_CAST(bb);
 
 	/* For sysmem PTE's, need to map them in our hole.. */
 	if (!IS_DGFX(xe)) {
 		ppgtt_ofs = NUM_KERNEL_PDE - 1;
-		if (q) {
-			xe_tile_assert(tile, num_updates <= NUM_VMUSA_WRITES_PER_UNIT);
+		if (!is_migrate) {
+			u32 num_units = DIV_ROUND_UP(num_updates,
+						     NUM_VMUSA_WRITES_PER_UNIT);
 
-			sa_bo = drm_suballoc_new(&m->vm_update_sa, 1,
+			sa_bo = drm_suballoc_new(&m->vm_update_sa, num_units,
 						 GFP_KERNEL, true, 0);
 			if (IS_ERR(sa_bo)) {
 				err = PTR_ERR(sa_bo);
@@ -1379,14 +1314,26 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 		bb->cs[bb->len++] = ppgtt_ofs * XE_PAGE_SIZE + page_ofs;
 		bb->cs[bb->len++] = 0; /* upper_32_bits */
 
-		for (i = 0; i < num_updates; i++) {
-			struct xe_bo *pt_bo = updates[i].pt_bo;
+		for (i = 0; i < pt_update_ops->num_ops; ++i) {
+			struct xe_vm_pgtable_update_op *pt_op =
+				&pt_update_ops->ops[i];
+			struct xe_vm_pgtable_update *updates = pt_op->entries;
 
-			xe_tile_assert(tile, pt_bo->size == SZ_4K);
+			for (j = 0; j < pt_op->num_entries; ++j, ++current_update) {
+				struct xe_vm *vm = pt_update->vops->vm;
+				struct xe_bo *pt_bo = updates[j].pt_bo;
 
-			addr = vm->pt_ops->pte_encode_bo(pt_bo, 0, pat_index, 0);
-			bb->cs[bb->len++] = lower_32_bits(addr);
-			bb->cs[bb->len++] = upper_32_bits(addr);
+				xe_tile_assert(tile, pt_bo->size == SZ_4K);
+
+				/* Map a PT at most once */
+				if (pt_bo->update_index < 0)
+					pt_bo->update_index = current_update;
+
+				addr = vm->pt_ops->pte_encode_bo(pt_bo, 0,
+								 XE_CACHE_WB, 0);
+				bb->cs[bb->len++] = lower_32_bits(addr);
+				bb->cs[bb->len++] = upper_32_bits(addr);
+			}
 		}
 
 		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
@@ -1394,19 +1341,39 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 
 		addr = xe_migrate_vm_addr(ppgtt_ofs, 0) +
 			(page_ofs / sizeof(u64)) * XE_PAGE_SIZE;
-		for (i = 0; i < num_updates; i++)
-			write_pgtable(tile, bb, addr + i * XE_PAGE_SIZE,
-				      &updates[i], pt_update);
+		for (i = 0; i < pt_update_ops->num_ops; ++i) {
+			struct xe_vm_pgtable_update_op *pt_op =
+				&pt_update_ops->ops[i];
+			struct xe_vm_pgtable_update *updates = pt_op->entries;
+
+			for (j = 0; j < pt_op->num_entries; ++j) {
+				struct xe_bo *pt_bo = updates[j].pt_bo;
+
+				write_pgtable(tile, bb, addr +
+					      pt_bo->update_index * XE_PAGE_SIZE,
+					      pt_op, &updates[j], pt_update);
+			}
+		}
 	} else {
 		/* phys pages, no preamble required */
 		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
 		update_idx = bb->len;
 
-		for (i = 0; i < num_updates; i++)
-			write_pgtable(tile, bb, 0, &updates[i], pt_update);
+		for (i = 0; i < pt_update_ops->num_ops; ++i) {
+			struct xe_vm_pgtable_update_op *pt_op =
+				&pt_update_ops->ops[i];
+			struct xe_vm_pgtable_update *updates = pt_op->entries;
+
+			for (j = 0; j < pt_op->num_entries; ++j)
+				write_pgtable(tile, bb, 0, pt_op, &updates[j],
+					      pt_update);
+		}
 	}
 
-	job = xe_bb_create_migration_job(q ?: m->q, bb,
+	if (is_migrate)
+		mutex_lock(&m->job_mutex);
+
+	job = xe_bb_create_migration_job(pt_update_ops->q, bb,
 					 xe_migrate_batch_base(m, usm),
 					 update_idx);
 	if (IS_ERR(job)) {
@@ -1414,46 +1381,18 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 		goto err_bb;
 	}
 
-	/* Wait on BO move */
-	if (bo) {
-		err = job_add_deps(job, bo->ttm.base.resv,
-				   DMA_RESV_USAGE_KERNEL);
-		if (err)
-			goto err_job;
-	}
-
-	/*
-	 * Munmap style VM unbind, need to wait for all jobs to be complete /
-	 * trigger preempts before moving forward
-	 */
-	if (first_munmap_rebind) {
-		err = job_add_deps(job, xe_vm_resv(vm),
-				   DMA_RESV_USAGE_BOOKKEEP);
-		if (err)
-			goto err_job;
-	}
-
-	err = xe_sched_job_last_fence_add_dep(job, vm);
-	for (i = 0; !err && i < num_syncs; i++)
-		err = xe_sync_entry_add_deps(&syncs[i], job);
-
-	if (err)
-		goto err_job;
-
 	if (ops->pre_commit) {
 		pt_update->job = job;
 		err = ops->pre_commit(pt_update);
 		if (err)
 			goto err_job;
 	}
-	if (!q)
-		mutex_lock(&m->job_mutex);
 
 	xe_sched_job_arm(job);
 	fence = dma_fence_get(&job->drm.s_fence->finished);
 	xe_sched_job_push(job);
 
-	if (!q)
+	if (is_migrate)
 		mutex_unlock(&m->job_mutex);
 
 	xe_bb_free(bb, fence);
@@ -1464,12 +1403,46 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 err_job:
 	xe_sched_job_put(job);
 err_bb:
+	if (is_migrate)
+		mutex_unlock(&m->job_mutex);
 	xe_bb_free(bb, NULL);
 err:
 	drm_suballoc_free(sa_bo, NULL);
 	return ERR_PTR(err);
 }
 
+/**
+ * xe_migrate_update_pgtables() - Pipelined page-table update
+ * @m: The migrate context.
+ * @pt_update: PT update arguments
+ *
+ * Perform a pipelined page-table update. The update descriptors are typically
+ * built under the same lock critical section as a call to this function. If
+ * using the default engine for the updates, they will be performed in the
+ * order they grab the job_mutex. If different engines are used, external
+ * synchronization is needed for overlapping updates to maintain page-table
+ * consistency. Note that the meaing of "overlapping" is that the updates
+ * touch the same page-table, which might be a higher-level page-directory.
+ * If no pipelining is needed, then updates may be performed by the cpu.
+ *
+ * Return: A dma_fence that, when signaled, indicates the update completion.
+ */
+struct dma_fence *
+xe_migrate_update_pgtables(struct xe_migrate *m,
+			   struct xe_migrate_pt_update *pt_update)
+
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&pt_update->vops->pt_update_ops[pt_update->tile_id];
+	struct dma_fence *fence;
+
+	fence =  xe_migrate_update_pgtables_cpu(m, pt_update);
+	if (!IS_ERR(fence))
+		return fence;
+
+	return __xe_migrate_update_pgtables(m, pt_update, pt_update_ops);
+}
+
 /**
  * xe_migrate_wait() - Complete all operations using the xe_migrate context
  * @m: Migrate context to wait for.
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index a5bcaafe4a99..453e0ecf5034 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -47,6 +47,24 @@ struct xe_migrate_pt_update_ops {
 			 struct xe_tile *tile, struct iosys_map *map,
 			 void *pos, u32 ofs, u32 num_qwords,
 			 const struct xe_vm_pgtable_update *update);
+	/**
+	 * @clear: Clear a command buffer or page-table with ptes.
+	 * @pt_update: Embeddable callback argument.
+	 * @tile: The tile for the current operation.
+	 * @map: struct iosys_map into the memory to be populated.
+	 * @pos: If @map is NULL, map into the memory to be populated.
+	 * @ofs: qword offset into @map, unused if @map is NULL.
+	 * @num_qwords: Number of qwords to write.
+	 * @update: Information about the PTEs to be inserted.
+	 *
+	 * This interface is intended to be used as a callback into the
+	 * page-table system to populate command buffers or shared
+	 * page-tables with PTEs.
+	 */
+	void (*clear)(struct xe_migrate_pt_update *pt_update,
+		      struct xe_tile *tile, struct iosys_map *map,
+		      void *pos, u32 ofs, u32 num_qwords,
+		      const struct xe_vm_pgtable_update *update);
 
 	/**
 	 * @pre_commit: Callback to be called just before arming the
@@ -67,14 +85,10 @@ struct xe_migrate_pt_update_ops {
 struct xe_migrate_pt_update {
 	/** @ops: Pointer to the struct xe_migrate_pt_update_ops callbacks */
 	const struct xe_migrate_pt_update_ops *ops;
-	/** @vma: The vma we're updating the pagetable for. */
-	struct xe_vma *vma;
+	/** @vops: VMA operations */
+	struct xe_vma_ops *vops;
 	/** @job: The job if a GPU page-table update. NULL otherwise */
 	struct xe_sched_job *job;
-	/** @start: Start of update for the range fence */
-	u64 start;
-	/** @last: Last of update for the range fence */
-	u64 last;
 	/** @tile_id: Tile ID of the update */
 	u8 tile_id;
 };
@@ -96,12 +110,6 @@ struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m);
 
 struct dma_fence *
 xe_migrate_update_pgtables(struct xe_migrate *m,
-			   struct xe_vm *vm,
-			   struct xe_bo *bo,
-			   struct xe_exec_queue *q,
-			   const struct xe_vm_pgtable_update *updates,
-			   u32 num_updates,
-			   struct xe_sync_entry *syncs, u32 num_syncs,
 			   struct xe_migrate_pt_update *pt_update);
 
 void xe_migrate_wait(struct xe_migrate *m);
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index cd60c009b679..4353a3bdf6c8 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -9,12 +9,14 @@
 #include "xe_bo.h"
 #include "xe_device.h"
 #include "xe_drm_client.h"
+#include "xe_exec_queue.h"
 #include "xe_gt.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
 #include "xe_pt_types.h"
 #include "xe_pt_walk.h"
 #include "xe_res_cursor.h"
+#include "xe_sync.h"
 #include "xe_trace.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -325,6 +327,7 @@ xe_pt_new_shared(struct xe_walk_update *wupd, struct xe_pt *parent,
 	entry->pt = parent;
 	entry->flags = 0;
 	entry->qwords = 0;
+	entry->pt_bo->update_index = -1;
 
 	if (alloc_entries) {
 		entry->pt_entries = kmalloc_array(XE_PDES,
@@ -864,9 +867,7 @@ static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 
 	lockdep_assert_held(&vm->lock);
 
-	if (xe_vma_is_userptr(vma))
-		lockdep_assert_held_read(&vm->userptr.notifier_lock);
-	else if (!xe_vma_is_null(vma))
+	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
 		dma_resv_assert_held(xe_vma_bo(vma)->ttm.base.resv);
 
 	xe_vm_assert_held(vm);
@@ -888,10 +889,8 @@ static void xe_pt_commit_bind(struct xe_vma *vma,
 		if (!rebind)
 			pt->num_live += entries[i].qwords;
 
-		if (!pt->level) {
-			kfree(entries[i].pt_entries);
+		if (!pt->level)
 			continue;
-		}
 
 		pt_dir = as_xe_pt_dir(pt);
 		for (j = 0; j < entries[i].qwords; j++) {
@@ -904,10 +903,18 @@ static void xe_pt_commit_bind(struct xe_vma *vma,
 
 			pt_dir->children[j_] = &newpte->base;
 		}
-		kfree(entries[i].pt_entries);
 	}
 }
 
+static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
+			    u32 num_entries)
+{
+	u32 i;
+
+	for (i = 0; i < num_entries; i++)
+		kfree(entries[i].pt_entries);
+}
+
 static int
 xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
 		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
@@ -926,12 +933,13 @@ xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
 
 static void xe_vm_dbg_print_entries(struct xe_device *xe,
 				    const struct xe_vm_pgtable_update *entries,
-				    unsigned int num_entries)
+				    unsigned int num_entries, bool bind)
 #if (IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM))
 {
 	unsigned int i;
 
-	vm_dbg(&xe->drm, "%u entries to update\n", num_entries);
+	vm_dbg(&xe->drm, "%s: %u entries to update\n", bind ? "bind" : "unbind",
+	       num_entries);
 	for (i = 0; i < num_entries; i++) {
 		const struct xe_vm_pgtable_update *entry = &entries[i];
 		struct xe_pt *xe_pt = entry->pt;
@@ -952,66 +960,122 @@ static void xe_vm_dbg_print_entries(struct xe_device *xe,
 {}
 #endif
 
-#ifdef CONFIG_DRM_XE_USERPTR_INVAL_INJECT
+static int job_add_deps(struct xe_sched_job *job, struct dma_resv *resv,
+			enum dma_resv_usage usage)
+{
+	return drm_sched_job_add_resv_dependencies(&job->drm, resv, usage);
+}
 
-static int xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+static bool no_in_syncs(struct xe_sync_entry *syncs, u32 num_syncs)
 {
-	u32 divisor = uvma->userptr.divisor ? uvma->userptr.divisor : 2;
-	static u32 count;
+	int i;
 
-	if (count++ % divisor == divisor - 1) {
-		struct xe_vm *vm = xe_vma_vm(&uvma->vma);
+	for (i = 0; i < num_syncs; i++) {
+		struct dma_fence *fence = syncs[i].fence;
 
-		uvma->userptr.divisor = divisor << 1;
-		spin_lock(&vm->userptr.invalidated_lock);
-		list_move_tail(&uvma->userptr.invalidate_link,
-			       &vm->userptr.invalidated);
-		spin_unlock(&vm->userptr.invalidated_lock);
-		return true;
+		if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
+				       &fence->flags))
+			return false;
 	}
 
-	return false;
+	return true;
 }
 
-#else
-
-static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+static int vma_add_deps(struct xe_vma *vma, struct xe_sched_job *job)
 {
-	return false;
+	struct xe_bo *bo = xe_vma_bo(vma);
+
+	xe_bo_assert_held(bo);
+
+	if (bo && !bo->vm) {
+		if (!job) {
+			if (!dma_resv_test_signaled(bo->ttm.base.resv,
+						    DMA_RESV_USAGE_KERNEL))
+				return -ETIME;
+		} else {
+			return job_add_deps(job, bo->ttm.base.resv,
+					    DMA_RESV_USAGE_KERNEL);
+		}
+	}
+
+	return 0;
 }
 
-#endif
+static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
+		       struct xe_sched_job *job)
+{
+	int err = 0;
 
-/**
- * struct xe_pt_migrate_pt_update - Callback argument for pre-commit callbacks
- * @base: Base we derive from.
- * @bind: Whether this is a bind or an unbind operation. A bind operation
- *        makes the pre-commit callback error with -EAGAIN if it detects a
- *        pending invalidation.
- * @locked: Whether the pre-commit callback locked the userptr notifier lock
- *          and it needs unlocking.
- */
-struct xe_pt_migrate_pt_update {
-	struct xe_migrate_pt_update base;
-	bool bind;
-	bool locked;
-};
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		err = vma_add_deps(op->map.vma, job);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			err = vma_add_deps(op->remap.prev, job);
+		if (!err && op->remap.next)
+			err = vma_add_deps(op->remap.next, job);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		err = vma_add_deps(gpuva_to_vma(op->base.prefetch.va), job);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
 
-/*
- * This function adds the needed dependencies to a page-table update job
- * to make sure racing jobs for separate bind engines don't race writing
- * to the same page-table range, wreaking havoc. Initially use a single
- * fence for the entire VM. An optimization would use smaller granularity.
- */
 static int xe_pt_vm_dependencies(struct xe_sched_job *job,
-				 struct xe_range_fence_tree *rftree,
-				 u64 start, u64 last)
+				 struct xe_vm *vm,
+				 struct xe_vma_ops *vops,
+				 struct xe_vm_pgtable_update_ops *pt_update_ops,
+				 struct xe_range_fence_tree *rftree)
 {
 	struct xe_range_fence *rtfence;
 	struct dma_fence *fence;
-	int err;
+	struct xe_vma_op *op;
+	int err = 0, i;
 
-	rtfence = xe_range_fence_tree_first(rftree, start, last);
+	xe_vm_assert_held(vm);
+
+	if (!job && !no_in_syncs(vops->syncs, vops->num_syncs))
+		return -ETIME;
+
+	if (!job && !xe_exec_queue_is_idle(pt_update_ops->q))
+		return -ETIME;
+
+	if (pt_update_ops->wait_vm_bookkeep) {
+		if (!job) {
+			if (!dma_resv_test_signaled(xe_vm_resv(vm),
+						    DMA_RESV_USAGE_BOOKKEEP))
+				return -ETIME;
+		} else {
+			err = job_add_deps(job, xe_vm_resv(vm),
+					   DMA_RESV_USAGE_BOOKKEEP);
+			if (err)
+				return err;
+		}
+	} else if (pt_update_ops->wait_vm_kernel) {
+		if (!job) {
+			if (!dma_resv_test_signaled(xe_vm_resv(vm),
+						    DMA_RESV_USAGE_KERNEL))
+				return -ETIME;
+		} else {
+			err = job_add_deps(job, xe_vm_resv(vm),
+					   DMA_RESV_USAGE_KERNEL);
+			if (err)
+				return err;
+		}
+	}
+
+	rtfence = xe_range_fence_tree_first(rftree, pt_update_ops->start,
+					    pt_update_ops->last);
 	while (rtfence) {
 		fence = rtfence->fence;
 
@@ -1029,80 +1093,168 @@ static int xe_pt_vm_dependencies(struct xe_sched_job *job,
 				return err;
 		}
 
-		rtfence = xe_range_fence_tree_next(rtfence, start, last);
+		rtfence = xe_range_fence_tree_next(rtfence,
+						   pt_update_ops->start,
+						   pt_update_ops->last);
 	}
 
-	return 0;
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_add_deps(vm, op, job);
+		if (err)
+			return err;
+	}
+
+	for (i = 0; job && !err && i < vops->num_syncs; i++)
+		err = xe_sync_entry_add_deps(&vops->syncs[i], job);
+
+	return err;
 }
 
 static int xe_pt_pre_commit(struct xe_migrate_pt_update *pt_update)
 {
-	struct xe_range_fence_tree *rftree =
-		&xe_vma_vm(pt_update->vma)->rftree[pt_update->tile_id];
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vm *vm = vops->vm;
+	struct xe_range_fence_tree *rftree = &vm->rftree[pt_update->tile_id];
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[pt_update->tile_id];
+
+	return xe_pt_vm_dependencies(pt_update->job, vm, pt_update->vops,
+				     pt_update_ops, rftree);
+}
+
+#ifdef CONFIG_DRM_XE_USERPTR_INVAL_INJECT
+
+static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+{
+	u32 divisor = uvma->userptr.divisor ? uvma->userptr.divisor : 2;
+	static u32 count;
 
-	return xe_pt_vm_dependencies(pt_update->job, rftree,
-				     pt_update->start, pt_update->last);
+	if (count++ % divisor == divisor - 1) {
+		uvma->userptr.divisor = divisor << 1;
+		return true;
+	}
+
+	return false;
 }
 
-static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
+#else
+
+static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
 {
-	struct xe_pt_migrate_pt_update *userptr_update =
-		container_of(pt_update, typeof(*userptr_update), base);
-	struct xe_userptr_vma *uvma = to_userptr_vma(pt_update->vma);
-	unsigned long notifier_seq = uvma->userptr.notifier_seq;
-	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
-	int err = xe_pt_vm_dependencies(pt_update->job,
-					&vm->rftree[pt_update->tile_id],
-					pt_update->start,
-					pt_update->last);
+	return false;
+}
 
-	if (err)
-		return err;
+#endif
 
-	userptr_update->locked = false;
+static int vma_check_userptr(struct xe_vm *vm, struct xe_vma *vma,
+			     struct xe_vm_pgtable_update_ops *pt_update)
+{
+	struct xe_userptr_vma *uvma;
+	unsigned long notifier_seq;
 
-	/*
-	 * Wait until nobody is running the invalidation notifier, and
-	 * since we're exiting the loop holding the notifier lock,
-	 * nobody can proceed invalidating either.
-	 *
-	 * Note that we don't update the vma->userptr.notifier_seq since
-	 * we don't update the userptr pages.
-	 */
-	do {
-		down_read(&vm->userptr.notifier_lock);
-		if (!mmu_interval_read_retry(&uvma->userptr.notifier,
-					     notifier_seq))
-			break;
+	lockdep_assert_held_read(&vm->userptr.notifier_lock);
 
-		up_read(&vm->userptr.notifier_lock);
+	if (!xe_vma_is_userptr(vma))
+		return 0;
 
-		if (userptr_update->bind)
-			return -EAGAIN;
+	uvma = to_userptr_vma(vma);
+	notifier_seq = uvma->userptr.notifier_seq;
 
-		notifier_seq = mmu_interval_read_begin(&uvma->userptr.notifier);
-	} while (true);
+	if (uvma->userptr.initial_bind && !xe_vm_in_fault_mode(vm))
+               return 0;
 
-	/* Inject errors to test_whether they are handled correctly */
-	if (userptr_update->bind && xe_pt_userptr_inject_eagain(uvma)) {
-		up_read(&vm->userptr.notifier_lock);
+	if (!mmu_interval_read_retry(&uvma->userptr.notifier,
+				     notifier_seq) &&
+	    !xe_pt_userptr_inject_eagain(uvma))
+		return 0;
+
+	if (xe_vm_in_fault_mode(vm)) {
 		return -EAGAIN;
-	}
+	} else {
+		spin_lock(&vm->userptr.invalidated_lock);
+		list_move_tail(&uvma->userptr.invalidate_link,
+			       &vm->userptr.invalidated);
+		spin_unlock(&vm->userptr.invalidated_lock);
 
-	userptr_update->locked = true;
+		if (xe_vm_in_preempt_fence_mode(vm)) {
+			struct dma_resv_iter cursor;
+			struct dma_fence *fence;
+			long err;
+
+			dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
+					    DMA_RESV_USAGE_BOOKKEEP);
+			dma_resv_for_each_fence_unlocked(&cursor, fence)
+				dma_fence_enable_sw_signaling(fence);
+			dma_resv_iter_end(&cursor);
+
+			err = dma_resv_wait_timeout(xe_vm_resv(vm),
+						    DMA_RESV_USAGE_BOOKKEEP,
+						    false, MAX_SCHEDULE_TIMEOUT);
+			XE_WARN_ON(err <= 0);
+		}
+	}
 
 	return 0;
 }
 
-static const struct xe_migrate_pt_update_ops bind_ops = {
-	.populate = xe_vm_populate_pgtable,
-	.pre_commit = xe_pt_pre_commit,
-};
+static int op_check_userptr(struct xe_vm *vm, struct xe_vma_op *op,
+			    struct xe_vm_pgtable_update_ops *pt_update)
+{
+	int err = 0;
 
-static const struct xe_migrate_pt_update_ops userptr_bind_ops = {
-	.populate = xe_vm_populate_pgtable,
-	.pre_commit = xe_pt_userptr_pre_commit,
-};
+	lockdep_assert_held_read(&vm->userptr.notifier_lock);
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		err = vma_check_userptr(vm, op->map.vma, pt_update);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			err = vma_check_userptr(vm, op->remap.prev, pt_update);
+		if (!err && op->remap.next)
+			err = vma_check_userptr(vm, op->remap.next, pt_update);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		err = vma_check_userptr(vm, gpuva_to_vma(op->base.prefetch.va),
+					pt_update);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
+
+static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
+{
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[pt_update->tile_id];
+	struct xe_vma_op *op;
+	int err;
+
+	err = xe_pt_pre_commit(pt_update);
+	if (err)
+		return err;
+
+	down_read(&vm->userptr.notifier_lock);
+
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_check_userptr(vm, op, pt_update_ops);
+		if (err) {
+			up_read(&vm->userptr.notifier_lock);
+			break;
+		}
+	}
+
+	return err;
+}
 
 struct invalidation_fence {
 	struct xe_gt_tlb_invalidation_fence base;
@@ -1198,190 +1350,6 @@ static int invalidation_fence_init(struct xe_gt *gt,
 	return ret && ret != -ENOENT ? ret : 0;
 }
 
-static void xe_pt_calc_rfence_interval(struct xe_vma *vma,
-				       struct xe_pt_migrate_pt_update *update,
-				       struct xe_vm_pgtable_update *entries,
-				       u32 num_entries)
-{
-	int i, level = 0;
-
-	for (i = 0; i < num_entries; i++) {
-		const struct xe_vm_pgtable_update *entry = &entries[i];
-
-		if (entry->pt->level > level)
-			level = entry->pt->level;
-	}
-
-	/* Greedy (non-optimal) calculation but simple */
-	update->base.start = ALIGN_DOWN(xe_vma_start(vma),
-					0x1ull << xe_pt_shift(level));
-	update->base.last = ALIGN(xe_vma_end(vma),
-				  0x1ull << xe_pt_shift(level)) - 1;
-}
-
-/**
- * __xe_pt_bind_vma() - Build and connect a page-table tree for the vma
- * address range.
- * @tile: The tile to bind for.
- * @vma: The vma to bind.
- * @q: The exec_queue with which to do pipelined page-table updates.
- * @syncs: Entries to sync on before binding the built tree to the live vm tree.
- * @num_syncs: Number of @sync entries.
- * @rebind: Whether we're rebinding this vma to the same address range without
- * an unbind in-between.
- *
- * This function builds a page-table tree (see xe_pt_stage_bind() for more
- * information on page-table building), and the xe_vm_pgtable_update entries
- * abstracting the operations needed to attach it to the main vm tree. It
- * then takes the relevant locks and updates the metadata side of the main
- * vm tree and submits the operations for pipelined attachment of the
- * gpu page-table to the vm main tree, (which can be done either by the
- * cpu and the GPU).
- *
- * Return: A valid dma-fence representing the pipelined attachment operation
- * on success, an error pointer on error.
- */
-struct dma_fence *
-__xe_pt_bind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool rebind)
-{
-	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
-	struct xe_pt_migrate_pt_update bind_pt_update = {
-		.base = {
-			.ops = xe_vma_is_userptr(vma) ? &userptr_bind_ops : &bind_ops,
-			.vma = vma,
-			.tile_id = tile->id,
-		},
-		.bind = true,
-	};
-	struct xe_vm *vm = xe_vma_vm(vma);
-	u32 num_entries;
-	struct dma_fence *fence;
-	struct invalidation_fence *ifence = NULL;
-	struct xe_range_fence *rfence;
-	int err;
-
-	bind_pt_update.locked = false;
-	xe_bo_assert_held(xe_vma_bo(vma));
-	xe_vm_assert_held(vm);
-
-	vm_dbg(&xe_vma_vm(vma)->xe->drm,
-	       "Preparing bind, with range [%llx...%llx) engine %p.\n",
-	       xe_vma_start(vma), xe_vma_end(vma), q);
-
-	err = xe_pt_prepare_bind(tile, vma, entries, &num_entries);
-	if (err)
-		goto err;
-
-	err = dma_resv_reserve_fences(xe_vm_resv(vm), 1);
-	if (!err && !xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-		err = dma_resv_reserve_fences(xe_vma_bo(vma)->ttm.base.resv, 1);
-	if (err)
-		goto err;
-
-	xe_tile_assert(tile, num_entries <= ARRAY_SIZE(entries));
-
-	xe_vm_dbg_print_entries(tile_to_xe(tile), entries, num_entries);
-	xe_pt_calc_rfence_interval(vma, &bind_pt_update, entries,
-				   num_entries);
-
-	/*
-	 * If rebind, we have to invalidate TLB on !LR vms to invalidate
-	 * cached PTEs point to freed memory. on LR vms this is done
-	 * automatically when the context is re-enabled by the rebind worker,
-	 * or in fault mode it was invalidated on PTE zapping.
-	 *
-	 * If !rebind, and scratch enabled VMs, there is a chance the scratch
-	 * PTE is already cached in the TLB so it needs to be invalidated.
-	 * on !LR VMs this is done in the ring ops preceding a batch, but on
-	 * non-faulting LR, in particular on user-space batch buffer chaining,
-	 * it needs to be done here.
-	 */
-	if ((!rebind && xe_vm_has_scratch(vm) && xe_vm_in_preempt_fence_mode(vm))) {
-		ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
-		if (!ifence)
-			return ERR_PTR(-ENOMEM);
-	} else if (rebind && !xe_vm_in_lr_mode(vm)) {
-		/* We bump also if batch_invalidate_tlb is true */
-		vm->tlb_flush_seqno++;
-	}
-
-	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
-	if (!rfence) {
-		kfree(ifence);
-		return ERR_PTR(-ENOMEM);
-	}
-
-	fence = xe_migrate_update_pgtables(tile->migrate,
-					   vm, xe_vma_bo(vma), q,
-					   entries, num_entries,
-					   syncs, num_syncs,
-					   &bind_pt_update.base);
-	if (!IS_ERR(fence)) {
-		bool last_munmap_rebind = vma->gpuva.flags & XE_VMA_LAST_REBIND;
-		LLIST_HEAD(deferred);
-		int err;
-
-		err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
-					    &xe_range_fence_kfree_ops,
-					    bind_pt_update.base.start,
-					    bind_pt_update.base.last, fence);
-		if (err)
-			dma_fence_wait(fence, false);
-
-		/* TLB invalidation must be done before signaling rebind */
-		if (ifence) {
-			int err = invalidation_fence_init(tile->primary_gt,
-							  ifence, fence,
-							  xe_vma_start(vma),
-							  xe_vma_end(vma),
-							  xe_vma_vm(vma)->usm.asid);
-			if (err) {
-				dma_fence_put(fence);
-				kfree(ifence);
-				return ERR_PTR(err);
-			}
-			fence = &ifence->base.base;
-		}
-
-		/* add shared fence now for pagetable delayed destroy */
-		dma_resv_add_fence(xe_vm_resv(vm), fence, rebind ||
-				   last_munmap_rebind ?
-				   DMA_RESV_USAGE_KERNEL :
-				   DMA_RESV_USAGE_BOOKKEEP);
-
-		if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
-					   DMA_RESV_USAGE_BOOKKEEP);
-		xe_pt_commit_bind(vma, entries, num_entries, rebind,
-				  bind_pt_update.locked ? &deferred : NULL);
-
-		/* This vma is live (again?) now */
-		vma->tile_present |= BIT(tile->id);
-
-		if (bind_pt_update.locked) {
-			to_userptr_vma(vma)->userptr.initial_bind = true;
-			up_read(&vm->userptr.notifier_lock);
-			xe_bo_put_commit(&deferred);
-		}
-		if (!rebind && last_munmap_rebind &&
-		    xe_vm_in_preempt_fence_mode(vm))
-			xe_vm_queue_rebind_worker(vm);
-	} else {
-		kfree(rfence);
-		kfree(ifence);
-		if (bind_pt_update.locked)
-			up_read(&vm->userptr.notifier_lock);
-		xe_pt_abort_bind(vma, entries, num_entries);
-	}
-
-	return fence;
-
-err:
-	return ERR_PTR(err);
-}
-
 struct xe_pt_stage_unbind_walk {
 	/** @base: The pagewalk base-class. */
 	struct xe_pt_walk base;
@@ -1532,8 +1500,8 @@ xe_migrate_clear_pgtable_callback(struct xe_migrate_pt_update *pt_update,
 				  void *ptr, u32 qword_ofs, u32 num_qwords,
 				  const struct xe_vm_pgtable_update *update)
 {
-	struct xe_vma *vma = pt_update->vma;
-	u64 empty = __xe_pt_empty_pte(tile, xe_vma_vm(vma), update->pt->level);
+	struct xe_vm *vm = pt_update->vops->vm;
+	u64 empty = __xe_pt_empty_pte(tile, vm, update->pt->level);
 	int i;
 
 	if (map && map->is_iomem)
@@ -1577,151 +1545,487 @@ xe_pt_commit_unbind(struct xe_vma *vma,
 	}
 }
 
-static const struct xe_migrate_pt_update_ops unbind_ops = {
-	.populate = xe_migrate_clear_pgtable_callback,
-	.pre_commit = xe_pt_pre_commit,
-};
+static void
+xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
+				 struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	int i, level = 0;
+	u64 start, last;
 
-static const struct xe_migrate_pt_update_ops userptr_unbind_ops = {
-	.populate = xe_migrate_clear_pgtable_callback,
-	.pre_commit = xe_pt_userptr_pre_commit,
-};
+	for (i = 0; i < pt_op->num_entries; i++) {
+		const struct xe_vm_pgtable_update *entry = &pt_op->entries[i];
 
-/**
- * __xe_pt_unbind_vma() - Disconnect and free a page-table tree for the vma
- * address range.
- * @tile: The tile to unbind for.
- * @vma: The vma to unbind.
- * @q: The exec_queue with which to do pipelined page-table updates.
- * @syncs: Entries to sync on before disconnecting the tree to be destroyed.
- * @num_syncs: Number of @sync entries.
- *
- * This function builds a the xe_vm_pgtable_update entries abstracting the
- * operations needed to detach the page-table tree to be destroyed from the
- * man vm tree.
- * It then takes the relevant locks and submits the operations for
- * pipelined detachment of the gpu page-table from  the vm main tree,
- * (which can be done either by the cpu and the GPU), Finally it frees the
- * detached page-table tree.
- *
- * Return: A valid dma-fence representing the pipelined detachment operation
- * on success, an error pointer on error.
- */
-struct dma_fence *
-__xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		   struct xe_sync_entry *syncs, u32 num_syncs)
+		if (entry->pt->level > level)
+			level = entry->pt->level;
+	}
+
+	/* Greedy (non-optimal) calculation but simple */
+	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull << xe_pt_shift(level));
+	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level)) - 1;
+
+	if (start < pt_update_ops->start)
+		pt_update_ops->start = start;
+	if (last > pt_update_ops->last)
+		pt_update_ops->last = last;
+}
+
+static int vma_reserve_fences(struct xe_device *xe, struct xe_vma *vma)
 {
-	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
-	struct xe_pt_migrate_pt_update unbind_pt_update = {
-		.base = {
-			.ops = xe_vma_is_userptr(vma) ? &userptr_unbind_ops :
-			&unbind_ops,
-			.vma = vma,
-			.tile_id = tile->id,
-		},
-	};
-	struct xe_vm *vm = xe_vma_vm(vma);
-	u32 num_entries;
-	struct dma_fence *fence = NULL;
-	struct invalidation_fence *ifence;
-	struct xe_range_fence *rfence;
-	int err;
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+		return dma_resv_reserve_fences(xe_vma_bo(vma)->ttm.base.resv,
+					       xe->info.tile_count);
 
-	LLIST_HEAD(deferred);
+	return 0;
+}
+
+static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
+			   struct xe_vm_pgtable_update_ops *pt_update_ops,
+			   struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	struct llist_head *deferred = &pt_update_ops->deferred;
+	int err;
 
 	xe_bo_assert_held(xe_vma_bo(vma));
-	xe_vm_assert_held(vm);
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
-	       "Preparing unbind, with range [%llx...%llx) engine %p.\n",
-	       xe_vma_start(vma), xe_vma_end(vma), q);
-
-	num_entries = xe_pt_stage_unbind(tile, vma, entries);
-	xe_tile_assert(tile, num_entries <= ARRAY_SIZE(entries));
+	       "Preparing bind, with range [%llx...%llx)\n",
+	       xe_vma_start(vma), xe_vma_end(vma) - 1);
 
-	xe_vm_dbg_print_entries(tile_to_xe(tile), entries, num_entries);
-	xe_pt_calc_rfence_interval(vma, &unbind_pt_update, entries,
-				   num_entries);
+	pt_op->vma = NULL;
+	pt_op->bind = true;
+	pt_op->rebind = BIT(tile->id) & vma->tile_present;
 
-	err = dma_resv_reserve_fences(xe_vm_resv(vm), 1);
-	if (!err && !xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-		err = dma_resv_reserve_fences(xe_vma_bo(vma)->ttm.base.resv, 1);
+	err = vma_reserve_fences(tile_to_xe(tile), vma);
 	if (err)
-		return ERR_PTR(err);
+		return err;
 
-	ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
-	if (!ifence)
-		return ERR_PTR(-ENOMEM);
+	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
+				 &pt_op->num_entries);
+	if (!err) {
+		xe_tile_assert(tile, pt_op->num_entries <=
+			       ARRAY_SIZE(pt_op->entries));
+		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+					pt_op->num_entries, true);
 
-	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
-	if (!rfence) {
-		kfree(ifence);
-		return ERR_PTR(-ENOMEM);
+		xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+		++pt_update_ops->current_op;
+		pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
+
+		/*
+		 * If rebind, we have to invalidate TLB on !LR vms to invalidate
+		 * cached PTEs point to freed memory. on LR vms this is done
+		 * automatically when the context is re-enabled by the rebind
+		 * worker, or in fault mode it was invalidated on PTE zapping.
+		 *
+		 * If !rebind, and scratch enabled VMs, there is a chance the
+		 * scratch PTE is already cached in the TLB so it needs to be
+		 * invalidated. on !LR VMs this is done in the ring ops
+		 * preceding a batch, but on non-faulting LR, in particular on
+		 * user-space batch buffer chaining, it needs to be done here.
+		 */
+		pt_update_ops->needs_invalidation |=
+			(pt_op->rebind && !xe_vm_in_lr_mode(vm) &&
+			!vm->batch_invalidate_tlb) ||
+			(!pt_op->rebind && vm->scratch_pt[tile->id] &&
+			 xe_vm_in_preempt_fence_mode(vm));
+
+		/* FIXME: Don't commit right away */
+		vma->tile_staged |= BIT(tile->id);
+		pt_op->vma = vma;
+		xe_pt_commit_bind(vma, pt_op->entries, pt_op->num_entries,
+				  pt_op->rebind, deferred);
 	}
 
+	return err;
+}
+
+static int unbind_op_prepare(struct xe_tile *tile,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops,
+			     struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	struct llist_head *deferred = &pt_update_ops->deferred;
+	int err;
+
+	if (!((vma->tile_present | vma->tile_staged) & BIT(tile->id)))
+		return 0;
+
+	xe_bo_assert_held(xe_vma_bo(vma));
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "Preparing unbind, with range [%llx...%llx)\n",
+	       xe_vma_start(vma), xe_vma_end(vma) - 1);
+
 	/*
-	 * Even if we were already evicted and unbind to destroy, we need to
-	 * clear again here. The eviction may have updated pagetables at a
-	 * lower level, because it needs to be more conservative.
+	 * Wait for invalidation to complete. Can corrupt internal page table
+	 * state if an invalidation is running while preparing an unbind.
 	 */
-	fence = xe_migrate_update_pgtables(tile->migrate,
-					   vm, NULL, q ? q :
-					   vm->q[tile->id],
-					   entries, num_entries,
-					   syncs, num_syncs,
-					   &unbind_pt_update.base);
-	if (!IS_ERR(fence)) {
-		int err;
-
-		err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
-					    &xe_range_fence_kfree_ops,
-					    unbind_pt_update.base.start,
-					    unbind_pt_update.base.last, fence);
-		if (err)
-			dma_fence_wait(fence, false);
+	if (xe_vma_is_userptr(vma) && xe_vm_in_fault_mode(xe_vma_vm(vma)))
+		mmu_interval_read_begin(&to_userptr_vma(vma)->userptr.notifier);
 
-		/* TLB invalidation must be done before signaling unbind */
-		err = invalidation_fence_init(tile->primary_gt, ifence, fence,
-					      xe_vma_start(vma),
-					      xe_vma_end(vma),
-					      xe_vma_vm(vma)->usm.asid);
-		if (err) {
-			dma_fence_put(fence);
-			kfree(ifence);
-			return ERR_PTR(err);
+	pt_op->vma = vma;
+	pt_op->bind = false;
+	pt_op->rebind = false;
+
+	err = vma_reserve_fences(tile_to_xe(tile), vma);
+	if (err)
+		return err;
+
+	pt_op->num_entries = xe_pt_stage_unbind(tile, vma, pt_op->entries);
+
+	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+				pt_op->num_entries, false);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+	++pt_update_ops->current_op;
+	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
+	pt_update_ops->needs_invalidation = true;
+
+	/* FIXME: Don't commit right away */
+	xe_pt_commit_unbind(vma, pt_op->entries, pt_op->num_entries,
+			    deferred);
+
+	return 0;
+}
+
+static int op_prepare(struct xe_vm *vm,
+		      struct xe_tile *tile,
+		      struct xe_vm_pgtable_update_ops *pt_update_ops,
+		      struct xe_vma_op *op)
+{
+	int err = 0;
+
+	xe_vm_assert_held(vm);
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
+		pt_update_ops->wait_vm_kernel = true;
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		err = unbind_op_prepare(tile, pt_update_ops,
+					gpuva_to_vma(op->base.remap.unmap->va));
+
+		if (!err && op->remap.prev) {
+			err = bind_op_prepare(vm, tile, pt_update_ops,
+					      op->remap.prev);
+			pt_update_ops->wait_vm_bookkeep = true;
 		}
-		fence = &ifence->base.base;
+		if (!err && op->remap.next) {
+			err = bind_op_prepare(vm, tile, pt_update_ops,
+					      op->remap.next);
+			pt_update_ops->wait_vm_bookkeep = true;
+		}
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		err = unbind_op_prepare(tile, pt_update_ops,
+					gpuva_to_vma(op->base.unmap.va));
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		err = bind_op_prepare(vm, tile, pt_update_ops,
+				      gpuva_to_vma(op->base.prefetch.va));
+		pt_update_ops->wait_vm_kernel = true;
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
 
-		/* add shared fence now for pagetable delayed destroy */
-		dma_resv_add_fence(xe_vm_resv(vm), fence,
-				   DMA_RESV_USAGE_BOOKKEEP);
+	return err;
+}
 
-		/* This fence will be installed by caller when doing eviction */
-		if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
-					   DMA_RESV_USAGE_BOOKKEEP);
-		xe_pt_commit_unbind(vma, entries, num_entries,
-				    unbind_pt_update.locked ? &deferred : NULL);
-		vma->tile_present &= ~BIT(tile->id);
-	} else {
-		kfree(rfence);
-		kfree(ifence);
+static void
+xe_pt_update_ops_init(struct xe_vm_pgtable_update_ops *pt_update_ops)
+{
+	init_llist_head(&pt_update_ops->deferred);
+	pt_update_ops->start = ~0x0ull;
+	pt_update_ops->last = 0x0ull;
+}
+
+/**
+ * xe_pt_update_ops_prepare() - Prepare PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
+ *
+ * Prepare PT update operations which includes updating internal PT state,
+ * allocate memory for page tables, populate page table being pruned in, and
+ * create PT update operations for leaf insertion / removal.
+ *
+ * Return: 0 on success, negative error code on error.
+ */
+int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	struct xe_vma_op *op;
+	int err;
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	xe_pt_update_ops_init(pt_update_ops);
+
+	err = dma_resv_reserve_fences(xe_vm_resv(vops->vm),
+				      tile_to_xe(tile)->info.tile_count);
+	if (err)
+		return err;
+
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_prepare(vops->vm, tile, pt_update_ops, op);
+
+		if (err)
+			return err;
 	}
 
-	if (!vma->tile_present)
-		list_del_init(&vma->combined_links.rebind);
+	xe_tile_assert(tile, pt_update_ops->current_op <=
+		       pt_update_ops->num_ops);
+
+	return 0;
+}
+
+static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
+			   struct xe_vm_pgtable_update_ops *pt_update_ops,
+			   struct xe_vma *vma, struct dma_fence *fence)
+{
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+	vma->tile_present |= BIT(tile->id);
+	vma->tile_staged &= ~BIT(tile->id);
+	if (xe_vma_is_userptr(vma)) {
+		lockdep_assert_held_read(&vm->userptr.notifier_lock);
+		to_userptr_vma(vma)->userptr.initial_bind = true;
+	}
 
-	if (unbind_pt_update.locked) {
-		xe_tile_assert(tile, xe_vma_is_userptr(vma));
+	/*
+	 * Kick rebind worker if this bind triggers preempt fences and not in
+	 * the rebind worker
+	 */
+	if (pt_update_ops->wait_vm_bookkeep &&
+	    xe_vm_in_preempt_fence_mode(vm) &&
+	    !current->mm)
+		xe_vm_queue_rebind_worker(vm);
+}
+
+static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops,
+			     struct xe_vma *vma, struct dma_fence *fence)
+{
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+	vma->tile_present &= ~BIT(tile->id);
+	if (!vma->tile_present) {
+		list_del_init(&vma->combined_links.rebind);
+		if (xe_vma_is_userptr(vma)) {
+			lockdep_assert_held_read(&vm->userptr.notifier_lock);
 
-		if (!vma->tile_present) {
 			spin_lock(&vm->userptr.invalidated_lock);
 			list_del_init(&to_userptr_vma(vma)->userptr.invalidate_link);
 			spin_unlock(&vm->userptr.invalidated_lock);
 		}
-		up_read(&vm->userptr.notifier_lock);
-		xe_bo_put_commit(&deferred);
 	}
+}
+
+static void op_commit(struct xe_vm *vm,
+		      struct xe_tile *tile,
+		      struct xe_vm_pgtable_update_ops *pt_update_ops,
+		      struct xe_vma_op *op, struct dma_fence *fence)
+{
+	xe_vm_assert_held(vm);
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		unbind_op_commit(vm, tile, pt_update_ops,
+				 gpuva_to_vma(op->base.remap.unmap->va), fence);
+
+		if (op->remap.prev)
+			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
+				       fence);
+		if (op->remap.next)
+			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
+				       fence);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		unbind_op_commit(vm, tile, pt_update_ops,
+				 gpuva_to_vma(op->base.unmap.va), fence);
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		bind_op_commit(vm, tile, pt_update_ops,
+			       gpuva_to_vma(op->base.prefetch.va), fence);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+}
+
+static const struct xe_migrate_pt_update_ops migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
+	.pre_commit = xe_pt_pre_commit,
+};
+
+static const struct xe_migrate_pt_update_ops userptr_migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
+	.pre_commit = xe_pt_userptr_pre_commit,
+};
+
+/**
+ * xe_pt_update_ops_run() - Run PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
+ *
+ * Run PT update operations which includes committing internal PT state changes,
+ * creating job for PT update operations for leaf insertion / removal, and
+ * installing job fence in various places.
+  *
+ * Return: fence on success, negative ERR_PTR on error.
+ */
+struct dma_fence *
+xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm *vm = vops->vm;
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	struct dma_fence *fence;
+	struct invalidation_fence *ifence = NULL;
+	struct xe_range_fence *rfence;
+	struct xe_vma_op *op;
+	int err = 0;
+	struct xe_migrate_pt_update update = {
+		.ops = pt_update_ops->needs_userptr_lock ?
+			&userptr_migrate_ops :
+			&migrate_ops,
+		.vops = vops,
+		.tile_id = tile->id
+	};
+
+	lockdep_assert_held(&vm->lock);
+	xe_vm_assert_held(vm);
+
+	if (!pt_update_ops->current_op) {
+		xe_tile_assert(tile, xe_vm_in_fault_mode(vm));
+
+		return dma_fence_get_stub();
+	}
+
+	if (pt_update_ops->needs_invalidation) {
+		ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
+		if (!ifence)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
+	if (!rfence) {
+		err = -ENOMEM;
+		goto free_ifence;
+	}
+
+	fence = xe_migrate_update_pgtables(tile->migrate, &update);
+	if (IS_ERR(fence)) {
+		err = PTR_ERR(fence);
+		goto free_rfence;
+	}
+
+	err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
+				    &xe_range_fence_kfree_ops,
+				    pt_update_ops->start,
+				    pt_update_ops->last, fence);
+	if (err)
+		dma_fence_wait(fence, false);
+
+	/* tlb invalidation must be done before signaling rebind */
+	if (ifence) {
+		err = invalidation_fence_init(tile->primary_gt, ifence, fence,
+					      pt_update_ops->start,
+					      pt_update_ops->last,
+					      vm->usm.asid);
+		if (err)
+			goto put_fence;
+		fence = &ifence->base.base;
+	}
+
+	dma_resv_add_fence(xe_vm_resv(vm), fence,
+			   pt_update_ops->wait_vm_bookkeep ?
+			   DMA_RESV_USAGE_KERNEL :
+			   DMA_RESV_USAGE_BOOKKEEP);
+
+	list_for_each_entry(op, &vops->list, link)
+		op_commit(vops->vm, tile, pt_update_ops, op, fence);
+
+	if (pt_update_ops->needs_userptr_lock)
+		up_read(&vm->userptr.notifier_lock);
 
 	return fence;
+
+put_fence:
+	if (pt_update_ops->needs_userptr_lock)
+		up_read(&vm->userptr.notifier_lock);
+	dma_fence_put(fence);
+free_rfence:
+	kfree(rfence);
+free_ifence:
+	kfree(ifence);
+
+	return ERR_PTR(err);
 }
+
+/**
+ * xe_pt_update_ops_fini() - Finish PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operations
+ *
+ * Finish PT update operations by committing to destroy page table memory
+ */
+void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	int i;
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	/* FIXME: Not 100% correct */
+	for (i = 0; i < pt_update_ops->num_ops; ++i) {
+		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
+
+		if (pt_op->bind)
+			xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
+	}
+	xe_bo_put_commit(&vops->pt_update_ops[tile->id].deferred);
+}
+
+/**
+ * xe_pt_update_ops_abort() - Abort PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
+ *
+ *  Abort PT update operations by unwinding internal PT state
+ */
+void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	/* FIXME: Just kill VM for now + cleanup PTs */
+	xe_bo_put_commit(&vops->pt_update_ops[tile->id].deferred);
+	xe_vm_kill(vops->vm, false);
+ }
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 71a4fbfcff43..9ab386431cad 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -17,6 +17,7 @@ struct xe_sync_entry;
 struct xe_tile;
 struct xe_vm;
 struct xe_vma;
+struct xe_vma_ops;
 
 /* Largest huge pte is currently 1GiB. May become device dependent. */
 #define MAX_HUGEPTE_LEVEL 2
@@ -34,14 +35,11 @@ void xe_pt_populate_empty(struct xe_tile *tile, struct xe_vm *vm,
 
 void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head *deferred);
 
-struct dma_fence *
-__xe_pt_bind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool rebind);
-
-struct dma_fence *
-__xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		   struct xe_sync_entry *syncs, u32 num_syncs);
+int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops);
+struct dma_fence *xe_pt_update_ops_run(struct xe_tile *tile,
+				       struct xe_vma_ops *vops);
+void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops);
+void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 
diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index 2093150f461e..384cc04de719 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -78,6 +78,8 @@ struct xe_vm_pgtable_update {
 struct xe_vm_pgtable_update_op {
 	/** @entries: entries to update for this operation */
 	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
+	/** @vma: VMA for operation, operation not valid if NULL */
+	struct xe_vma *vma;
 	/** @num_entries: number of entries for this update operation */
 	u32 num_entries;
 	/** @bind: is a bind */
@@ -86,4 +88,38 @@ struct xe_vm_pgtable_update_op {
 	bool rebind;
 };
 
+/** struct xe_vm_pgtable_update_ops: page table update operations */
+struct xe_vm_pgtable_update_ops {
+	/** @ops: operations */
+	struct xe_vm_pgtable_update_op *ops;
+	/** @deferred: deferred list to destroy PT entries */
+	struct llist_head deferred;
+	/** @q: exec queue for PT operations */
+	struct xe_exec_queue *q;
+	/** @start: start address of ops */
+	u64 start;
+	/** @last: last address of ops */
+	u64 last;
+	/** @num_ops: number of operations */
+	u32 num_ops;
+	/** @current_op: current operations */
+	u32 current_op;
+	/** @needs_userptr_lock: Needs userptr lock */
+	bool needs_userptr_lock;
+	/** @needs_invalidation: Needs invalidation */
+	bool needs_invalidation;
+	/**
+	 * @wait_vm_bookkeep: PT operations need to wait until VM is idle
+	 * (bookkeep dma-resv slots are idle) and stage all future VM activity
+	 * behind these operations (install PT operations into VM kernel
+	 * dma-resv slot).
+	 */
+	bool wait_vm_bookkeep;
+	/**
+	 * @wait_vm_kernel: PT operations need to wait until VM kernel dma-resv
+	 * slots are idle.
+	 */
+	bool wait_vm_kernel;
+};
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 78cbc87c37fa..ba7df582dc0b 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -315,7 +315,7 @@ int __xe_vm_userptr_needs_repin(struct xe_vm *vm)
 
 #define XE_VM_REBIND_RETRY_TIMEOUT_MS 1000
 
-static void xe_vm_kill(struct xe_vm *vm, bool unlocked)
+void xe_vm_kill(struct xe_vm *vm, bool unlocked)
 {
 	struct xe_exec_queue *q;
 
@@ -792,7 +792,7 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 	struct xe_vma *vma, *next;
 	struct xe_vma_ops vops;
 	struct xe_vma_op *op, *next_op;
-	int err;
+	int err, i;
 
 	lockdep_assert_held(&vm->lock);
 	if ((xe_vm_in_lr_mode(vm) && !rebind_worker) ||
@@ -800,6 +800,8 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 		return 0;
 
 	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
+		vops.pt_update_ops[i].wait_vm_bookkeep = true;
 
 	xe_vm_assert_held(vm);
 	list_for_each_entry(vma, &vm->rebind_list, combined_links.rebind) {
@@ -844,6 +846,8 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 	struct dma_fence *fence = NULL;
 	struct xe_vma_ops vops;
 	struct xe_vma_op *op, *next_op;
+	struct xe_tile *tile;
+	u8 id;
 	int err;
 
 	lockdep_assert_held(&vm->lock);
@@ -851,6 +855,11 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
 
 	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_exec_queue(tile);
+	}
 
 	err = xe_vm_ops_add_rebind(&vops, vma, tile_mask);
 	if (err)
@@ -1691,147 +1700,6 @@ to_wait_exec_queue(struct xe_vm *vm, struct xe_exec_queue *q)
 	return q ? q : vm->q[0];
 }
 
-static struct dma_fence *
-xe_vm_unbind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool first_op, bool last_op)
-{
-	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-	struct xe_tile *tile;
-	struct dma_fence *fence = NULL;
-	struct dma_fence **fences = NULL;
-	struct dma_fence_array *cf = NULL;
-	int cur_fence = 0;
-	int number_tiles = hweight8(vma->tile_present);
-	int err;
-	u8 id;
-
-	trace_xe_vma_unbind(vma);
-
-	if (number_tiles > 1) {
-		fences = kmalloc_array(number_tiles, sizeof(*fences),
-				       GFP_KERNEL);
-		if (!fences)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	for_each_tile(tile, vm->xe, id) {
-		if (!(vma->tile_present & BIT(id)))
-			goto next;
-
-		fence = __xe_pt_unbind_vma(tile, vma, q ? q : vm->q[id],
-					   first_op ? syncs : NULL,
-					   first_op ? num_syncs : 0);
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			goto err_fences;
-		}
-
-		if (fences)
-			fences[cur_fence++] = fence;
-
-next:
-		if (q && vm->pt_root[id] && !list_empty(&q->multi_gt_list))
-			q = list_next_entry(q, multi_gt_list);
-	}
-
-	if (fences) {
-		cf = dma_fence_array_create(number_tiles, fences,
-					    vm->composite_fence_ctx,
-					    vm->composite_fence_seqno++,
-					    false);
-		if (!cf) {
-			--vm->composite_fence_seqno;
-			err = -ENOMEM;
-			goto err_fences;
-		}
-	}
-
-	fence = cf ? &cf->base : !fence ?
-		xe_exec_queue_last_fence_get(wait_exec_queue, vm) : fence;
-
-	return fence;
-
-err_fences:
-	if (fences) {
-		while (cur_fence)
-			dma_fence_put(fences[--cur_fence]);
-		kfree(fences);
-	}
-
-	return ERR_PTR(err);
-}
-
-static struct dma_fence *
-xe_vm_bind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-	       struct xe_sync_entry *syncs, u32 num_syncs,
-	       u8 tile_mask, bool first_op, bool last_op)
-{
-	struct xe_tile *tile;
-	struct dma_fence *fence;
-	struct dma_fence **fences = NULL;
-	struct dma_fence_array *cf = NULL;
-	struct xe_vm *vm = xe_vma_vm(vma);
-	int cur_fence = 0;
-	int number_tiles = hweight8(tile_mask);
-	int err;
-	u8 id;
-
-	trace_xe_vma_bind(vma);
-
-	if (number_tiles > 1) {
-		fences = kmalloc_array(number_tiles, sizeof(*fences),
-				       GFP_KERNEL);
-		if (!fences)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	for_each_tile(tile, vm->xe, id) {
-		if (!(tile_mask & BIT(id)))
-			goto next;
-
-		fence = __xe_pt_bind_vma(tile, vma, q ? q : vm->q[id],
-					 first_op ? syncs : NULL,
-					 first_op ? num_syncs : 0,
-					 vma->tile_present & BIT(id));
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			goto err_fences;
-		}
-
-		if (fences)
-			fences[cur_fence++] = fence;
-
-next:
-		if (q && vm->pt_root[id] && !list_empty(&q->multi_gt_list))
-			q = list_next_entry(q, multi_gt_list);
-	}
-
-	if (fences) {
-		cf = dma_fence_array_create(number_tiles, fences,
-					    vm->composite_fence_ctx,
-					    vm->composite_fence_seqno++,
-					    false);
-		if (!cf) {
-			--vm->composite_fence_seqno;
-			err = -ENOMEM;
-			goto err_fences;
-		}
-	}
-
-	return cf ? &cf->base : fence;
-
-err_fences:
-	if (fences) {
-		while (cur_fence)
-			dma_fence_put(fences[--cur_fence]);
-		kfree(fences);
-	}
-
-	return ERR_PTR(err);
-}
-
 static struct xe_user_fence *
 find_ufence_get(struct xe_sync_entry *syncs, u32 num_syncs)
 {
@@ -1847,48 +1715,6 @@ find_ufence_get(struct xe_sync_entry *syncs, u32 num_syncs)
 	return NULL;
 }
 
-static struct dma_fence *
-xe_vm_bind(struct xe_vm *vm, struct xe_vma *vma, struct xe_exec_queue *q,
-	   struct xe_bo *bo, struct xe_sync_entry *syncs, u32 num_syncs,
-	   u8 tile_mask, bool immediate, bool first_op, bool last_op)
-{
-	struct dma_fence *fence;
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(bo);
-
-	if (immediate) {
-		fence = xe_vm_bind_vma(vma, q, syncs, num_syncs, tile_mask,
-				       first_op, last_op);
-		if (IS_ERR(fence))
-			return fence;
-	} else {
-		xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
-
-		fence = xe_exec_queue_last_fence_get(wait_exec_queue, vm);
-	}
-
-	return fence;
-}
-
-static struct dma_fence *
-xe_vm_unbind(struct xe_vm *vm, struct xe_vma *vma,
-	     struct xe_exec_queue *q, struct xe_sync_entry *syncs,
-	     u32 num_syncs, bool first_op, bool last_op)
-{
-	struct dma_fence *fence;
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(xe_vma_bo(vma));
-
-	fence = xe_vm_unbind_vma(vma, q, syncs, num_syncs, first_op, last_op);
-	if (IS_ERR(fence))
-		return fence;
-
-	return fence;
-}
-
 #define ALL_DRM_XE_VM_CREATE_FLAGS (DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE | \
 				    DRM_XE_VM_CREATE_FLAG_LR_MODE | \
 				    DRM_XE_VM_CREATE_FLAG_FAULT_MODE)
@@ -2029,21 +1855,6 @@ static const u32 region_to_mem_type[] = {
 	XE_PL_VRAM1,
 };
 
-static struct dma_fence *
-xe_vm_prefetch(struct xe_vm *vm, struct xe_vma *vma,
-	       struct xe_exec_queue *q, struct xe_sync_entry *syncs,
-	       u32 num_syncs, bool first_op, bool last_op)
-{
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-
-	if (vma->tile_mask != (vma->tile_present & ~vma->tile_invalidated)) {
-		return xe_vm_bind(vm, vma, q, xe_vma_bo(vma), syncs, num_syncs,
-				  vma->tile_mask, true, first_op, last_op);
-	} else {
-		return xe_exec_queue_last_fence_get(wait_exec_queue, vm);
-	}
-}
-
 static void prep_vma_destroy(struct xe_vm *vm, struct xe_vma *vma,
 			     bool post_commit)
 {
@@ -2332,13 +2143,10 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
-				   struct drm_gpuva_ops *ops,
-				   struct xe_sync_entry *syncs, u32 num_syncs,
-				   struct xe_vma_ops *vops, bool last)
+static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
+				   struct xe_vma_ops *vops)
 {
 	struct xe_device *xe = vm->xe;
-	struct xe_vma_op *last_op = NULL;
 	struct drm_gpuva_op *__op;
 	struct xe_tile *tile;
 	u8 id, tile_mask = 0;
@@ -2352,19 +2160,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 	drm_gpuva_for_each_op(__op, ops) {
 		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
 		struct xe_vma *vma;
-		bool first = list_empty(&vops->list);
 		unsigned int flags = 0;
 
 		INIT_LIST_HEAD(&op->link);
 		list_add_tail(&op->link, &vops->list);
-
-		if (first) {
-			op->flags |= XE_VMA_OP_FIRST;
-			op->num_syncs = num_syncs;
-			op->syncs = syncs;
-		}
-
-		op->q = q;
 		op->tile_mask = tile_mask;
 
 		switch (op->base.op) {
@@ -2477,197 +2276,21 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 		}
 		case DRM_GPUVA_OP_UNMAP:
 		case DRM_GPUVA_OP_PREFETCH:
+			/* FIXME: Need to skip some prefetch ops */
 			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
 
-		last_op = op;
-
 		err = xe_vma_op_commit(vm, op);
 		if (err)
 			return err;
 	}
 
-	/* FIXME: Unhandled corner case */
-	XE_WARN_ON(!last_op && last && !list_empty(&vops->list));
-
-	if (!last_op)
-		return 0;
-
-	if (last) {
-		last_op->flags |= XE_VMA_OP_LAST;
-		last_op->num_syncs = num_syncs;
-		last_op->syncs = syncs;
-	}
-
 	return 0;
 }
 
-static struct dma_fence *op_execute(struct xe_vm *vm, struct xe_vma *vma,
-				    struct xe_vma_op *op)
-{
-	struct dma_fence *fence = NULL;
-
-	lockdep_assert_held(&vm->lock);
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(xe_vma_bo(vma));
-
-	switch (op->base.op) {
-	case DRM_GPUVA_OP_MAP:
-		fence = xe_vm_bind(vm, vma, op->q, xe_vma_bo(vma),
-				   op->syncs, op->num_syncs,
-				   op->tile_mask,
-				   op->map.immediate || !xe_vm_in_fault_mode(vm),
-				   op->flags & XE_VMA_OP_FIRST,
-				   op->flags & XE_VMA_OP_LAST);
-		break;
-	case DRM_GPUVA_OP_REMAP:
-	{
-		bool prev = !!op->remap.prev;
-		bool next = !!op->remap.next;
-
-		if (!op->remap.unmap_done) {
-			if (prev || next)
-				vma->gpuva.flags |= XE_VMA_FIRST_REBIND;
-			fence = xe_vm_unbind(vm, vma, op->q, op->syncs,
-					     op->num_syncs,
-					     op->flags & XE_VMA_OP_FIRST,
-					     op->flags & XE_VMA_OP_LAST &&
-					     !prev && !next);
-			if (IS_ERR(fence))
-				break;
-			op->remap.unmap_done = true;
-		}
-
-		if (prev) {
-			op->remap.prev->gpuva.flags |= XE_VMA_LAST_REBIND;
-			dma_fence_put(fence);
-			fence = xe_vm_bind(vm, op->remap.prev, op->q,
-					   xe_vma_bo(op->remap.prev), op->syncs,
-					   op->num_syncs,
-					   op->remap.prev->tile_mask, true,
-					   false,
-					   op->flags & XE_VMA_OP_LAST && !next);
-			op->remap.prev->gpuva.flags &= ~XE_VMA_LAST_REBIND;
-			if (IS_ERR(fence))
-				break;
-			op->remap.prev = NULL;
-		}
-
-		if (next) {
-			op->remap.next->gpuva.flags |= XE_VMA_LAST_REBIND;
-			dma_fence_put(fence);
-			fence = xe_vm_bind(vm, op->remap.next, op->q,
-					   xe_vma_bo(op->remap.next),
-					   op->syncs, op->num_syncs,
-					   op->remap.next->tile_mask, true,
-					   false, op->flags & XE_VMA_OP_LAST);
-			op->remap.next->gpuva.flags &= ~XE_VMA_LAST_REBIND;
-			if (IS_ERR(fence))
-				break;
-			op->remap.next = NULL;
-		}
-
-		break;
-	}
-	case DRM_GPUVA_OP_UNMAP:
-		fence = xe_vm_unbind(vm, vma, op->q, op->syncs,
-				     op->num_syncs, op->flags & XE_VMA_OP_FIRST,
-				     op->flags & XE_VMA_OP_LAST);
-		break;
-	case DRM_GPUVA_OP_PREFETCH:
-		fence = xe_vm_prefetch(vm, vma, op->q, op->syncs, op->num_syncs,
-				       op->flags & XE_VMA_OP_FIRST,
-				       op->flags & XE_VMA_OP_LAST);
-		break;
-	default:
-		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
-	}
-
-	if (IS_ERR(fence))
-		trace_xe_vma_fail(vma);
-
-	return fence;
-}
-
-static struct dma_fence *
-__xe_vma_op_execute(struct xe_vm *vm, struct xe_vma *vma,
-		    struct xe_vma_op *op)
-{
-	struct dma_fence *fence;
-	int err;
-
-retry_userptr:
-	fence = op_execute(vm, vma, op);
-	if (IS_ERR(fence) && PTR_ERR(fence) == -EAGAIN) {
-		lockdep_assert_held_write(&vm->lock);
-
-		if (op->base.op == DRM_GPUVA_OP_REMAP) {
-			if (!op->remap.unmap_done)
-				vma = gpuva_to_vma(op->base.remap.unmap->va);
-			else if (op->remap.prev)
-				vma = op->remap.prev;
-			else
-				vma = op->remap.next;
-		}
-
-		if (xe_vma_is_userptr(vma)) {
-			err = xe_vma_userptr_pin_pages(to_userptr_vma(vma));
-			if (!err)
-				goto retry_userptr;
-
-			fence = ERR_PTR(err);
-			trace_xe_vma_fail(vma);
-		}
-	}
-
-	return fence;
-}
-
-static struct dma_fence *
-xe_vma_op_execute(struct xe_vm *vm, struct xe_vma_op *op)
-{
-	struct dma_fence *fence = ERR_PTR(-ENOMEM);
-
-	lockdep_assert_held(&vm->lock);
-
-	switch (op->base.op) {
-	case DRM_GPUVA_OP_MAP:
-		fence = __xe_vma_op_execute(vm, op->map.vma, op);
-		break;
-	case DRM_GPUVA_OP_REMAP:
-	{
-		struct xe_vma *vma;
-
-		if (!op->remap.unmap_done)
-			vma = gpuva_to_vma(op->base.remap.unmap->va);
-		else if (op->remap.prev)
-			vma = op->remap.prev;
-		else
-			vma = op->remap.next;
-
-		fence = __xe_vma_op_execute(vm, vma, op);
-		break;
-	}
-	case DRM_GPUVA_OP_UNMAP:
-		fence = __xe_vma_op_execute(vm, gpuva_to_vma(op->base.unmap.va),
-					    op);
-		break;
-	case DRM_GPUVA_OP_PREFETCH:
-		fence = __xe_vma_op_execute(vm,
-					    gpuva_to_vma(op->base.prefetch.va),
-					    op);
-		break;
-	default:
-		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
-	}
-
-	return fence;
-}
-
 static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
 			     bool post_commit, bool prev_post_commit,
 			     bool next_post_commit)
@@ -2853,23 +2476,110 @@ static int vm_bind_ioctl_ops_lock_and_prep(struct drm_exec *exec,
 	return 0;
 }
 
+static int vm_ops_setup_tile_args(struct xe_vm *vm, struct xe_vma_ops *vops)
+{
+	struct xe_exec_queue *q = vops->q;
+	struct xe_tile *tile;
+	int number_tiles = 0;
+	u8 id;
+
+	for_each_tile(tile, vm->xe, id) {
+		if (vops->pt_update_ops[id].num_ops)
+			++number_tiles;
+
+		if (vops->pt_update_ops[id].q)
+			continue;
+
+		if (q) {
+			vops->pt_update_ops[id].q = q;
+			if (vm->pt_root[id] && !list_empty(&q->multi_gt_list))
+				q = list_next_entry(q, multi_gt_list);
+		} else {
+			vops->pt_update_ops[id].q = vm->q[id];
+		}
+	}
+
+	return number_tiles;
+}
+
 static struct dma_fence *ops_execute(struct xe_vm *vm,
 				     struct xe_vma_ops *vops)
 {
-	struct xe_vma_op *op, *next;
+	struct xe_tile *tile;
 	struct dma_fence *fence = NULL;
+	struct dma_fence **fences = NULL;
+	struct dma_fence_array *cf = NULL;
+	int number_tiles = 0, current_fence = 0, err;
+	u8 id;
 
-	list_for_each_entry_safe(op, next, &vops->list, link) {
-		dma_fence_put(fence);
-		fence = xe_vma_op_execute(vm, op);
-		if (IS_ERR(fence)) {
-			drm_warn(&vm->xe->drm, "VM op(%d) failed with %ld",
-				 op->base.op, PTR_ERR(fence));
-			fence = ERR_PTR(-ENOSPC);
-			break;
+	number_tiles = vm_ops_setup_tile_args(vm, vops);
+	if (number_tiles == 0)
+		return ERR_PTR(-ENODATA);
+
+	if (number_tiles > 1) {
+		fences = kmalloc_array(number_tiles, sizeof(*fences),
+				       GFP_KERNEL);
+		if (!fences)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		err = xe_pt_update_ops_prepare(tile, vops);
+		if (err) {
+			fence = ERR_PTR(err);
+			goto err_out;
 		}
 	}
 
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		fence = xe_pt_update_ops_run(tile, vops);
+		if (IS_ERR(fence))
+			goto err_out;
+
+		if (fences)
+			fences[current_fence++] = fence;
+	}
+
+	if (fences) {
+		cf = dma_fence_array_create(number_tiles, fences,
+					    vm->composite_fence_ctx,
+					    vm->composite_fence_seqno++,
+					    false);
+		if (!cf) {
+			--vm->composite_fence_seqno;
+			fence = ERR_PTR(-ENOMEM);
+			goto err_out;
+		}
+		fence = &cf->base;
+	}
+
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		xe_pt_update_ops_fini(tile, vops);
+	}
+
+	return fence;
+
+err_out:
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		xe_pt_update_ops_abort(tile, vops);
+	}
+	while (current_fence)
+		dma_fence_put(fences[--current_fence]);
+	kfree(fences);
+	kfree(cf);
+
 	return fence;
 }
 
@@ -2950,12 +2660,10 @@ static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
 		fence = ops_execute(vm, vops);
 		if (IS_ERR(fence)) {
 			err = PTR_ERR(fence);
-			/* FIXME: Killing VM rather than proper error handling */
-			xe_vm_kill(vm, false);
 			goto unlock;
-		} else {
-			vm_bind_ioctl_ops_fini(vm, vops, fence);
 		}
+
+		vm_bind_ioctl_ops_fini(vm, vops, fence);
 	}
 
 unlock:
@@ -3312,8 +3020,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto unwind_ops;
 		}
 
-		err = vm_bind_ioctl_ops_parse(vm, q, ops[i], syncs, num_syncs,
-					      &vops, i == args->num_binds - 1);
+		err = vm_bind_ioctl_ops_parse(vm, ops[i], &vops);
 		if (err)
 			goto unwind_ops;
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index b481608b12f1..c864dba35e1d 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -259,6 +259,8 @@ static inline struct dma_resv *xe_vm_resv(struct xe_vm *vm)
 	return drm_gpuvm_resv(&vm->gpuvm);
 }
 
+void xe_vm_kill(struct xe_vm *vm, bool unlocked);
+
 /**
  * xe_vm_assert_held(vm) - Assert that the vm's reservation object is held.
  * @vm: The vm
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 211c88801182..27d651093d30 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -26,14 +26,12 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_READ_ONLY	DRM_GPUVA_USERBITS
 #define XE_VMA_DESTROYED	(DRM_GPUVA_USERBITS << 1)
 #define XE_VMA_ATOMIC_PTE_BIT	(DRM_GPUVA_USERBITS << 2)
-#define XE_VMA_FIRST_REBIND	(DRM_GPUVA_USERBITS << 3)
-#define XE_VMA_LAST_REBIND	(DRM_GPUVA_USERBITS << 4)
-#define XE_VMA_PTE_4K		(DRM_GPUVA_USERBITS << 5)
-#define XE_VMA_PTE_2M		(DRM_GPUVA_USERBITS << 6)
-#define XE_VMA_PTE_1G		(DRM_GPUVA_USERBITS << 7)
-#define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 8)
-#define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 9)
-#define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 10)
+#define XE_VMA_PTE_4K		(DRM_GPUVA_USERBITS << 3)
+#define XE_VMA_PTE_2M		(DRM_GPUVA_USERBITS << 4)
+#define XE_VMA_PTE_1G		(DRM_GPUVA_USERBITS << 5)
+#define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 6)
+#define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 7)
+#define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 8)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -100,6 +98,9 @@ struct xe_vma {
 	 */
 	u8 tile_present;
 
+	/** @tile_staged: bind is staged for this VMA */
+	u8 tile_staged;
+
 	/**
 	 * @pat_index: The pat index to use when encoding the PTEs for this vma.
 	 */
@@ -315,31 +316,18 @@ struct xe_vma_op_prefetch {
 
 /** enum xe_vma_op_flags - flags for VMA operation */
 enum xe_vma_op_flags {
-	/** @XE_VMA_OP_FIRST: first VMA operation for a set of syncs */
-	XE_VMA_OP_FIRST			= BIT(0),
-	/** @XE_VMA_OP_LAST: last VMA operation for a set of syncs */
-	XE_VMA_OP_LAST			= BIT(1),
 	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
-	XE_VMA_OP_COMMITTED		= BIT(2),
+	XE_VMA_OP_COMMITTED		= BIT(0),
 	/** @XE_VMA_OP_PREV_COMMITTED: Previous VMA operation committed */
-	XE_VMA_OP_PREV_COMMITTED	= BIT(3),
+	XE_VMA_OP_PREV_COMMITTED	= BIT(1),
 	/** @XE_VMA_OP_NEXT_COMMITTED: Next VMA operation committed */
-	XE_VMA_OP_NEXT_COMMITTED	= BIT(4),
+	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
 };
 
 /** struct xe_vma_op - VMA operation */
 struct xe_vma_op {
 	/** @base: GPUVA base operation */
 	struct drm_gpuva_op base;
-	/** @q: exec queue for this operation */
-	struct xe_exec_queue *q;
-	/**
-	 * @syncs: syncs for this operation, only used on first and last
-	 * operation
-	 */
-	struct xe_sync_entry *syncs;
-	/** @num_syncs: number of syncs */
-	u32 num_syncs;
 	/** @link: async operation link */
 	struct list_head link;
 	/** @flags: operation flags */
@@ -363,19 +351,14 @@ struct xe_vma_ops {
 	struct list_head list;
 	/** @vm: VM */
 	struct xe_vm *vm;
-	/** @q: exec queue these operations */
+	/** @q: exec queue for VMA operations */
 	struct xe_exec_queue *q;
 	/** @syncs: syncs these operation */
 	struct xe_sync_entry *syncs;
 	/** @num_syncs: number of syncs */
 	u32 num_syncs;
 	/** @pt_update_ops: page table update operations */
-	struct {
-		/** @ops: operations */
-		struct xe_vm_pgtable_update_op *ops;
-		/** @num_ops: number of operations */
-		u32 num_ops;
-	} pt_update_ops[XE_MAX_TILES_PER_DEVICE];
+	struct xe_vm_pgtable_update_ops pt_update_ops[XE_MAX_TILES_PER_DEVICE];
 };
 
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 18/42] drm/xe: Update VM trace events
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (16 preceding siblings ...)
  2024-06-13  4:24 ` [CI 17/42] drm/xe: Convert multiple bind ops into single job Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 19/42] drm/xe: Update PT layer with better error handling Oak Zeng
                   ` (23 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

The trace events have changed moving to a single job per VM bind IOCTL,
update the trace events align with old behavior as much as possible.

Cc: Oak Zeng <oak.zeng@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_trace.h | 10 ++++-----
 drivers/gpu/drm/xe/xe_vm.c    | 42 +++++++++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index e4cba64474e6..3d7949d26de3 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -426,11 +426,6 @@ DEFINE_EVENT(xe_vma, xe_vma_acc,
 	     TP_ARGS(vma)
 );
 
-DEFINE_EVENT(xe_vma, xe_vma_fail,
-	     TP_PROTO(struct xe_vma *vma),
-	     TP_ARGS(vma)
-);
-
 DEFINE_EVENT(xe_vma, xe_vma_bind,
 	     TP_PROTO(struct xe_vma *vma),
 	     TP_ARGS(vma)
@@ -544,6 +539,11 @@ DEFINE_EVENT(xe_vm, xe_vm_rebind_worker_exit,
 	     TP_ARGS(vm)
 );
 
+DEFINE_EVENT(xe_vm, xe_vm_ops_fail,
+	     TP_PROTO(struct xe_vm *vm),
+	     TP_ARGS(vm)
+);
+
 /* GuC */
 DECLARE_EVENT_CLASS(xe_guc_ct_flow_control,
 		    TP_PROTO(u32 _head, u32 _tail, u32 size, u32 space, u32 len),
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index ba7df582dc0b..b4fc8cf3f6a8 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2476,6 +2476,38 @@ static int vm_bind_ioctl_ops_lock_and_prep(struct drm_exec *exec,
 	return 0;
 }
 
+static void op_trace(struct xe_vma_op *op)
+{
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		trace_xe_vma_bind(op->map.vma);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		trace_xe_vma_unbind(gpuva_to_vma(op->base.remap.unmap->va));
+		if (op->remap.prev)
+			trace_xe_vma_bind(op->remap.prev);
+		if (op->remap.next)
+			trace_xe_vma_bind(op->remap.next);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		trace_xe_vma_unbind(gpuva_to_vma(op->base.unmap.va));
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		trace_xe_vma_bind(gpuva_to_vma(op->base.prefetch.va));
+		break;
+	default:
+		XE_WARN_ON("NOT POSSIBLE");
+	}
+}
+
+static void trace_xe_vm_ops_execute(struct xe_vma_ops *vops)
+{
+	struct xe_vma_op *op;
+
+	list_for_each_entry(op, &vops->list, link)
+		op_trace(op);
+}
+
 static int vm_ops_setup_tile_args(struct xe_vm *vm, struct xe_vma_ops *vops)
 {
 	struct xe_exec_queue *q = vops->q;
@@ -2519,8 +2551,10 @@ static struct dma_fence *ops_execute(struct xe_vm *vm,
 	if (number_tiles > 1) {
 		fences = kmalloc_array(number_tiles, sizeof(*fences),
 				       GFP_KERNEL);
-		if (!fences)
-			return ERR_PTR(-ENOMEM);
+		if (!fences) {
+			fence = ERR_PTR(-ENOMEM);
+			goto err_trace;
+		}
 	}
 
 	for_each_tile(tile, vm->xe, id) {
@@ -2534,6 +2568,8 @@ static struct dma_fence *ops_execute(struct xe_vm *vm,
 		}
 	}
 
+	trace_xe_vm_ops_execute(vops);
+
 	for_each_tile(tile, vm->xe, id) {
 		if (!vops->pt_update_ops[id].num_ops)
 			continue;
@@ -2580,6 +2616,8 @@ static struct dma_fence *ops_execute(struct xe_vm *vm,
 	kfree(fences);
 	kfree(cf);
 
+err_trace:
+	trace_xe_vm_ops_fail(vm);
 	return fence;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 19/42] drm/xe: Update PT layer with better error handling
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (17 preceding siblings ...)
  2024-06-13  4:24 ` [CI 18/42] drm/xe: Update VM trace events Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 20/42] drm/xe: Retry BO allocation Oak Zeng
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

Update PT layer so if a memory allocation for a PTE fails the error can
be propagated to the user without requiring to be killed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c | 208 ++++++++++++++++++++++++++++---------
 1 file changed, 160 insertions(+), 48 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 4353a3bdf6c8..e4dd385ee954 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -845,19 +845,27 @@ xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *t
 	}
 }
 
-static void xe_pt_abort_bind(struct xe_vma *vma,
-			     struct xe_vm_pgtable_update *entries,
-			     u32 num_entries)
+static void xe_pt_cancel_bind(struct xe_vma *vma,
+			      struct xe_vm_pgtable_update *entries,
+			      u32 num_entries)
 {
 	u32 i, j;
 
 	for (i = 0; i < num_entries; i++) {
-		if (!entries[i].pt_entries)
+		struct xe_pt *pt = entries[i].pt;
+
+		if (!pt)
 			continue;
 
-		for (j = 0; j < entries[i].qwords; j++)
-			xe_pt_destroy(entries[i].pt_entries[j].pt, xe_vma_vm(vma)->flags, NULL);
+		if (pt->level) {
+			for (j = 0; j < entries[i].qwords; j++)
+				xe_pt_destroy(entries[i].pt_entries[j].pt,
+					      xe_vma_vm(vma)->flags, NULL);
+		}
+
 		kfree(entries[i].pt_entries);
+		entries[i].pt_entries = NULL;
+		entries[i].qwords = 0;
 	}
 }
 
@@ -873,10 +881,61 @@ static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 	xe_vm_assert_held(vm);
 }
 
-static void xe_pt_commit_bind(struct xe_vma *vma,
-			      struct xe_vm_pgtable_update *entries,
-			      u32 num_entries, bool rebind,
-			      struct llist_head *deferred)
+static void xe_pt_commit(struct xe_vma *vma,
+			 struct xe_vm_pgtable_update *entries,
+			 u32 num_entries, struct llist_head *deferred)
+{
+	u32 i, j;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (i = 0; i < num_entries; i++) {
+		struct xe_pt *pt = entries[i].pt;
+
+		if (!pt->level)
+			continue;
+
+		for (j = 0; j < entries[i].qwords; j++) {
+			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
+
+			xe_pt_destroy(oldpte, xe_vma_vm(vma)->flags, deferred);
+		}
+	}
+}
+
+static void xe_pt_abort_bind(struct xe_vma *vma,
+			     struct xe_vm_pgtable_update *entries,
+			     u32 num_entries, bool rebind)
+{
+	int i, j;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (i = num_entries - 1; i >= 0; --i) {
+		struct xe_pt *pt = entries[i].pt;
+		struct xe_pt_dir *pt_dir;
+
+		if (!rebind)
+			pt->num_live -= entries[i].qwords;
+
+		if (!pt->level)
+			continue;
+
+		pt_dir = as_xe_pt_dir(pt);
+		for (j = 0; j < entries[i].qwords; j++) {
+			u32 j_ = j + entries[i].ofs;
+			struct xe_pt *newpte = xe_pt_entry(pt_dir, j_);
+			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
+
+			pt_dir->children[j_] = oldpte ? &oldpte->base : 0;
+			xe_pt_destroy(newpte, xe_vma_vm(vma)->flags, NULL);
+		}
+	}
+}
+
+static void xe_pt_commit_prepare_bind(struct xe_vma *vma,
+				      struct xe_vm_pgtable_update *entries,
+				      u32 num_entries, bool rebind)
 {
 	u32 i, j;
 
@@ -896,12 +955,13 @@ static void xe_pt_commit_bind(struct xe_vma *vma,
 		for (j = 0; j < entries[i].qwords; j++) {
 			u32 j_ = j + entries[i].ofs;
 			struct xe_pt *newpte = entries[i].pt_entries[j].pt;
+			struct xe_pt *oldpte = NULL;
 
 			if (xe_pt_entry(pt_dir, j_))
-				xe_pt_destroy(xe_pt_entry(pt_dir, j_),
-					      xe_vma_vm(vma)->flags, deferred);
+				oldpte = xe_pt_entry(pt_dir, j_);
 
 			pt_dir->children[j_] = &newpte->base;
+			entries[i].pt_entries[j].pt = oldpte;
 		}
 	}
 }
@@ -925,8 +985,6 @@ xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
 	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
 	if (!err)
 		xe_tile_assert(tile, *num_entries);
-	else /* abort! */
-		xe_pt_abort_bind(vma, entries, *num_entries);
 
 	return err;
 }
@@ -1447,7 +1505,7 @@ xe_pt_stage_unbind_post_descend(struct xe_ptw *parent, pgoff_t offset,
 				     &end_offset))
 		return 0;
 
-	(void)xe_pt_new_shared(&xe_walk->wupd, xe_child, offset, false);
+	(void)xe_pt_new_shared(&xe_walk->wupd, xe_child, offset, true);
 	xe_walk->wupd.updates[level].update->qwords = end_offset - offset;
 
 	return 0;
@@ -1515,32 +1573,57 @@ xe_migrate_clear_pgtable_callback(struct xe_migrate_pt_update *pt_update,
 		memset64(ptr, empty, num_qwords);
 }
 
+static void xe_pt_abort_unbind(struct xe_vma *vma,
+			       struct xe_vm_pgtable_update *entries,
+			       u32 num_entries)
+{
+	int j, i;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (j = num_entries - 1; j >= 0; --j) {
+		struct xe_vm_pgtable_update *entry = &entries[j];
+		struct xe_pt *pt = entry->pt;
+		struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
+
+		pt->num_live += entry->qwords;
+
+		if (!pt->level)
+			continue;
+
+		for (i = entry->ofs; i < entry->ofs + entry->qwords; i++)
+			pt_dir->children[i] =
+				entries[j].pt_entries[i - entry->ofs].pt ?
+				&entries[j].pt_entries[i - entry->ofs].pt->base : 0;
+	}
+}
+
 static void
-xe_pt_commit_unbind(struct xe_vma *vma,
-		    struct xe_vm_pgtable_update *entries, u32 num_entries,
-		    struct llist_head *deferred)
+xe_pt_commit_prepare_unbind(struct xe_vma *vma,
+			    struct xe_vm_pgtable_update *entries,
+			    u32 num_entries)
 {
-	u32 j;
+	int j, i;
 
 	xe_pt_commit_locks_assert(vma);
 
 	for (j = 0; j < num_entries; ++j) {
 		struct xe_vm_pgtable_update *entry = &entries[j];
 		struct xe_pt *pt = entry->pt;
+		struct xe_pt_dir *pt_dir;
 
 		pt->num_live -= entry->qwords;
-		if (pt->level) {
-			struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
-			u32 i;
-
-			for (i = entry->ofs; i < entry->ofs + entry->qwords;
-			     i++) {
-				if (xe_pt_entry(pt_dir, i))
-					xe_pt_destroy(xe_pt_entry(pt_dir, i),
-						      xe_vma_vm(vma)->flags, deferred);
+		if (!pt->level)
+			continue;
 
-				pt_dir->children[i] = NULL;
-			}
+		pt_dir = as_xe_pt_dir(pt);
+		for (i = entry->ofs; i < entry->ofs + entry->qwords; i++) {
+			if (xe_pt_entry(pt_dir, i))
+				entries[j].pt_entries[i - entry->ofs].pt =
+					xe_pt_entry(pt_dir, i);
+			else
+				entries[j].pt_entries[i - entry->ofs].pt = NULL;
+			pt_dir->children[i] = NULL;
 		}
 	}
 }
@@ -1586,7 +1669,6 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 {
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
-	struct llist_head *deferred = &pt_update_ops->deferred;
 	int err;
 
 	xe_bo_assert_held(xe_vma_bo(vma));
@@ -1633,11 +1715,12 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 			(!pt_op->rebind && vm->scratch_pt[tile->id] &&
 			 xe_vm_in_preempt_fence_mode(vm));
 
-		/* FIXME: Don't commit right away */
 		vma->tile_staged |= BIT(tile->id);
 		pt_op->vma = vma;
-		xe_pt_commit_bind(vma, pt_op->entries, pt_op->num_entries,
-				  pt_op->rebind, deferred);
+		xe_pt_commit_prepare_bind(vma, pt_op->entries,
+					  pt_op->num_entries, pt_op->rebind);
+	} else {
+		xe_pt_cancel_bind(vma, pt_op->entries, pt_op->num_entries);
 	}
 
 	return err;
@@ -1649,7 +1732,6 @@ static int unbind_op_prepare(struct xe_tile *tile,
 {
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
-	struct llist_head *deferred = &pt_update_ops->deferred;
 	int err;
 
 	if (!((vma->tile_present | vma->tile_staged) & BIT(tile->id)))
@@ -1685,9 +1767,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 	pt_update_ops->needs_invalidation = true;
 
-	/* FIXME: Don't commit right away */
-	xe_pt_commit_unbind(vma, pt_op->entries, pt_op->num_entries,
-			    deferred);
+	xe_pt_commit_prepare_unbind(vma, pt_op->entries, pt_op->num_entries);
 
 	return 0;
 }
@@ -1908,7 +1988,7 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 	struct invalidation_fence *ifence = NULL;
 	struct xe_range_fence *rfence;
 	struct xe_vma_op *op;
-	int err = 0;
+	int err = 0, i;
 	struct xe_migrate_pt_update update = {
 		.ops = pt_update_ops->needs_userptr_lock ?
 			&userptr_migrate_ops :
@@ -1928,8 +2008,10 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 
 	if (pt_update_ops->needs_invalidation) {
 		ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
-		if (!ifence)
-			return ERR_PTR(-ENOMEM);
+		if (!ifence) {
+			err = -ENOMEM;
+			goto kill_vm_tile1;
+		}
 	}
 
 	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
@@ -1944,6 +2026,15 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 		goto free_rfence;
 	}
 
+	/* Point of no return - VM killed if failure after this */
+	for (i = 0; i < pt_update_ops->current_op; ++i) {
+		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
+
+		xe_pt_commit(pt_op->vma, pt_op->entries,
+			     pt_op->num_entries, &pt_update_ops->deferred);
+		pt_op->vma = NULL;	/* skip in xe_pt_update_ops_abort */
+	}
+
 	err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
 				    &xe_range_fence_kfree_ops,
 				    pt_update_ops->start,
@@ -1979,10 +2070,15 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 	if (pt_update_ops->needs_userptr_lock)
 		up_read(&vm->userptr.notifier_lock);
 	dma_fence_put(fence);
+	if (!tile->id)
+		xe_vm_kill(vops->vm, false);
 free_rfence:
 	kfree(rfence);
 free_ifence:
 	kfree(ifence);
+kill_vm_tile1:
+	if (err != -EAGAIN && tile->id)
+		xe_vm_kill(vops->vm, false);
 
 	return ERR_PTR(err);
 }
@@ -2003,12 +2099,10 @@ void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops)
 	lockdep_assert_held(&vops->vm->lock);
 	xe_vm_assert_held(vops->vm);
 
-	/* FIXME: Not 100% correct */
-	for (i = 0; i < pt_update_ops->num_ops; ++i) {
+	for (i = 0; i < pt_update_ops->current_op; ++i) {
 		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
 
-		if (pt_op->bind)
-			xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
+		xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
 	}
 	xe_bo_put_commit(&vops->pt_update_ops[tile->id].deferred);
 }
@@ -2022,10 +2116,28 @@ void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops)
  */
 void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops)
 {
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	int i;
+
 	lockdep_assert_held(&vops->vm->lock);
 	xe_vm_assert_held(vops->vm);
 
-	/* FIXME: Just kill VM for now + cleanup PTs */
+	for (i = pt_update_ops->num_ops - 1; i >= 0; --i) {
+		struct xe_vm_pgtable_update_op *pt_op =
+			&pt_update_ops->ops[i];
+
+		if (!pt_op->vma || i >= pt_update_ops->current_op)
+			continue;
+
+		if (pt_op->bind)
+			xe_pt_abort_bind(pt_op->vma, pt_op->entries,
+					 pt_op->num_entries,
+					 pt_op->rebind);
+		else
+			xe_pt_abort_unbind(pt_op->vma, pt_op->entries,
+					   pt_op->num_entries);
+	}
+
 	xe_bo_put_commit(&vops->pt_update_ops[tile->id].deferred);
-	xe_vm_kill(vops->vm, false);
- }
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 20/42] drm/xe: Retry BO allocation
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (18 preceding siblings ...)
  2024-06-13  4:24 ` [CI 19/42] drm/xe: Update PT layer with better error handling Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 21/42] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

TTM doesn't support fair eviction via WW locking, this mitigated in by
using retry loops in exec and preempt rebind worker. Extend this retry
loop to BO allocation. Once TTM supports fair eviction this patch can be
reverted.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 2bae01ce4e5b..63d7fa4c3db9 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -1929,6 +1929,7 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 	struct xe_file *xef = to_xe_file(file);
 	struct drm_xe_gem_create *args = data;
 	struct xe_vm *vm = NULL;
+	ktime_t end = 0;
 	struct xe_bo *bo;
 	unsigned int bo_flags;
 	u32 handle;
@@ -1994,11 +1995,14 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 		vm = xe_vm_lookup(xef, args->vm_id);
 		if (XE_IOCTL_DBG(xe, !vm))
 			return -ENOENT;
+	}
+
+retry:
+	if (vm) {
 		err = xe_vm_lock(vm, true);
 		if (err)
 			goto out_vm;
 	}
-
 	bo = xe_bo_create_user(xe, NULL, vm, args->size, args->cpu_caching,
 			       ttm_bo_type_device, bo_flags);
 
@@ -2007,6 +2011,8 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 
 	if (IS_ERR(bo)) {
 		err = PTR_ERR(bo);
+		if (xe_vm_validate_should_retry(NULL, err, &end))
+			goto retry;
 		goto out_vm;
 	}
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 21/42] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (19 preceding siblings ...)
  2024-06-13  4:24 ` [CI 20/42] drm/xe: Retry BO allocation Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 22/42] drm/xe: Add a helper to calculate userptr end address Oak Zeng
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

Add the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag, which is used to
create unpopulated virtual memory areas (VMAs) without memory backing or
GPU page tables. These VMAs are referred to as system allocator VMAs.
The idea is that upon a page fault, the memory backing and GPU page
tables will be populated.

System allocator VMAs only update GPUVM state; they do not have an
internal page table (PT) state, nor do they have GPU mappings.

It is expected that system allocator VMAs will be mixed with buffer
object (BO) VMAs within a single VM. In other words, system allocations
and runtime allocations can be mixed within a single user-mode driver
(UMD) program.

Expected usage:

- Bind the entire virtual address (VA) space upon program load using the
  DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag and DRM_XE_VM_BIND_OP_MAP op.
- If a buffer object (BO) requires GPU mapping, reserve an address range
  using mmap, unbind the address range from system allocator using
  DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag and DRM_XE_VM_BIND_OP_UNMAP op,
  and bind the BO to the reserved address using existing bind IOCTLs
  (runtime allocation).
- If a BO no longer requires GPU mapping, bind the mapping address with
  the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
- Any malloc'd address accessed by the GPU will be faulted in via the
  SVM implementation (system allocation).
- Upon freeing any malloc'd data, the SVM implementation will remove GPU
  mappings.

v1:
  fix commit message (Oak)
  rebase with lastest xekmd, fix conflict in xe_analyze_vm (Oak)
  allow UNMAP operation for system allocator (Krishna)

FIXME: Not enforcing DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR for fault mode VMs

FIXME: Only supporting 1 to 1 mapping between user address space and
GPU address space

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Krishna Bommu <krishnaiah.bommu@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       |  73 +++++++++++++++++----
 drivers/gpu/drm/xe/xe_vm.c       | 107 ++++++++++++++++++++-----------
 drivers/gpu/drm/xe/xe_vm.h       |   8 ++-
 drivers/gpu/drm/xe/xe_vm_types.h |   3 +
 include/uapi/drm/xe_drm.h        |  11 +++-
 5 files changed, 148 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index e4dd385ee954..3f6b1d8f437b 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -1064,6 +1064,11 @@ static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
 {
 	int err = 0;
 
+	/*
+	 * No need to check for is_system_allocator here as vma_add_deps is a
+	 * NOP if VMA is_system_allocator
+	 */
+
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
 		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
@@ -1671,6 +1676,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 	int err;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1737,6 +1743,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	if (!((vma->tile_present | vma->tile_staged) & BIT(tile->id)))
 		return 0;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1783,15 +1790,21 @@ static int op_prepare(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.remap.unmap->va));
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, old);
 
 		if (!err && op->remap.prev) {
 			err = bind_op_prepare(vm, tile, pt_update_ops,
@@ -1804,15 +1817,28 @@ static int op_prepare(struct xe_vm *vm,
 			pt_update_ops->wait_vm_bookkeep = true;
 		}
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.unmap.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, vma);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		err = bind_op_prepare(vm, tile, pt_update_ops,
-				      gpuva_to_vma(op->base.prefetch.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -1873,6 +1899,8 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vm_pgtable_update_ops *pt_update_ops,
 			   struct xe_vma *vma, struct dma_fence *fence)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1899,6 +1927,8 @@ static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			     struct xe_vm_pgtable_update_ops *pt_update_ops,
 			     struct xe_vma *vma, struct dma_fence *fence)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1926,14 +1956,20 @@ static void op_commit(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence);
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.remap.unmap->va), fence);
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		unbind_op_commit(vm, tile, pt_update_ops, old, fence);
 
 		if (op->remap.prev)
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
@@ -1942,14 +1978,23 @@ static void op_commit(struct xe_vm *vm,
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
 				       fence);
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.unmap.va), fence);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			unbind_op_commit(vm, tile, pt_update_ops, vma, fence);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		bind_op_commit(vm, tile, pt_update_ops,
-			       gpuva_to_vma(op->base.prefetch.va), fence);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			bind_op_commit(vm, tile, pt_update_ops, vma, fence);
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index b4fc8cf3f6a8..00d383b0acb7 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -891,9 +891,10 @@ static void xe_vma_free(struct xe_vma *vma)
 		kfree(vma);
 }
 
-#define VMA_CREATE_FLAG_READ_ONLY	BIT(0)
-#define VMA_CREATE_FLAG_IS_NULL		BIT(1)
-#define VMA_CREATE_FLAG_DUMPABLE	BIT(2)
+#define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
+#define VMA_CREATE_FLAG_IS_NULL			BIT(1)
+#define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
+#define VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR	BIT(3)
 
 static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 				    struct xe_bo *bo,
@@ -907,6 +908,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	bool read_only = (flags & VMA_CREATE_FLAG_READ_ONLY);
 	bool is_null = (flags & VMA_CREATE_FLAG_IS_NULL);
 	bool dumpable = (flags & VMA_CREATE_FLAG_DUMPABLE);
+	bool is_system_allocator =
+		(flags & VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR);
 
 	xe_assert(vm->xe, start < end);
 	xe_assert(vm->xe, end < vm->size);
@@ -915,7 +918,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	 * Allocate and ensure that the xe_vma_is_userptr() return
 	 * matches what was allocated.
 	 */
-	if (!bo && !is_null) {
+	if (!bo && !is_null && !is_system_allocator) {
 		struct xe_userptr_vma *uvma = kzalloc(sizeof(*uvma), GFP_KERNEL);
 
 		if (!uvma)
@@ -927,6 +930,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		if (!vma)
 			return ERR_PTR(-ENOMEM);
 
+		if (is_system_allocator)
+			vma->gpuva.flags |= XE_VMA_SYSTEM_ALLOCATOR;
 		if (is_null)
 			vma->gpuva.flags |= DRM_GPUVA_SPARSE;
 		if (bo)
@@ -969,7 +974,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		drm_gpuva_link(&vma->gpuva, vm_bo);
 		drm_gpuvm_bo_put(vm_bo);
 	} else /* userptr or null */ {
-		if (!is_null) {
+		if (!is_null && !is_system_allocator) {
 			struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
 			u64 size = end - start + 1;
 			int err;
@@ -1019,7 +1024,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		 */
 		mmu_interval_notifier_remove(&userptr->notifier);
 		xe_vm_put(vm);
-	} else if (xe_vma_is_null(vma)) {
+	} else if (xe_vma_is_null(vma) || xe_vma_is_system_allocator(vma)) {
 		xe_vm_put(vm);
 	} else {
 		xe_bo_put(xe_vma_bo(vma));
@@ -1058,7 +1063,7 @@ static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
 		spin_lock(&vm->userptr.invalidated_lock);
 		list_del(&to_userptr_vma(vma)->userptr.invalidate_link);
 		spin_unlock(&vm->userptr.invalidated_lock);
-	} else if (!xe_vma_is_null(vma)) {
+	} else if (!xe_vma_is_null(vma) && !xe_vma_is_system_allocator(vma)) {
 		xe_bo_assert_held(xe_vma_bo(vma));
 
 		drm_gpuva_unlink(&vma->gpuva);
@@ -1983,6 +1988,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 			op->map.read_only =
 				flags & DRM_XE_VM_BIND_FLAG_READONLY;
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+			op->map.is_system_allocator = flags &
+				DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
 		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
@@ -2175,6 +2182,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				VMA_CREATE_FLAG_IS_NULL : 0;
 			flags |= op->map.dumpable ?
 				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->map.is_system_allocator ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
 			vma = new_vma(vm, &op->base.map, op->map.pat_index,
 				      flags);
@@ -2182,7 +2191,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
-			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+			if ((op->map.immediate || !xe_vm_in_fault_mode(vm)) &&
+			    !op->map.is_system_allocator)
 				xe_vma_ops_incr_pt_update_ops(vops,
 							      op->tile_mask);
 			break;
@@ -2191,21 +2201,24 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 		{
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
+			bool skip = xe_vma_is_system_allocator(old);
 
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
 
-			if (op->base.remap.prev) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_READ_ONLY ?
+				VMA_CREATE_FLAG_READ_ONLY : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				DRM_GPUVA_SPARSE ?
+				VMA_CREATE_FLAG_IS_NULL : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_DUMPABLE ?
+				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= xe_vma_is_system_allocator(old) ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
+			if (op->base.remap.prev) {
 				vma = new_vma(vm, op->base.remap.prev,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2217,9 +2230,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_prev = !xe_vma_is_userptr(old) &&
+				op->remap.skip_prev = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_end(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_prev) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2235,16 +2249,6 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			}
 
 			if (op->base.remap.next) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
-
 				vma = new_vma(vm, op->base.remap.next,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2256,9 +2260,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_next = !xe_vma_is_userptr(old) &&
+				op->remap.skip_next = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_start(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_next) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2271,14 +2276,27 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!skip)
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			break;
+		}
 		case DRM_GPUVA_OP_PREFETCH:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
 			/* FIXME: Need to skip some prefetch ops */
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
+		}
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
@@ -2713,7 +2731,8 @@ static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
 	(DRM_XE_VM_BIND_FLAG_READONLY | \
 	 DRM_XE_VM_BIND_FLAG_IMMEDIATE | \
 	 DRM_XE_VM_BIND_FLAG_NULL | \
-	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+	 DRM_XE_VM_BIND_FLAG_DUMPABLE | \
+	 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR)
 #define XE_64K_PAGE_MASK 0xffffull
 #define ALL_DRM_XE_SYNCS_FLAGS (DRM_XE_SYNCS_FLAG_WAIT_FOR_OP)
 
@@ -2761,9 +2780,17 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u64 obj_offset = (*bind_ops)[i].obj_offset;
 		u32 prefetch_region = (*bind_ops)[i].prefetch_mem_region_instance;
 		bool is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+		bool is_system_allocator = flags &
+			DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
+		/* FIXME: Disabling system allocator for now */
+		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
+			err = -EOPNOTSUPP;
+			goto free_bind_ops;
+		}
+
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
@@ -2784,13 +2811,16 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 
 		if (XE_IOCTL_DBG(xe, op > DRM_XE_VM_BIND_OP_PREFETCH) ||
 		    XE_IOCTL_DBG(xe, flags & ~SUPPORTED_FLAGS) ||
-		    XE_IOCTL_DBG(xe, obj && is_null) ||
-		    XE_IOCTL_DBG(xe, obj_offset && is_null) ||
+		    XE_IOCTL_DBG(xe, obj && (is_null || is_system_allocator)) ||
+		    XE_IOCTL_DBG(xe, obj_offset &&
+				 (is_null || is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
 				 is_null) ||
+		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
+				 op != DRM_XE_VM_BIND_OP_UNMAP && is_system_allocator) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_MAP &&
-				 !is_null) ||
+				 !is_null && !is_system_allocator) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_UNMAP_ALL) ||
 		    XE_IOCTL_DBG(xe, addr &&
@@ -3155,6 +3185,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	int ret;
 
 	xe_assert(xe, !xe_vma_is_null(vma));
+	xe_assert(xe, !xe_vma_is_system_allocator(vma));
 	trace_xe_vma_invalidate(vma);
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index c864dba35e1d..1a5aed678214 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -151,6 +151,11 @@ static inline bool xe_vma_is_null(struct xe_vma *vma)
 	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
 }
 
+static inline bool xe_vma_is_system_allocator(struct xe_vma *vma)
+{
+	return vma->gpuva.flags & XE_VMA_SYSTEM_ALLOCATOR;
+}
+
 static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 {
 	return !xe_vma_bo(vma);
@@ -158,7 +163,8 @@ static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 
 static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 {
-	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma);
+	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma) &&
+		!xe_vma_is_system_allocator(vma);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 27d651093d30..e1d3dd699380 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -32,6 +32,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 6)
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 7)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 8)
+#define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 9)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -284,6 +285,8 @@ struct xe_vma_op_map {
 	bool read_only;
 	/** @is_null: is NULL binding */
 	bool is_null;
+	/** @is_system_allocator: is system allocator binding */
+	bool is_system_allocator;
 	/** @dumpable: whether BO is dumped on GPU hang */
 	bool dumpable;
 	/** @pat_index: The pat index to use for this operation. */
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index d7b0903c22b2..af8ee6f929d4 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -885,6 +885,12 @@ struct drm_xe_vm_destroy {
  *    will only be valid for DRM_XE_VM_BIND_OP_MAP operations, the BO
  *    handle MBZ, and the BO offset MBZ. This flag is intended to
  *    implement VK sparse bindings.
+ *  - %DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR - When the system allocator flag is
+ *    set, no mappings are created rather the range is reserved for system
+ *    allocations which will be populated on GPU page faults. Only valid on VMs
+ *    with DRM_XE_VM_CREATE_FLAG_FAULT_MODE set. The system allocator flag are
+ *    only valid for DRM_XE_VM_BIND_OP_MAP operations, the BO handle MBZ, and
+ *    the BO offset MBZ.
  */
 struct drm_xe_vm_bind_op {
 	/** @extensions: Pointer to the first extension struct, if any */
@@ -937,7 +943,9 @@ struct drm_xe_vm_bind_op {
 	 * on the @pat_index. For such mappings there is no actual memory being
 	 * mapped (the address in the PTE is invalid), so the various PAT memory
 	 * attributes likely do not apply.  Simply leaving as zero is one
-	 * option (still a valid pat_index).
+	 * option (still a valid pat_index). Same applies to
+	 * DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR bindings as for such mapping
+	 * there is no actual memory being mapped.
 	 */
 	__u16 pat_index;
 
@@ -975,6 +983,7 @@ struct drm_xe_vm_bind_op {
 #define DRM_XE_VM_BIND_FLAG_IMMEDIATE	(1 << 1)
 #define DRM_XE_VM_BIND_FLAG_NULL	(1 << 2)
 #define DRM_XE_VM_BIND_FLAG_DUMPABLE	(1 << 3)
+#define DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR	(1 << 4)
 	/** @flags: Bind flags */
 	__u32 flags;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 22/42] drm/xe: Add a helper to calculate userptr end address
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (20 preceding siblings ...)
  2024-06-13  4:24 ` [CI 21/42] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 23/42] drm/xe: Add dma_addr res cursor Oak Zeng
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

xe_vma_userptr_end is added to calculate a userptr end.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 1a5aed678214..89f3306561ad 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -146,6 +146,11 @@ static inline u64 xe_vma_userptr(struct xe_vma *vma)
 	return vma->gpuva.gem.offset;
 }
 
+static inline u64 xe_vma_userptr_end(struct xe_vma *vma)
+{
+	return vma->gpuva.gem.offset + xe_vma_size(vma);
+}
+
 static inline bool xe_vma_is_null(struct xe_vma *vma)
 {
 	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 23/42] drm/xe: Add dma_addr res cursor
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (21 preceding siblings ...)
  2024-06-13  4:24 ` [CI 22/42] drm/xe: Add a helper to calculate userptr end address Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 24/42] drm/xe: Use drm_mem_region for xe Oak Zeng
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_res_cursor.h | 50 +++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h b/drivers/gpu/drm/xe/xe_res_cursor.h
index 655af89b31a9..b17b3375f6d9 100644
--- a/drivers/gpu/drm/xe/xe_res_cursor.h
+++ b/drivers/gpu/drm/xe/xe_res_cursor.h
@@ -44,7 +44,9 @@ struct xe_res_cursor {
 	u64 remaining;
 	void *node;
 	u32 mem_type;
+	unsigned int order;
 	struct scatterlist *sgl;
+	const dma_addr_t *dma_addr;
 	struct drm_buddy *mm;
 };
 
@@ -71,6 +73,7 @@ static inline void xe_res_first(struct ttm_resource *res,
 				struct xe_res_cursor *cur)
 {
 	cur->sgl = NULL;
+	cur->dma_addr = NULL;
 	if (!res)
 		goto fallback;
 
@@ -161,11 +164,43 @@ static inline void xe_res_first_sg(const struct sg_table *sg,
 	cur->start = start;
 	cur->remaining = size;
 	cur->size = 0;
+	cur->dma_addr = NULL;
 	cur->sgl = sg->sgl;
 	cur->mem_type = XE_PL_TT;
 	__xe_res_sg_next(cur);
 }
 
+/**
+ * xe_res_first_dma - initialize a xe_res_cursor with dma_addr array
+ *
+ * @dma_addr: dma_addr array to walk
+ * @start: Start of the range
+ * @size: Size of the range
+ * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
+ * @cur: cursor object to initialize
+ *
+ * Start walking over the range of allocations between @start and @size.
+ */
+static inline void xe_res_first_dma(const dma_addr_t *dma_addr,
+				    u64 start, u64 size,
+				    unsigned int order,
+				    struct xe_res_cursor *cur)
+{
+	XE_WARN_ON(start);
+	XE_WARN_ON(!dma_addr);
+	XE_WARN_ON(!IS_ALIGNED(start, PAGE_SIZE) ||
+		   !IS_ALIGNED(size, PAGE_SIZE));
+
+	cur->node = NULL;
+	cur->start = start;
+	cur->remaining = size;
+	cur->size = PAGE_SIZE << order;
+	cur->dma_addr = dma_addr;
+	cur->order = order;
+	cur->sgl = NULL;
+	cur->mem_type = XE_PL_TT;
+}
+
 /**
  * xe_res_next - advance the cursor
  *
@@ -192,6 +227,13 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
 		return;
 	}
 
+	if (cur->dma_addr) {
+		cur->size = (PAGE_SIZE << cur->order) -
+			(size - cur->size);
+		cur->start += size;
+		return;
+	}
+
 	if (cur->sgl) {
 		cur->start += size;
 		__xe_res_sg_next(cur);
@@ -233,6 +275,12 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
  */
 static inline u64 xe_res_dma(const struct xe_res_cursor *cur)
 {
-	return cur->sgl ? sg_dma_address(cur->sgl) + cur->start : cur->start;
+	if (cur->dma_addr)
+		return cur->dma_addr[cur->start >> (PAGE_SHIFT + cur->order)] +
+			(cur->start & ((PAGE_SIZE << cur->order) - 1));
+	else if (cur->sgl)
+		return sg_dma_address(cur->sgl) + cur->start;
+	else
+		return cur->start;
 }
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 24/42] drm/xe: Use drm_mem_region for xe
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (22 preceding siblings ...)
  2024-06-13  4:24 ` [CI 23/42] drm/xe: Add dma_addr res cursor Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 25/42] drm/xe: use drm_hmmptr in xe Oak Zeng
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

drm_mem_region was introduced to move some memory management
codes to drm layer so it can be shared b/t different vendor
drivers. This patch apply drm_mem_region concept to xekmd
driver.

drm_mem_region is the parent class of xe_mem_region. Some
xe_mem_region member such as dpa_base is deleted as it
is already in the parent class.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/display/xe_fb_pin.c        |  2 +-
 drivers/gpu/drm/xe/display/xe_plane_initial.c |  2 +-
 drivers/gpu/drm/xe/xe_bo.c                    |  6 +++---
 drivers/gpu/drm/xe/xe_device_types.h          | 11 ++---------
 drivers/gpu/drm/xe/xe_migrate.c               |  8 ++++----
 drivers/gpu/drm/xe/xe_query.c                 |  2 +-
 drivers/gpu/drm/xe/xe_tile.c                  |  2 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c          |  2 +-
 drivers/gpu/drm/xe/xe_vram.c                  | 12 ++++++------
 9 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/drivers/gpu/drm/xe/display/xe_fb_pin.c
index a2f417209124..5c4590c62c08 100644
--- a/drivers/gpu/drm/xe/display/xe_fb_pin.c
+++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c
@@ -272,7 +272,7 @@ static struct i915_vma *__xe_pin_fb_vma(const struct intel_framebuffer *fb,
 		 * accessible.  This is important on small-bar systems where
 		 * only some subset of VRAM is CPU accessible.
 		 */
-		if (tile->mem.vram.io_size < tile->mem.vram.usable_size) {
+		if (tile->mem.vram.io_size < tile->mem.vram.drm_mr.usable_size) {
 			ret = -EINVAL;
 			goto err;
 		}
diff --git a/drivers/gpu/drm/xe/display/xe_plane_initial.c b/drivers/gpu/drm/xe/display/xe_plane_initial.c
index e135b20962d9..c2c079a2b133 100644
--- a/drivers/gpu/drm/xe/display/xe_plane_initial.c
+++ b/drivers/gpu/drm/xe/display/xe_plane_initial.c
@@ -86,7 +86,7 @@ initial_plane_bo(struct xe_device *xe,
 		 * We don't currently expect this to ever be placed in the
 		 * stolen portion.
 		 */
-		if (phys_base >= tile0->mem.vram.usable_size) {
+		if (phys_base >= tile0->mem.vram.drm_mr.usable_size) {
 			drm_err(&xe->drm,
 				"Initial plane programming using invalid range, phys_base=%pa\n",
 				&phys_base);
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 63d7fa4c3db9..1a048fcb1496 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -172,7 +172,7 @@ static void add_vram(struct xe_device *xe, struct xe_bo *bo,
 	xe_assert(xe, *c < ARRAY_SIZE(bo->placements));
 
 	vram = to_xe_ttm_vram_mgr(ttm_manager_type(&xe->ttm, mem_type))->vram;
-	xe_assert(xe, vram && vram->usable_size);
+	xe_assert(xe, vram && vram->drm_mr.usable_size);
 	io_size = vram->io_size;
 
 	/*
@@ -183,7 +183,7 @@ static void add_vram(struct xe_device *xe, struct xe_bo *bo,
 			XE_BO_FLAG_GGTT))
 		place.flags |= TTM_PL_FLAG_CONTIGUOUS;
 
-	if (io_size < vram->usable_size) {
+	if (io_size < vram->drm_mr.usable_size) {
 		if (bo_flags & XE_BO_FLAG_NEEDS_CPU_ACCESS) {
 			place.fpfn = 0;
 			place.lpfn = io_size >> PAGE_SHIFT;
@@ -1637,7 +1637,7 @@ uint64_t vram_region_gpu_offset(struct ttm_resource *res)
 	if (res->mem_type == XE_PL_STOLEN)
 		return xe_ttm_stolen_gpu_offset(xe);
 
-	return res_to_mem_region(res)->dpa_base;
+	return res_to_mem_region(res)->drm_mr.dpa_base;
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index f1c09824b145..cf61b52e6d84 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -11,6 +11,7 @@
 #include <drm/drm_device.h>
 #include <drm/drm_file.h>
 #include <drm/ttm/ttm_device.h>
+#include <drm/drm_svm.h>
 
 #include "xe_devcoredump_types.h"
 #include "xe_heci_gsc.h"
@@ -69,6 +70,7 @@ struct xe_pat_ops;
  * device, such as HBM memory or CXL extension memory.
  */
 struct xe_mem_region {
+	struct drm_mem_region drm_mr;
 	/** @io_start: IO start address of this VRAM instance */
 	resource_size_t io_start;
 	/**
@@ -81,15 +83,6 @@ struct xe_mem_region {
 	 * configuration is known as small-bar.
 	 */
 	resource_size_t io_size;
-	/** @dpa_base: This memory regions's DPA (device physical address) base */
-	resource_size_t dpa_base;
-	/**
-	 * @usable_size: usable size of VRAM
-	 *
-	 * Usable size of VRAM excluding reserved portions
-	 * (e.g stolen mem)
-	 */
-	resource_size_t usable_size;
 	/**
 	 * @actual_physical_size: Actual VRAM size
 	 *
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 0457693315e6..cc8455daa2bb 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -126,7 +126,7 @@ static u64 xe_migrate_vram_ofs(struct xe_device *xe, u64 addr)
 	 * Remove the DPA to get a correct offset into identity table for the
 	 * migrate offset
 	 */
-	addr -= xe->mem.vram.dpa_base;
+	addr -= xe->mem.vram.drm_mr.dpa_base;
 	return addr + (256ULL << xe_pt_shift(2));
 }
 
@@ -256,21 +256,21 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile, struct xe_migrate *m,
 		u64 pos, ofs, flags;
 		/* XXX: Unclear if this should be usable_size? */
 		u64 vram_limit =  xe->mem.vram.actual_physical_size +
-			xe->mem.vram.dpa_base;
+			xe->mem.vram.drm_mr.dpa_base;
 
 		level = 2;
 		ofs = map_ofs + XE_PAGE_SIZE * level + 256 * 8;
 		flags = vm->pt_ops->pte_encode_addr(xe, 0, pat_index, level,
 						    true, 0);
 
-		xe_assert(xe, IS_ALIGNED(xe->mem.vram.usable_size, SZ_2M));
+		xe_assert(xe, IS_ALIGNED(xe->mem.vram.drm_mr.usable_size, SZ_2M));
 
 		/*
 		 * Use 1GB pages when possible, last chunk always use 2M
 		 * pages as mixing reserved memory (stolen, WOCPM) with a single
 		 * mapping is not allowed on certain platforms.
 		 */
-		for (pos = xe->mem.vram.dpa_base; pos < vram_limit;
+		for (pos = xe->mem.vram.drm_mr.dpa_base; pos < vram_limit;
 		     pos += SZ_1G, ofs += 8) {
 			if (pos + SZ_1G >= vram_limit) {
 				u64 pt31_ofs = bo->size - XE_PAGE_SIZE;
diff --git a/drivers/gpu/drm/xe/xe_query.c b/drivers/gpu/drm/xe/xe_query.c
index 995effcb904b..8b3d63420cef 100644
--- a/drivers/gpu/drm/xe/xe_query.c
+++ b/drivers/gpu/drm/xe/xe_query.c
@@ -334,7 +334,7 @@ static int query_config(struct xe_device *xe, struct drm_xe_device_query *query)
 	config->num_params = num_params;
 	config->info[DRM_XE_QUERY_CONFIG_REV_AND_DEVICE_ID] =
 		xe->info.devid | (xe->info.revid << 16);
-	if (xe_device_get_root_tile(xe)->mem.vram.usable_size)
+	if (xe_device_get_root_tile(xe)->mem.vram.drm_mr.usable_size)
 		config->info[DRM_XE_QUERY_CONFIG_FLAGS] =
 			DRM_XE_QUERY_CONFIG_FLAG_HAS_VRAM;
 	config->info[DRM_XE_QUERY_CONFIG_MIN_ALIGNMENT] =
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 15ea0a942f67..109f3118e821 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -132,7 +132,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
 	struct xe_device *xe = tile_to_xe(tile);
 	int err;
 
-	if (tile->mem.vram.usable_size) {
+	if (tile->mem.vram.drm_mr.usable_size) {
 		err = xe_ttm_vram_mgr_init(tile, tile->mem.vram_mgr);
 		if (err)
 			return err;
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index fe3779fdba2c..dd31b24fb07d 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -364,7 +364,7 @@ int xe_ttm_vram_mgr_init(struct xe_tile *tile, struct xe_ttm_vram_mgr *mgr)
 
 	mgr->vram = vram;
 	return __xe_ttm_vram_mgr_init(xe, mgr, XE_PL_VRAM0 + tile->id,
-				      vram->usable_size, vram->io_size,
+				      vram->drm_mr.usable_size, vram->io_size,
 				      PAGE_SIZE);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_vram.c b/drivers/gpu/drm/xe/xe_vram.c
index 5bcd59190353..fff18517a9f9 100644
--- a/drivers/gpu/drm/xe/xe_vram.c
+++ b/drivers/gpu/drm/xe/xe_vram.c
@@ -150,7 +150,7 @@ static int determine_lmem_bar_size(struct xe_device *xe)
 		return -EIO;
 
 	/* XXX: Need to change when xe link code is ready */
-	xe->mem.vram.dpa_base = 0;
+	xe->mem.vram.drm_mr.dpa_base = 0;
 
 	/* set up a map to the total memory area. */
 	xe->mem.vram.mapping = ioremap_wc(xe->mem.vram.io_start, xe->mem.vram.io_size);
@@ -333,16 +333,16 @@ int xe_vram_probe(struct xe_device *xe)
 			return -ENODEV;
 		}
 
-		tile->mem.vram.dpa_base = xe->mem.vram.dpa_base + tile_offset;
-		tile->mem.vram.usable_size = vram_size;
+		tile->mem.vram.drm_mr.dpa_base = xe->mem.vram.drm_mr.dpa_base + tile_offset;
+		tile->mem.vram.drm_mr.usable_size = vram_size;
 		tile->mem.vram.mapping = xe->mem.vram.mapping + tile_offset;
 
-		if (tile->mem.vram.io_size < tile->mem.vram.usable_size)
+		if (tile->mem.vram.io_size < tile->mem.vram.drm_mr.usable_size)
 			drm_info(&xe->drm, "Small BAR device\n");
 		drm_info(&xe->drm, "VRAM[%u, %u]: Actual physical size %pa, usable size exclude stolen %pa, CPU accessible size %pa\n", id,
-			 tile->id, &tile->mem.vram.actual_physical_size, &tile->mem.vram.usable_size, &tile->mem.vram.io_size);
+			 tile->id, &tile->mem.vram.actual_physical_size, &tile->mem.vram.drm_mr.usable_size, &tile->mem.vram.io_size);
 		drm_info(&xe->drm, "VRAM[%u, %u]: DPA range: [%pa-%llx], io range: [%pa-%llx]\n", id, tile->id,
-			 &tile->mem.vram.dpa_base, tile->mem.vram.dpa_base + (u64)tile->mem.vram.actual_physical_size,
+			 &tile->mem.vram.drm_mr.dpa_base, tile->mem.vram.drm_mr.dpa_base + (u64)tile->mem.vram.actual_physical_size,
 			 &tile->mem.vram.io_start, tile->mem.vram.io_start + (u64)tile->mem.vram.io_size);
 
 		/* calculate total size using tile size to get the correct HW sizing */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 25/42] drm/xe: use drm_hmmptr in xe
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (23 preceding siblings ...)
  2024-06-13  4:24 ` [CI 24/42] drm/xe: Use drm_mem_region for xe Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 26/42] drm/xe: Moving to range based vma invalidation Oak Zeng
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

drm_hmmptr was created to move some userptr and svm logic
such as mmu notifier registration, range population etc
to drm layer, so those logic can be shared b/t vendor
drivers.

This patch apply the drm_hmmptr to xekmd driver. drm_hmmptr
is the parent class of xe_userptr.

Most of the changes are straight forward, such as some
xe_userptr members are now moved to drm_hmmptr. Since
drm_hmmptr doesn't use scatter-gather table for dma
address anymore, we changed to the dma-address cursor
from the sg cursor during page table building.

xe_hmm.c and xe_hmm.h are removed, as the logic is
now moved to drm_svm.c

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Krishna Bommu <krishnaiah.bommu@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/Makefile      |   2 -
 drivers/gpu/drm/xe/xe_hmm.c      | 253 -------------------------------
 drivers/gpu/drm/xe/xe_hmm.h      |  11 --
 drivers/gpu/drm/xe/xe_pt.c       |  10 +-
 drivers/gpu/drm/xe/xe_vm.c       |  89 ++++++++---
 drivers/gpu/drm/xe/xe_vm_types.h |  11 +-
 6 files changed, 71 insertions(+), 305 deletions(-)
 delete mode 100644 drivers/gpu/drm/xe/xe_hmm.c
 delete mode 100644 drivers/gpu/drm/xe/xe_hmm.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 478acc94a71c..80bfa5741f26 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -126,8 +126,6 @@ xe-y += xe_bb.o \
 	xe_wa.o \
 	xe_wopcm.o
 
-xe-$(CONFIG_HMM_MIRROR) += xe_hmm.o
-
 # graphics hardware monitoring (HWMON) support
 xe-$(CONFIG_HWMON) += xe_hwmon.o
 
diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
deleted file mode 100644
index 2c32dc46f7d4..000000000000
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ /dev/null
@@ -1,253 +0,0 @@
-// SPDX-License-Identifier: MIT
-/*
- * Copyright © 2024 Intel Corporation
- */
-
-#include <linux/scatterlist.h>
-#include <linux/mmu_notifier.h>
-#include <linux/dma-mapping.h>
-#include <linux/memremap.h>
-#include <linux/swap.h>
-#include <linux/hmm.h>
-#include <linux/mm.h>
-#include "xe_hmm.h"
-#include "xe_vm.h"
-#include "xe_bo.h"
-
-static u64 xe_npages_in_range(unsigned long start, unsigned long end)
-{
-	return (end - start) >> PAGE_SHIFT;
-}
-
-/*
- * xe_mark_range_accessed() - mark a range is accessed, so core mm
- * have such information for memory eviction or write back to
- * hard disk
- *
- * @range: the range to mark
- * @write: if write to this range, we mark pages in this range
- * as dirty
- */
-static void xe_mark_range_accessed(struct hmm_range *range, bool write)
-{
-	struct page *page;
-	u64 i, npages;
-
-	npages = xe_npages_in_range(range->start, range->end);
-	for (i = 0; i < npages; i++) {
-		page = hmm_pfn_to_page(range->hmm_pfns[i]);
-		if (write)
-			set_page_dirty_lock(page);
-
-		mark_page_accessed(page);
-	}
-}
-
-/*
- * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
- * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
- * and will be used to program GPU page table later.
- *
- * @xe: the xe device who will access the dma-address in sg table
- * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
- * has the pfn numbers of pages that back up this hmm address range.
- * @st: pointer to the sg table.
- * @write: whether we write to this range. This decides dma map direction
- * for system pages. If write we map it bi-diretional; otherwise
- * DMA_TO_DEVICE
- *
- * All the contiguous pfns will be collapsed into one entry in
- * the scatter gather table. This is for the purpose of efficiently
- * programming GPU page table.
- *
- * The dma_address in the sg table will later be used by GPU to
- * access memory. So if the memory is system memory, we need to
- * do a dma-mapping so it can be accessed by GPU/DMA.
- *
- * FIXME: This function currently only support pages in system
- * memory. If the memory is GPU local memory (of the GPU who
- * is going to access memory), we need gpu dpa (device physical
- * address), and there is no need of dma-mapping. This is TBD.
- *
- * FIXME: dma-mapping for peer gpu device to access remote gpu's
- * memory. Add this when you support p2p
- *
- * This function allocates the storage of the sg table. It is
- * caller's responsibility to free it calling sg_free_table.
- *
- * Returns 0 if successful; -ENOMEM if fails to allocate memory
- */
-static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
-		       struct sg_table *st, bool write)
-{
-	struct device *dev = xe->drm.dev;
-	struct page **pages;
-	u64 i, npages;
-	int ret;
-
-	npages = xe_npages_in_range(range->start, range->end);
-	pages = kvmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
-	for (i = 0; i < npages; i++) {
-		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
-		xe_assert(xe, !is_device_private_page(pages[i]));
-	}
-
-	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0, npages << PAGE_SHIFT,
-						xe_sg_segment_size(dev), GFP_KERNEL);
-	if (ret)
-		goto free_pages;
-
-	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
-			      DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
-	if (ret) {
-		sg_free_table(st);
-		st = NULL;
-	}
-
-free_pages:
-	kvfree(pages);
-	return ret;
-}
-
-/*
- * xe_hmm_userptr_free_sg() - Free the scatter gather table of userptr
- *
- * @uvma: the userptr vma which hold the scatter gather table
- *
- * With function xe_userptr_populate_range, we allocate storage of
- * the userptr sg table. This is a helper function to free this
- * sg table, and dma unmap the address in the table.
- */
-void xe_hmm_userptr_free_sg(struct xe_userptr_vma *uvma)
-{
-	struct xe_userptr *userptr = &uvma->userptr;
-	struct xe_vma *vma = &uvma->vma;
-	bool write = !xe_vma_read_only(vma);
-	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_device *xe = vm->xe;
-	struct device *dev = xe->drm.dev;
-
-	xe_assert(xe, userptr->sg);
-	dma_unmap_sgtable(dev, userptr->sg,
-			  write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
-
-	sg_free_table(userptr->sg);
-	userptr->sg = NULL;
-}
-
-/**
- * xe_hmm_userptr_populate_range() - Populate physical pages of a virtual
- * address range
- *
- * @uvma: userptr vma which has information of the range to populate.
- * @is_mm_mmap_locked: True if mmap_read_lock is already acquired by caller.
- *
- * This function populate the physical pages of a virtual
- * address range. The populated physical pages is saved in
- * userptr's sg table. It is similar to get_user_pages but call
- * hmm_range_fault.
- *
- * This function also read mmu notifier sequence # (
- * mmu_interval_read_begin), for the purpose of later
- * comparison (through mmu_interval_read_retry).
- *
- * This must be called with mmap read or write lock held.
- *
- * This function allocates the storage of the userptr sg table.
- * It is caller's responsibility to free it calling sg_free_table.
- *
- * returns: 0 for succuss; negative error no on failure
- */
-int xe_hmm_userptr_populate_range(struct xe_userptr_vma *uvma,
-				  bool is_mm_mmap_locked)
-{
-	unsigned long timeout =
-		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
-	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
-	struct xe_userptr *userptr;
-	struct xe_vma *vma = &uvma->vma;
-	u64 userptr_start = xe_vma_userptr(vma);
-	u64 userptr_end = userptr_start + xe_vma_size(vma);
-	struct xe_vm *vm = xe_vma_vm(vma);
-	struct hmm_range hmm_range;
-	bool write = !xe_vma_read_only(vma);
-	unsigned long notifier_seq;
-	u64 npages;
-	int ret;
-
-	userptr = &uvma->userptr;
-
-	if (is_mm_mmap_locked)
-		mmap_assert_locked(userptr->notifier.mm);
-
-	if (vma->gpuva.flags & XE_VMA_DESTROYED)
-		return 0;
-
-	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
-	if (notifier_seq == userptr->notifier_seq)
-		return 0;
-
-	if (userptr->sg)
-		xe_hmm_userptr_free_sg(uvma);
-
-	npages = xe_npages_in_range(userptr_start, userptr_end);
-	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
-	if (unlikely(!pfns))
-		return -ENOMEM;
-
-	if (write)
-		flags |= HMM_PFN_REQ_WRITE;
-
-	if (!mmget_not_zero(userptr->notifier.mm)) {
-		ret = -EFAULT;
-		goto free_pfns;
-	}
-
-	hmm_range.default_flags = flags;
-	hmm_range.hmm_pfns = pfns;
-	hmm_range.notifier = &userptr->notifier;
-	hmm_range.start = userptr_start;
-	hmm_range.end = userptr_end;
-	hmm_range.dev_private_owner = vm->xe;
-
-	while (true) {
-		hmm_range.notifier_seq = mmu_interval_read_begin(&userptr->notifier);
-
-		if (!is_mm_mmap_locked)
-			mmap_read_lock(userptr->notifier.mm);
-
-		ret = hmm_range_fault(&hmm_range);
-
-		if (!is_mm_mmap_locked)
-			mmap_read_unlock(userptr->notifier.mm);
-
-		if (ret == -EBUSY) {
-			if (time_after(jiffies, timeout))
-				break;
-
-			continue;
-		}
-		break;
-	}
-
-	mmput(userptr->notifier.mm);
-
-	if (ret)
-		goto free_pfns;
-
-	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
-	if (ret)
-		goto free_pfns;
-
-	xe_mark_range_accessed(&hmm_range, write);
-	userptr->sg = &userptr->sgt;
-	userptr->notifier_seq = hmm_range.notifier_seq;
-
-free_pfns:
-	kvfree(pfns);
-	return ret;
-}
-
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
deleted file mode 100644
index 909dc2bdcd97..000000000000
--- a/drivers/gpu/drm/xe/xe_hmm.h
+++ /dev/null
@@ -1,11 +0,0 @@
-/* SPDX-License-Identifier: MIT
- *
- * Copyright © 2024 Intel Corporation
- */
-
-#include <linux/types.h>
-
-struct xe_userptr_vma;
-
-int xe_hmm_userptr_populate_range(struct xe_userptr_vma *uvma, bool is_mm_mmap_locked);
-void xe_hmm_userptr_free_sg(struct xe_userptr_vma *uvma);
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 3f6b1d8f437b..b30fc855147d 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -669,8 +669,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 
 	if (!xe_vma_is_null(vma)) {
 		if (xe_vma_is_userptr(vma))
-			xe_res_first_sg(to_userptr_vma(vma)->userptr.sg, 0,
-					xe_vma_size(vma), &curs);
+			xe_res_first_dma(to_userptr_vma(vma)->userptr.hmmptr.dma_addr,
+					0, xe_vma_size(vma), 0, &curs);
 		else if (xe_bo_is_vram(bo) || xe_bo_is_stolen(bo))
 			xe_res_first(bo->ttm.resource, xe_vma_bo_offset(vma),
 				     xe_vma_size(vma), &curs);
@@ -1221,12 +1221,12 @@ static int vma_check_userptr(struct xe_vm *vm, struct xe_vma *vma,
 		return 0;
 
 	uvma = to_userptr_vma(vma);
-	notifier_seq = uvma->userptr.notifier_seq;
+	notifier_seq = uvma->userptr.hmmptr.notifier_seq;
 
 	if (uvma->userptr.initial_bind && !xe_vm_in_fault_mode(vm))
                return 0;
 
-	if (!mmu_interval_read_retry(&uvma->userptr.notifier,
+	if (!mmu_interval_read_retry(&uvma->userptr.hmmptr.notifier,
 				     notifier_seq) &&
 	    !xe_pt_userptr_inject_eagain(uvma))
 		return 0;
@@ -1755,7 +1755,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	 * state if an invalidation is running while preparing an unbind.
 	 */
 	if (xe_vma_is_userptr(vma) && xe_vm_in_fault_mode(xe_vma_vm(vma)))
-		mmu_interval_read_begin(&to_userptr_vma(vma)->userptr.notifier);
+		mmu_interval_read_begin(&to_userptr_vma(vma)->userptr.hmmptr.notifier);
 
 	pt_op->vma = vma;
 	pt_op->bind = false;
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 00d383b0acb7..3b853306208b 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -38,7 +38,6 @@
 #include "xe_sync.h"
 #include "xe_trace.h"
 #include "xe_wa.h"
-#include "xe_hmm.h"
 
 static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
 {
@@ -59,21 +58,63 @@ static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
  */
 int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma)
 {
-	return mmu_interval_check_retry(&uvma->userptr.notifier,
-					uvma->userptr.notifier_seq) ?
+	return mmu_interval_check_retry(&uvma->userptr.hmmptr.notifier,
+					uvma->userptr.hmmptr.notifier_seq) ?
 		-EAGAIN : 0;
 }
 
+static void xe_vma_userptr_dma_map_pages(struct xe_userptr_vma *uvma,
+		u64 start, u64 end)
+{
+	struct drm_hmmptr *hmmptr = &uvma->userptr.hmmptr;
+	struct xe_vma *vma = &uvma->vma;
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_device *xe = vm->xe;
+	u64 npages = (end - start) >> PAGE_SHIFT;
+	u64 page_idx = (start - xe_vma_userptr(vma)) >> PAGE_SHIFT;
+
+	xe_assert(xe, xe_vma_is_userptr(vma));
+	xe_assert(xe, start >= xe_vma_userptr(vma));
+	xe_assert(xe, end <= xe_vma_userptr_end(vma));
+	drm_svm_hmmptr_map_dma_pages(hmmptr, page_idx, npages);
+}
+
+static void xe_vma_userptr_dma_unmap_pages(struct xe_userptr_vma *uvma,
+		u64 start, u64 end)
+{
+	struct drm_hmmptr *hmmptr = &uvma->userptr.hmmptr;
+	struct xe_vma *vma = &uvma->vma;
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_device *xe = vm->xe;
+	u64 npages = (end - start) >> PAGE_SHIFT;
+	u64 page_idx = (start - xe_vma_userptr(vma)) >> PAGE_SHIFT;
+
+	xe_assert(xe, xe_vma_is_userptr(vma));
+	xe_assert(xe, start >= xe_vma_userptr(vma));
+	xe_assert(xe, end <= xe_vma_userptr_end(vma));
+
+	drm_svm_hmmptr_unmap_dma_pages(hmmptr, page_idx, npages);
+}
+
 int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 {
+	struct drm_hmmptr *hmmptr = &uvma->userptr.hmmptr;
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct xe_device *xe = vm->xe;
+	int ret;
 
 	lockdep_assert_held(&vm->lock);
 	xe_assert(xe, xe_vma_is_userptr(vma));
 
-	return xe_hmm_userptr_populate_range(uvma, false);
+	ret = drm_svm_hmmptr_populate(hmmptr, NULL, xe_vma_userptr(vma),
+			 xe_vma_userptr(vma) + xe_vma_size(vma),
+			 !xe_vma_read_only(vma), false);
+	if (ret)
+		return ret;
+
+	xe_vma_userptr_dma_map_pages(uvma, xe_vma_userptr(vma), xe_vma_userptr_end(vma));
+	return 0;
 }
 
 static bool preempt_fences_waiting(struct xe_vm *vm)
@@ -574,7 +615,7 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 				   const struct mmu_notifier_range *range,
 				   unsigned long cur_seq)
 {
-	struct xe_userptr *userptr = container_of(mni, typeof(*userptr), notifier);
+	struct xe_userptr *userptr = container_of(mni, typeof(*userptr), hmmptr.notifier);
 	struct xe_userptr_vma *uvma = container_of(userptr, typeof(*uvma), userptr);
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
@@ -637,6 +678,8 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 		XE_WARN_ON(err);
 	}
 
+	xe_vma_userptr_dma_unmap_pages(uvma, xe_vma_userptr(vma), xe_vma_userptr_end(vma));
+
 	trace_xe_vma_userptr_invalidate_complete(vma);
 
 	return true;
@@ -891,6 +934,16 @@ static void xe_vma_free(struct xe_vma *vma)
 		kfree(vma);
 }
 
+static struct drm_gpuva *xe_hmmptr_get_gpuva(struct drm_hmmptr *hmmptr)
+{
+	struct xe_userptr_vma *uvma;
+	struct xe_vma *vma;
+
+	uvma = container_of(hmmptr, typeof(*uvma), userptr.hmmptr);
+	vma = &uvma->vma;
+	return &vma->gpuva;
+}
+
 #define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
 #define VMA_CREATE_FLAG_IS_NULL			BIT(1)
 #define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
@@ -976,23 +1029,19 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	} else /* userptr or null */ {
 		if (!is_null && !is_system_allocator) {
 			struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
-			u64 size = end - start + 1;
 			int err;
 
 			INIT_LIST_HEAD(&userptr->invalidate_link);
 			INIT_LIST_HEAD(&userptr->repin_link);
 			vma->gpuva.gem.offset = bo_offset_or_userptr;
 
-			err = mmu_interval_notifier_insert(&userptr->notifier,
-							   current->mm,
-							   xe_vma_userptr(vma), size,
-							   &vma_userptr_notifier_ops);
+			userptr->hmmptr.get_gpuva = &xe_hmmptr_get_gpuva;
+			err = drm_svm_hmmptr_init(&userptr->hmmptr,
+				 &vma_userptr_notifier_ops);
 			if (err) {
 				xe_vma_free(vma);
 				return ERR_PTR(err);
 			}
-
-			userptr->notifier_seq = LONG_MAX;
 		}
 
 		xe_vm_get(vm);
@@ -1014,15 +1063,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
 		struct xe_userptr *userptr = &uvma->userptr;
 
-		if (userptr->sg)
-			xe_hmm_userptr_free_sg(uvma);
-
-		/*
-		 * Since userptr pages are not pinned, we can't remove
-		 * the notifer until we're sure the GPU is not accessing
-		 * them anymore
-		 */
-		mmu_interval_notifier_remove(&userptr->notifier);
+		drm_svm_hmmptr_release(&userptr->hmmptr);
 		xe_vm_put(vm);
 	} else if (xe_vma_is_null(vma) || xe_vma_is_system_allocator(vma)) {
 		xe_vm_put(vm);
@@ -3196,8 +3237,8 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
 		if (xe_vma_is_userptr(vma)) {
 			WARN_ON_ONCE(!mmu_interval_check_retry
-				     (&to_userptr_vma(vma)->userptr.notifier,
-				      to_userptr_vma(vma)->userptr.notifier_seq));
+				     (&to_userptr_vma(vma)->userptr.hmmptr.notifier,
+				      to_userptr_vma(vma)->userptr.hmmptr.notifier_seq));
 			WARN_ON_ONCE(!dma_resv_test_signaled(xe_vm_resv(xe_vma_vm(vma)),
 							     DMA_RESV_USAGE_BOOKKEEP));
 
@@ -3283,7 +3324,7 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm)
 			snap->snap[i].bo_ofs = xe_vma_bo_offset(vma);
 		} else if (xe_vma_is_userptr(vma)) {
 			struct mm_struct *mm =
-				to_userptr_vma(vma)->userptr.notifier.mm;
+				to_userptr_vma(vma)->userptr.hmmptr.notifier.mm;
 
 			if (mmget_not_zero(mm))
 				snap->snap[i].mm = mm;
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index e1d3dd699380..976982972a06 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -36,20 +36,11 @@ struct xe_vm_pgtable_update_op;
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
+	struct drm_hmmptr hmmptr;
 	/** @invalidate_link: Link for the vm::userptr.invalidated list */
 	struct list_head invalidate_link;
 	/** @userptr: link into VM repin list if userptr. */
 	struct list_head repin_link;
-	/**
-	 * @notifier: MMU notifier for user pointer (invalidation call back)
-	 */
-	struct mmu_interval_notifier notifier;
-	/** @sgt: storage for a scatter gather table */
-	struct sg_table sgt;
-	/** @sg: allocated scatter gather table */
-	struct sg_table *sg;
-	/** @notifier_seq: notifier sequence number */
-	unsigned long notifier_seq;
 	/**
 	 * @initial_bind: user pointer has been bound at least once.
 	 * write: vm->userptr.notifier_lock in read mode and vm->resv held.
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 26/42] drm/xe: Moving to range based vma invalidation
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (24 preceding siblings ...)
  2024-06-13  4:24 ` [CI 25/42] drm/xe: use drm_hmmptr in xe Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 27/42] drm/xe: Support range based page table update Oak Zeng
                   ` (15 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

New parameters are introduced to the vma invalidation function to allow
partially invalidate a vma.

For userptr invalidation, we now only invalidate the mmu notified range
instead of the whole vma. This more more proper than the whole vma based
invalidation.

For other cases, still keep the whole vma invalidation scheme.

One of the consequence of this change is, we now don't have information
whether a vma is fully invalidated or not, because vma can be partially
invalidated now. The tile_invalidated member is deleted due to this
reason.

This is prepare work for system allocator where we want to apply range
based vma invalidation. It is also reasonable to apply the same scheme
for userptr invalidation.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c           |  2 +-
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 11 ---------
 drivers/gpu/drm/xe/xe_pt.c           | 16 +++++++++----
 drivers/gpu/drm/xe/xe_pt.h           |  2 +-
 drivers/gpu/drm/xe/xe_vm.c           | 36 +++++++++++++++++++---------
 drivers/gpu/drm/xe/xe_vm.h           |  2 +-
 drivers/gpu/drm/xe/xe_vm_types.h     |  3 ---
 7 files changed, 40 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 1a048fcb1496..10d66bb5bad2 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -514,7 +514,7 @@ static int xe_bo_trigger_rebind(struct xe_device *xe, struct xe_bo *bo,
 			struct xe_vma *vma = gpuva_to_vma(gpuva);
 
 			trace_xe_vma_evict(vma);
-			ret = xe_vm_invalidate_vma(vma);
+			ret = xe_vm_invalidate_vma(vma, xe_vma_start(vma), xe_vma_end(vma));
 			if (XE_WARN_ON(ret))
 				return ret;
 		}
diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index eaf68f0135c1..c3e9331cf1b6 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -65,12 +65,6 @@ static bool access_is_atomic(enum access_type access_type)
 	return access_type == ACCESS_TYPE_ATOMIC;
 }
 
-static bool vma_is_valid(struct xe_tile *tile, struct xe_vma *vma)
-{
-	return BIT(tile->id) & vma->tile_present &&
-		!(BIT(tile->id) & vma->tile_invalidated);
-}
-
 static bool vma_matches(struct xe_vma *vma, u64 page_addr)
 {
 	if (page_addr > xe_vma_end(vma) - 1 ||
@@ -138,10 +132,6 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 	trace_xe_vma_pagefault(vma);
 	atomic = access_is_atomic(pf->access_type);
 
-	/* Check if VMA is valid */
-	if (vma_is_valid(tile, vma) && !atomic)
-		return 0;
-
 retry_userptr:
 	if (xe_vma_is_userptr(vma) &&
 	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
@@ -175,7 +165,6 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 
 	dma_fence_wait(fence, false);
 	dma_fence_put(fence);
-	vma->tile_invalidated &= ~BIT(tile->id);
 
 unlock_dma_resv:
 	drm_exec_fini(&exec);
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index b30fc855147d..96600ba9e100 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -792,6 +792,8 @@ static const struct xe_pt_walk_ops xe_pt_zap_ptes_ops = {
  * xe_pt_zap_ptes() - Zap (zero) gpu ptes of an address range
  * @tile: The tile we're zapping for.
  * @vma: GPU VMA detailing address range.
+ * @start: start of the range.
+ * @end: end of the range.
  *
  * Eviction and Userptr invalidation needs to be able to zap the
  * gpu ptes of a given address range in pagefaulting mode.
@@ -803,8 +805,11 @@ static const struct xe_pt_walk_ops xe_pt_zap_ptes_ops = {
  *
  * Return: Whether ptes were actually updated and a TLB invalidation is
  * required.
+ *
+ * FIXME: double confirm xe_pt_walk_shared support walking of a sub-range of
+ * vma (vs whole vma)
  */
-bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma)
+bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma, u64 start, u64 end)
 {
 	struct xe_pt_zap_ptes_walk xe_walk = {
 		.base = {
@@ -815,13 +820,16 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma)
 		.tile = tile,
 	};
 	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
-	u8 pt_mask = (vma->tile_present & ~vma->tile_invalidated);
+	u8 pt_mask = vma->tile_present;
+
+	xe_assert(tile_to_xe(tile), start >= xe_vma_start(vma));
+	xe_assert(tile_to_xe(tile), end <= xe_vma_end(vma));
 
 	if (!(pt_mask & BIT(tile->id)))
 		return false;
 
-	(void)xe_pt_walk_shared(&pt->base, pt->level, xe_vma_start(vma),
-				xe_vma_end(vma), &xe_walk.base);
+	(void)xe_pt_walk_shared(&pt->base, pt->level, start,
+				end, &xe_walk.base);
 
 	return xe_walk.needs_invalidate;
 }
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 9ab386431cad..aa8ff28e75b0 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -41,6 +41,6 @@ struct dma_fence *xe_pt_update_ops_run(struct xe_tile *tile,
 void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops);
 void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
 
-bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
+bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma, u64 start, u64 end);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 3b853306208b..5ae5b556ec1b 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -617,21 +617,31 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 {
 	struct xe_userptr *userptr = container_of(mni, typeof(*userptr), hmmptr.notifier);
 	struct xe_userptr_vma *uvma = container_of(userptr, typeof(*uvma), userptr);
+	u64 range_start, range_end, range_size, range_offset;
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct dma_resv_iter cursor;
 	struct dma_fence *fence;
+	u64 start, end;
 	long err;
 
+	range_start = max_t(u64, xe_vma_userptr(vma), range->start);
+	range_end = min_t(u64, xe_vma_userptr_end(vma), range->end);
+	range_size = range_end - range_start;
+	range_offset = range_start - xe_vma_userptr(vma);
+
 	xe_assert(vm->xe, xe_vma_is_userptr(vma));
 	trace_xe_vma_userptr_invalidate(vma);
 
+	start = xe_vma_start(vma) + range_offset;
+	end = start + range_size;
+
 	if (!mmu_notifier_range_blockable(range))
 		return false;
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
 	       "NOTIFIER: addr=0x%016llx, range=0x%016llx",
-		xe_vma_start(vma), xe_vma_size(vma));
+		start, range_size);
 
 	down_write(&vm->userptr.notifier_lock);
 	mmu_interval_set_seq(mni, cur_seq);
@@ -674,11 +684,11 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 	XE_WARN_ON(err <= 0);
 
 	if (xe_vm_in_fault_mode(vm)) {
-		err = xe_vm_invalidate_vma(vma);
+		err = xe_vm_invalidate_vma(vma, start, end);
 		XE_WARN_ON(err);
 	}
 
-	xe_vma_userptr_dma_unmap_pages(uvma, xe_vma_userptr(vma), xe_vma_userptr_end(vma));
+	xe_vma_userptr_dma_unmap_pages(uvma, range_start, range_end);
 
 	trace_xe_vma_userptr_invalidate_complete(vma);
 
@@ -721,7 +731,8 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 					      DMA_RESV_USAGE_BOOKKEEP,
 					      false, MAX_SCHEDULE_TIMEOUT);
 
-			err = xe_vm_invalidate_vma(&uvma->vma);
+			err = xe_vm_invalidate_vma(&uvma->vma, xe_vma_start(&uvma->vma),
+				     xe_vma_end(&uvma->vma));
 			xe_vm_unlock(vm);
 			if (err)
 				return err;
@@ -3207,8 +3218,10 @@ void xe_vm_unlock(struct xe_vm *vm)
 }
 
 /**
- * xe_vm_invalidate_vma - invalidate GPU mappings for VMA without a lock
+ * xe_vm_invalidate_vma - invalidate GPU mappings for a range of VMA without a lock
  * @vma: VMA to invalidate
+ * @start: start of the range.
+ * @end: end of the range.
  *
  * Walks a list of page tables leaves which it memset the entries owned by this
  * VMA to zero, invalidates the TLBs, and block until TLBs invalidation is
@@ -3216,7 +3229,7 @@ void xe_vm_unlock(struct xe_vm *vm)
  *
  * Returns 0 for success, negative error code otherwise.
  */
-int xe_vm_invalidate_vma(struct xe_vma *vma)
+int xe_vm_invalidate_vma(struct xe_vma *vma, u64 start, u64 end)
 {
 	struct xe_device *xe = xe_vma_vm(vma)->xe;
 	struct xe_tile *tile;
@@ -3227,11 +3240,13 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 
 	xe_assert(xe, !xe_vma_is_null(vma));
 	xe_assert(xe, !xe_vma_is_system_allocator(vma));
+	xe_assert(xe, start >= xe_vma_start(vma));
+	xe_assert(xe, end <= xe_vma_end(vma));
 	trace_xe_vma_invalidate(vma);
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
 	       "INVALIDATE: addr=0x%016llx, range=0x%016llx",
-		xe_vma_start(vma), xe_vma_size(vma));
+		start, end - start);
 
 	/* Check that we don't race with page-table updates */
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
@@ -3248,14 +3263,15 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	}
 
 	for_each_tile(tile, xe, id) {
-		if (xe_pt_zap_ptes(tile, vma)) {
+		if (xe_pt_zap_ptes(tile, vma, start, end)) {
 			tile_needs_invalidate |= BIT(id);
 			xe_device_wmb(xe);
 			/*
 			 * FIXME: We potentially need to invalidate multiple
 			 * GTs within the tile
 			 */
-			seqno[id] = xe_gt_tlb_invalidation_vma(tile->primary_gt, NULL, vma);
+			seqno[id] = xe_gt_tlb_invalidation_range(tile->primary_gt, NULL,
+					     start, end, xe_vma_vm(vma)->usm.asid);
 			if (seqno[id] < 0)
 				return seqno[id];
 		}
@@ -3269,8 +3285,6 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 		}
 	}
 
-	vma->tile_invalidated = vma->tile_mask;
-
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 89f3306561ad..7c10f6c60b63 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -223,7 +223,7 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
 struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma,
 				u8 tile_mask);
 
-int xe_vm_invalidate_vma(struct xe_vma *vma);
+int xe_vm_invalidate_vma(struct xe_vma *vma, u64 start, u64 end);
 
 static inline void xe_vm_queue_rebind_worker(struct xe_vm *vm)
 {
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 976982972a06..4d9707c19031 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -75,9 +75,6 @@ struct xe_vma {
 		struct work_struct destroy_work;
 	};
 
-	/** @tile_invalidated: VMA has been invalidated */
-	u8 tile_invalidated;
-
 	/** @tile_mask: Tile mask of where to create binding for this VMA */
 	u8 tile_mask;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 27/42] drm/xe: Support range based page table update
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (25 preceding siblings ...)
  2024-06-13  4:24 ` [CI 26/42] drm/xe: Moving to range based vma invalidation Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 28/42] drm/xe/uapi: Add DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM flag Oak Zeng
                   ` (14 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Currently the page table update interface only supports whole xe_vma
bind. This works fine for BO based driver. But for system allocator,
we need to partially bind a vma to GPU page table.

GPU page table update interface such as xe_vma_rebind is modified
to support partial vma bind. Binding range (start, end) parameters
are added to a few vma bind and page table update functions.

VMA unbind is still whole xe_vma based as there is no requirement
to unbind a vma partiallly.

There is no function change in this patch. It is only a interface
change as a preparation of the coming system allocator codes.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  3 +-
 drivers/gpu/drm/xe/xe_pt.c           | 60 ++++++++++++++++------------
 drivers/gpu/drm/xe/xe_vm.c           | 23 ++++++-----
 drivers/gpu/drm/xe/xe_vm.h           |  2 +-
 4 files changed, 51 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index c3e9331cf1b6..3b98499ad614 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -154,7 +154,8 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 
 		/* Bind VMA only to the GT that has faulted */
 		trace_xe_vma_pf_bind(vma);
-		fence = xe_vma_rebind(vm, vma, BIT(tile->id));
+		fence = xe_vma_rebind(vm, vma, xe_vma_start(vma),
+			    xe_vma_end(vma), BIT(tile->id));
 		if (IS_ERR(fence)) {
 			err = PTR_ERR(fence);
 			if (xe_vm_validate_should_retry(&exec, err, &end))
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 96600ba9e100..91b61fa80acb 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -583,6 +583,8 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  * range.
  * @tile: The tile we're building for.
  * @vma: The vma indicating the address range.
+ * @start: start of the address range to bind, must be inside vma's va range
+ * @end: end of the address range, must be inside vma's va range
  * @entries: Storage for the update entries used for connecting the tree to
  * the main tree at commit time.
  * @num_entries: On output contains the number of @entries used.
@@ -597,7 +599,7 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  * Return 0 on success, negative error code on error.
  */
 static int
-xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
+xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma, u64 start, u64 end,
 		 struct xe_vm_pgtable_update *entries, u32 *num_entries)
 {
 	struct xe_device *xe = tile_to_xe(tile);
@@ -614,7 +616,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		.vm = xe_vma_vm(vma),
 		.tile = tile,
 		.curs = &curs,
-		.va_curs_start = xe_vma_start(vma),
+		.va_curs_start = start,
 		.vma = vma,
 		.wupd.entries = entries,
 		.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K) && is_devmem,
@@ -622,6 +624,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
 	int ret;
 
+	xe_assert(xe, start >= xe_vma_start(vma));
+	xe_assert(xe, end <= xe_vma_end(vma));
 	/**
 	 * Default atomic expectations for different allocation scenarios are as follows:
 	 *
@@ -668,21 +672,24 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 	xe_bo_assert_held(bo);
 
 	if (!xe_vma_is_null(vma)) {
+		u64 size = end - start;
+		u64 offset = start - xe_vma_start(vma);
+		u64 page_idx = offset >> PAGE_SHIFT;
 		if (xe_vma_is_userptr(vma))
-			xe_res_first_dma(to_userptr_vma(vma)->userptr.hmmptr.dma_addr,
+			xe_res_first_dma(to_userptr_vma(vma)->userptr.hmmptr.dma_addr + page_idx,
 					0, xe_vma_size(vma), 0, &curs);
 		else if (xe_bo_is_vram(bo) || xe_bo_is_stolen(bo))
-			xe_res_first(bo->ttm.resource, xe_vma_bo_offset(vma),
-				     xe_vma_size(vma), &curs);
+			xe_res_first(bo->ttm.resource, xe_vma_bo_offset(vma) + offset,
+				     size, &curs);
 		else
-			xe_res_first_sg(xe_bo_sg(bo), xe_vma_bo_offset(vma),
-					xe_vma_size(vma), &curs);
+			xe_res_first_sg(xe_bo_sg(bo), xe_vma_bo_offset(vma) + offset,
+					size, &curs);
 	} else {
 		curs.size = xe_vma_size(vma);
 	}
 
-	ret = xe_pt_walk_range(&pt->base, pt->level, xe_vma_start(vma),
-			       xe_vma_end(vma), &xe_walk.base);
+	ret = xe_pt_walk_range(&pt->base, pt->level, start,
+			       end, &xe_walk.base);
 
 	*num_entries = xe_walk.wupd.num_used_entries;
 	return ret;
@@ -984,13 +991,13 @@ static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
 }
 
 static int
-xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
+xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma, u64 start, u64 end,
 		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
 {
 	int err;
 
 	*num_entries = 0;
-	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
+	err = xe_pt_stage_bind(tile, vma, start, end, entries, num_entries);
 	if (!err)
 		xe_tile_assert(tile, *num_entries);
 
@@ -1643,7 +1650,7 @@ xe_pt_commit_prepare_unbind(struct xe_vma *vma,
 
 static void
 xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
-				 struct xe_vma *vma)
+				 u64 start_va, u64 end_va)
 {
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
@@ -1658,8 +1665,8 @@ xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
 	}
 
 	/* Greedy (non-optimal) calculation but simple */
-	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull << xe_pt_shift(level));
-	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level)) - 1;
+	start = ALIGN_DOWN(start_va, 0x1ull << xe_pt_shift(level));
+	last = ALIGN(end_va, 0x1ull << xe_pt_shift(level)) - 1;
 
 	if (start < pt_update_ops->start)
 		pt_update_ops->start = start;
@@ -1678,7 +1685,7 @@ static int vma_reserve_fences(struct xe_device *xe, struct xe_vma *vma)
 
 static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vm_pgtable_update_ops *pt_update_ops,
-			   struct xe_vma *vma)
+			   struct xe_vma *vma, u64 start, u64 end)
 {
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
@@ -1686,10 +1693,12 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 
 	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
+	xe_assert(vm->xe, start >= xe_vma_start(vma));
+	xe_assert(vm->xe, end <= xe_vma_end(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
 	       "Preparing bind, with range [%llx...%llx)\n",
-	       xe_vma_start(vma), xe_vma_end(vma) - 1);
+	       start, end - 1);
 
 	pt_op->vma = NULL;
 	pt_op->bind = true;
@@ -1699,7 +1708,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	if (err)
 		return err;
 
-	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
+	err = xe_pt_prepare_bind(tile, vma, start, end, pt_op->entries,
 				 &pt_op->num_entries);
 	if (!err) {
 		xe_tile_assert(tile, pt_op->num_entries <=
@@ -1707,7 +1716,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 					pt_op->num_entries, true);
 
-		xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+		xe_pt_update_ops_rfence_interval(pt_update_ops, start, end);
 		++pt_update_ops->current_op;
 		pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 
@@ -1777,7 +1786,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 
 	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 				pt_op->num_entries, false);
-	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, xe_vma_start(vma), xe_vma_end(vma));
 	++pt_update_ops->current_op;
 	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 	pt_update_ops->needs_invalidation = true;
@@ -1802,7 +1811,8 @@ static int op_prepare(struct xe_vm *vm,
 		    op->map.is_system_allocator)
 			break;
 
-		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
+		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma,
+			op->base.map.va.addr, op->base.map.va.addr + op->base.map.va.range);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	case DRM_GPUVA_OP_REMAP:
@@ -1815,13 +1825,13 @@ static int op_prepare(struct xe_vm *vm,
 		err = unbind_op_prepare(tile, pt_update_ops, old);
 
 		if (!err && op->remap.prev) {
-			err = bind_op_prepare(vm, tile, pt_update_ops,
-					      op->remap.prev);
+			err = bind_op_prepare(vm, tile, pt_update_ops, op->remap.prev,
+				xe_vma_start(op->remap.prev), xe_vma_end(op->remap.prev));
 			pt_update_ops->wait_vm_bookkeep = true;
 		}
 		if (!err && op->remap.next) {
-			err = bind_op_prepare(vm, tile, pt_update_ops,
-					      op->remap.next);
+			err = bind_op_prepare(vm, tile, pt_update_ops, op->remap.next,
+				xe_vma_start(op->remap.next), xe_vma_end(op->remap.next));
 			pt_update_ops->wait_vm_bookkeep = true;
 		}
 		break;
@@ -1843,7 +1853,7 @@ static int op_prepare(struct xe_vm *vm,
 		if (xe_vma_is_system_allocator(vma))
 			break;
 
-		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
+		err = bind_op_prepare(vm, tile, pt_update_ops, vma, xe_vma_start(vma), xe_vma_end(vma));
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 5ae5b556ec1b..cac538de5a9d 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -803,15 +803,15 @@ static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops, u8 tile_mask)
 }
 
 static void xe_vm_populate_rebind(struct xe_vma_op *op, struct xe_vma *vma,
-				  u8 tile_mask)
+				  u64 start, u64 end, u8 tile_mask)
 {
 	INIT_LIST_HEAD(&op->link);
 	op->tile_mask = tile_mask;
 	op->base.op = DRM_GPUVA_OP_MAP;
-	op->base.map.va.addr = vma->gpuva.va.addr;
-	op->base.map.va.range = vma->gpuva.va.range;
+	op->base.map.va.addr = start;
+	op->base.map.va.range = end - start;
 	op->base.map.gem.obj = vma->gpuva.gem.obj;
-	op->base.map.gem.offset = vma->gpuva.gem.offset;
+	op->base.map.gem.offset = vma->gpuva.gem.offset + (start - xe_vma_start(vma));
 	op->map.vma = vma;
 	op->map.immediate = true;
 	op->map.dumpable = vma->gpuva.flags & XE_VMA_DUMPABLE;
@@ -819,7 +819,7 @@ static void xe_vm_populate_rebind(struct xe_vma_op *op, struct xe_vma *vma,
 }
 
 static int xe_vm_ops_add_rebind(struct xe_vma_ops *vops, struct xe_vma *vma,
-				u8 tile_mask)
+				u64 start, u64 end, u8 tile_mask)
 {
 	struct xe_vma_op *op;
 
@@ -827,7 +827,7 @@ static int xe_vm_ops_add_rebind(struct xe_vma_ops *vops, struct xe_vma *vma,
 	if (!op)
 		return -ENOMEM;
 
-	xe_vm_populate_rebind(op, vma, tile_mask);
+	xe_vm_populate_rebind(op, vma, start, end, tile_mask);
 	list_add_tail(&op->link, &vops->list);
 	xe_vma_ops_incr_pt_update_ops(vops, tile_mask);
 
@@ -866,8 +866,8 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 		else
 			trace_xe_vma_rebind_exec(vma);
 
-		err = xe_vm_ops_add_rebind(&vops, vma,
-					   vma->tile_present);
+		err = xe_vm_ops_add_rebind(&vops, vma, xe_vma_start(vma),
+					   xe_vma_end(vma), vma->tile_present);
 		if (err)
 			goto free_ops;
 	}
@@ -895,7 +895,8 @@ int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 	return err;
 }
 
-struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_mask)
+struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma,
+					    u64 start, u64 end, u8 tile_mask)
 {
 	struct dma_fence *fence = NULL;
 	struct xe_vma_ops vops;
@@ -907,6 +908,8 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 	lockdep_assert_held(&vm->lock);
 	xe_vm_assert_held(vm);
 	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+	xe_assert(vm->xe, start >= xe_vma_start(vma));
+	xe_assert(vm->xe, end <= xe_vma_end(vma));
 
 	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
 	for_each_tile(tile, vm->xe, id) {
@@ -915,7 +918,7 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 			xe_tile_migrate_exec_queue(tile);
 	}
 
-	err = xe_vm_ops_add_rebind(&vops, vma, tile_mask);
+	err = xe_vm_ops_add_rebind(&vops, vma, start, end, tile_mask);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 7c10f6c60b63..680c0c49b2f4 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -221,7 +221,7 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm);
 
 int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
 struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma,
-				u8 tile_mask);
+				u64 start, u64 end, u8 tile_mask);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma, u64 start, u64 end);
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 28/42] drm/xe/uapi: Add DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM flag
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (26 preceding siblings ...)
  2024-06-13  4:24 ` [CI 27/42] drm/xe: Support range based page table update Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 29/42] drm/xe/svm: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

When a xe_vm is created with DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM,
xe_vm shares the same process address space with the CPU process who
create this xe_vm.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Suggested-by: Christian Konig <Christian.koenig@amd.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c       | 2 ++
 drivers/gpu/drm/xe/xe_vm_types.h | 1 +
 include/uapi/drm/xe_drm.h        | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index cac538de5a9d..499b95f6422b 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1832,6 +1832,8 @@ int xe_vm_create_ioctl(struct drm_device *dev, void *data,
 		flags |= XE_VM_FLAG_LR_MODE;
 	if (args->flags & DRM_XE_VM_CREATE_FLAG_FAULT_MODE)
 		flags |= XE_VM_FLAG_FAULT_MODE;
+	if (args->flags & DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM)
+		flags |= XE_VM_FLAG_PARTICIPATE_SVM;
 
 	vm = xe_vm_create(xe, flags);
 	if (IS_ERR(vm))
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 4d9707c19031..165c2ca258de 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -143,6 +143,7 @@ struct xe_vm {
 #define XE_VM_FLAG_BANNED		BIT(5)
 #define XE_VM_FLAG_TILE_ID(flags)	FIELD_GET(GENMASK(7, 6), flags)
 #define XE_VM_FLAG_SET_TILE_ID(tile)	FIELD_PREP(GENMASK(7, 6), (tile)->id)
+#define XE_VM_FLAG_PARTICIPATE_SVM		BIT(8)
 	unsigned long flags;
 
 	/** @composite_fence_ctx: context composite fence */
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index af8ee6f929d4..ee09ed1ea3c2 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -830,6 +830,9 @@ struct drm_xe_gem_mmap_offset {
  *    demand when accessed, and also allows per-VM overcommit of memory.
  *    The xe driver internally uses recoverable pagefaults to implement
  *    this.
+ *  - %DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM - This vm participates
+ *    SVM (shared virtual memory), which means this vm share one process
+ *    address space with CPU process who create this xe_vm.
  */
 struct drm_xe_vm_create {
 	/** @extensions: Pointer to the first extension struct, if any */
@@ -838,6 +841,7 @@ struct drm_xe_vm_create {
 #define DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE	(1 << 0)
 #define DRM_XE_VM_CREATE_FLAG_LR_MODE	        (1 << 1)
 #define DRM_XE_VM_CREATE_FLAG_FAULT_MODE	(1 << 2)
+#define DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM	(1 << 3)
 	/** @flags: Flags */
 	__u32 flags;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 29/42] drm/xe/svm: Create userptr if page fault occurs on system_allocator VMA
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (27 preceding siblings ...)
  2024-06-13  4:24 ` [CI 28/42] drm/xe/uapi: Add DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM flag Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 30/42] drm/xe/svm: Add faulted userptr VMA garbage collector Oak Zeng
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

If a page fault occurs on system_allocator VMA, create a userptr VMA to
replaced fault region and map to GPU.

v1: Pass userptr to the req_offset of sm_map_ops_create function. This
fix malloc'd memory failure (Oak)
drop xe_vma_is_userptr(vma) condition. XE_VMA_FAULT_USERPTR
is enough to determine a fault userptr (Matt)

v2: don't pin fault userptr during fault userptr creation. Fault userptr
is created during gpu page fault. After creation, we decide whether we
need to migrate it to GPU vram. So pin fault userptr after migration is
enough. A pin before migration is redundant. (Oak)

v3: remove mm field from xe_vm as we add mm in drm_gpuvm now. rebase (Oak)

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  12 +++
 drivers/gpu/drm/xe/xe_vm.c           | 118 +++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_vm.h           |   7 ++
 drivers/gpu/drm/xe/xe_vm_types.h     |   1 +
 4 files changed, 133 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 3b98499ad614..0bdde3164cd3 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -132,6 +132,18 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 	trace_xe_vma_pagefault(vma);
 	atomic = access_is_atomic(pf->access_type);
 
+	/*
+	 * Create userptr VMA if fault occurs in a range reserved for system
+	 * allocator.
+	 */
+	if (xe_vma_is_system_allocator(vma)) {
+		vma = xe_vm_fault_userptr(vm, pf->page_addr);
+		if (IS_ERR(vma)) {
+			xe_vm_kill(vm, true);
+			return PTR_ERR(vma);
+		}
+	}
+
 retry_userptr:
 	if (xe_vma_is_userptr(vma) &&
 	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 499b95f6422b..dba7a64075e9 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1482,7 +1482,8 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	}
 
 	drm_gpuvm_init(&vm->gpuvm, "Xe VM", DRM_GPUVM_RESV_PROTECTED, &xe->drm,
-		       vm_resv_obj, 0, vm->size, 0, 0, &gpuvm_ops, false);
+	           vm_resv_obj, 0, vm->size, 0, 0, &gpuvm_ops,
+	           flags & XE_VM_FLAG_PARTICIPATE_SVM);
 
 	drm_gem_object_put(vm_resv_obj);
 
@@ -2093,7 +2094,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
 	if (bo)
 		drm_exec_fini(&exec);
 
-	if (xe_vma_is_userptr(vma)) {
+	if (xe_vma_is_userptr(vma) && !xe_vma_is_fault_userptr(vma)) {
 		err = xe_vma_userptr_pin_pages(to_userptr_vma(vma));
 		if (err) {
 			prep_vma_destroy(vm, vma, false);
@@ -2207,8 +2208,11 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
-				   struct xe_vma_ops *vops)
+static int vm_bind_ioctl_ops_update_gpuvm_state(struct xe_vm *vm,
+						struct drm_gpuva_ops *ops,
+						struct xe_sync_entry *syncs,
+						u32 num_syncs,
+						struct xe_vma_ops *vops)
 {
 	struct xe_device *xe = vm->xe;
 	struct drm_gpuva_op *__op;
@@ -3145,7 +3149,8 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto unwind_ops;
 		}
 
-		err = vm_bind_ioctl_ops_parse(vm, ops[i], &vops);
+		err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops[i], syncs,
+							   num_syncs, &vops);
 		if (err)
 			goto unwind_ops;
 	}
@@ -3464,3 +3469,106 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap)
 	}
 	kvfree(snap);
 }
+
+/**
+ * xe_vm_fault_userptr() - VM fault userptr
+ * @vm: VM
+ * @fault_addr: fault address
+ *
+ * Create userptr VMA from fault address
+ *
+ * Return: newly created userptr VMA on success, ERR_PTR on failure
+ */
+struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
+{
+	struct vm_area_struct *vas;
+	struct mm_struct *mm = vm->gpuvm.mm;
+	struct xe_vma_ops vops;
+	struct drm_gpuva_ops *ops = NULL;
+	struct drm_gpuva_op *__op;
+	struct xe_vma *vma = NULL;
+	u64 start, range;
+	int err;
+
+	vm_dbg(&vm->xe->drm, "FAULT: addr=0x%016llx", fault_addr);
+
+	if (!mmget_not_zero(mm))
+		return ERR_PTR(-EFAULT);
+
+	kthread_use_mm(mm);
+
+	mmap_read_lock(mm);
+	vas = find_vma_intersection(mm, fault_addr, fault_addr + 4);
+	if (!vas) {
+		err = -ENOENT;
+		goto err_unlock;
+	}
+
+	vm_dbg(&vm->xe->drm, "FOUND VAS: vm_start=0x%016lx, vm_end=0x%016lx",
+	       vas->vm_start, vas->vm_end);
+
+	start = vas->vm_start;
+	range = vas->vm_end - vas->vm_start;
+	mmap_read_unlock(mm);
+
+	ops = drm_gpuvm_sm_map_ops_create(&vm->gpuvm, start, range, 0, start);
+	if (IS_ERR(ops)) {
+		err = PTR_ERR(ops);
+		goto err_kthread;
+	}
+
+	drm_gpuva_for_each_op(__op, ops)
+		print_op(vm->xe, __op);
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops, NULL, 0, &vops);
+	if (err)
+		goto err_kthread;
+
+	/*
+	 * No need to execute ops as we just want to update GPUVM state, page
+	 * fault handler will update GPU page tables. Find VMA that needs GPU
+	 * mapping and return to page fault handler.
+	 */
+	xe_vm_lock(vm, false);
+	drm_gpuva_for_each_op(__op, ops) {
+		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+		if (__op->op == DRM_GPUVA_OP_MAP) {
+			xe_assert(vm->xe, !vma);
+			vma = op->map.vma;
+			vma->gpuva.flags |= XE_VMA_FAULT_USERPTR;
+		} else if (__op->op == DRM_GPUVA_OP_UNMAP) {
+			xe_vma_destroy(gpuva_to_vma(op->base.unmap.va), NULL);
+		} else if (__op->op == DRM_GPUVA_OP_REMAP) {
+			xe_vma_destroy(gpuva_to_vma(op->base.remap.unmap->va),
+				       NULL);
+		}
+	}
+	xe_vm_unlock(vm);
+
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return vma;
+
+err_unlock:
+	mmap_read_unlock(mm);
+err_kthread:
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	if (ops) {
+		drm_gpuva_for_each_op_reverse(__op, ops) {
+			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+			xe_vma_op_unwind(vm, op,
+					 op->flags & XE_VMA_OP_COMMITTED,
+					 op->flags & XE_VMA_OP_PREV_COMMITTED,
+					 op->flags & XE_VMA_OP_NEXT_COMMITTED);
+		}
+		drm_gpuva_ops_free(&vm->gpuvm, ops);
+	}
+
+	return ERR_PTR(err);
+}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 680c0c49b2f4..a31409b87b8a 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -172,6 +172,11 @@ static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 		!xe_vma_is_system_allocator(vma);
 }
 
+static inline bool xe_vma_is_fault_userptr(struct xe_vma *vma)
+{
+	return vma->gpuva.flags & XE_VMA_FAULT_USERPTR;
+}
+
 /**
  * to_userptr_vma() - Return a pointer to an embedding userptr vma
  * @vma: Pointer to the embedded struct xe_vma
@@ -252,6 +257,8 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma);
 
 int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma);
 
+struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr);
+
 bool xe_vm_validate_should_retry(struct drm_exec *exec, int err, ktime_t *end);
 
 int xe_vm_lock_vma(struct drm_exec *exec, struct xe_vma *vma);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 165c2ca258de..c1bffa60cefc 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -33,6 +33,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 7)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 8)
 #define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 9)
+#define XE_VMA_FAULT_USERPTR	(DRM_GPUVA_USERBITS << 10)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 30/42] drm/xe/svm: Add faulted userptr VMA garbage collector
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (28 preceding siblings ...)
  2024-06-13  4:24 ` [CI 29/42] drm/xe/svm: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 31/42] drm/xe: Introduce helper to get tile from memory region Oak Zeng
                   ` (11 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

When a faulted userptr VMA (allocated by page handler) is invalidated
add to list which a garbage collector will unmap from GPU, destroy
faulted userptr VMA, and replace with system_allocator VMA.

v1: Run gargabe collector only on MMU_NOTIFY_UNMAP event. For other
    events, we just invalidate GPU page table but keep the vma because
    the userptr is still exist. On next GPU access, we will revalidate
    and rebind this userptr to GPU(Oak)
v2: rebase
    support range based userptr invalidation in garbage collector. Allow
    partial of a userptr to be invalidated (such as trigger by partial
    munmap of a userptr) (Oak)
    Fix vm->lock recursive lock issue (Oak)

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |   1 +
 drivers/gpu/drm/xe/xe_vm.c           | 162 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h           |   8 ++
 drivers/gpu/drm/xe/xe_vm_types.h     |  15 +++
 4 files changed, 186 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 0bdde3164cd3..9e47e0c97a1b 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -145,6 +145,7 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 	}
 
 retry_userptr:
+	xe_vm_userptr_garbage_collector(vm);
 	if (xe_vma_is_userptr(vma) &&
 	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
 		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index dba7a64075e9..b1ab86890226 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -690,6 +690,21 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 
 	xe_vma_userptr_dma_unmap_pages(uvma, range_start, range_end);
 
+	if (range->event == MMU_NOTIFY_UNMAP &&
+	    vma->gpuva.flags & XE_VMA_FAULT_USERPTR &&
+	    !xe_vm_is_closed(vm) && !xe_vm_is_banned(vm) &&
+	    !(vma->gpuva.flags & XE_VMA_DESTROYED) && vma->tile_present) {
+		xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+		userptr->invalidate_start = start;
+		userptr->invalidate_range = range_size;
+		spin_lock(&vm->userptr.invalidated_lock);
+		list_move_tail(&userptr->invalidate_link,
+			       &vm->userptr.fault_invalidated);
+		spin_unlock(&vm->userptr.invalidated_lock);
+
+		queue_work(system_wq, &vm->userptr.garbage_collector);
+	}
+
 	trace_xe_vma_userptr_invalidate_complete(vma);
 
 	return true;
@@ -1428,6 +1443,8 @@ static void xe_vm_free_scratch(struct xe_vm *vm)
 	}
 }
 
+static void vm_userptr_garbage_collector(struct work_struct *w);
+
 struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 {
 	struct drm_gem_object *vm_resv_obj;
@@ -1453,8 +1470,10 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 
 	INIT_LIST_HEAD(&vm->userptr.repin_list);
 	INIT_LIST_HEAD(&vm->userptr.invalidated);
+	INIT_LIST_HEAD(&vm->userptr.fault_invalidated);
 	init_rwsem(&vm->userptr.notifier_lock);
 	spin_lock_init(&vm->userptr.invalidated_lock);
+	INIT_WORK(&vm->userptr.garbage_collector, vm_userptr_garbage_collector);
 
 	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
 
@@ -1609,6 +1628,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	xe_vm_close(vm);
 	if (xe_vm_in_preempt_fence_mode(vm))
 		flush_work(&vm->preempt.rebind_work);
+	if (xe_vm_in_fault_mode(vm))
+		flush_work(&vm->userptr.garbage_collector);
 
 	down_write(&vm->lock);
 	for_each_tile(tile, xe, id) {
@@ -3572,3 +3593,144 @@ struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
 
 	return ERR_PTR(err);
 }
+
+static int
+vm_userptr_garbage_collector_destroy_uvma(struct xe_vm *vm,
+					  struct xe_userptr_vma *uvma)
+{
+	struct xe_userptr *userptr = &uvma->userptr;
+	struct mm_struct *mm = vm->gpuvm.mm;
+	struct xe_vma_ops vops;
+	struct drm_gpuva_ops *ops = NULL;
+	struct drm_gpuva_op *__op;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
+
+	vm_dbg(&vm->xe->drm, "GARBAGE COLLECTOR: addr=0x%016llx, range=0x%016llx",
+	       userptr->invalidate_start, userptr->invalidate_range);
+
+	xe_assert(vm->xe, uvma->vma.gpuva.flags & XE_VMA_FAULT_USERPTR);
+	lockdep_assert_held_write(&vm->lock);
+
+	if (!mmget_not_zero(mm))
+		return -EFAULT;
+
+	kthread_use_mm(mm);
+
+	/* Replace xe_userptr_vma sub-range with system_allocator VMA */
+	ops = drm_gpuvm_sm_map_ops_create(&vm->gpuvm,
+					  userptr->invalidate_start,
+					  userptr->invalidate_range, 0, 0);
+	if (IS_ERR(ops)) {
+		err = PTR_ERR(ops);
+		goto err_kthread;
+	}
+
+	drm_gpuva_for_each_op(__op, ops) {
+		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+		if (__op->op == DRM_GPUVA_OP_MAP) {
+			op->map.immediate = true;
+			op->map.is_system_allocator = true;
+		}
+
+		print_op(vm->xe, __op);
+	}
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops, NULL, 0, &vops);
+	if (err)
+		goto err_kthread;
+
+	/*
+	 * Order behind any user operations and use same exec queue as page
+	 * fault handler.
+	 */
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[tile->id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_exec_queue(tile);
+	}
+
+	err = xe_vma_ops_alloc(&vops);
+	if (err)
+		goto err_kthread;
+
+	err = vm_bind_ioctl_ops_execute(vm, &vops);
+
+	xe_vma_ops_fini(&vops);
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return err;
+
+err_kthread:
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	if (ops)
+		drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return err;
+}
+
+static void vm_userptr_garbage_collector_locked(struct xe_vm *vm)
+{
+	struct xe_userptr_vma *uvma, *next;
+	int err;
+
+	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+
+	if (xe_vm_is_closed_or_banned(vm))
+		return;
+
+	/*
+	 * FIXME: Could create 1 set of VMA ops for all VMAs on
+	 * fault_invalidated list
+	 */
+	spin_lock(&vm->userptr.invalidated_lock);
+	list_for_each_entry_safe(uvma, next, &vm->userptr.fault_invalidated,
+				 userptr.invalidate_link) {
+		list_del_init(&uvma->userptr.invalidate_link);
+		spin_unlock(&vm->userptr.invalidated_lock);
+
+		err = vm_userptr_garbage_collector_destroy_uvma(vm, uvma);
+		if (err) {
+			XE_WARN_ON("Garbage collection failed, killing VM");
+			xe_vm_kill(vm, true);
+		}
+
+		spin_lock(&vm->userptr.invalidated_lock);
+	}
+	spin_unlock(&vm->userptr.invalidated_lock);
+}
+
+static void vm_userptr_garbage_collector(struct work_struct *w)
+{
+	struct xe_vm *vm =
+		container_of(w, struct xe_vm, userptr.garbage_collector);
+
+	down_write(&vm->lock);
+
+	if (xe_vm_is_closed_or_banned(vm))
+		goto unlock;
+
+	vm_userptr_garbage_collector_locked(vm);
+
+unlock:
+	up_write(&vm->lock);
+}
+
+/**
+ * xe_vm_userptr_garbage_collector() - VM userptr garbage collector
+ * @vm: VM
+ *
+ * For all invalidated faulted userptr VMAs (created by page fault handler)
+ * unmap from GPU, destroy faulted userptr VMA, and replace with
+ * system_allocator VMA.
+ */
+void xe_vm_userptr_garbage_collector(struct xe_vm *vm)
+{
+	vm_userptr_garbage_collector_locked(vm);
+}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index a31409b87b8a..b3b6ceec39ba 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -285,6 +285,14 @@ void xe_vm_kill(struct xe_vm *vm, bool unlocked);
  */
 #define xe_vm_assert_held(vm) dma_resv_assert_held(xe_vm_resv(vm))
 
+int xe_vm_populate_dummy_rebind(struct xe_vm *vm, struct xe_vma *vma,
+				u8 tile_mask);
+void xe_vma_ops_free(struct xe_vma_ops *vops);
+struct dma_fence *xe_vm_ops_execute(struct xe_vm *vm, struct xe_vma_ops *vops);
+
+void xe_vm_kill(struct xe_vm *vm, bool unlocked);
+void xe_vm_userptr_garbage_collector(struct xe_vm *vm);
+
 #if IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM)
 #define vm_dbg drm_dbg
 #else
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index c1bffa60cefc..6ebe05242997 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -40,6 +40,10 @@ struct xe_userptr {
 	struct drm_hmmptr hmmptr;
 	/** @invalidate_link: Link for the vm::userptr.invalidated list */
 	struct list_head invalidate_link;
+	/** invalidation start address */
+	u64 invalidate_start;
+	/** invalidation range */
+	u64 invalidate_range;
 	/** @userptr: link into VM repin list if userptr. */
 	struct list_head repin_link;
 	/**
@@ -212,6 +216,17 @@ struct xe_vm {
 		 * write mode.
 		 */
 		struct list_head invalidated;
+		/**
+		 * @userptr.fault_invalidated: List of invalidated userptrs,
+		 * craeted by page fault, which will be destroy by the garbage
+		 * collector. Protected from access with the @invalidated_lock.
+		 */
+		struct list_head fault_invalidated;
+		/**
+		 * @userptr.garbage_collector: worker to implement destroying of
+		 * userptrs on @userptr.fault_invalidated list.
+		 */
+		struct work_struct garbage_collector;
 	} userptr;
 
 	/** @preempt: preempt state */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 31/42] drm/xe: Introduce helper to get tile from memory region
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (29 preceding siblings ...)
  2024-06-13  4:24 ` [CI 30/42] drm/xe/svm: Add faulted userptr VMA garbage collector Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 32/42] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Introduce a simple helper to retrieve tile from memory region

v1: move the function to xe_device.h (Matt)
    improve commit message, add kerneldoc (Thomas)
v2: move the function to xe_tile.h (Matt)

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_tile.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_tile.h b/drivers/gpu/drm/xe/xe_tile.h
index 1c9e42ade6b0..091c5a98110a 100644
--- a/drivers/gpu/drm/xe/xe_tile.h
+++ b/drivers/gpu/drm/xe/xe_tile.h
@@ -15,4 +15,12 @@ int xe_tile_init_noalloc(struct xe_tile *tile);
 
 void xe_tile_migrate_wait(struct xe_tile *tile);
 
+/**
+ * xe_mem_region_to_tile() - retrieve tile from memory region
+ * @mr: the memory region we retrieve tile from
+ */
+static inline struct xe_tile *xe_mem_region_to_tile(struct xe_mem_region *mr)
+{
+	return container_of(mr, struct xe_tile, mem.vram);
+}
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 32/42] drm/xe/svm: implement functions to allocate and free device memory
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (30 preceding siblings ...)
  2024-06-13  4:24 ` [CI 31/42] drm/xe: Introduce helper to get tile from memory region Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 33/42] drm/xe/svm: Get drm device from drm memory region Oak Zeng
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Function xe_svm_alloc_pages allocate pages from drm buddy and perform
house keeping work for all the pages allocated, such as get a page
refcount, set page's zone_device_data to point to the drm buddy block
that the page belongs to.

Function xe_svm_free_page free one page back to drm buddy allocator.

v1: Drop buddy block meta data (Matt)
    Take vram_mgr->lock during buddy operation (Matt)
    Moving to page granularity memory free (Oak)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/drm_svm.c   |   1 -
 drivers/gpu/drm/xe/Makefile |   1 +
 drivers/gpu/drm/xe/xe_svm.c | 148 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h |  16 ++++
 4 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h

diff --git a/drivers/gpu/drm/drm_svm.c b/drivers/gpu/drm/drm_svm.c
index 967e7addae54..a751cc2c9c91 100644
--- a/drivers/gpu/drm/drm_svm.c
+++ b/drivers/gpu/drm/drm_svm.c
@@ -3,7 +3,6 @@
  * Copyright Â© 2024 Intel Corporation
  */
 
-
 #include <linux/scatterlist.h>
 #include <linux/mmu_notifier.h>
 #include <linux/dma-mapping.h>
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 80bfa5741f26..221b7ddd5287 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -108,6 +108,7 @@ xe-y += xe_bb.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
new file mode 100644
index 000000000000..a2c52f205ea0
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright Â© 2023 Intel Corporation
+ */
+
+#include <linux/mm_types.h>
+#include <linux/sched/mm.h>
+#include <linux/gfp.h>
+#include <drm/drm_buddy.h>
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_device_types.h"
+#include "xe_device.h"
+#include "xe_tile.h"
+#include "xe_svm.h"
+#include "xe_assert.h"
+
+/** This is a placeholder for compilation. Will introduce function to drm buddy */
+static void drm_buddy_block_free_subrange(struct drm_buddy_block *block, u64 start,
+                        u64 end, struct list_head *block_list)
+{
+}
+
+static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
+{
+	/** DRM buddy's block offset is 0-based*/
+	offset += mr->drm_mr.hpa_base;
+
+	return PHYS_PFN(offset);
+}
+
+void xe_svm_free_page(struct page *page)
+{
+	struct drm_buddy_block *block =
+					(struct drm_buddy_block *)page->zone_device_data;
+	struct drm_mem_region *mr = drm_page_to_mem_region(page);
+	struct xe_mem_region *xe_mr = container_of(mr, struct xe_mem_region, drm_mr);
+	struct xe_tile *tile = xe_mem_region_to_tile(xe_mr);
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	u64 size = drm_buddy_block_size(mm, block);
+	u64 pages_per_block = size >> PAGE_SHIFT;
+	u64 block_pfn_first =
+					block_offset_to_pfn(xe_mr, drm_buddy_block_offset(block));
+	u64 page_pfn = page_to_pfn(page);
+	u64 page_idx = page_pfn - block_pfn_first;
+	struct list_head blocks = LIST_HEAD_INIT(blocks);
+	struct drm_buddy_block *tmp;
+	int i;
+
+	xe_assert(tile->xe, page_idx < pages_per_block);
+	drm_buddy_block_free_subrange(block, page_idx << PAGE_SHIFT,
+            (page_idx << PAGE_SHIFT) + PAGE_SIZE, &blocks);
+
+	/**
+	 * page's buddy_block (record in page's zone_device_data) is changed
+	 * due to above partial free. update zone_device_data.
+	 */
+	list_for_each_entry_safe(block, tmp, &blocks, link) {
+		size = drm_buddy_block_size(mm, block);
+		pages_per_block = size >> PAGE_SHIFT;
+		block_pfn_first =
+					block_offset_to_pfn(xe_mr, drm_buddy_block_offset(block));
+		for(i = 0; i < pages_per_block; i++) {
+			struct page *p;
+
+			p = pfn_to_page(block_pfn_first + i);
+			p->zone_device_data = block;
+		}
+	}
+}
+
+/**
+ * __free_blocks() - free all memory blocks
+ *
+ * @mm: drm_buddy that the blocks belongs to
+ * @blocks: memory blocks list head
+ */
+static void __free_blocks(struct drm_buddy *mm, struct list_head *blocks)
+{
+	struct drm_buddy_block *block, *tmp;
+
+	list_for_each_entry_safe(block, tmp, blocks, link)
+		drm_buddy_free_block(mm, block);
+}
+
+/**
+ * xe_svm_alloc_pages() - allocate device pages from buddy allocator
+ *
+ * @mr: which memory region to allocate device memory from
+ * @npages: how many pages to allocate
+ * @pfn: used to return the pfn of all allocated pages. Must be big enough
+ * to hold at @npages entries.
+ *
+ * This function allocate blocks of memory from drm buddy allocator, and
+ * performs initialization work: set struct page::zone_device_data to point
+ * to the memory block; zone_device_page_init each page allocated; add pages
+ * to lru managers lru list for eviction purpose - this is TBD.
+ *
+ * return: 0 on success
+ * error code otherwise
+ */
+int xe_svm_alloc_pages(struct drm_mem_region *mr,
+            unsigned long npages, unsigned long *pfn)
+{
+	struct xe_mem_region *xe_mr = container_of(mr, struct xe_mem_region, drm_mr);
+	struct xe_tile *tile = xe_mem_region_to_tile(xe_mr);
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	struct drm_buddy_block *block, *tmp;
+	u64 size = npages << PAGE_SHIFT;
+	int ret = 0, i, j = 0;
+	struct list_head blocks;
+
+	mutex_lock(&tile->mem.vram_mgr->lock);
+	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
+						&blocks, DRM_BUDDY_CONTIGUOUS_ALLOCATION);
+
+	if (unlikely(ret)) {
+		mutex_unlock(&tile->mem.vram_mgr->lock);
+		return ret;
+	}
+
+	ret = drm_buddy_block_trim(mm, size, &blocks);
+	if (ret) {
+		__free_blocks(mm, &blocks);
+		mutex_unlock(&tile->mem.vram_mgr->lock);
+		return ret;
+	}
+	mutex_unlock(&tile->mem.vram_mgr->lock);
+
+	list_for_each_entry_safe(block, tmp, &blocks, link) {
+		u64 block_pfn_first, pages_per_block;
+
+		size = drm_buddy_block_size(mm, block);
+		pages_per_block = size >> PAGE_SHIFT;
+		block_pfn_first =
+					block_offset_to_pfn(xe_mr, drm_buddy_block_offset(block));
+		for(i = 0; i < pages_per_block; i++) {
+			struct page *page;
+
+			pfn[j++] = block_pfn_first + i;
+			page = pfn_to_page(block_pfn_first + i);
+			/**Lock page per hmm requirement, see hmm.rst.*/
+			zone_device_page_init(page);
+			page->zone_device_data = block;
+		}
+	}
+
+	return ret;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..09fef8ea97f3
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright Â© 2024 Intel Corporation
+ */
+
+#ifndef _XE_SVM_
+#define _XE_SVM_
+
+struct drm_mem_region;
+struct page;
+
+int xe_svm_alloc_pages(struct drm_mem_region *mr,
+            unsigned long npages, unsigned long *pfn);
+void xe_svm_free_page(struct page *page);
+
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 33/42] drm/xe/svm: Get drm device from drm memory region
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (31 preceding siblings ...)
  2024-06-13  4:24 ` [CI 32/42] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 34/42] drm/xe/svm: Get page map owner of a " Oak Zeng
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Implement a function xe_mem_region_to_device to
get the drm device which has a specific mem region

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_device.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index bb07f5669dbb..bae8ebeda20c 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -9,6 +9,7 @@
 #include <drm/drm_util.h>
 
 #include "xe_device_types.h"
+#include "xe_tile.h"
 
 static inline struct xe_device *to_xe_device(const struct drm_device *dev)
 {
@@ -170,4 +171,11 @@ static inline bool xe_device_wedged(struct xe_device *xe)
 
 void xe_device_declare_wedged(struct xe_device *xe);
 
+static inline struct drm_device *xe_mem_region_to_device(struct drm_mem_region *mr)
+{
+	struct xe_mem_region *xe_mr = container_of(mr, struct xe_mem_region, drm_mr);
+	struct xe_tile *tile = xe_mem_region_to_tile(xe_mr);
+
+	return &tile->xe->drm;
+}
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 34/42] drm/xe/svm: Get page map owner of a memory region
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (32 preceding siblings ...)
  2024-06-13  4:24 ` [CI 33/42] drm/xe/svm: Get drm device from drm memory region Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 35/42] drm/xe/svm: Add migrate layer functions for SVM support Oak Zeng
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Introduce function xe_svm_mem_region_to_pagemap_owner to
get a memory region's page map owner.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 8 ++++++++
 drivers/gpu/drm/xe/xe_svm.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index a2c52f205ea0..d2838ce46eaf 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -146,3 +146,11 @@ int xe_svm_alloc_pages(struct drm_mem_region *mr,
 
 	return ret;
 }
+
+void* xe_svm_mem_region_to_pagemap_owner(struct drm_mem_region *mr)
+{
+	struct xe_mem_region *xe_mr = container_of(mr, struct xe_mem_region, drm_mr);
+	struct xe_tile *tile = xe_mem_region_to_tile(xe_mr);
+
+	return tile->xe;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 09fef8ea97f3..b70551d12578 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -12,5 +12,6 @@ struct page;
 int xe_svm_alloc_pages(struct drm_mem_region *mr,
             unsigned long npages, unsigned long *pfn);
 void xe_svm_free_page(struct page *page);
+void* xe_svm_mem_region_to_pagemap_owner(struct drm_mem_region *mr);
 
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 35/42] drm/xe/svm: Add migrate layer functions for SVM support
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (33 preceding siblings ...)
  2024-06-13  4:24 ` [CI 34/42] drm/xe/svm: Get page map owner of a " Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 36/42] drm/xe/svm: introduce svm migration function Oak Zeng
                   ` (6 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

Add functions which migrate to / from VRAM accepting a single DPA
argument (VRAM) and array of dma addresses (SRAM).

FIXME: Support non-contiguous VRAM DPA. The VRAM DPA can be an
array and we can dynamically map DPAs into contiguous device
virtual address space like what we did for SRAM, and still use
one single blitter command for migration

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 126 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h |   5 ++
 2 files changed, 131 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index cc8455daa2bb..fa70cf912e60 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1457,6 +1457,132 @@ void xe_migrate_wait(struct xe_migrate *m)
 		dma_fence_wait(m->fence, false);
 }
 
+static u32 pte_update_cmd_size(u64 size)
+{
+	u32 dword;
+	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+
+	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
+	/*
+	 * MI_STORE_DATA_IMM command is used to update page table. Each
+	 * instruction can update maximumly 0x1ff pte entries. To update
+	 * n (n <= 0x1ff) pte entries, we need:
+	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
+	 * 2 dword for the page table's physical location
+	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
+	 */
+	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
+	dword += entries * 2;
+
+	return dword;
+}
+
+static void build_pt_update_batch_sram(struct xe_migrate *m,
+				       struct xe_bb *bb, u32 pt_offset,
+				       dma_addr_t *sram_addr, u32 size)
+{
+	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
+	u32 ptes;
+	int i = 0;
+
+	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+	while (ptes) {
+		u32 chunk = min(0x1ffU, ptes);
+
+		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
+		bb->cs[bb->len++] = pt_offset;
+		bb->cs[bb->len++] = 0;
+
+		pt_offset += chunk * 8;
+		ptes -= chunk;
+
+		while (chunk--) {
+			u64 addr = sram_addr[i++] & PAGE_MASK;
+
+			xe_tile_assert(m->tile, addr);
+			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
+								 addr, pat_index,
+								 0, false, 0);
+			bb->cs[bb->len++] = lower_32_bits(addr);
+			bb->cs[bb->len++] = upper_32_bits(addr);
+		}
+	}
+}
+
+struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
+					 unsigned long npages,
+					 dma_addr_t *sram_addr, u64 vram_addr,
+					 bool dst_vram)
+{
+	struct xe_gt *gt = m->tile->primary_gt;
+	struct xe_device *xe = gt_to_xe(gt);
+	struct dma_fence *fence = NULL;
+	u32 batch_size = 2;
+	u64 src_L0_ofs, dst_L0_ofs;
+	u64 round_update_size;
+	struct xe_sched_job *job;
+	struct xe_bb *bb;
+	u32 update_idx, pt_slot = 0;
+	int err;
+
+	round_update_size = min_t(u64, npages * PAGE_SIZE,
+				  MAX_PREEMPTDISABLE_TRANSFER);
+	batch_size += pte_update_cmd_size(round_update_size);
+	batch_size += EMIT_COPY_DW;
+
+	bb = xe_bb_new(gt, batch_size, true);
+	if (IS_ERR(bb)) {
+		err = PTR_ERR(bb);
+		return ERR_PTR(err);
+	}
+
+	build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+				   sram_addr, round_update_size);
+
+	if (dst_vram) {
+		src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		dst_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr);
+
+	} else {
+		src_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr);
+		dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+	}
+
+	bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
+	update_idx = bb->len;
+
+	emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
+		  XE_PAGE_SIZE);
+
+	mutex_lock(&m->job_mutex);
+	job = xe_bb_create_migration_job(m->q, bb,
+					 xe_migrate_batch_base(m, true),
+					 update_idx);
+	if (IS_ERR(job)) {
+		err = PTR_ERR(job);
+		goto err;
+	}
+
+	xe_sched_job_add_migrate_flush(job, 0);
+	xe_sched_job_arm(job);
+	fence = dma_fence_get(&job->drm.s_fence->finished);
+	xe_sched_job_push(job);
+
+	dma_fence_put(m->fence);
+	m->fence = dma_fence_get(fence);
+	mutex_unlock(&m->job_mutex);
+
+	xe_bb_free(bb, fence);
+
+	return fence;
+
+err:
+	mutex_unlock(&m->job_mutex);
+	xe_bb_free(bb, NULL);
+
+	return ERR_PTR(err);
+}
+
 #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
 #include "tests/xe_migrate.c"
 #endif
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 453e0ecf5034..c6a18be1373f 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -115,4 +115,9 @@ xe_migrate_update_pgtables(struct xe_migrate *m,
 void xe_migrate_wait(struct xe_migrate *m);
 
 struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile);
+
+struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
+					 unsigned long npages,
+					 dma_addr_t *sram_addr, u64 vram_addr,
+					 bool dst_vram);
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 36/42] drm/xe/svm: introduce svm migration function
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (34 preceding siblings ...)
  2024-06-13  4:24 ` [CI 35/42] drm/xe/svm: Add migrate layer functions for SVM support Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 37/42] drm/xe/svm: Register xe memory region to drm layer Oak Zeng
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

xe_svm_migrate is introduced to migrate b/t two migration
vectors. This will be a callback function registered to
drm layer for svm memory migration.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 46 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h |  3 +++
 2 files changed, 49 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index d2838ce46eaf..014b70250abd 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -7,8 +7,10 @@
 #include <linux/sched/mm.h>
 #include <linux/gfp.h>
 #include <drm/drm_buddy.h>
+#include <drm/drm_svm.h>
 #include "xe_ttm_vram_mgr_types.h"
 #include "xe_device_types.h"
+#include "xe_migrate.h"
 #include "xe_device.h"
 #include "xe_tile.h"
 #include "xe_svm.h"
@@ -154,3 +156,47 @@ void* xe_svm_mem_region_to_pagemap_owner(struct drm_mem_region *mr)
 
 	return tile->xe;
 }
+
+/**
+ * xe_svm_migrate() - migrate b/t two migration vectors.
+ * @src_vec: source migration vector
+ * @dst_vec: dst migration vector
+ *
+ * For now this function only support migration b/t system memory
+ * and device memory. It can be extended to support migration b/t
+ * two device memory regions.
+ *
+ * FIXME: right now it requires vram dpa to be physically contiguous.
+ * This is guaranteed by the DRM_BUDDY_CONTIGUOUS_ALLOCATION in
+ * xe_svm_alloc_pages. In the future we can support migration with
+ * non-contiguous vram. This requires change to xe_migrate_vram
+ * to map non-contiguous vram to contiguous device virtual address
+ * space (aka not using identity vram migrate vm mapping)
+ */
+struct dma_fence* xe_svm_migrate(struct migrate_vec *src_vec,
+            struct migrate_vec *dst_vec)
+{
+	unsigned long npages = src_vec->npages;
+	struct xe_mem_region *xe_mr;
+	struct drm_mem_region *mr;
+	bool dst_vram = false;
+	struct xe_tile *tile;
+	dma_addr_t *sram_addr;
+	u64 vram_addr;
+
+	if (dst_vec->mr) {
+		mr = dst_vec->mr;
+		dst_vram = true;
+		vram_addr = dst_vec->addr_vec[0].dpa;
+		sram_addr = &src_vec->addr_vec[0].dma_addr;
+	} else {
+		mr = src_vec->mr;
+		vram_addr = src_vec->addr_vec[0].dpa;
+		sram_addr = &dst_vec->addr_vec[0].dma_addr;
+	}
+
+	xe_mr = container_of(mr, struct xe_mem_region, drm_mr);
+	tile = xe_mem_region_to_tile(xe_mr);
+
+	return xe_migrate_vram(tile->migrate, npages, sram_addr, vram_addr, dst_vram);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index b70551d12578..1b891b2a7587 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -8,10 +8,13 @@
 
 struct drm_mem_region;
 struct page;
+struct migrate_vec;
 
 int xe_svm_alloc_pages(struct drm_mem_region *mr,
             unsigned long npages, unsigned long *pfn);
 void xe_svm_free_page(struct page *page);
 void* xe_svm_mem_region_to_pagemap_owner(struct drm_mem_region *mr);
+struct dma_fence* xe_svm_migrate(struct migrate_vec *src_vec,
+            struct migrate_vec *dst_vec);
 
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 37/42] drm/xe/svm: Register xe memory region to drm layer
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (35 preceding siblings ...)
  2024-06-13  4:24 ` [CI 36/42] drm/xe/svm: introduce svm migration function Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 38/42] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
                   ` (4 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Register xe memory region to drm layer for SVM purpose.
This will essentially create struct page backing for
xe device memory, which will be used by SVM functions.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_tile.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 109f3118e821..6c50df281b11 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -3,7 +3,9 @@
  * Copyright © 2023 Intel Corporation
  */
 
+#include <linux/memremap.h>
 #include <drm/drm_managed.h>
+#include <drm/drm_svm.h>
 
 #include "xe_device.h"
 #include "xe_ggtt.h"
@@ -14,6 +16,7 @@
 #include "xe_tile_sysfs.h"
 #include "xe_ttm_vram_mgr.h"
 #include "xe_wa.h"
+#include "xe_svm.h"
 
 /**
  * DOC: Multi-tile Design
@@ -142,6 +145,15 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
 	return 0;
 }
 
+static void __init_drm_mem_region_ops(struct drm_mem_region_ops *ops)
+{
+	ops->drm_mem_region_alloc_pages = &xe_svm_alloc_pages;
+	ops->drm_mem_region_free_page = &xe_svm_free_page;
+	ops->drm_mem_region_migrate = &xe_svm_migrate;
+	ops->drm_mem_region_pagemap_owner = &xe_svm_mem_region_to_pagemap_owner;
+	ops->drm_mem_region_get_device = &xe_mem_region_to_device;
+}
+
 /**
  * xe_tile_init_noalloc - Init tile up to the point where allocations can happen.
  * @tile: The tile to initialize.
@@ -159,6 +171,8 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
 	int err;
+	struct xe_device *xe = tile_to_xe(tile);
+	struct drm_device *drm = &xe->drm;
 
 	err = tile_ttm_mgr_init(tile);
 	if (err)
@@ -172,6 +186,11 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	err = xe_tile_sysfs_init(tile);
 
+	if (xe->info.has_usm) {
+		__init_drm_mem_region_ops(&tile->mem.vram.drm_mr.mr_ops);
+		drm_svm_register_mem_region(drm, &tile->mem.vram.drm_mr, MEMORY_DEVICE_PRIVATE);
+	}
+
 	return 0;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 38/42] drm/xe/svm: Introduce DRM_XE_SVM kernel config
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (36 preceding siblings ...)
  2024-06-13  4:24 ` [CI 37/42] drm/xe/svm: Register xe memory region to drm layer Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 39/42] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
                   ` (3 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

Introduce a DRM_XE_SVM kernel config entry for
xe svm feature. xe svm feature allows share
virtual address space between CPU and GPU program.

v1: Improve commit message (Thomas)
    Avoid using #if directive (Thomas)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/xe/Kconfig   | 21 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_tile.c | 11 ++++++++---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
index 7bbe46a98ff1..27d6145cefa7 100644
--- a/drivers/gpu/drm/xe/Kconfig
+++ b/drivers/gpu/drm/xe/Kconfig
@@ -84,6 +84,27 @@ config DRM_XE_FORCE_PROBE
 	  4571.
 
 	  Use "!*" to block the probe of the driver for all known devices.
+config DRM_XE_SVM
+	bool "Enable Shared Virtual Memory support in xe"
+	depends on DRM_XE
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTREMOVE
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on ARCH_HAS_PTE_DEVMAP
+	depends on SPARSEMEM_VMEMMAP
+	depends on ZONE_DEVICE
+	depends on DEVICE_PRIVATE
+	depends on MMU
+	select HMM_MIRROR
+	select MMU_NOTIFIER
+	default y
+	help
+	  Choose this option if you want Shared Virtual Memory (SVM)
+	  support in xe. With SVM, virtual address space is shared
+	  between CPU and GPU. This means any virtual address such
+	  as malloc or mmap returns, variables on stack, or global
+	  memory pointers, can be used for GPU transparently.
 
 menu "drm/Xe Debugging"
 depends on DRM_XE
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 6c50df281b11..1f938d001094 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -171,8 +171,13 @@ static void __init_drm_mem_region_ops(struct drm_mem_region_ops *ops)
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
 	int err;
-	struct xe_device *xe = tile_to_xe(tile);
-	struct drm_device *drm = &xe->drm;
+	struct xe_device __maybe_unused *xe;
+	struct drm_device __maybe_unused *drm;
+
+	if (IS_ENABLED(CONFIG_DRM_XE_SVM)) {
+		xe = tile_to_xe(tile);
+		drm = &xe->drm;
+	}
 
 	err = tile_ttm_mgr_init(tile);
 	if (err)
@@ -186,7 +191,7 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	err = xe_tile_sysfs_init(tile);
 
-	if (xe->info.has_usm) {
+	if (IS_ENABLED(CONFIG_DRM_XE_SVM) && xe->info.has_usm) {
 		__init_drm_mem_region_ops(&tile->mem.vram.drm_mr.mr_ops);
 		drm_svm_register_mem_region(drm, &tile->mem.vram.drm_mr, MEMORY_DEVICE_PRIVATE);
 	}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 39/42] drm/xe/svm: Migration from sram to vram for system allocator
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (37 preceding siblings ...)
  2024-06-13  4:24 ` [CI 38/42] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 40/42] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
                   ` (2 subsequent siblings)
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

If applicable, migrate a range of hmmptr from sram to vram for
system allocator. Traditional userptr is not migrated. Only userptr
created during fault (aka userptr splitted from system allocator vma,
aka fault userptr in codes) can be migrated.

Instead of the whole userptr, only a sub-range of userptr is populated
and bind to GPU page table. Right now the range granularity is 2MiB,
which will be overwritten by memory attribute API.

FIXME: The migration should be conditional on user memory attributes
setting. Add this logic when memory attributes are supported

v1: Use none NULL owner when call drm_svm_hmmptr_populate (Himal)

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 41 ++++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_vm.c           | 18 +++++++-----
 drivers/gpu/drm/xe/xe_vm.h           |  2 +-
 3 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 9e47e0c97a1b..004cd4d8e1a0 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -128,6 +128,12 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 	ktime_t end = 0;
 	int err;
 	bool atomic;
+	/**FIXME: use migration granularity from memory attributes*/
+	u64 migrate_granularity = SZ_2M;
+	u64 fault_addr = pf->page_addr;
+	u64 fault_offset, fault_cpu_addr;
+	u64 aligned_cpu_fault_start, aligned_cpu_fault_end;
+	u64 cpu_va_start, cpu_va_end, gpu_va_start, gpu_va_end;
 
 	trace_xe_vma_pagefault(vma);
 	atomic = access_is_atomic(pf->access_type);
@@ -144,13 +150,42 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 		}
 	}
 
+	fault_offset = fault_addr - xe_vma_start(vma);
+	fault_cpu_addr = xe_vma_userptr(vma) + fault_offset;
+	aligned_cpu_fault_start = ALIGN_DOWN(fault_cpu_addr, migrate_granularity);
+	aligned_cpu_fault_end  = aligned_cpu_fault_start + migrate_granularity;
+
+	if (xe_vma_is_userptr(vma)) {
+		cpu_va_start = max_t(u64, xe_vma_userptr(vma), aligned_cpu_fault_start);
+		cpu_va_end = min_t(u64, xe_vma_userptr_end(vma), aligned_cpu_fault_end);
+		gpu_va_start = xe_vma_start(vma) + (cpu_va_start - xe_vma_userptr(vma));
+		gpu_va_end = xe_vma_end(vma) - (xe_vma_userptr_end(vma) - cpu_va_end);
+	} else {
+		gpu_va_start = xe_vma_start(vma);
+		gpu_va_end = xe_vma_end(vma);
+	}
+
 retry_userptr:
 	xe_vm_userptr_garbage_collector(vm);
 	if (xe_vma_is_userptr(vma) &&
 	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
 		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		struct xe_userptr *userptr = &uvma->userptr;
+		struct drm_hmmptr *hmmptr = &userptr->hmmptr;
+
+		mmap_read_lock(hmmptr->notifier.mm);
+		if (xe_vma_is_fault_userptr(vma)) {
+			/**FIXME: add migration policy here*/
+			err = drm_svm_migrate_hmmptr_to_vram(&vm->gpuvm, &tile->mem.vram.drm_mr,
+                hmmptr, cpu_va_start, cpu_va_end);
+			if (err) {
+				mmap_read_unlock(hmmptr->notifier.mm);
+				return err;
+			}
 
-		err = xe_vma_userptr_pin_pages(uvma);
+		}
+		err = xe_vma_userptr_pin_pages(uvma, cpu_va_start, cpu_va_end, true);
+		mmap_read_unlock(hmmptr->notifier.mm);
 		if (err)
 			return err;
 	}
@@ -167,8 +202,8 @@ static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
 
 		/* Bind VMA only to the GT that has faulted */
 		trace_xe_vma_pf_bind(vma);
-		fence = xe_vma_rebind(vm, vma, xe_vma_start(vma),
-			    xe_vma_end(vma), BIT(tile->id));
+		fence = xe_vma_rebind(vm, vma, gpu_va_start,
+			    gpu_va_end, BIT(tile->id));
 		if (IS_ERR(fence)) {
 			err = PTR_ERR(fence);
 			if (xe_vm_validate_should_retry(&exec, err, &end))
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index b1ab86890226..617949313490 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -76,6 +76,7 @@ static void xe_vma_userptr_dma_map_pages(struct xe_userptr_vma *uvma,
 	xe_assert(xe, xe_vma_is_userptr(vma));
 	xe_assert(xe, start >= xe_vma_userptr(vma));
 	xe_assert(xe, end <= xe_vma_userptr_end(vma));
+
 	drm_svm_hmmptr_map_dma_pages(hmmptr, page_idx, npages);
 }
 
@@ -96,7 +97,7 @@ static void xe_vma_userptr_dma_unmap_pages(struct xe_userptr_vma *uvma,
 	drm_svm_hmmptr_unmap_dma_pages(hmmptr, page_idx, npages);
 }
 
-int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
+int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma, u64 start, u64 end, bool mmap_locked)
 {
 	struct drm_hmmptr *hmmptr = &uvma->userptr.hmmptr;
 	struct xe_vma *vma = &uvma->vma;
@@ -106,14 +107,15 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 
 	lockdep_assert_held(&vm->lock);
 	xe_assert(xe, xe_vma_is_userptr(vma));
+	xe_assert(xe, start >= xe_vma_userptr(vma));
+	xe_assert(xe, end <= xe_vma_userptr_end(vma));
 
-	ret = drm_svm_hmmptr_populate(hmmptr, NULL, xe_vma_userptr(vma),
-			 xe_vma_userptr(vma) + xe_vma_size(vma),
-			 !xe_vma_read_only(vma), false);
+	ret = drm_svm_hmmptr_populate(hmmptr, xe, start, end,
+			 !xe_vma_read_only(vma), mmap_locked);
 	if (ret)
 		return ret;
 
-	xe_vma_userptr_dma_map_pages(uvma, xe_vma_userptr(vma), xe_vma_userptr_end(vma));
+	xe_vma_userptr_dma_map_pages(uvma, start, end);
 	return 0;
 }
 
@@ -736,7 +738,8 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 	/* Pin and move to temporary list */
 	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
 				 userptr.repin_link) {
-		err = xe_vma_userptr_pin_pages(uvma);
+		err = xe_vma_userptr_pin_pages(uvma, xe_vma_userptr(&uvma->vma),
+			xe_vma_userptr_end(&uvma->vma), false);
 		if (err == -EFAULT) {
 			list_del_init(&uvma->userptr.repin_link);
 
@@ -2116,7 +2119,8 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
 		drm_exec_fini(&exec);
 
 	if (xe_vma_is_userptr(vma) && !xe_vma_is_fault_userptr(vma)) {
-		err = xe_vma_userptr_pin_pages(to_userptr_vma(vma));
+		err = xe_vma_userptr_pin_pages(to_userptr_vma(vma), xe_vma_userptr(vma),
+				xe_vma_userptr_end(vma), false);
 		if (err) {
 			prep_vma_destroy(vm, vma, false);
 			xe_vma_destroy_unlocked(vma);
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index b3b6ceec39ba..f24891cb1fcb 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -253,7 +253,7 @@ static inline void xe_vm_reactivate_rebind(struct xe_vm *vm)
 	}
 }
 
-int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma);
+int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma, u64 start, u64 end, bool mmap_locked);
 
 int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma);
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 40/42] drm/xe/svm: Determine a vma is backed by device memory
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (38 preceding siblings ...)
  2024-06-13  4:24 ` [CI 39/42] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 41/42] drm/xe/svm: Introduce hmm_pfn array based resource cursor Oak Zeng
  2024-06-13  4:24 ` [CI 42/42] drm/xe: Enable system allocator uAPI Oak Zeng
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

With system allocator, a userptr can now be back by device
memory also. Introduce a helper function xe_vma_is_devmem
to determine whether a range of vma is backed by device memory.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 91b61fa80acb..691ede2faec5 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -3,6 +3,7 @@
  * Copyright © 2022 Intel Corporation
  */
 
+#include <linux/hmm.h>
 #include "xe_pt.h"
 
 #include "regs/xe_gtt_defs.h"
@@ -578,6 +579,28 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
 	.pt_entry = xe_pt_stage_bind_entry,
 };
 
+static bool xe_vma_is_devmem(struct xe_vma *vma, u64 start)
+{
+	if (xe_vma_is_userptr(vma)) {
+		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		struct drm_hmmptr *hmmptr = &uvma->userptr.hmmptr;
+		u64 offset = start - xe_vma_start(vma);
+		u64 page_idx = offset >> PAGE_SHIFT;
+		u64 hmm_pfn = hmmptr->pfn[page_idx];
+		struct page *page = hmm_pfn_to_page(hmm_pfn);
+
+		/**
+		 * FIXME: Assume there is no mixture system memory and device
+		 * memory placement in the [start, end) range of vma. We might
+		 * need to relook at this in the future.
+		 */
+		return is_device_private_page(page);
+	} else {
+		struct xe_bo *bo = xe_vma_bo(vma);
+		return bo && (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	}
+}
+
 /**
  * xe_pt_stage_bind() - Build a disconnected page-table tree for a given address
  * range.
@@ -604,8 +627,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma, u64 start, u64 end,
 {
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_bo *bo = xe_vma_bo(vma);
-	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
-		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	bool is_devmem = xe_vma_is_devmem(vma, start);
 	struct xe_res_cursor curs;
 	struct xe_pt_stage_bind_walk xe_walk = {
 		.base = {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 41/42] drm/xe/svm: Introduce hmm_pfn array based resource cursor
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (39 preceding siblings ...)
  2024-06-13  4:24 ` [CI 40/42] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  2024-06-13  4:24 ` [CI 42/42] drm/xe: Enable system allocator uAPI Oak Zeng
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

This type of resource cursor will be used by system allocator or userptr.
With system allocator, all the backing resource are backed by struct page.
The resource could be in system memory or in gpu device memory. For
userptr, the backing resource is always in system memory.

For system memory, the page is already dma-mapped. The dma-mapped
address is in dma_addr array.

For gpu device memory, we will calculate the device physical address
using hmm_pfn.

Note we support a mixture placement of system memory and device memory.
In the resource range, some of the page could be backed by system mem,
and some by device memory.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c         |  5 +-
 drivers/gpu/drm/xe/xe_res_cursor.h | 80 ++++++++++++++++++++----------
 2 files changed, 57 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 691ede2faec5..76c4cbe4e984 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -698,8 +698,9 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma, u64 start, u64 end,
 		u64 offset = start - xe_vma_start(vma);
 		u64 page_idx = offset >> PAGE_SHIFT;
 		if (xe_vma_is_userptr(vma))
-			xe_res_first_dma(to_userptr_vma(vma)->userptr.hmmptr.dma_addr + page_idx,
-					0, xe_vma_size(vma), 0, &curs);
+			xe_res_first_hmmptr(to_userptr_vma(vma)->userptr.hmmptr.dma_addr + page_idx,
+					to_userptr_vma(vma)->userptr.hmmptr.pfn + page_idx,
+					&tile->mem.vram.drm_mr, size, &curs);
 		else if (xe_bo_is_vram(bo) || xe_bo_is_stolen(bo))
 			xe_res_first(bo->ttm.resource, xe_vma_bo_offset(vma) + offset,
 				     size, &curs);
diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h b/drivers/gpu/drm/xe/xe_res_cursor.h
index b17b3375f6d9..fae53c093a57 100644
--- a/drivers/gpu/drm/xe/xe_res_cursor.h
+++ b/drivers/gpu/drm/xe/xe_res_cursor.h
@@ -25,12 +25,16 @@
 #define _XE_RES_CURSOR_H_
 
 #include <linux/scatterlist.h>
+#include <linux/mm_types.h>
+#include <linux/memremap.h>
+#include <linux/hmm.h>
 
 #include <drm/drm_mm.h>
 #include <drm/ttm/ttm_placement.h>
 #include <drm/ttm/ttm_range_manager.h>
 #include <drm/ttm/ttm_resource.h>
 #include <drm/ttm/ttm_tt.h>
+#include <drm/drm_svm.h>
 
 #include "xe_bo.h"
 #include "xe_device.h"
@@ -44,10 +48,11 @@ struct xe_res_cursor {
 	u64 remaining;
 	void *node;
 	u32 mem_type;
-	unsigned int order;
 	struct scatterlist *sgl;
-	const dma_addr_t *dma_addr;
+	dma_addr_t *dma_addr;
 	struct drm_buddy *mm;
+	unsigned long *hmm_pfn;
+	struct drm_mem_region *mr;
 };
 
 static struct drm_buddy *xe_res_get_buddy(struct ttm_resource *res)
@@ -80,6 +85,8 @@ static inline void xe_res_first(struct ttm_resource *res,
 	XE_WARN_ON(start + size > res->size);
 
 	cur->mem_type = res->mem_type;
+	cur->hmm_pfn = NULL;
+	cur->mr = NULL;
 
 	switch (cur->mem_type) {
 	case XE_PL_STOLEN:
@@ -160,6 +167,8 @@ static inline void xe_res_first_sg(const struct sg_table *sg,
 				   struct xe_res_cursor *cur)
 {
 	XE_WARN_ON(!sg);
+	cur->hmm_pfn = NULL;
+	cur->mr = NULL;
 	cur->node = NULL;
 	cur->start = start;
 	cur->remaining = size;
@@ -171,34 +180,43 @@ static inline void xe_res_first_sg(const struct sg_table *sg,
 }
 
 /**
- * xe_res_first_dma - initialize a xe_res_cursor with dma_addr array
+ * xe_res_first_hmmptr - initialize a xe_res_cursor for hmmptr
  *
- * @dma_addr: dma_addr array to walk
- * @start: Start of the range
+ * @dma_addr: dma_addr array to walk, valid when resource is in system mem.
+ * @hmm_pfn: a hmm_pfn array, each item contains hmm_pfn of a 4k page.
+ * @mr: memory region that the resource belongs to
  * @size: Size of the range
- * @order: Order of dma mapping. i.e. PAGE_SIZE << order is mapping size
  * @cur: cursor object to initialize
  *
- * Start walking over the range of allocations between @start and @size.
+ * Start walking over the resources used by a hmmptr. For hmmptr,
+ * the backing resource are all backed by struct page. The resource
+ * could be in system memory or in gpu device memory.
+ *
+ * For system memory, the page is already dma-mapped. The dma-mapped
+ * address is in dma_addr array.
+ *
+ * For gpu device memory, we will calculate the device physical address
+ * using hmm_pfn.
+ *
+ * Note we support a mixture placement of system memory and device memory.
+ * In the resource range, some of the page could be backed by system mem,
+ * and some by device memory.
  */
-static inline void xe_res_first_dma(const dma_addr_t *dma_addr,
-				    u64 start, u64 size,
-				    unsigned int order,
-				    struct xe_res_cursor *cur)
+static inline void xe_res_first_hmmptr(dma_addr_t *dma_addr,
+                    unsigned long *hmm_pfn, struct drm_mem_region *mr,
+				    u64 size, struct xe_res_cursor *cur)
 {
-	XE_WARN_ON(start);
 	XE_WARN_ON(!dma_addr);
-	XE_WARN_ON(!IS_ALIGNED(start, PAGE_SIZE) ||
-		   !IS_ALIGNED(size, PAGE_SIZE));
+	XE_WARN_ON(!IS_ALIGNED(size, PAGE_SIZE));
 
 	cur->node = NULL;
-	cur->start = start;
 	cur->remaining = size;
-	cur->size = PAGE_SIZE << order;
+	cur->size = 0;
 	cur->dma_addr = dma_addr;
-	cur->order = order;
 	cur->sgl = NULL;
 	cur->mem_type = XE_PL_TT;
+	cur->hmm_pfn = hmm_pfn;
+	cur->mr = mr;
 }
 
 /**
@@ -221,15 +239,19 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
 	if (!cur->remaining)
 		return;
 
-	if (cur->size > size) {
-		cur->size -= size;
-		cur->start += size;
+	if (cur->mr) {
+		int npages;
+
+		XE_WARN_ON(!IS_ALIGNED(size, PAGE_SIZE));
+
+		npages = size >> PAGE_SHIFT;
+		cur->hmm_pfn += npages;
+		cur->dma_addr += npages;
 		return;
 	}
 
-	if (cur->dma_addr) {
-		cur->size = (PAGE_SIZE << cur->order) -
-			(size - cur->size);
+	if (cur->size > size) {
+		cur->size -= size;
 		cur->start += size;
 		return;
 	}
@@ -275,9 +297,15 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
  */
 static inline u64 xe_res_dma(const struct xe_res_cursor *cur)
 {
-	if (cur->dma_addr)
-		return cur->dma_addr[cur->start >> (PAGE_SHIFT + cur->order)] +
-			(cur->start & ((PAGE_SIZE << cur->order) - 1));
+	if (cur->mr) {
+		u64 hmm_pfn = *cur->hmm_pfn;
+		struct page *page = hmm_pfn_to_page(hmm_pfn);
+
+		if (is_device_private_page(page))
+			return drm_mem_region_page_to_dpa(cur->mr, page);
+		else
+			return *cur->dma_addr;
+	}
 	else if (cur->sgl)
 		return sg_dma_address(cur->sgl) + cur->start;
 	else
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 42/42] drm/xe: Enable system allocator uAPI
  2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
                   ` (40 preceding siblings ...)
  2024-06-13  4:24 ` [CI 41/42] drm/xe/svm: Introduce hmm_pfn array based resource cursor Oak Zeng
@ 2024-06-13  4:24 ` Oak Zeng
  41 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13  4:24 UTC (permalink / raw)
  To: intel-xe

From: Matthew Brost <matthew.brost@intel.com>

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 617949313490..28ee7bf5d4c5 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2871,12 +2871,6 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
-		/* FIXME: Disabling system allocator for now */
-		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
-			err = -EOPNOTSUPP;
-			goto free_bind_ops;
-		}
-
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [CI 32/42] drm/xe/svm: implement functions to allocate and free device memory
  2024-06-13 15:30 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
@ 2024-06-13 15:31 ` Oak Zeng
  0 siblings, 0 replies; 44+ messages in thread
From: Oak Zeng @ 2024-06-13 15:31 UTC (permalink / raw)
  To: intel-xe

Function xe_svm_alloc_pages allocate pages from drm buddy and perform
house keeping work for all the pages allocated, such as get a page
refcount, set page's zone_device_data to point to the drm buddy block
that the page belongs to.

Function xe_svm_free_page free one page back to drm buddy allocator.

v1: Drop buddy block meta data (Matt)
    Take vram_mgr->lock during buddy operation (Matt)
    Moving to page granularity memory free (Oak)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/drm_svm.c   |   1 -
 drivers/gpu/drm/xe/Makefile |   1 +
 drivers/gpu/drm/xe/xe_svm.c | 148 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h |  16 ++++
 4 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h

diff --git a/drivers/gpu/drm/drm_svm.c b/drivers/gpu/drm/drm_svm.c
index 967e7addae54..a751cc2c9c91 100644
--- a/drivers/gpu/drm/drm_svm.c
+++ b/drivers/gpu/drm/drm_svm.c
@@ -3,7 +3,6 @@
  * Copyright Â© 2024 Intel Corporation
  */
 
-
 #include <linux/scatterlist.h>
 #include <linux/mmu_notifier.h>
 #include <linux/dma-mapping.h>
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 6457f1b74cd8..cf5ce447648c 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -108,6 +108,7 @@ xe-y += xe_bb.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
new file mode 100644
index 000000000000..a2c52f205ea0
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright Â© 2023 Intel Corporation
+ */
+
+#include <linux/mm_types.h>
+#include <linux/sched/mm.h>
+#include <linux/gfp.h>
+#include <drm/drm_buddy.h>
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_device_types.h"
+#include "xe_device.h"
+#include "xe_tile.h"
+#include "xe_svm.h"
+#include "xe_assert.h"
+
+/** This is a placeholder for compilation. Will introduce function to drm buddy */
+static void drm_buddy_block_free_subrange(struct drm_buddy_block *block, u64 start,
+                        u64 end, struct list_head *block_list)
+{
+}
+
+static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
+{
+	/** DRM buddy's block offset is 0-based*/
+	offset += mr->drm_mr.hpa_base;
+
+	return PHYS_PFN(offset);
+}
+
+void xe_svm_free_page(struct page *page)
+{
+	struct drm_buddy_block *block =
+					(struct drm_buddy_block *)page->zone_device_data;
+	struct drm_mem_region *mr = drm_page_to_mem_region(page);
+	struct xe_mem_region *xe_mr = container_of(mr, struct xe_mem_region, drm_mr);
+	struct xe_tile *tile = xe_mem_region_to_tile(xe_mr);
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	u64 size = drm_buddy_block_size(mm, block);
+	u64 pages_per_block = size >> PAGE_SHIFT;
+	u64 block_pfn_first =
+					block_offset_to_pfn(xe_mr, drm_buddy_block_offset(block));
+	u64 page_pfn = page_to_pfn(page);
+	u64 page_idx = page_pfn - block_pfn_first;
+	struct list_head blocks = LIST_HEAD_INIT(blocks);
+	struct drm_buddy_block *tmp;
+	int i;
+
+	xe_assert(tile->xe, page_idx < pages_per_block);
+	drm_buddy_block_free_subrange(block, page_idx << PAGE_SHIFT,
+            (page_idx << PAGE_SHIFT) + PAGE_SIZE, &blocks);
+
+	/**
+	 * page's buddy_block (record in page's zone_device_data) is changed
+	 * due to above partial free. update zone_device_data.
+	 */
+	list_for_each_entry_safe(block, tmp, &blocks, link) {
+		size = drm_buddy_block_size(mm, block);
+		pages_per_block = size >> PAGE_SHIFT;
+		block_pfn_first =
+					block_offset_to_pfn(xe_mr, drm_buddy_block_offset(block));
+		for(i = 0; i < pages_per_block; i++) {
+			struct page *p;
+
+			p = pfn_to_page(block_pfn_first + i);
+			p->zone_device_data = block;
+		}
+	}
+}
+
+/**
+ * __free_blocks() - free all memory blocks
+ *
+ * @mm: drm_buddy that the blocks belongs to
+ * @blocks: memory blocks list head
+ */
+static void __free_blocks(struct drm_buddy *mm, struct list_head *blocks)
+{
+	struct drm_buddy_block *block, *tmp;
+
+	list_for_each_entry_safe(block, tmp, blocks, link)
+		drm_buddy_free_block(mm, block);
+}
+
+/**
+ * xe_svm_alloc_pages() - allocate device pages from buddy allocator
+ *
+ * @mr: which memory region to allocate device memory from
+ * @npages: how many pages to allocate
+ * @pfn: used to return the pfn of all allocated pages. Must be big enough
+ * to hold at @npages entries.
+ *
+ * This function allocate blocks of memory from drm buddy allocator, and
+ * performs initialization work: set struct page::zone_device_data to point
+ * to the memory block; zone_device_page_init each page allocated; add pages
+ * to lru managers lru list for eviction purpose - this is TBD.
+ *
+ * return: 0 on success
+ * error code otherwise
+ */
+int xe_svm_alloc_pages(struct drm_mem_region *mr,
+            unsigned long npages, unsigned long *pfn)
+{
+	struct xe_mem_region *xe_mr = container_of(mr, struct xe_mem_region, drm_mr);
+	struct xe_tile *tile = xe_mem_region_to_tile(xe_mr);
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	struct drm_buddy_block *block, *tmp;
+	u64 size = npages << PAGE_SHIFT;
+	int ret = 0, i, j = 0;
+	struct list_head blocks;
+
+	mutex_lock(&tile->mem.vram_mgr->lock);
+	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
+						&blocks, DRM_BUDDY_CONTIGUOUS_ALLOCATION);
+
+	if (unlikely(ret)) {
+		mutex_unlock(&tile->mem.vram_mgr->lock);
+		return ret;
+	}
+
+	ret = drm_buddy_block_trim(mm, size, &blocks);
+	if (ret) {
+		__free_blocks(mm, &blocks);
+		mutex_unlock(&tile->mem.vram_mgr->lock);
+		return ret;
+	}
+	mutex_unlock(&tile->mem.vram_mgr->lock);
+
+	list_for_each_entry_safe(block, tmp, &blocks, link) {
+		u64 block_pfn_first, pages_per_block;
+
+		size = drm_buddy_block_size(mm, block);
+		pages_per_block = size >> PAGE_SHIFT;
+		block_pfn_first =
+					block_offset_to_pfn(xe_mr, drm_buddy_block_offset(block));
+		for(i = 0; i < pages_per_block; i++) {
+			struct page *page;
+
+			pfn[j++] = block_pfn_first + i;
+			page = pfn_to_page(block_pfn_first + i);
+			/**Lock page per hmm requirement, see hmm.rst.*/
+			zone_device_page_init(page);
+			page->zone_device_data = block;
+		}
+	}
+
+	return ret;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..09fef8ea97f3
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright Â© 2024 Intel Corporation
+ */
+
+#ifndef _XE_SVM_
+#define _XE_SVM_
+
+struct drm_mem_region;
+struct page;
+
+int xe_svm_alloc_pages(struct drm_mem_region *mr,
+            unsigned long npages, unsigned long *pfn);
+void xe_svm_free_page(struct page *page);
+
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2024-06-13 15:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13  4:23 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
2024-06-13  4:20 ` ✗ CI.Patch_applied: failure for series starting with [CI,01/42] " Patchwork
2024-06-13  4:23 ` [CI 02/42] dma-mapping: provide an interface to allocate IOVA Oak Zeng
2024-06-13  4:23 ` [CI 03/42] dma-mapping: provide callbacks to link/unlink pages to specific IOVA Oak Zeng
2024-06-13  4:23 ` [CI 04/42] iommu/dma: Provide an interface to allow preallocate IOVA Oak Zeng
2024-06-13  4:23 ` [CI 05/42] iommu/dma: Prepare map/unmap page functions to receive IOVA Oak Zeng
2024-06-13  4:23 ` [CI 06/42] iommu/dma: Implement link/unlink page callbacks Oak Zeng
2024-06-13  4:23 ` [CI 07/42] drm: Move GPUVA_START/LAST to drm_gpuvm.h Oak Zeng
2024-06-13  4:23 ` [CI 08/42] drm/svm: Mark drm_gpuvm to participate SVM Oak Zeng
2024-06-13  4:23 ` [CI 09/42] drm/svm: introduce drm_mem_region concept Oak Zeng
2024-06-13  4:23 ` [CI 10/42] drm/svm: introduce hmmptr and helper functions Oak Zeng
2024-06-13  4:23 ` [CI 11/42] drm/svm: Introduce helper to remap drm memory region Oak Zeng
2024-06-13  4:23 ` [CI 12/42] drm/svm: handle CPU page fault Oak Zeng
2024-06-13  4:24 ` [CI 13/42] drm/svm: Migrate a range of hmmptr to vram Oak Zeng
2024-06-13  4:24 ` [CI 14/42] drm/svm: Add DRM SVM documentation Oak Zeng
2024-06-13  4:24 ` [CI 15/42] drm/xe: s/xe_tile_migrate_engine/xe_tile_migrate_exec_queue Oak Zeng
2024-06-13  4:24 ` [CI 16/42] drm/xe: Add xe_vm_pgtable_update_op to xe_vma_ops Oak Zeng
2024-06-13  4:24 ` [CI 17/42] drm/xe: Convert multiple bind ops into single job Oak Zeng
2024-06-13  4:24 ` [CI 18/42] drm/xe: Update VM trace events Oak Zeng
2024-06-13  4:24 ` [CI 19/42] drm/xe: Update PT layer with better error handling Oak Zeng
2024-06-13  4:24 ` [CI 20/42] drm/xe: Retry BO allocation Oak Zeng
2024-06-13  4:24 ` [CI 21/42] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
2024-06-13  4:24 ` [CI 22/42] drm/xe: Add a helper to calculate userptr end address Oak Zeng
2024-06-13  4:24 ` [CI 23/42] drm/xe: Add dma_addr res cursor Oak Zeng
2024-06-13  4:24 ` [CI 24/42] drm/xe: Use drm_mem_region for xe Oak Zeng
2024-06-13  4:24 ` [CI 25/42] drm/xe: use drm_hmmptr in xe Oak Zeng
2024-06-13  4:24 ` [CI 26/42] drm/xe: Moving to range based vma invalidation Oak Zeng
2024-06-13  4:24 ` [CI 27/42] drm/xe: Support range based page table update Oak Zeng
2024-06-13  4:24 ` [CI 28/42] drm/xe/uapi: Add DRM_XE_VM_CREATE_FLAG_PARTICIPATE_SVM flag Oak Zeng
2024-06-13  4:24 ` [CI 29/42] drm/xe/svm: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
2024-06-13  4:24 ` [CI 30/42] drm/xe/svm: Add faulted userptr VMA garbage collector Oak Zeng
2024-06-13  4:24 ` [CI 31/42] drm/xe: Introduce helper to get tile from memory region Oak Zeng
2024-06-13  4:24 ` [CI 32/42] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
2024-06-13  4:24 ` [CI 33/42] drm/xe/svm: Get drm device from drm memory region Oak Zeng
2024-06-13  4:24 ` [CI 34/42] drm/xe/svm: Get page map owner of a " Oak Zeng
2024-06-13  4:24 ` [CI 35/42] drm/xe/svm: Add migrate layer functions for SVM support Oak Zeng
2024-06-13  4:24 ` [CI 36/42] drm/xe/svm: introduce svm migration function Oak Zeng
2024-06-13  4:24 ` [CI 37/42] drm/xe/svm: Register xe memory region to drm layer Oak Zeng
2024-06-13  4:24 ` [CI 38/42] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
2024-06-13  4:24 ` [CI 39/42] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
2024-06-13  4:24 ` [CI 40/42] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
2024-06-13  4:24 ` [CI 41/42] drm/xe/svm: Introduce hmm_pfn array based resource cursor Oak Zeng
2024-06-13  4:24 ` [CI 42/42] drm/xe: Enable system allocator uAPI Oak Zeng
  -- strict thread matches above, loose matches on Subject: below --
2024-06-13 15:30 [CI 01/42] mm/hmm: let users to tag specific PFNs Oak Zeng
2024-06-13 15:31 ` [CI 32/42] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox