[PATCH 1/3] drm/xe/hmm: Style- and include fixes

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/3] drm/xe/hmm: Style- and include fixes
       [not found] <20250228104418.44313-1-thomas.hellstrom@linux.intel.com>
@ 2025-02-28 10:44 ` Thomas Hellström
  2025-02-28 12:56   ` Matthew Auld
  2025-02-28 10:44 ` [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock Thomas Hellström
  2025-02-28 10:44 ` [PATCH 3/3] drm/xe/userptr: Unmap userptrs in the mmu notifier Thomas Hellström
  2 siblings, 1 reply; 8+ messages in thread
From: Thomas Hellström @ 2025-02-28 10:44 UTC (permalink / raw)
  To: intel-xe; +Cc: Thomas Hellström, Oak Zeng, stable

Add proper #ifndef around the xe_hmm.h header, proper spacing
and since the documentation mostly follows kerneldoc format,
make it kerneldoc. Also prepare for upcoming -stable fixes.

Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng <oak.zeng@intel.com>
Cc: <stable@vger.kernel.org> # v6.10+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_hmm.c | 9 +++------
 drivers/gpu/drm/xe/xe_hmm.h | 5 +++++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index 089834467880..c56738fa713b 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -19,11 +19,10 @@ static u64 xe_npages_in_range(unsigned long start, unsigned long end)
 	return (end - start) >> PAGE_SHIFT;
 }
 
-/*
+/**
  * xe_mark_range_accessed() - mark a range is accessed, so core mm
  * have such information for memory eviction or write back to
  * hard disk
- *
  * @range: the range to mark
  * @write: if write to this range, we mark pages in this range
  * as dirty
@@ -43,11 +42,10 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
 	}
 }
 
-/*
+/**
  * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
  * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
  * and will be used to program GPU page table later.
- *
  * @xe: the xe device who will access the dma-address in sg table
  * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
  * has the pfn numbers of pages that back up this hmm address range.
@@ -112,9 +110,8 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
 	return ret;
 }
 
-/*
+/**
  * xe_hmm_userptr_free_sg() - Free the scatter gather table of userptr
- *
  * @uvma: the userptr vma which hold the scatter gather table
  *
  * With function xe_userptr_populate_range, we allocate storage of
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
index 909dc2bdcd97..9602cb7d976d 100644
--- a/drivers/gpu/drm/xe/xe_hmm.h
+++ b/drivers/gpu/drm/xe/xe_hmm.h
@@ -3,9 +3,14 @@
  * Copyright © 2024 Intel Corporation
  */
 
+#ifndef _XE_HMM_H_
+#define _XE_HMM_H_
+
 #include <linux/types.h>
 
 struct xe_userptr_vma;
 
 int xe_hmm_userptr_populate_range(struct xe_userptr_vma *uvma, bool is_mm_mmap_locked);
+
 void xe_hmm_userptr_free_sg(struct xe_userptr_vma *uvma);
+#endif
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock
       [not found] <20250228104418.44313-1-thomas.hellstrom@linux.intel.com>
  2025-02-28 10:44 ` [PATCH 1/3] drm/xe/hmm: Style- and include fixes Thomas Hellström
@ 2025-02-28 10:44 ` Thomas Hellström
  2025-02-28 12:55   ` Matthew Auld
  2025-02-28 10:44 ` [PATCH 3/3] drm/xe/userptr: Unmap userptrs in the mmu notifier Thomas Hellström
  2 siblings, 1 reply; 8+ messages in thread
From: Thomas Hellström @ 2025-02-28 10:44 UTC (permalink / raw)
  To: intel-xe; +Cc: Thomas Hellström, Oak Zeng, stable

The pnfs that we obtain from hmm_range_fault() point to pages that
we don't have a reference on, and the guarantee that they are still
in the cpu page-tables is that the notifier lock must be held and the
notifier seqno is still valid.

So while building the sg table and marking the pages accesses / dirty
we need to hold this lock with a validated seqno.

However, the lock is reclaim tainted which makes
sg_alloc_table_from_pages_segment() unusable, since it internally
allocates memory.

Instead build the sg-table manually. For the non-iommu case
this might lead to fewer coalesces, but if that's a problem it can
be fixed up later in the resource cursor code. For the iommu case,
the whole sg-table may still be coalesced to a single contigous
device va region.

This avoids marking pages that we don't own dirty and accessed, and
it also avoid dereferencing struct pages that we don't own.

Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng <oak.zeng@intel.com>
Cc: <stable@vger.kernel.org> # v6.10+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_hmm.c | 115 ++++++++++++++++++++++++++----------
 1 file changed, 85 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index c56738fa713b..d3b5551496d0 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -42,6 +42,36 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
 	}
 }
 
+static int xe_alloc_sg(struct sg_table *st, struct hmm_range *range,
+		       struct rw_semaphore *notifier_sem)
+{
+	unsigned long i, npages, hmm_pfn;
+	unsigned long num_chunks = 0;
+	int ret;
+
+	/* HMM docs says this is needed. */
+	ret = down_read_interruptible(notifier_sem);
+	if (ret)
+		return ret;
+
+	if (mmu_interval_read_retry(range->notifier, range->notifier_seq))
+		return -EAGAIN;
+
+	npages = xe_npages_in_range(range->start, range->end);
+	for (i = 0; i < npages;) {
+		hmm_pfn = range->hmm_pfns[i];
+		if (!(hmm_pfn & HMM_PFN_VALID)) {
+			up_read(notifier_sem);
+			return -EFAULT;
+		}
+		num_chunks++;
+		i += 1UL << hmm_pfn_to_map_order(hmm_pfn);
+	}
+	up_read(notifier_sem);
+
+	return sg_alloc_table(st, num_chunks, GFP_KERNEL);
+}
+
 /**
  * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
  * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
@@ -50,6 +80,7 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
  * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
  * has the pfn numbers of pages that back up this hmm address range.
  * @st: pointer to the sg table.
+ * @notifier_sem: The xe notifier lock.
  * @write: whether we write to this range. This decides dma map direction
  * for system pages. If write we map it bi-diretional; otherwise
  * DMA_TO_DEVICE
@@ -76,38 +107,33 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
  * Returns 0 if successful; -ENOMEM if fails to allocate memory
  */
 static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
-		       struct sg_table *st, bool write)
+		       struct sg_table *st,
+		       struct rw_semaphore *notifier_sem,
+		       bool write)
 {
 	struct device *dev = xe->drm.dev;
-	struct page **pages;
-	u64 i, npages;
-	int ret;
-
-	npages = xe_npages_in_range(range->start, range->end);
-	pages = kvmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
-	for (i = 0; i < npages; i++) {
-		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
-		xe_assert(xe, !is_device_private_page(pages[i]));
-	}
-
-	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0, npages << PAGE_SHIFT,
-						xe_sg_segment_size(dev), GFP_KERNEL);
-	if (ret)
-		goto free_pages;
-
-	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
-			      DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
-	if (ret) {
-		sg_free_table(st);
-		st = NULL;
+	unsigned long hmm_pfn, size;
+	struct scatterlist *sgl;
+	struct page *page;
+	unsigned long i, j;
+
+	lockdep_assert_held(notifier_sem);
+
+	i = 0;
+	for_each_sg(st->sgl, sgl, st->nents, j) {
+		hmm_pfn = range->hmm_pfns[i];
+		page = hmm_pfn_to_page(hmm_pfn);
+		xe_assert(xe, !is_device_private_page(page));
+		size = 1UL << hmm_pfn_to_map_order(hmm_pfn);
+		sg_set_page(sgl, page, size << PAGE_SHIFT, 0);
+		if (unlikely(j == st->nents - 1))
+			sg_mark_end(sgl);
+		i += size;
 	}
+	xe_assert(xe, i == xe_npages_in_range(range->start, range->end));
 
-free_pages:
-	kvfree(pages);
-	return ret;
+	return dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
+			       DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
 }
 
 /**
@@ -235,16 +261,45 @@ int xe_hmm_userptr_populate_range(struct xe_userptr_vma *uvma,
 	if (ret)
 		goto free_pfns;
 
-	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
+	if (unlikely(userptr->sg)) {
+		ret = down_write_killable(&vm->userptr.notifier_lock);
+		if (ret)
+			goto free_pfns;
+
+		xe_hmm_userptr_free_sg(uvma);
+		up_write(&vm->userptr.notifier_lock);
+	}
+
+	ret = xe_alloc_sg(&userptr->sgt, &hmm_range, &vm->userptr.notifier_lock);
 	if (ret)
 		goto free_pfns;
 
+	ret = down_read_interruptible(&vm->userptr.notifier_lock);
+	if (ret)
+		goto free_st;
+
+	if (mmu_interval_read_retry(hmm_range.notifier, hmm_range.notifier_seq)) {
+		ret = -EAGAIN;
+		goto out_unlock;
+	}
+
+	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
+			  &vm->userptr.notifier_lock, write);
+	if (ret)
+		goto out_unlock;
+
 	xe_mark_range_accessed(&hmm_range, write);
 	userptr->sg = &userptr->sgt;
 	userptr->notifier_seq = hmm_range.notifier_seq;
+	up_read(&vm->userptr.notifier_lock);
+	kvfree(pfns);
+	return 0;
 
+out_unlock:
+	up_read(&vm->userptr.notifier_lock);
+free_st:
+	sg_free_table(&userptr->sgt);
 free_pfns:
 	kvfree(pfns);
 	return ret;
 }
-
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] drm/xe/userptr: Unmap userptrs in the mmu notifier
       [not found] <20250228104418.44313-1-thomas.hellstrom@linux.intel.com>
  2025-02-28 10:44 ` [PATCH 1/3] drm/xe/hmm: Style- and include fixes Thomas Hellström
  2025-02-28 10:44 ` [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock Thomas Hellström
@ 2025-02-28 10:44 ` Thomas Hellström
  2 siblings, 0 replies; 8+ messages in thread
From: Thomas Hellström @ 2025-02-28 10:44 UTC (permalink / raw)
  To: intel-xe; +Cc: Thomas Hellström, Oak Zeng, stable

If userptr pages are freed after a call to the xe mmu notifier,
the device will not be blocked out from theoretically accessing
these pages unless they are also unmapped from the iommu, and
this violates some aspects of the iommu-imposed security.

Ensure that userptrs are unmapped in the mmu notifier to
mitigate this. A naive attempt would try to free the sg table, but
the sg table itself may be accessed by a concurrent bind
operation, so settle for unly unmapping.

Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng <oak.zeng@intel.com>
Cc: <stable@vger.kernel.org> # v6.10+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_hmm.c      | 49 ++++++++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_hmm.h      |  2 ++
 drivers/gpu/drm/xe/xe_vm.c       |  4 +++
 drivers/gpu/drm/xe/xe_vm_types.h |  4 +++
 4 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index d3b5551496d0..35d257d4680f 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -136,6 +136,43 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
 			       DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
 }
 
+static void xe_hmm_userptr_set_mapped(struct xe_userptr_vma *uvma)
+{
+	struct xe_userptr *userptr = &uvma->userptr;
+	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
+
+	lockdep_assert_held_write(&vm->lock);
+
+	mutex_lock(&userptr->unmap_mutex);
+	xe_assert(vm->xe, !userptr->mapped);
+	userptr->mapped = true;
+	mutex_unlock(&userptr->unmap_mutex);
+}
+
+void xe_hmm_userptr_unmap(struct xe_userptr_vma *uvma)
+{
+	struct xe_userptr *userptr = &uvma->userptr;
+	struct xe_vma *vma = &uvma->vma;
+	bool write = !xe_vma_read_only(vma);
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_device *xe = vm->xe;
+
+	if (!lockdep_is_held_type(&vm->userptr.notifier_lock, 0) &&
+	    !lockdep_is_held_type(&vm->lock, 0) &&
+	    !(vma->gpuva.flags & XE_VMA_DESTROYED)) {
+		xe_vm_assert_held(vm);
+		lockdep_assert_held(&vm->userptr.notifier_lock);
+		lockdep_assert_held(&vm->lock);
+	}
+
+	mutex_lock(&userptr->unmap_mutex);
+	if (userptr->sg && userptr->mapped)
+		dma_unmap_sgtable(xe->drm.dev, userptr->sg,
+				  write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
+	userptr->mapped = false;
+	mutex_unlock(&userptr->unmap_mutex);
+}
+
 /**
  * xe_hmm_userptr_free_sg() - Free the scatter gather table of userptr
  * @uvma: the userptr vma which hold the scatter gather table
@@ -147,16 +184,9 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
 void xe_hmm_userptr_free_sg(struct xe_userptr_vma *uvma)
 {
 	struct xe_userptr *userptr = &uvma->userptr;
-	struct xe_vma *vma = &uvma->vma;
-	bool write = !xe_vma_read_only(vma);
-	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_device *xe = vm->xe;
-	struct device *dev = xe->drm.dev;
-
-	xe_assert(xe, userptr->sg);
-	dma_unmap_sgtable(dev, userptr->sg,
-			  write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
 
+	xe_assert(xe_vma_vm(&uvma->vma)->xe, userptr->sg);
+	xe_hmm_userptr_unmap(uvma);
 	sg_free_table(userptr->sg);
 	userptr->sg = NULL;
 }
@@ -290,6 +320,7 @@ int xe_hmm_userptr_populate_range(struct xe_userptr_vma *uvma,
 
 	xe_mark_range_accessed(&hmm_range, write);
 	userptr->sg = &userptr->sgt;
+	xe_hmm_userptr_set_mapped(uvma);
 	userptr->notifier_seq = hmm_range.notifier_seq;
 	up_read(&vm->userptr.notifier_lock);
 	kvfree(pfns);
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
index 9602cb7d976d..0ea98d8e7bbc 100644
--- a/drivers/gpu/drm/xe/xe_hmm.h
+++ b/drivers/gpu/drm/xe/xe_hmm.h
@@ -13,4 +13,6 @@ struct xe_userptr_vma;
 int xe_hmm_userptr_populate_range(struct xe_userptr_vma *uvma, bool is_mm_mmap_locked);
 
 void xe_hmm_userptr_free_sg(struct xe_userptr_vma *uvma);
+
+void xe_hmm_userptr_unmap(struct xe_userptr_vma *uvma);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index dd422ac95dc0..3dbd3d38008a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -621,6 +621,8 @@ static void __vma_userptr_invalidate(struct xe_vm *vm, struct xe_userptr_vma *uv
 		err = xe_vm_invalidate_vma(vma);
 		XE_WARN_ON(err);
 	}
+
+	xe_hmm_userptr_unmap(uvma);
 }
 
 static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
@@ -1039,6 +1041,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 			INIT_LIST_HEAD(&userptr->invalidate_link);
 			INIT_LIST_HEAD(&userptr->repin_link);
 			vma->gpuva.gem.offset = bo_offset_or_userptr;
+			mutex_init(&userptr->unmap_mutex);
 
 			err = mmu_interval_notifier_insert(&userptr->notifier,
 							   current->mm,
@@ -1080,6 +1083,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		 * them anymore
 		 */
 		mmu_interval_notifier_remove(&userptr->notifier);
+		mutex_destroy(&userptr->unmap_mutex);
 		xe_vm_put(vm);
 	} else if (xe_vma_is_null(vma)) {
 		xe_vm_put(vm);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 1fe79bf23b6b..eca73c4197d4 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -59,12 +59,16 @@ struct xe_userptr {
 	struct sg_table *sg;
 	/** @notifier_seq: notifier sequence number */
 	unsigned long notifier_seq;
+	/** @unmap_mutex: Mutex protecting dma-unmapping */
+	struct mutex unmap_mutex;
 	/**
 	 * @initial_bind: user pointer has been bound at least once.
 	 * write: vm->userptr.notifier_lock in read mode and vm->resv held.
 	 * read: vm->userptr.notifier_lock in write mode or vm->resv held.
 	 */
 	bool initial_bind;
+	/** @mapped: Whether the @sgt sg-table is dma-mapped. Protected by @unmap_mutex. */
+	bool mapped;
 #if IS_ENABLED(CONFIG_DRM_XE_USERPTR_INVAL_INJECT)
 	u32 divisor;
 #endif
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock
  2025-02-28 10:44 ` [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock Thomas Hellström
@ 2025-02-28 12:55   ` Matthew Auld
  2025-02-28 13:08     ` Thomas Hellström
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Auld @ 2025-02-28 12:55 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe; +Cc: Oak Zeng, stable

On 28/02/2025 10:44, Thomas Hellström wrote:
> The pnfs that we obtain from hmm_range_fault() point to pages that
> we don't have a reference on, and the guarantee that they are still
> in the cpu page-tables is that the notifier lock must be held and the
> notifier seqno is still valid.
> 
> So while building the sg table and marking the pages accesses / dirty
> we need to hold this lock with a validated seqno.
> 
> However, the lock is reclaim tainted which makes
> sg_alloc_table_from_pages_segment() unusable, since it internally
> allocates memory.
> 
> Instead build the sg-table manually. For the non-iommu case
> this might lead to fewer coalesces, but if that's a problem it can
> be fixed up later in the resource cursor code. For the iommu case,
> the whole sg-table may still be coalesced to a single contigous
> device va region.
> 
> This avoids marking pages that we don't own dirty and accessed, and
> it also avoid dereferencing struct pages that we don't own.
> 
> Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
> Cc: Oak Zeng <oak.zeng@intel.com>
> Cc: <stable@vger.kernel.org> # v6.10+
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>   drivers/gpu/drm/xe/xe_hmm.c | 115 ++++++++++++++++++++++++++----------
>   1 file changed, 85 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
> index c56738fa713b..d3b5551496d0 100644
> --- a/drivers/gpu/drm/xe/xe_hmm.c
> +++ b/drivers/gpu/drm/xe/xe_hmm.c
> @@ -42,6 +42,36 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>   	}
>   }
>   
> +static int xe_alloc_sg(struct sg_table *st, struct hmm_range *range,
> +		       struct rw_semaphore *notifier_sem)
> +{
> +	unsigned long i, npages, hmm_pfn;
> +	unsigned long num_chunks = 0;
> +	int ret;
> +
> +	/* HMM docs says this is needed. */
> +	ret = down_read_interruptible(notifier_sem);
> +	if (ret)
> +		return ret;
> +
> +	if (mmu_interval_read_retry(range->notifier, range->notifier_seq))
> +		return -EAGAIN;
> +
> +	npages = xe_npages_in_range(range->start, range->end);
> +	for (i = 0; i < npages;) {
> +		hmm_pfn = range->hmm_pfns[i];
> +		if (!(hmm_pfn & HMM_PFN_VALID)) {

Is this possible? The default_flags are always REQ_FAULT, so that should 
ensure PFN_VALID, or the hmm_fault would have returned an error?

 From the docs:

"HMM_PFN_REQ_FAULT - The output must have HMM_PFN_VALID or 
hmm_range_fault() will fail"

Should this be an assert?

Also probably dumb question, but why do we need to hold the notifier 
lock over this loop? What is it protecting?

> +			up_read(notifier_sem);
> +			return -EFAULT;
> +		}
> +		num_chunks++;
> +		i += 1UL << hmm_pfn_to_map_order(hmm_pfn);
> +	}
> +	up_read(notifier_sem);
> +
> +	return sg_alloc_table(st, num_chunks, GFP_KERNEL);
> +}
> +
>   /**
>    * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
>    * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
> @@ -50,6 +80,7 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>    * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
>    * has the pfn numbers of pages that back up this hmm address range.
>    * @st: pointer to the sg table.
> + * @notifier_sem: The xe notifier lock.
>    * @write: whether we write to this range. This decides dma map direction
>    * for system pages. If write we map it bi-diretional; otherwise
>    * DMA_TO_DEVICE
> @@ -76,38 +107,33 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>    * Returns 0 if successful; -ENOMEM if fails to allocate memory
>    */
>   static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
> -		       struct sg_table *st, bool write)
> +		       struct sg_table *st,
> +		       struct rw_semaphore *notifier_sem,
> +		       bool write)
>   {
>   	struct device *dev = xe->drm.dev;
> -	struct page **pages;
> -	u64 i, npages;
> -	int ret;
> -
> -	npages = xe_npages_in_range(range->start, range->end);
> -	pages = kvmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
> -	if (!pages)
> -		return -ENOMEM;
> -
> -	for (i = 0; i < npages; i++) {
> -		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
> -		xe_assert(xe, !is_device_private_page(pages[i]));
> -	}
> -
> -	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0, npages << PAGE_SHIFT,
> -						xe_sg_segment_size(dev), GFP_KERNEL);
> -	if (ret)
> -		goto free_pages;
> -
> -	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
> -			      DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
> -	if (ret) {
> -		sg_free_table(st);
> -		st = NULL;
> +	unsigned long hmm_pfn, size;
> +	struct scatterlist *sgl;
> +	struct page *page;
> +	unsigned long i, j;
> +
> +	lockdep_assert_held(notifier_sem);
> +
> +	i = 0;
> +	for_each_sg(st->sgl, sgl, st->nents, j) {
> +		hmm_pfn = range->hmm_pfns[i];
> +		page = hmm_pfn_to_page(hmm_pfn);
> +		xe_assert(xe, !is_device_private_page(page));
> +		size = 1UL << hmm_pfn_to_map_order(hmm_pfn);
> +		sg_set_page(sgl, page, size << PAGE_SHIFT, 0);
> +		if (unlikely(j == st->nents - 1))
> +			sg_mark_end(sgl);
> +		i += size;
>   	}
> +	xe_assert(xe, i == xe_npages_in_range(range->start, range->end));
>   
> -free_pages:
> -	kvfree(pages);
> -	return ret;
> +	return dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
> +			       DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
>   }
>   
>   /**
> @@ -235,16 +261,45 @@ int xe_hmm_userptr_populate_range(struct xe_userptr_vma *uvma,
>   	if (ret)
>   		goto free_pfns;
>   
> -	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
> +	if (unlikely(userptr->sg)) {
> +		ret = down_write_killable(&vm->userptr.notifier_lock);
> +		if (ret)
> +			goto free_pfns;
> +
> +		xe_hmm_userptr_free_sg(uvma);
> +		up_write(&vm->userptr.notifier_lock);
> +	}
> +
> +	ret = xe_alloc_sg(&userptr->sgt, &hmm_range, &vm->userptr.notifier_lock);
>   	if (ret)
>   		goto free_pfns;
>   
> +	ret = down_read_interruptible(&vm->userptr.notifier_lock);
> +	if (ret)
> +		goto free_st;
> +
> +	if (mmu_interval_read_retry(hmm_range.notifier, hmm_range.notifier_seq)) {
> +		ret = -EAGAIN;
> +		goto out_unlock;
> +	}
> +
> +	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
> +			  &vm->userptr.notifier_lock, write);
> +	if (ret)
> +		goto out_unlock;
> +
>   	xe_mark_range_accessed(&hmm_range, write);
>   	userptr->sg = &userptr->sgt;
>   	userptr->notifier_seq = hmm_range.notifier_seq;
> +	up_read(&vm->userptr.notifier_lock);
> +	kvfree(pfns);
> +	return 0;
>   
> +out_unlock:
> +	up_read(&vm->userptr.notifier_lock);
> +free_st:
> +	sg_free_table(&userptr->sgt);
>   free_pfns:
>   	kvfree(pfns);
>   	return ret;
>   }
> -


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] drm/xe/hmm: Style- and include fixes
  2025-02-28 10:44 ` [PATCH 1/3] drm/xe/hmm: Style- and include fixes Thomas Hellström
@ 2025-02-28 12:56   ` Matthew Auld
  0 siblings, 0 replies; 8+ messages in thread
From: Matthew Auld @ 2025-02-28 12:56 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe; +Cc: Oak Zeng, stable

On 28/02/2025 10:44, Thomas Hellström wrote:
> Add proper #ifndef around the xe_hmm.h header, proper spacing
> and since the documentation mostly follows kerneldoc format,
> make it kerneldoc. Also prepare for upcoming -stable fixes.
> 
> Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
> Cc: Oak Zeng <oak.zeng@intel.com>
> Cc: <stable@vger.kernel.org> # v6.10+
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock
  2025-02-28 12:55   ` Matthew Auld
@ 2025-02-28 13:08     ` Thomas Hellström
  2025-02-28 18:32       ` Matthew Auld
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Hellström @ 2025-02-28 13:08 UTC (permalink / raw)
  To: Matthew Auld, intel-xe; +Cc: Oak Zeng, stable

On Fri, 2025-02-28 at 12:55 +0000, Matthew Auld wrote:
> On 28/02/2025 10:44, Thomas Hellström wrote:
> > The pnfs that we obtain from hmm_range_fault() point to pages that
> > we don't have a reference on, and the guarantee that they are still
> > in the cpu page-tables is that the notifier lock must be held and
> > the
> > notifier seqno is still valid.
> > 
> > So while building the sg table and marking the pages accesses /
> > dirty
> > we need to hold this lock with a validated seqno.
> > 
> > However, the lock is reclaim tainted which makes
> > sg_alloc_table_from_pages_segment() unusable, since it internally
> > allocates memory.
> > 
> > Instead build the sg-table manually. For the non-iommu case
> > this might lead to fewer coalesces, but if that's a problem it can
> > be fixed up later in the resource cursor code. For the iommu case,
> > the whole sg-table may still be coalesced to a single contigous
> > device va region.
> > 
> > This avoids marking pages that we don't own dirty and accessed, and
> > it also avoid dereferencing struct pages that we don't own.
> > 
> > Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate
> > userptr")
> > Cc: Oak Zeng <oak.zeng@intel.com>
> > Cc: <stable@vger.kernel.org> # v6.10+
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_hmm.c | 115 ++++++++++++++++++++++++++-----
> > -----
> >   1 file changed, 85 insertions(+), 30 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > b/drivers/gpu/drm/xe/xe_hmm.c
> > index c56738fa713b..d3b5551496d0 100644
> > --- a/drivers/gpu/drm/xe/xe_hmm.c
> > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > @@ -42,6 +42,36 @@ static void xe_mark_range_accessed(struct
> > hmm_range *range, bool write)
> >   	}
> >   }
> >   
> > +static int xe_alloc_sg(struct sg_table *st, struct hmm_range
> > *range,
> > +		       struct rw_semaphore *notifier_sem)
> > +{
> > +	unsigned long i, npages, hmm_pfn;
> > +	unsigned long num_chunks = 0;
> > +	int ret;
> > +
> > +	/* HMM docs says this is needed. */
> > +	ret = down_read_interruptible(notifier_sem);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (mmu_interval_read_retry(range->notifier, range-
> > >notifier_seq))
> > +		return -EAGAIN;
> > +
> > +	npages = xe_npages_in_range(range->start, range->end);
> > +	for (i = 0; i < npages;) {
> > +		hmm_pfn = range->hmm_pfns[i];
> > +		if (!(hmm_pfn & HMM_PFN_VALID)) {
> 
> Is this possible? The default_flags are always REQ_FAULT, so that
> should 
> ensure PFN_VALID, or the hmm_fault would have returned an error?
> 
>  From the docs:
> 
> "HMM_PFN_REQ_FAULT - The output must have HMM_PFN_VALID or 
> hmm_range_fault() will fail"
> 
> Should this be an assert?
> 
> Also probably dumb question, but why do we need to hold the notifier 
> lock over this loop? What is it protecting?

Docs for hmm_pfn_to_map_order():

/*
 * This must be called under the caller 'user_lock' after a successful
 * mmu_interval_read_begin(). The caller must have tested for
HMM_PFN_VALID
 * already.
 */

I'm fine with changing to an assert, and I agree that the lock is
pointless: We're operating on thread local data, but I also think that
not adhering to the doc requirements might cause problems in the
future. Like if the map order encoding is dropped and the order was
grabbed from the underlying page.

/Thomas


> 
> > +			up_read(notifier_sem);
> > +			return -EFAULT;
> > +		}
> > +		num_chunks++;
> > +		i += 1UL << hmm_pfn_to_map_order(hmm_pfn);
> > +	}
> > +	up_read(notifier_sem);
> > +
> > +	return sg_alloc_table(st, num_chunks, GFP_KERNEL);
> > +}
> > +
> >   /**
> >    * xe_build_sg() - build a scatter gather table for all the
> > physical pages/pfn
> >    * in a hmm_range. dma-map pages if necessary. dma-address is
> > save in sg table
> > @@ -50,6 +80,7 @@ static void xe_mark_range_accessed(struct
> > hmm_range *range, bool write)
> >    * @range: the hmm range that we build the sg table from. range-
> > >hmm_pfns[]
> >    * has the pfn numbers of pages that back up this hmm address
> > range.
> >    * @st: pointer to the sg table.
> > + * @notifier_sem: The xe notifier lock.
> >    * @write: whether we write to this range. This decides dma map
> > direction
> >    * for system pages. If write we map it bi-diretional; otherwise
> >    * DMA_TO_DEVICE
> > @@ -76,38 +107,33 @@ static void xe_mark_range_accessed(struct
> > hmm_range *range, bool write)
> >    * Returns 0 if successful; -ENOMEM if fails to allocate memory
> >    */
> >   static int xe_build_sg(struct xe_device *xe, struct hmm_range
> > *range,
> > -		       struct sg_table *st, bool write)
> > +		       struct sg_table *st,
> > +		       struct rw_semaphore *notifier_sem,
> > +		       bool write)
> >   {
> >   	struct device *dev = xe->drm.dev;
> > -	struct page **pages;
> > -	u64 i, npages;
> > -	int ret;
> > -
> > -	npages = xe_npages_in_range(range->start, range->end);
> > -	pages = kvmalloc_array(npages, sizeof(*pages),
> > GFP_KERNEL);
> > -	if (!pages)
> > -		return -ENOMEM;
> > -
> > -	for (i = 0; i < npages; i++) {
> > -		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
> > -		xe_assert(xe, !is_device_private_page(pages[i]));
> > -	}
> > -
> > -	ret = sg_alloc_table_from_pages_segment(st, pages, npages,
> > 0, npages << PAGE_SHIFT,
> > -
> > 						xe_sg_segment_size(dev), GFP_KERNEL);
> > -	if (ret)
> > -		goto free_pages;
> > -
> > -	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL :
> > DMA_TO_DEVICE,
> > -			      DMA_ATTR_SKIP_CPU_SYNC |
> > DMA_ATTR_NO_KERNEL_MAPPING);
> > -	if (ret) {
> > -		sg_free_table(st);
> > -		st = NULL;
> > +	unsigned long hmm_pfn, size;
> > +	struct scatterlist *sgl;
> > +	struct page *page;
> > +	unsigned long i, j;
> > +
> > +	lockdep_assert_held(notifier_sem);
> > +
> > +	i = 0;
> > +	for_each_sg(st->sgl, sgl, st->nents, j) {
> > +		hmm_pfn = range->hmm_pfns[i];
> > +		page = hmm_pfn_to_page(hmm_pfn);
> > +		xe_assert(xe, !is_device_private_page(page));
> > +		size = 1UL << hmm_pfn_to_map_order(hmm_pfn);
> > +		sg_set_page(sgl, page, size << PAGE_SHIFT, 0);
> > +		if (unlikely(j == st->nents - 1))
> > +			sg_mark_end(sgl);
> > +		i += size;
> >   	}
> > +	xe_assert(xe, i == xe_npages_in_range(range->start, range-
> > >end));
> >   
> > -free_pages:
> > -	kvfree(pages);
> > -	return ret;
> > +	return dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL
> > : DMA_TO_DEVICE,
> > +			       DMA_ATTR_SKIP_CPU_SYNC |
> > DMA_ATTR_NO_KERNEL_MAPPING);
> >   }
> >   
> >   /**
> > @@ -235,16 +261,45 @@ int xe_hmm_userptr_populate_range(struct
> > xe_userptr_vma *uvma,
> >   	if (ret)
> >   		goto free_pfns;
> >   
> > -	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
> > write);
> > +	if (unlikely(userptr->sg)) {
> > +		ret = down_write_killable(&vm-
> > >userptr.notifier_lock);
> > +		if (ret)
> > +			goto free_pfns;
> > +
> > +		xe_hmm_userptr_free_sg(uvma);
> > +		up_write(&vm->userptr.notifier_lock);
> > +	}
> > +
> > +	ret = xe_alloc_sg(&userptr->sgt, &hmm_range, &vm-
> > >userptr.notifier_lock);
> >   	if (ret)
> >   		goto free_pfns;
> >   
> > +	ret = down_read_interruptible(&vm->userptr.notifier_lock);
> > +	if (ret)
> > +		goto free_st;
> > +
> > +	if (mmu_interval_read_retry(hmm_range.notifier,
> > hmm_range.notifier_seq)) {
> > +		ret = -EAGAIN;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
> > +			  &vm->userptr.notifier_lock, write);
> > +	if (ret)
> > +		goto out_unlock;
> > +
> >   	xe_mark_range_accessed(&hmm_range, write);
> >   	userptr->sg = &userptr->sgt;
> >   	userptr->notifier_seq = hmm_range.notifier_seq;
> > +	up_read(&vm->userptr.notifier_lock);
> > +	kvfree(pfns);
> > +	return 0;
> >   
> > +out_unlock:
> > +	up_read(&vm->userptr.notifier_lock);
> > +free_st:
> > +	sg_free_table(&userptr->sgt);
> >   free_pfns:
> >   	kvfree(pfns);
> >   	return ret;
> >   }
> > -
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock
  2025-02-28 13:08     ` Thomas Hellström
@ 2025-02-28 18:32       ` Matthew Auld
  2025-03-04 11:28         ` Thomas Hellström
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Auld @ 2025-02-28 18:32 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe; +Cc: Oak Zeng, stable

On 28/02/2025 13:08, Thomas Hellström wrote:
> On Fri, 2025-02-28 at 12:55 +0000, Matthew Auld wrote:
>> On 28/02/2025 10:44, Thomas Hellström wrote:
>>> The pnfs that we obtain from hmm_range_fault() point to pages that
>>> we don't have a reference on, and the guarantee that they are still
>>> in the cpu page-tables is that the notifier lock must be held and
>>> the
>>> notifier seqno is still valid.
>>>
>>> So while building the sg table and marking the pages accesses /
>>> dirty
>>> we need to hold this lock with a validated seqno.
>>>
>>> However, the lock is reclaim tainted which makes
>>> sg_alloc_table_from_pages_segment() unusable, since it internally
>>> allocates memory.
>>>
>>> Instead build the sg-table manually. For the non-iommu case
>>> this might lead to fewer coalesces, but if that's a problem it can
>>> be fixed up later in the resource cursor code. For the iommu case,
>>> the whole sg-table may still be coalesced to a single contigous
>>> device va region.
>>>
>>> This avoids marking pages that we don't own dirty and accessed, and
>>> it also avoid dereferencing struct pages that we don't own.
>>>
>>> Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate
>>> userptr")
>>> Cc: Oak Zeng <oak.zeng@intel.com>
>>> Cc: <stable@vger.kernel.org> # v6.10+
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/xe/xe_hmm.c | 115 ++++++++++++++++++++++++++-----
>>> -----
>>>    1 file changed, 85 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_hmm.c
>>> b/drivers/gpu/drm/xe/xe_hmm.c
>>> index c56738fa713b..d3b5551496d0 100644
>>> --- a/drivers/gpu/drm/xe/xe_hmm.c
>>> +++ b/drivers/gpu/drm/xe/xe_hmm.c
>>> @@ -42,6 +42,36 @@ static void xe_mark_range_accessed(struct
>>> hmm_range *range, bool write)
>>>    	}
>>>    }
>>>    
>>> +static int xe_alloc_sg(struct sg_table *st, struct hmm_range
>>> *range,
>>> +		       struct rw_semaphore *notifier_sem)
>>> +{
>>> +	unsigned long i, npages, hmm_pfn;
>>> +	unsigned long num_chunks = 0;
>>> +	int ret;
>>> +
>>> +	/* HMM docs says this is needed. */
>>> +	ret = down_read_interruptible(notifier_sem);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (mmu_interval_read_retry(range->notifier, range-
>>>> notifier_seq))
>>> +		return -EAGAIN;
>>> +
>>> +	npages = xe_npages_in_range(range->start, range->end);
>>> +	for (i = 0; i < npages;) {
>>> +		hmm_pfn = range->hmm_pfns[i];
>>> +		if (!(hmm_pfn & HMM_PFN_VALID)) {
>>
>> Is this possible? The default_flags are always REQ_FAULT, so that
>> should
>> ensure PFN_VALID, or the hmm_fault would have returned an error?
>>
>>   From the docs:
>>
>> "HMM_PFN_REQ_FAULT - The output must have HMM_PFN_VALID or
>> hmm_range_fault() will fail"
>>
>> Should this be an assert?
>>
>> Also probably dumb question, but why do we need to hold the notifier
>> lock over this loop? What is it protecting?
> 
> Docs for hmm_pfn_to_map_order():
> 
> /*
>   * This must be called under the caller 'user_lock' after a successful
>   * mmu_interval_read_begin(). The caller must have tested for
> HMM_PFN_VALID
>   * already.
>   */
> 
> I'm fine with changing to an assert, and I agree that the lock is
> pointless: We're operating on thread local data, but I also think that
> not adhering to the doc requirements might cause problems in the
> future. Like if the map order encoding is dropped and the order was
> grabbed from the underlying page.

Makes sense. Keeping the lock indeed looks sensible.

Staring some more at hmm_pfn_to_map_order(), it also says:

"The caller must be careful with edge cases as the start and end VA of 
the given page may extend past the range used with hmm_range_fault()."

I take that to mean it will still return a huge page order even if there 
is say some 2M PTE, but the hmm_range is just some small sub range of 
pfn covering part of the huge page, like say 8K, where the first 4K page 
is exactly at the end of the 2M page. Are there some potentially nasty 
surprises with stuff like:

i += 1UL << hmm_pfn_to_map_order(hmm_pfn);

Since the increment here (512) could go past even npages (2), and then 
num_chunks is too small (1)?

> 
> /Thomas
> 
> 
>>
>>> +			up_read(notifier_sem);
>>> +			return -EFAULT;
>>> +		}
>>> +		num_chunks++;
>>> +		i += 1UL << hmm_pfn_to_map_order(hmm_pfn);
>>> +	}
>>> +	up_read(notifier_sem);
>>> +
>>> +	return sg_alloc_table(st, num_chunks, GFP_KERNEL);
>>> +}
>>> +
>>>    /**
>>>     * xe_build_sg() - build a scatter gather table for all the
>>> physical pages/pfn
>>>     * in a hmm_range. dma-map pages if necessary. dma-address is
>>> save in sg table
>>> @@ -50,6 +80,7 @@ static void xe_mark_range_accessed(struct
>>> hmm_range *range, bool write)
>>>     * @range: the hmm range that we build the sg table from. range-
>>>> hmm_pfns[]
>>>     * has the pfn numbers of pages that back up this hmm address
>>> range.
>>>     * @st: pointer to the sg table.
>>> + * @notifier_sem: The xe notifier lock.
>>>     * @write: whether we write to this range. This decides dma map
>>> direction
>>>     * for system pages. If write we map it bi-diretional; otherwise
>>>     * DMA_TO_DEVICE
>>> @@ -76,38 +107,33 @@ static void xe_mark_range_accessed(struct
>>> hmm_range *range, bool write)
>>>     * Returns 0 if successful; -ENOMEM if fails to allocate memory
>>>     */
>>>    static int xe_build_sg(struct xe_device *xe, struct hmm_range
>>> *range,
>>> -		       struct sg_table *st, bool write)
>>> +		       struct sg_table *st,
>>> +		       struct rw_semaphore *notifier_sem,
>>> +		       bool write)
>>>    {
>>>    	struct device *dev = xe->drm.dev;
>>> -	struct page **pages;
>>> -	u64 i, npages;
>>> -	int ret;
>>> -
>>> -	npages = xe_npages_in_range(range->start, range->end);
>>> -	pages = kvmalloc_array(npages, sizeof(*pages),
>>> GFP_KERNEL);
>>> -	if (!pages)
>>> -		return -ENOMEM;
>>> -
>>> -	for (i = 0; i < npages; i++) {
>>> -		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
>>> -		xe_assert(xe, !is_device_private_page(pages[i]));
>>> -	}
>>> -
>>> -	ret = sg_alloc_table_from_pages_segment(st, pages, npages,
>>> 0, npages << PAGE_SHIFT,
>>> -
>>> 						xe_sg_segment_size(dev), GFP_KERNEL);
>>> -	if (ret)
>>> -		goto free_pages;
>>> -
>>> -	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL :
>>> DMA_TO_DEVICE,
>>> -			      DMA_ATTR_SKIP_CPU_SYNC |
>>> DMA_ATTR_NO_KERNEL_MAPPING);
>>> -	if (ret) {
>>> -		sg_free_table(st);
>>> -		st = NULL;
>>> +	unsigned long hmm_pfn, size;
>>> +	struct scatterlist *sgl;
>>> +	struct page *page;
>>> +	unsigned long i, j;
>>> +
>>> +	lockdep_assert_held(notifier_sem);
>>> +
>>> +	i = 0;
>>> +	for_each_sg(st->sgl, sgl, st->nents, j) {
>>> +		hmm_pfn = range->hmm_pfns[i];
>>> +		page = hmm_pfn_to_page(hmm_pfn);
>>> +		xe_assert(xe, !is_device_private_page(page));
>>> +		size = 1UL << hmm_pfn_to_map_order(hmm_pfn);
>>> +		sg_set_page(sgl, page, size << PAGE_SHIFT, 0);
>>> +		if (unlikely(j == st->nents - 1))
>>> +			sg_mark_end(sgl);
>>> +		i += size;
>>>    	}
>>> +	xe_assert(xe, i == xe_npages_in_range(range->start, range-
>>>> end));
>>>    
>>> -free_pages:
>>> -	kvfree(pages);
>>> -	return ret;
>>> +	return dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL
>>> : DMA_TO_DEVICE,
>>> +			       DMA_ATTR_SKIP_CPU_SYNC |
>>> DMA_ATTR_NO_KERNEL_MAPPING);
>>>    }
>>>    
>>>    /**
>>> @@ -235,16 +261,45 @@ int xe_hmm_userptr_populate_range(struct
>>> xe_userptr_vma *uvma,
>>>    	if (ret)
>>>    		goto free_pfns;
>>>    
>>> -	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
>>> write);
>>> +	if (unlikely(userptr->sg)) {
>>> +		ret = down_write_killable(&vm-
>>>> userptr.notifier_lock);
>>> +		if (ret)
>>> +			goto free_pfns;
>>> +
>>> +		xe_hmm_userptr_free_sg(uvma);
>>> +		up_write(&vm->userptr.notifier_lock);
>>> +	}
>>> +
>>> +	ret = xe_alloc_sg(&userptr->sgt, &hmm_range, &vm-
>>>> userptr.notifier_lock);
>>>    	if (ret)
>>>    		goto free_pfns;
>>>    
>>> +	ret = down_read_interruptible(&vm->userptr.notifier_lock);
>>> +	if (ret)
>>> +		goto free_st;
>>> +
>>> +	if (mmu_interval_read_retry(hmm_range.notifier,
>>> hmm_range.notifier_seq)) {
>>> +		ret = -EAGAIN;
>>> +		goto out_unlock;
>>> +	}
>>> +
>>> +	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
>>> +			  &vm->userptr.notifier_lock, write);
>>> +	if (ret)
>>> +		goto out_unlock;
>>> +
>>>    	xe_mark_range_accessed(&hmm_range, write);
>>>    	userptr->sg = &userptr->sgt;
>>>    	userptr->notifier_seq = hmm_range.notifier_seq;
>>> +	up_read(&vm->userptr.notifier_lock);
>>> +	kvfree(pfns);
>>> +	return 0;
>>>    
>>> +out_unlock:
>>> +	up_read(&vm->userptr.notifier_lock);
>>> +free_st:
>>> +	sg_free_table(&userptr->sgt);
>>>    free_pfns:
>>>    	kvfree(pfns);
>>>    	return ret;
>>>    }
>>> -
>>
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock
  2025-02-28 18:32       ` Matthew Auld
@ 2025-03-04 11:28         ` Thomas Hellström
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Hellström @ 2025-03-04 11:28 UTC (permalink / raw)
  To: Matthew Auld, intel-xe; +Cc: Oak Zeng, stable

On Fri, 2025-02-28 at 18:32 +0000, Matthew Auld wrote:
> On 28/02/2025 13:08, Thomas Hellström wrote:
> > On Fri, 2025-02-28 at 12:55 +0000, Matthew Auld wrote:
> > > On 28/02/2025 10:44, Thomas Hellström wrote:
> > > > The pnfs that we obtain from hmm_range_fault() point to pages
> > > > that
> > > > we don't have a reference on, and the guarantee that they are
> > > > still
> > > > in the cpu page-tables is that the notifier lock must be held
> > > > and
> > > > the
> > > > notifier seqno is still valid.
> > > > 
> > > > So while building the sg table and marking the pages accesses /
> > > > dirty
> > > > we need to hold this lock with a validated seqno.
> > > > 
> > > > However, the lock is reclaim tainted which makes
> > > > sg_alloc_table_from_pages_segment() unusable, since it
> > > > internally
> > > > allocates memory.
> > > > 
> > > > Instead build the sg-table manually. For the non-iommu case
> > > > this might lead to fewer coalesces, but if that's a problem it
> > > > can
> > > > be fixed up later in the resource cursor code. For the iommu
> > > > case,
> > > > the whole sg-table may still be coalesced to a single contigous
> > > > device va region.
> > > > 
> > > > This avoids marking pages that we don't own dirty and accessed,
> > > > and
> > > > it also avoid dereferencing struct pages that we don't own.
> > > > 
> > > > Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate
> > > > userptr")
> > > > Cc: Oak Zeng <oak.zeng@intel.com>
> > > > Cc: <stable@vger.kernel.org> # v6.10+
> > > > Signed-off-by: Thomas Hellström
> > > > <thomas.hellstrom@linux.intel.com>
> > > > ---
> > > >    drivers/gpu/drm/xe/xe_hmm.c | 115
> > > > ++++++++++++++++++++++++++-----
> > > > -----
> > > >    1 file changed, 85 insertions(+), 30 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > > index c56738fa713b..d3b5551496d0 100644
> > > > --- a/drivers/gpu/drm/xe/xe_hmm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > > @@ -42,6 +42,36 @@ static void xe_mark_range_accessed(struct
> > > > hmm_range *range, bool write)
> > > >    	}
> > > >    }
> > > >    
> > > > +static int xe_alloc_sg(struct sg_table *st, struct hmm_range
> > > > *range,
> > > > +		       struct rw_semaphore *notifier_sem)
> > > > +{
> > > > +	unsigned long i, npages, hmm_pfn;
> > > > +	unsigned long num_chunks = 0;
> > > > +	int ret;
> > > > +
> > > > +	/* HMM docs says this is needed. */
> > > > +	ret = down_read_interruptible(notifier_sem);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (mmu_interval_read_retry(range->notifier, range-
> > > > > notifier_seq))
> > > > +		return -EAGAIN;
> > > > +
> > > > +	npages = xe_npages_in_range(range->start, range->end);
> > > > +	for (i = 0; i < npages;) {
> > > > +		hmm_pfn = range->hmm_pfns[i];
> > > > +		if (!(hmm_pfn & HMM_PFN_VALID)) {
> > > 
> > > Is this possible? The default_flags are always REQ_FAULT, so that
> > > should
> > > ensure PFN_VALID, or the hmm_fault would have returned an error?
> > > 
> > >   From the docs:
> > > 
> > > "HMM_PFN_REQ_FAULT - The output must have HMM_PFN_VALID or
> > > hmm_range_fault() will fail"
> > > 
> > > Should this be an assert?
> > > 
> > > Also probably dumb question, but why do we need to hold the
> > > notifier
> > > lock over this loop? What is it protecting?
> > 
> > Docs for hmm_pfn_to_map_order():
> > 
> > /*
> >   * This must be called under the caller 'user_lock' after a
> > successful
> >   * mmu_interval_read_begin(). The caller must have tested for
> > HMM_PFN_VALID
> >   * already.
> >   */
> > 
> > I'm fine with changing to an assert, and I agree that the lock is
> > pointless: We're operating on thread local data, but I also think
> > that
> > not adhering to the doc requirements might cause problems in the
> > future. Like if the map order encoding is dropped and the order was
> > grabbed from the underlying page.
> 
> Makes sense. Keeping the lock indeed looks sensible.
> 
> Staring some more at hmm_pfn_to_map_order(), it also says:
> 
> "The caller must be careful with edge cases as the start and end VA
> of 
> the given page may extend past the range used with
> hmm_range_fault()."
> 
> I take that to mean it will still return a huge page order even if
> there 
> is say some 2M PTE, but the hmm_range is just some small sub range of
> pfn covering part of the huge page, like say 8K, where the first 4K
> page 
> is exactly at the end of the 2M page. Are there some potentially
> nasty 
> surprises with stuff like:
> 
> i += 1UL << hmm_pfn_to_map_order(hmm_pfn);
> 
> Since the increment here (512) could go past even npages (2), and
> then 
> num_chunks is too small (1)?

Yes, you're right. Will update the patch to reflect this.

/Thomas



> 
> > 
> > /Thomas
> > 
> > 
> > > 
> > > > +			up_read(notifier_sem);
> > > > +			return -EFAULT;
> > > > +		}
> > > > +		num_chunks++;
> > > > +		i += 1UL << hmm_pfn_to_map_order(hmm_pfn);
> > > > +	}
> > > > +	up_read(notifier_sem);
> > > > +
> > > > +	return sg_alloc_table(st, num_chunks, GFP_KERNEL);
> > > > +}
> > > > +
> > > >    /**
> > > >     * xe_build_sg() - build a scatter gather table for all the
> > > > physical pages/pfn
> > > >     * in a hmm_range. dma-map pages if necessary. dma-address
> > > > is
> > > > save in sg table
> > > > @@ -50,6 +80,7 @@ static void xe_mark_range_accessed(struct
> > > > hmm_range *range, bool write)
> > > >     * @range: the hmm range that we build the sg table from.
> > > > range-
> > > > > hmm_pfns[]
> > > >     * has the pfn numbers of pages that back up this hmm
> > > > address
> > > > range.
> > > >     * @st: pointer to the sg table.
> > > > + * @notifier_sem: The xe notifier lock.
> > > >     * @write: whether we write to this range. This decides dma
> > > > map
> > > > direction
> > > >     * for system pages. If write we map it bi-diretional;
> > > > otherwise
> > > >     * DMA_TO_DEVICE
> > > > @@ -76,38 +107,33 @@ static void xe_mark_range_accessed(struct
> > > > hmm_range *range, bool write)
> > > >     * Returns 0 if successful; -ENOMEM if fails to allocate
> > > > memory
> > > >     */
> > > >    static int xe_build_sg(struct xe_device *xe, struct
> > > > hmm_range
> > > > *range,
> > > > -		       struct sg_table *st, bool write)
> > > > +		       struct sg_table *st,
> > > > +		       struct rw_semaphore *notifier_sem,
> > > > +		       bool write)
> > > >    {
> > > >    	struct device *dev = xe->drm.dev;
> > > > -	struct page **pages;
> > > > -	u64 i, npages;
> > > > -	int ret;
> > > > -
> > > > -	npages = xe_npages_in_range(range->start, range->end);
> > > > -	pages = kvmalloc_array(npages, sizeof(*pages),
> > > > GFP_KERNEL);
> > > > -	if (!pages)
> > > > -		return -ENOMEM;
> > > > -
> > > > -	for (i = 0; i < npages; i++) {
> > > > -		pages[i] = hmm_pfn_to_page(range-
> > > > >hmm_pfns[i]);
> > > > -		xe_assert(xe,
> > > > !is_device_private_page(pages[i]));
> > > > -	}
> > > > -
> > > > -	ret = sg_alloc_table_from_pages_segment(st, pages,
> > > > npages,
> > > > 0, npages << PAGE_SHIFT,
> > > > -
> > > > 						xe_sg_segment_
> > > > size(dev), GFP_KERNEL);
> > > > -	if (ret)
> > > > -		goto free_pages;
> > > > -
> > > > -	ret = dma_map_sgtable(dev, st, write ?
> > > > DMA_BIDIRECTIONAL :
> > > > DMA_TO_DEVICE,
> > > > -			      DMA_ATTR_SKIP_CPU_SYNC |
> > > > DMA_ATTR_NO_KERNEL_MAPPING);
> > > > -	if (ret) {
> > > > -		sg_free_table(st);
> > > > -		st = NULL;
> > > > +	unsigned long hmm_pfn, size;
> > > > +	struct scatterlist *sgl;
> > > > +	struct page *page;
> > > > +	unsigned long i, j;
> > > > +
> > > > +	lockdep_assert_held(notifier_sem);
> > > > +
> > > > +	i = 0;
> > > > +	for_each_sg(st->sgl, sgl, st->nents, j) {
> > > > +		hmm_pfn = range->hmm_pfns[i];
> > > > +		page = hmm_pfn_to_page(hmm_pfn);
> > > > +		xe_assert(xe, !is_device_private_page(page));
> > > > +		size = 1UL << hmm_pfn_to_map_order(hmm_pfn);
> > > > +		sg_set_page(sgl, page, size << PAGE_SHIFT, 0);
> > > > +		if (unlikely(j == st->nents - 1))
> > > > +			sg_mark_end(sgl);
> > > > +		i += size;
> > > >    	}
> > > > +	xe_assert(xe, i == xe_npages_in_range(range->start,
> > > > range-
> > > > > end));
> > > >    
> > > > -free_pages:
> > > > -	kvfree(pages);
> > > > -	return ret;
> > > > +	return dma_map_sgtable(dev, st, write ?
> > > > DMA_BIDIRECTIONAL
> > > > : DMA_TO_DEVICE,
> > > > +			       DMA_ATTR_SKIP_CPU_SYNC |
> > > > DMA_ATTR_NO_KERNEL_MAPPING);
> > > >    }
> > > >    
> > > >    /**
> > > > @@ -235,16 +261,45 @@ int xe_hmm_userptr_populate_range(struct
> > > > xe_userptr_vma *uvma,
> > > >    	if (ret)
> > > >    		goto free_pfns;
> > > >    
> > > > -	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
> > > > write);
> > > > +	if (unlikely(userptr->sg)) {
> > > > +		ret = down_write_killable(&vm-
> > > > > userptr.notifier_lock);
> > > > +		if (ret)
> > > > +			goto free_pfns;
> > > > +
> > > > +		xe_hmm_userptr_free_sg(uvma);
> > > > +		up_write(&vm->userptr.notifier_lock);
> > > > +	}
> > > > +
> > > > +	ret = xe_alloc_sg(&userptr->sgt, &hmm_range, &vm-
> > > > > userptr.notifier_lock);
> > > >    	if (ret)
> > > >    		goto free_pfns;
> > > >    
> > > > +	ret = down_read_interruptible(&vm-
> > > > >userptr.notifier_lock);
> > > > +	if (ret)
> > > > +		goto free_st;
> > > > +
> > > > +	if (mmu_interval_read_retry(hmm_range.notifier,
> > > > hmm_range.notifier_seq)) {
> > > > +		ret = -EAGAIN;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt,
> > > > +			  &vm->userptr.notifier_lock, write);
> > > > +	if (ret)
> > > > +		goto out_unlock;
> > > > +
> > > >    	xe_mark_range_accessed(&hmm_range, write);
> > > >    	userptr->sg = &userptr->sgt;
> > > >    	userptr->notifier_seq = hmm_range.notifier_seq;
> > > > +	up_read(&vm->userptr.notifier_lock);
> > > > +	kvfree(pfns);
> > > > +	return 0;
> > > >    
> > > > +out_unlock:
> > > > +	up_read(&vm->userptr.notifier_lock);
> > > > +free_st:
> > > > +	sg_free_table(&userptr->sgt);
> > > >    free_pfns:
> > > >    	kvfree(pfns);
> > > >    	return ret;
> > > >    }
> > > > -
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-03-04 11:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250228104418.44313-1-thomas.hellstrom@linux.intel.com>
2025-02-28 10:44 ` [PATCH 1/3] drm/xe/hmm: Style- and include fixes Thomas Hellström
2025-02-28 12:56   ` Matthew Auld
2025-02-28 10:44 ` [PATCH 2/3] drm/xe/hmm: Don't dereference struct page pointers without notifier lock Thomas Hellström
2025-02-28 12:55   ` Matthew Auld
2025-02-28 13:08     ` Thomas Hellström
2025-02-28 18:32       ` Matthew Auld
2025-03-04 11:28         ` Thomas Hellström
2025-02-28 10:44 ` [PATCH 3/3] drm/xe/userptr: Unmap userptrs in the mmu notifier Thomas Hellström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).