[PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation
@ 2024-10-16  3:24 Matthew Brost
  2024-10-16  3:24 ` [PATCH v2 01/29] drm/xe: Retry BO allocation Matthew Brost
                   ` (31 more replies)
  0 siblings, 32 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Version 2 of GPU SVM has been promoted to the proper series from an RFC.
Thanks to everyone (especially Sima and Thomas) for their numerous
reviews on revision 1 and for helping to address many design issues.

This version has been tested with IGT [1] on PVC, BMG, and LNL.

Major changes in v2:
- Dropped mmap write abuse
- core MM locking and retry loops instead of driver locking to avoid races
- Removed physical to virtual references
- Embedded structure/ops for drm_gpusvm_devmem
- Fixed mremap and fork issues
- Added DRM pagemap
- Included RFC documentation in the kernel doc

Known issues in v2:
- gpusvm/pagemap is still located in the Xe directory; it can be moved
  to the drm directory in the next revision, but will remain in Xe for
  streamlined testing for now
- drm_gpusvm_range_evict looks up VMA; it may be possible to use
  hmm_range_fault here instead, with some minor tweaks to HMM

Matt

[1] https://patchwork.freedesktop.org/series/137545/#rev3

Matthew Brost (26):
  drm/xe: Retry BO allocation
  mm/migrate: Add migrate_device_prepopulated_range
  mm/migrate: Trylock device page in do_swap_page
  drm/gpusvm: Add support for GPU Shared Virtual Memory
  drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
  drm/xe: Add SVM init / close / fini to faulting VMs
  drm/xe: Add SVM range invalidation
  drm/gpuvm: Add DRM_GPUVA_OP_USER
  drm/xe: Add (re)bind to SVM page fault handler
  drm/xe: Add SVM garbage collector
  drm/xe: Add unbind to SVM garbage collector
  drm/xe: Do not allow system allocator VMA unbind if the GPU has
    bindings
  drm/xe: Enable system allocator uAPI
  drm/xe: Add migrate layer functions for SVM support
  drm/xe: Add SVM device memory mirroring
  drm/xe: Add drm_gpusvm_devmem to xe_bo
  drm/xe: Add GPUSVM devic memory copy vfunc functions
  drm/xe: Add Xe SVM populate_devmem_pfn vfunc
  drm/xe: Add Xe SVM devmem_release vfunc
  drm/xe: Add BO flags required for SVM
  drm/xe: Add SVM VRAM migration
  drm/xe: Basic SVM BO eviction
  drm/xe: Add SVM debug
  drm/xe: Add modparam for SVM notifier size
  drm/xe: Add always_migrate_to_vram modparam
  drm/doc: gpusvm: Add GPU SVM documentation

Thomas Hellström (3):
  drm/pagemap: Add DRM pagemap
  drm/xe: Add dma_addr res cursor
  drm/xe: Add drm_pagemap ops to SVM

 Documentation/gpu/rfc/gpusvm.rst     |   70 +
 Documentation/gpu/rfc/index.rst      |    4 +
 drivers/gpu/drm/xe/Makefile          |    4 +-
 drivers/gpu/drm/xe/drm_gpusvm.c      | 2074 ++++++++++++++++++++++++++
 drivers/gpu/drm/xe/drm_gpusvm.h      |  447 ++++++
 drivers/gpu/drm/xe/drm_pagemap.h     |  103 ++
 drivers/gpu/drm/xe/xe_bo.c           |   43 +-
 drivers/gpu/drm/xe/xe_bo.h           |    2 +
 drivers/gpu/drm/xe/xe_bo_types.h     |    5 +
 drivers/gpu/drm/xe/xe_device_types.h |   15 +
 drivers/gpu/drm/xe/xe_gt_pagefault.c |   17 +-
 drivers/gpu/drm/xe/xe_migrate.c      |  149 ++
 drivers/gpu/drm/xe/xe_migrate.h      |   10 +
 drivers/gpu/drm/xe/xe_module.c       |    7 +
 drivers/gpu/drm/xe/xe_module.h       |    2 +
 drivers/gpu/drm/xe/xe_pt.c           |  362 ++++-
 drivers/gpu/drm/xe/xe_pt.h           |    3 +
 drivers/gpu/drm/xe/xe_pt_types.h     |    2 +
 drivers/gpu/drm/xe/xe_res_cursor.h   |  116 +-
 drivers/gpu/drm/xe/xe_svm.c          |  840 +++++++++++
 drivers/gpu/drm/xe/xe_svm.h          |   60 +
 drivers/gpu/drm/xe/xe_tile.c         |    5 +
 drivers/gpu/drm/xe/xe_vm.c           |  286 +++-
 drivers/gpu/drm/xe/xe_vm.h           |   15 +-
 drivers/gpu/drm/xe/xe_vm_types.h     |   44 +
 include/drm/drm_gpuvm.h              |    5 +
 include/linux/migrate.h              |    1 +
 include/uapi/drm/xe_drm.h            |   19 +-
 mm/memory.c                          |   13 +-
 mm/migrate_device.c                  |  104 +-
 30 files changed, 4695 insertions(+), 132 deletions(-)
 create mode 100644 Documentation/gpu/rfc/gpusvm.rst
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
 create mode 100644 drivers/gpu/drm/xe/drm_pagemap.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 01/29] drm/xe: Retry BO allocation
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-10-16  3:24 ` [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range Matthew Brost
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

TTM doesn't support fair eviction via WW locking, this mitigated in by
using retry loops in exec and preempt rebind worker. Extend this retry
loop to BO allocation. Once TTM supports fair eviction this patch can be
reverted.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 5b232f2951b1..a02d63e322ae 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -2011,6 +2011,7 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 	struct xe_file *xef = to_xe_file(file);
 	struct drm_xe_gem_create *args = data;
 	struct xe_vm *vm = NULL;
+	ktime_t end = 0;
 	struct xe_bo *bo;
 	unsigned int bo_flags;
 	u32 handle;
@@ -2083,11 +2084,14 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 		vm = xe_vm_lookup(xef, args->vm_id);
 		if (XE_IOCTL_DBG(xe, !vm))
 			return -ENOENT;
+	}
+
+retry:
+	if (vm) {
 		err = xe_vm_lock(vm, true);
 		if (err)
 			goto out_vm;
 	}
-
 	bo = xe_bo_create_user(xe, NULL, vm, args->size, args->cpu_caching,
 			       bo_flags);
 
@@ -2096,6 +2100,8 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 
 	if (IS_ERR(bo)) {
 		err = PTR_ERR(bo);
+		if (xe_vm_validate_should_retry(NULL, err, &end))
+			goto retry;
 		goto out_vm;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
  2024-10-16  3:24 ` [PATCH v2 01/29] drm/xe: Retry BO allocation Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-10-16  4:04   ` Alistair Popple
  2024-10-16  3:24 ` [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page Matthew Brost
                   ` (29 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add migrate_device_prepoluated_range which prepares an array of
pre-populated device pages for migration.

v2:
 - s/migrate_device_vma_range/migrate_device_prepopulated_range
 - Drop extra mmu invalidation (Vetter)

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/linux/migrate.h |  1 +
 mm/migrate_device.c     | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 002e49b2ebd9..9146ed39a2a3 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -229,6 +229,7 @@ void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
 int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 			unsigned long npages);
+int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages);
 void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
 			unsigned long npages);
 void migrate_device_finalize(unsigned long *src_pfns,
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 9cf26592ac93..f163c2131022 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -924,6 +924,41 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 }
 EXPORT_SYMBOL(migrate_device_range);
 
+/**
+ * migrate_device_prepopulated_range() - migrate device private pfns to normal memory.
+ * @src_pfns: pre-popluated array of source device private pfns to migrate.
+ * @npages: number of pages to migrate.
+ *
+ * Similar to migrate_device_range() but supports non-contiguous pre-popluated
+ * array of device pages to migrate.
+ */
+int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = pfn_to_page(src_pfns[i]);
+
+		if (!get_page_unless_zero(page)) {
+			src_pfns[i] = 0;
+			continue;
+		}
+
+		if (!trylock_page(page)) {
+			src_pfns[i] = 0;
+			put_page(page);
+			continue;
+		}
+
+		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
+	}
+
+	migrate_device_unmap(src_pfns, npages, NULL);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_device_prepopulated_range);
+
 /*
  * Migrate a device coherent folio back to normal memory. The caller should have
  * a reference on folio which will be copied to the new folio if migration is
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-16  3:24 ` [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range Matthew Brost
@ 2024-10-16  4:04   ` Alistair Popple
  2024-10-16  4:46     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-16  4:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> Add migrate_device_prepoluated_range which prepares an array of
> pre-populated device pages for migration.

It would be nice if the commit message could also include an explanation
of why the existing migrate_device_range() is inadequate for your needs.

> v2:
>  - s/migrate_device_vma_range/migrate_device_prepopulated_range
>  - Drop extra mmu invalidation (Vetter)
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/linux/migrate.h |  1 +
>  mm/migrate_device.c     | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 36 insertions(+)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 002e49b2ebd9..9146ed39a2a3 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -229,6 +229,7 @@ void migrate_vma_pages(struct migrate_vma *migrate);
>  void migrate_vma_finalize(struct migrate_vma *migrate);
>  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
>  			unsigned long npages);
> +int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages);
>  void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
>  			unsigned long npages);
>  void migrate_device_finalize(unsigned long *src_pfns,
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 9cf26592ac93..f163c2131022 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -924,6 +924,41 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
>  }
>  EXPORT_SYMBOL(migrate_device_range);
>  
> +/**
> + * migrate_device_prepopulated_range() - migrate device private pfns to normal memory.
> + * @src_pfns: pre-popluated array of source device private pfns to migrate.
> + * @npages: number of pages to migrate.
> + *
> + * Similar to migrate_device_range() but supports non-contiguous pre-popluated
> + * array of device pages to migrate.
> + */
> +int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages)

I don't love the name, I think because it is quite long and makes me
think of something complicated like prefaulting. Perhaps
migrate_device_pfns()?

> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *page = pfn_to_page(src_pfns[i]);
> +
> +		if (!get_page_unless_zero(page)) {
> +			src_pfns[i] = 0;
> +			continue;
> +		}
> +
> +		if (!trylock_page(page)) {
> +			src_pfns[i] = 0;
> +			put_page(page);
> +			continue;
> +		}
> +
> +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;

This needs to be converted to use a folio like
migrate_device_range(). But more importantly this should be split out as
a function that both migrate_device_range() and this function can call
given this bit is identical.

> +	}
> +
> +	migrate_device_unmap(src_pfns, npages, NULL);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> +
>  /*
>   * Migrate a device coherent folio back to normal memory. The caller should have
>   * a reference on folio which will be copied to the new folio if migration is


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-16  4:04   ` Alistair Popple
@ 2024-10-16  4:46     ` Matthew Brost
  2024-10-17  0:56       ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  4:46 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > Add migrate_device_prepoluated_range which prepares an array of
> > pre-populated device pages for migration.
> 
> It would be nice if the commit message could also include an explanation
> of why the existing migrate_device_range() is inadequate for your needs.
> 

Yea, my bad. It should explain this is required for non-contiguous
device pages. I suppose I could always iterate for contiguous regions
with migrate_device_range too if you think that is better.

> > v2:
> >  - s/migrate_device_vma_range/migrate_device_prepopulated_range
> >  - Drop extra mmu invalidation (Vetter)
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/linux/migrate.h |  1 +
> >  mm/migrate_device.c     | 35 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 36 insertions(+)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 002e49b2ebd9..9146ed39a2a3 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -229,6 +229,7 @@ void migrate_vma_pages(struct migrate_vma *migrate);
> >  void migrate_vma_finalize(struct migrate_vma *migrate);
> >  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
> >  			unsigned long npages);
> > +int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages);
> >  void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
> >  			unsigned long npages);
> >  void migrate_device_finalize(unsigned long *src_pfns,
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 9cf26592ac93..f163c2131022 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -924,6 +924,41 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
> >  }
> >  EXPORT_SYMBOL(migrate_device_range);
> >  
> > +/**
> > + * migrate_device_prepopulated_range() - migrate device private pfns to normal memory.
> > + * @src_pfns: pre-popluated array of source device private pfns to migrate.
> > + * @npages: number of pages to migrate.
> > + *
> > + * Similar to migrate_device_range() but supports non-contiguous pre-popluated
> > + * array of device pages to migrate.
> > + */
> > +int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages)
> 
> I don't love the name, I think because it is quite long and makes me
> think of something complicated like prefaulting. Perhaps
> migrate_device_pfns()?
> 

Sure.

> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		struct page *page = pfn_to_page(src_pfns[i]);
> > +
> > +		if (!get_page_unless_zero(page)) {
> > +			src_pfns[i] = 0;
> > +			continue;
> > +		}
> > +
> > +		if (!trylock_page(page)) {
> > +			src_pfns[i] = 0;
> > +			put_page(page);
> > +			continue;
> > +		}
> > +
> > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> 
> This needs to be converted to use a folio like
> migrate_device_range(). But more importantly this should be split out as
> a function that both migrate_device_range() and this function can call
> given this bit is identical.
> 

Missed the folio conversion and agree a helper shared between this
function and migrate_device_range would be a good idea. Let add that.

Matt

> > +	}
> > +
> > +	migrate_device_unmap(src_pfns, npages, NULL);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> > +
> >  /*
> >   * Migrate a device coherent folio back to normal memory. The caller should have
> >   * a reference on folio which will be copied to the new folio if migration is
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-16  4:46     ` Matthew Brost
@ 2024-10-17  0:56       ` Matthew Brost
  2024-10-17  1:49         ` Alistair Popple
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-17  0:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> > 
> > Matthew Brost <matthew.brost@intel.com> writes:
> > 
> > > Add migrate_device_prepoluated_range which prepares an array of
> > > pre-populated device pages for migration.
> > 
> > It would be nice if the commit message could also include an explanation
> > of why the existing migrate_device_range() is inadequate for your needs.
> > 
> 
> Yea, my bad. It should explain this is required for non-contiguous
> device pages. I suppose I could always iterate for contiguous regions
> with migrate_device_range too if you think that is better.
> 
> > > v2:
> > >  - s/migrate_device_vma_range/migrate_device_prepopulated_range
> > >  - Drop extra mmu invalidation (Vetter)
> > >
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  include/linux/migrate.h |  1 +
> > >  mm/migrate_device.c     | 35 +++++++++++++++++++++++++++++++++++
> > >  2 files changed, 36 insertions(+)
> > >
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 002e49b2ebd9..9146ed39a2a3 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -229,6 +229,7 @@ void migrate_vma_pages(struct migrate_vma *migrate);
> > >  void migrate_vma_finalize(struct migrate_vma *migrate);
> > >  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
> > >  			unsigned long npages);
> > > +int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages);
> > >  void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
> > >  			unsigned long npages);
> > >  void migrate_device_finalize(unsigned long *src_pfns,
> > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > index 9cf26592ac93..f163c2131022 100644
> > > --- a/mm/migrate_device.c
> > > +++ b/mm/migrate_device.c
> > > @@ -924,6 +924,41 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start,
> > >  }
> > >  EXPORT_SYMBOL(migrate_device_range);
> > >  
> > > +/**
> > > + * migrate_device_prepopulated_range() - migrate device private pfns to normal memory.
> > > + * @src_pfns: pre-popluated array of source device private pfns to migrate.
> > > + * @npages: number of pages to migrate.
> > > + *
> > > + * Similar to migrate_device_range() but supports non-contiguous pre-popluated
> > > + * array of device pages to migrate.
> > > + */
> > > +int migrate_device_prepopulated_range(unsigned long *src_pfns, unsigned long npages)
> > 
> > I don't love the name, I think because it is quite long and makes me
> > think of something complicated like prefaulting. Perhaps
> > migrate_device_pfns()?
> > 
> 
> Sure.
> 
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; i++) {
> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> > > +
> > > +		if (!get_page_unless_zero(page)) {
> > > +			src_pfns[i] = 0;
> > > +			continue;
> > > +		}
> > > +
> > > +		if (!trylock_page(page)) {
> > > +			src_pfns[i] = 0;
> > > +			put_page(page);
> > > +			continue;
> > > +		}
> > > +
> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> > 
> > This needs to be converted to use a folio like
> > migrate_device_range(). But more importantly this should be split out as
> > a function that both migrate_device_range() and this function can call
> > given this bit is identical.
> > 
> 
> Missed the folio conversion and agree a helper shared between this
> function and migrate_device_range would be a good idea. Let add that.
> 

Alistair,

Ok, I think now I want to go slightly different direction here to give
GPUSVM a bit more control over several eviction scenarios.

What if I exported the helper discussed above, e.g.,

 905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
 906 {
 907         struct folio *folio;
 908
 909         folio = folio_get_nontail_page(pfn_to_page(pfn));
 910         if (!folio)
 911                 return 0;
 912
 913         if (!folio_trylock(folio)) {
 914                 folio_put(folio);
 915                 return 0;
 916         }
 917
 918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 919 }
 920 EXPORT_SYMBOL(migrate_device_pfn_lock);

And then also export migrate_device_unmap.

The usage here would be let a driver collect the device pages in virtual
address range via hmm_range_fault, lock device pages under notifier
lock ensuring device pages are valid, drop the notifier lock and call
migrate_device_unmap. Sima has strongly suggested avoiding a CPUVMA
lookup during eviction cases and this would let me fixup
drm_gpusvm_range_evict in [1] to avoid this.

It would also make the function exported in this patch unnecessary too
as non-contiguous pfns can be setup on driver side via
migrate_device_pfn_lock and then migrate_device_unmap can be called.
This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
in [1].

Do you see an issue exporting migrate_device_pfn_lock,
migrate_device_unmap?

Matt

[1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2

> Matt
> 
> > > +	}
> > > +
> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> > > +
> > >  /*
> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> > >   * a reference on folio which will be copied to the new folio if migration is
> > 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17  0:56       ` Matthew Brost
@ 2024-10-17  1:49         ` Alistair Popple
  2024-10-17  2:45           ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-17  1:49 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:

[...]

>> > > +{
>> > > +	unsigned long i;
>> > > +
>> > > +	for (i = 0; i < npages; i++) {
>> > > +		struct page *page = pfn_to_page(src_pfns[i]);
>> > > +
>> > > +		if (!get_page_unless_zero(page)) {
>> > > +			src_pfns[i] = 0;
>> > > +			continue;
>> > > +		}
>> > > +
>> > > +		if (!trylock_page(page)) {
>> > > +			src_pfns[i] = 0;
>> > > +			put_page(page);
>> > > +			continue;
>> > > +		}
>> > > +
>> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>> > 
>> > This needs to be converted to use a folio like
>> > migrate_device_range(). But more importantly this should be split out as
>> > a function that both migrate_device_range() and this function can call
>> > given this bit is identical.
>> > 
>> 
>> Missed the folio conversion and agree a helper shared between this
>> function and migrate_device_range would be a good idea. Let add that.
>> 
>
> Alistair,
>
> Ok, I think now I want to go slightly different direction here to give
> GPUSVM a bit more control over several eviction scenarios.
>
> What if I exported the helper discussed above, e.g.,
>
>  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>  906 {
>  907         struct folio *folio;
>  908
>  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>  910         if (!folio)
>  911                 return 0;
>  912
>  913         if (!folio_trylock(folio)) {
>  914                 folio_put(folio);
>  915                 return 0;
>  916         }
>  917
>  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>  919 }
>  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>
> And then also export migrate_device_unmap.
>
> The usage here would be let a driver collect the device pages in virtual
> address range via hmm_range_fault, lock device pages under notifier
> lock ensuring device pages are valid, drop the notifier lock and call
> migrate_device_unmap.

I'm still working through this series but that seems a bit dubious, the
locking here is pretty subtle and easy to get wrong so seeing some code
would help me a lot in understanding what you're suggesting.

> Sima has strongly suggested avoiding a CPUVMA
> lookup during eviction cases and this would let me fixup
> drm_gpusvm_range_evict in [1] to avoid this.

That sounds reasonable but for context do you have a link to the
comments/discussion on this? I couldn't readily find it, but I may have
just missed it.

> It would also make the function exported in this patch unnecessary too
> as non-contiguous pfns can be setup on driver side via
> migrate_device_pfn_lock and then migrate_device_unmap can be called.
> This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> in [1].
>
> Do you see an issue exporting migrate_device_pfn_lock,
> migrate_device_unmap?

If there is a good justification for it I can't see a problem with
exporting it. That said I don't really understand why you would
want/need to split those steps up but I'll wait to see the code.

 - Alistair

> Matt
>
> [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>
>> Matt
>> 
>> > > +	}
>> > > +
>> > > +	migrate_device_unmap(src_pfns, npages, NULL);
>> > > +
>> > > +	return 0;
>> > > +}
>> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>> > > +
>> > >  /*
>> > >   * Migrate a device coherent folio back to normal memory. The caller should have
>> > >   * a reference on folio which will be copied to the new folio if migration is
>> > 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17  1:49         ` Alistair Popple
@ 2024-10-17  2:45           ` Matthew Brost
  2024-10-17  3:21             ` Alistair Popple
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-17  2:45 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> 
> [...]
> 
> >> > > +{
> >> > > +	unsigned long i;
> >> > > +
> >> > > +	for (i = 0; i < npages; i++) {
> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> >> > > +
> >> > > +		if (!get_page_unless_zero(page)) {
> >> > > +			src_pfns[i] = 0;
> >> > > +			continue;
> >> > > +		}
> >> > > +
> >> > > +		if (!trylock_page(page)) {
> >> > > +			src_pfns[i] = 0;
> >> > > +			put_page(page);
> >> > > +			continue;
> >> > > +		}
> >> > > +
> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> >> > 
> >> > This needs to be converted to use a folio like
> >> > migrate_device_range(). But more importantly this should be split out as
> >> > a function that both migrate_device_range() and this function can call
> >> > given this bit is identical.
> >> > 
> >> 
> >> Missed the folio conversion and agree a helper shared between this
> >> function and migrate_device_range would be a good idea. Let add that.
> >> 
> >
> > Alistair,
> >
> > Ok, I think now I want to go slightly different direction here to give
> > GPUSVM a bit more control over several eviction scenarios.
> >
> > What if I exported the helper discussed above, e.g.,
> >
> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
> >  906 {
> >  907         struct folio *folio;
> >  908
> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
> >  910         if (!folio)
> >  911                 return 0;
> >  912
> >  913         if (!folio_trylock(folio)) {
> >  914                 folio_put(folio);
> >  915                 return 0;
> >  916         }
> >  917
> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> >  919 }
> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
> >
> > And then also export migrate_device_unmap.
> >
> > The usage here would be let a driver collect the device pages in virtual
> > address range via hmm_range_fault, lock device pages under notifier
> > lock ensuring device pages are valid, drop the notifier lock and call
> > migrate_device_unmap.
> 
> I'm still working through this series but that seems a bit dubious, the
> locking here is pretty subtle and easy to get wrong so seeing some code
> would help me a lot in understanding what you're suggesting.
>

For sure locking in tricky, my mistake on not working through this
before sending out the next rev but it came to mind after sending +
regarding some late feedback from Thomas about using hmm for eviction
[2]. His suggestion of using hmm_range_fault to trigger migration
doesn't work for coherent pages, while something like below does.

[2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461

Here is a snippet I have locally which seems to work.

2024 retry:
2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
2026         hmm_range.hmm_pfns = src;
2027
2028         while (true) {
2029                 mmap_read_lock(mm);
2030                 err = hmm_range_fault(&hmm_range);
2031                 mmap_read_unlock(mm);
2032                 if (err == -EBUSY) {
2033                         if (time_after(jiffies, timeout))
2034                                 break;
2035
2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
2037                         continue;
2038                 }
2039                 break;
2040         }
2041         if (err)
2042                 goto err_put;
2043
2044         drm_gpusvm_notifier_lock(gpusvm);
2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
2046                 drm_gpusvm_notifier_unlock(gpusvm);
2047                 memset(src, 0, sizeof(*src) * npages);
2048                 goto retry;
2049         }
2050         for (i = 0; i < npages; ++i) {
2051                 struct page *page = hmm_pfn_to_page(src[i]);
2052
2053                 if (page && (is_device_private_page(page) ||
2054                     is_device_coherent_page(page)) && page->zone_device_data)
2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
2056                 else
2057                         src[i] = 0;
2058                 if (src[i])
2059                         src[i] = migrate_device_pfn_lock(src[i]);
2060         }
2061         drm_gpusvm_notifier_unlock(gpusvm);
2062
2063         migrate_device_unmap(src, npages, NULL);
...
2101         migrate_device_pages(src, dst, npages);
2102         migrate_device_finalize(src, dst, npages);


> > Sima has strongly suggested avoiding a CPUVMA
> > lookup during eviction cases and this would let me fixup
> > drm_gpusvm_range_evict in [1] to avoid this.
> 
> That sounds reasonable but for context do you have a link to the
> comments/discussion on this? I couldn't readily find it, but I may have
> just missed it.
> 

See in [4], search for '2. eviction' comment from sima.

[3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
[4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78

> > It would also make the function exported in this patch unnecessary too
> > as non-contiguous pfns can be setup on driver side via
> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> > in [1].
> >
> > Do you see an issue exporting migrate_device_pfn_lock,
> > migrate_device_unmap?
> 
> If there is a good justification for it I can't see a problem with
> exporting it. That said I don't really understand why you would
> want/need to split those steps up but I'll wait to see the code.
>

It is so the device pages returned from hmm_range_fault, which are only
guaranteed to be valid under the notifier lock + a seqno check, to be
locked and ref taken for migration. migrate_device_unmap() can trigger a
MMU invalidation which takes the notifier lock thus calling the function
which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.

I think this flow makes sense and agree in general this likely better
than looking at a CPUVMA.

Matt
 
>  - Alistair
> 
> > Matt
> >
> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
> >
> >> Matt
> >> 
> >> > > +	}
> >> > > +
> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> >> > > +
> >> > > +	return 0;
> >> > > +}
> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> >> > > +
> >> > >  /*
> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> >> > >   * a reference on folio which will be copied to the new folio if migration is
> >> > 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17  2:45           ` Matthew Brost
@ 2024-10-17  3:21             ` Alistair Popple
  2024-10-17  4:07               ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-17  3:21 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>> 
>> [...]
>> 
>> >> > > +{
>> >> > > +	unsigned long i;
>> >> > > +
>> >> > > +	for (i = 0; i < npages; i++) {
>> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
>> >> > > +
>> >> > > +		if (!get_page_unless_zero(page)) {
>> >> > > +			src_pfns[i] = 0;
>> >> > > +			continue;
>> >> > > +		}
>> >> > > +
>> >> > > +		if (!trylock_page(page)) {
>> >> > > +			src_pfns[i] = 0;
>> >> > > +			put_page(page);
>> >> > > +			continue;
>> >> > > +		}
>> >> > > +
>> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>> >> > 
>> >> > This needs to be converted to use a folio like
>> >> > migrate_device_range(). But more importantly this should be split out as
>> >> > a function that both migrate_device_range() and this function can call
>> >> > given this bit is identical.
>> >> > 
>> >> 
>> >> Missed the folio conversion and agree a helper shared between this
>> >> function and migrate_device_range would be a good idea. Let add that.
>> >> 
>> >
>> > Alistair,
>> >
>> > Ok, I think now I want to go slightly different direction here to give
>> > GPUSVM a bit more control over several eviction scenarios.
>> >
>> > What if I exported the helper discussed above, e.g.,
>> >
>> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>> >  906 {
>> >  907         struct folio *folio;
>> >  908
>> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>> >  910         if (!folio)
>> >  911                 return 0;
>> >  912
>> >  913         if (!folio_trylock(folio)) {
>> >  914                 folio_put(folio);
>> >  915                 return 0;
>> >  916         }
>> >  917
>> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>> >  919 }
>> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>> >
>> > And then also export migrate_device_unmap.
>> >
>> > The usage here would be let a driver collect the device pages in virtual
>> > address range via hmm_range_fault, lock device pages under notifier
>> > lock ensuring device pages are valid, drop the notifier lock and call
>> > migrate_device_unmap.
>> 
>> I'm still working through this series but that seems a bit dubious, the
>> locking here is pretty subtle and easy to get wrong so seeing some code
>> would help me a lot in understanding what you're suggesting.
>>
>
> For sure locking in tricky, my mistake on not working through this
> before sending out the next rev but it came to mind after sending +
> regarding some late feedback from Thomas about using hmm for eviction
> [2]. His suggestion of using hmm_range_fault to trigger migration
> doesn't work for coherent pages, while something like below does.
>
> [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>
> Here is a snippet I have locally which seems to work.
>
> 2024 retry:
> 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> 2026         hmm_range.hmm_pfns = src;
> 2027
> 2028         while (true) {
> 2029                 mmap_read_lock(mm);
> 2030                 err = hmm_range_fault(&hmm_range);
> 2031                 mmap_read_unlock(mm);
> 2032                 if (err == -EBUSY) {
> 2033                         if (time_after(jiffies, timeout))
> 2034                                 break;
> 2035
> 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> 2037                         continue;
> 2038                 }
> 2039                 break;
> 2040         }
> 2041         if (err)
> 2042                 goto err_put;
> 2043
> 2044         drm_gpusvm_notifier_lock(gpusvm);
> 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> 2046                 drm_gpusvm_notifier_unlock(gpusvm);
> 2047                 memset(src, 0, sizeof(*src) * npages);
> 2048                 goto retry;
> 2049         }
> 2050         for (i = 0; i < npages; ++i) {
> 2051                 struct page *page = hmm_pfn_to_page(src[i]);
> 2052
> 2053                 if (page && (is_device_private_page(page) ||
> 2054                     is_device_coherent_page(page)) && page->zone_device_data)
> 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
> 2056                 else
> 2057                         src[i] = 0;
> 2058                 if (src[i])
> 2059                         src[i] = migrate_device_pfn_lock(src[i]);
> 2060         }
> 2061         drm_gpusvm_notifier_unlock(gpusvm);

Practically for eviction isn't this much the same as calling
migrate_vma_setup()? And also for eviction as Sima mentioned you
probably shouldn't be looking at mm/vma structs.

> 2063         migrate_device_unmap(src, npages, NULL);
> ...
> 2101         migrate_device_pages(src, dst, npages);
> 2102         migrate_device_finalize(src, dst, npages);
>
>
>> > Sima has strongly suggested avoiding a CPUVMA
>> > lookup during eviction cases and this would let me fixup
>> > drm_gpusvm_range_evict in [1] to avoid this.
>> 
>> That sounds reasonable but for context do you have a link to the
>> comments/discussion on this? I couldn't readily find it, but I may have
>> just missed it.
>> 
>
> See in [4], search for '2. eviction' comment from sima.

Thanks for pointing that out. For reference here's Sima's comment:

> 2. eviction
> 
> Requirements much like migrate_to_ram, because otherwise we break the
> migration gurantee:
> 
> - Only looking at physical memory datastructures and locks, no looking at
>   mm/vma structs or relying on those being locked. We rely entirely on
>   reverse maps from try_to_migrate to find all the mappings on both cpu
>   and gpu side (cpu only zone device swap or migration pte entries ofc).

I also very much agree with this. That's basically why I added
migrate_device_range(), so that we can forcibly evict pages when the
driver needs them freed (eg. driver unload, low memory, etc.). In
general it is impossible to guarantee eviction og all pages using just
hmm_range_fault().

> [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
> [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>
>> > It would also make the function exported in this patch unnecessary too
>> > as non-contiguous pfns can be setup on driver side via
>> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
>> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>> > in [1].
>> >
>> > Do you see an issue exporting migrate_device_pfn_lock,
>> > migrate_device_unmap?
>> 
>> If there is a good justification for it I can't see a problem with
>> exporting it. That said I don't really understand why you would
>> want/need to split those steps up but I'll wait to see the code.
>>
>
> It is so the device pages returned from hmm_range_fault, which are only
> guaranteed to be valid under the notifier lock + a seqno check, to be
> locked and ref taken for migration. migrate_device_unmap() can trigger a
> MMU invalidation which takes the notifier lock thus calling the function
> which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>
> I think this flow makes sense and agree in general this likely better
> than looking at a CPUVMA.

I'm still a bit confused about what is better with this flow if you are
still calling hmm_range_fault(). How is it better than just calling
migrate_vma_setup()? Obviously it will fault the pages in, but it seems
we're talking about eviction here so I don't understand why that would
be relevant. And hmm_range_fault() still requires the VMA, although I
need to look at the patches more closely, probably CPUVMA is a DRM
specific concept?

Thanks.

 - Alistair

> Matt
>  
>>  - Alistair
>> 
>> > Matt
>> >
>> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>> >
>> >> Matt
>> >> 
>> >> > > +	}
>> >> > > +
>> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
>> >> > > +
>> >> > > +	return 0;
>> >> > > +}
>> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>> >> > > +
>> >> > >  /*
>> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
>> >> > >   * a reference on folio which will be copied to the new folio if migration is
>> >> > 
>> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17  3:21             ` Alistair Popple
@ 2024-10-17  4:07               ` Matthew Brost
  2024-10-17  5:49                 ` Alistair Popple
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-17  4:07 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
> >> 
> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> 
> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> >> 
> >> [...]
> >> 
> >> >> > > +{
> >> >> > > +	unsigned long i;
> >> >> > > +
> >> >> > > +	for (i = 0; i < npages; i++) {
> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> >> >> > > +
> >> >> > > +		if (!get_page_unless_zero(page)) {
> >> >> > > +			src_pfns[i] = 0;
> >> >> > > +			continue;
> >> >> > > +		}
> >> >> > > +
> >> >> > > +		if (!trylock_page(page)) {
> >> >> > > +			src_pfns[i] = 0;
> >> >> > > +			put_page(page);
> >> >> > > +			continue;
> >> >> > > +		}
> >> >> > > +
> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> >> >> > 
> >> >> > This needs to be converted to use a folio like
> >> >> > migrate_device_range(). But more importantly this should be split out as
> >> >> > a function that both migrate_device_range() and this function can call
> >> >> > given this bit is identical.
> >> >> > 
> >> >> 
> >> >> Missed the folio conversion and agree a helper shared between this
> >> >> function and migrate_device_range would be a good idea. Let add that.
> >> >> 
> >> >
> >> > Alistair,
> >> >
> >> > Ok, I think now I want to go slightly different direction here to give
> >> > GPUSVM a bit more control over several eviction scenarios.
> >> >
> >> > What if I exported the helper discussed above, e.g.,
> >> >
> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
> >> >  906 {
> >> >  907         struct folio *folio;
> >> >  908
> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
> >> >  910         if (!folio)
> >> >  911                 return 0;
> >> >  912
> >> >  913         if (!folio_trylock(folio)) {
> >> >  914                 folio_put(folio);
> >> >  915                 return 0;
> >> >  916         }
> >> >  917
> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> >> >  919 }
> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
> >> >
> >> > And then also export migrate_device_unmap.
> >> >
> >> > The usage here would be let a driver collect the device pages in virtual
> >> > address range via hmm_range_fault, lock device pages under notifier
> >> > lock ensuring device pages are valid, drop the notifier lock and call
> >> > migrate_device_unmap.
> >> 
> >> I'm still working through this series but that seems a bit dubious, the
> >> locking here is pretty subtle and easy to get wrong so seeing some code
> >> would help me a lot in understanding what you're suggesting.
> >>
> >
> > For sure locking in tricky, my mistake on not working through this
> > before sending out the next rev but it came to mind after sending +
> > regarding some late feedback from Thomas about using hmm for eviction
> > [2]. His suggestion of using hmm_range_fault to trigger migration
> > doesn't work for coherent pages, while something like below does.
> >
> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
> >
> > Here is a snippet I have locally which seems to work.
> >
> > 2024 retry:
> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > 2026         hmm_range.hmm_pfns = src;
> > 2027
> > 2028         while (true) {
> > 2029                 mmap_read_lock(mm);
> > 2030                 err = hmm_range_fault(&hmm_range);
> > 2031                 mmap_read_unlock(mm);
> > 2032                 if (err == -EBUSY) {
> > 2033                         if (time_after(jiffies, timeout))
> > 2034                                 break;
> > 2035
> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > 2037                         continue;
> > 2038                 }
> > 2039                 break;
> > 2040         }
> > 2041         if (err)
> > 2042                 goto err_put;
> > 2043
> > 2044         drm_gpusvm_notifier_lock(gpusvm);
> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
> > 2047                 memset(src, 0, sizeof(*src) * npages);
> > 2048                 goto retry;
> > 2049         }
> > 2050         for (i = 0; i < npages; ++i) {
> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
> > 2052
> > 2053                 if (page && (is_device_private_page(page) ||
> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
> > 2056                 else
> > 2057                         src[i] = 0;
> > 2058                 if (src[i])
> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
> > 2060         }
> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
> 
> Practically for eviction isn't this much the same as calling
> migrate_vma_setup()? And also for eviction as Sima mentioned you
> probably shouldn't be looking at mm/vma structs.
> 

hmm_range_fault is just collecting the pages, internally I suppose it
does look at a VMA (struct vm_area_struct) but I think the point is
drivers should not be looking at VMA here.

> > 2063         migrate_device_unmap(src, npages, NULL);
> > ...
> > 2101         migrate_device_pages(src, dst, npages);
> > 2102         migrate_device_finalize(src, dst, npages);
> >
> >
> >> > Sima has strongly suggested avoiding a CPUVMA
> >> > lookup during eviction cases and this would let me fixup
> >> > drm_gpusvm_range_evict in [1] to avoid this.
> >> 
> >> That sounds reasonable but for context do you have a link to the
> >> comments/discussion on this? I couldn't readily find it, but I may have
> >> just missed it.
> >> 
> >
> > See in [4], search for '2. eviction' comment from sima.
> 
> Thanks for pointing that out. For reference here's Sima's comment:
> 
> > 2. eviction
> > 
> > Requirements much like migrate_to_ram, because otherwise we break the
> > migration gurantee:
> > 
> > - Only looking at physical memory datastructures and locks, no looking at
> >   mm/vma structs or relying on those being locked. We rely entirely on
> >   reverse maps from try_to_migrate to find all the mappings on both cpu
> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
> 
> I also very much agree with this. That's basically why I added
> migrate_device_range(), so that we can forcibly evict pages when the
> driver needs them freed (eg. driver unload, low memory, etc.). In
> general it is impossible to guarantee eviction og all pages using just
> hmm_range_fault().
> 

In this code path we don't have device pages available, hence the
purposed collection via hmm_range_fault.

> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
> >
> >> > It would also make the function exported in this patch unnecessary too
> >> > as non-contiguous pfns can be setup on driver side via
> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> >> > in [1].
> >> >
> >> > Do you see an issue exporting migrate_device_pfn_lock,
> >> > migrate_device_unmap?
> >> 
> >> If there is a good justification for it I can't see a problem with
> >> exporting it. That said I don't really understand why you would
> >> want/need to split those steps up but I'll wait to see the code.
> >>
> >
> > It is so the device pages returned from hmm_range_fault, which are only
> > guaranteed to be valid under the notifier lock + a seqno check, to be
> > locked and ref taken for migration. migrate_device_unmap() can trigger a
> > MMU invalidation which takes the notifier lock thus calling the function
> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
> >
> > I think this flow makes sense and agree in general this likely better
> > than looking at a CPUVMA.
> 
> I'm still a bit confused about what is better with this flow if you are
> still calling hmm_range_fault(). How is it better than just calling
> migrate_vma_setup()? Obviously it will fault the pages in, but it seems

The code in rev2 calls migrate_vma_setup but the requires a struct
vm_area_struct argument whereas hmm_range_fault does not.

> we're talking about eviction here so I don't understand why that would
> be relevant. And hmm_range_fault() still requires the VMA, although I
> need to look at the patches more closely, probably CPUVMA is a DRM

'hmm_range_fault() still requires the VMA' internal yes, but again not
as argument. This is about avoiding a driver side lookup of the VMA.

CPUVMA == struct vm_area_struct in this email.

Matt

> specific concept?
> 
> Thanks.
> 
>  - Alistair
> 
> > Matt
> >  
> >>  - Alistair
> >> 
> >> > Matt
> >> >
> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
> >> >
> >> >> Matt
> >> >> 
> >> >> > > +	}
> >> >> > > +
> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> >> >> > > +
> >> >> > > +	return 0;
> >> >> > > +}
> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> >> >> > > +
> >> >> > >  /*
> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
> >> >> > 
> >> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17  4:07               ` Matthew Brost
@ 2024-10-17  5:49                 ` Alistair Popple
  2024-10-17 15:40                   ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-17  5:49 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>> >> 
>> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> 
>> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>> >> 
>> >> [...]
>> >> 
>> >> >> > > +{
>> >> >> > > +	unsigned long i;
>> >> >> > > +
>> >> >> > > +	for (i = 0; i < npages; i++) {
>> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
>> >> >> > > +
>> >> >> > > +		if (!get_page_unless_zero(page)) {
>> >> >> > > +			src_pfns[i] = 0;
>> >> >> > > +			continue;
>> >> >> > > +		}
>> >> >> > > +
>> >> >> > > +		if (!trylock_page(page)) {
>> >> >> > > +			src_pfns[i] = 0;
>> >> >> > > +			put_page(page);
>> >> >> > > +			continue;
>> >> >> > > +		}
>> >> >> > > +
>> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>> >> >> > 
>> >> >> > This needs to be converted to use a folio like
>> >> >> > migrate_device_range(). But more importantly this should be split out as
>> >> >> > a function that both migrate_device_range() and this function can call
>> >> >> > given this bit is identical.
>> >> >> > 
>> >> >> 
>> >> >> Missed the folio conversion and agree a helper shared between this
>> >> >> function and migrate_device_range would be a good idea. Let add that.
>> >> >> 
>> >> >
>> >> > Alistair,
>> >> >
>> >> > Ok, I think now I want to go slightly different direction here to give
>> >> > GPUSVM a bit more control over several eviction scenarios.
>> >> >
>> >> > What if I exported the helper discussed above, e.g.,
>> >> >
>> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>> >> >  906 {
>> >> >  907         struct folio *folio;
>> >> >  908
>> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>> >> >  910         if (!folio)
>> >> >  911                 return 0;
>> >> >  912
>> >> >  913         if (!folio_trylock(folio)) {
>> >> >  914                 folio_put(folio);
>> >> >  915                 return 0;
>> >> >  916         }
>> >> >  917
>> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>> >> >  919 }
>> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>> >> >
>> >> > And then also export migrate_device_unmap.
>> >> >
>> >> > The usage here would be let a driver collect the device pages in virtual
>> >> > address range via hmm_range_fault, lock device pages under notifier
>> >> > lock ensuring device pages are valid, drop the notifier lock and call
>> >> > migrate_device_unmap.
>> >> 
>> >> I'm still working through this series but that seems a bit dubious, the
>> >> locking here is pretty subtle and easy to get wrong so seeing some code
>> >> would help me a lot in understanding what you're suggesting.
>> >>
>> >
>> > For sure locking in tricky, my mistake on not working through this
>> > before sending out the next rev but it came to mind after sending +
>> > regarding some late feedback from Thomas about using hmm for eviction
>> > [2]. His suggestion of using hmm_range_fault to trigger migration
>> > doesn't work for coherent pages, while something like below does.
>> >
>> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>> >
>> > Here is a snippet I have locally which seems to work.
>> >
>> > 2024 retry:
>> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> > 2026         hmm_range.hmm_pfns = src;
>> > 2027
>> > 2028         while (true) {
>> > 2029                 mmap_read_lock(mm);
>> > 2030                 err = hmm_range_fault(&hmm_range);
>> > 2031                 mmap_read_unlock(mm);
>> > 2032                 if (err == -EBUSY) {
>> > 2033                         if (time_after(jiffies, timeout))
>> > 2034                                 break;
>> > 2035
>> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> > 2037                         continue;
>> > 2038                 }
>> > 2039                 break;
>> > 2040         }
>> > 2041         if (err)
>> > 2042                 goto err_put;
>> > 2043
>> > 2044         drm_gpusvm_notifier_lock(gpusvm);
>> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
>> > 2047                 memset(src, 0, sizeof(*src) * npages);
>> > 2048                 goto retry;
>> > 2049         }
>> > 2050         for (i = 0; i < npages; ++i) {
>> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
>> > 2052
>> > 2053                 if (page && (is_device_private_page(page) ||
>> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
>> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
>> > 2056                 else
>> > 2057                         src[i] = 0;
>> > 2058                 if (src[i])
>> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
>> > 2060         }
>> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
>> 
>> Practically for eviction isn't this much the same as calling
>> migrate_vma_setup()? And also for eviction as Sima mentioned you
>> probably shouldn't be looking at mm/vma structs.
>> 
>
> hmm_range_fault is just collecting the pages, internally I suppose it
> does look at a VMA (struct vm_area_struct) but I think the point is
> drivers should not be looking at VMA here.

migrate_vma_setup() is designed to be called by drivers and needs a vma,
so in general I don't see a problem with drivers looking up vma's. The
problem arises specifically for eviction and whether or not that happens
in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
issues there (see below).

>> > 2063         migrate_device_unmap(src, npages, NULL);
>> > ...
>> > 2101         migrate_device_pages(src, dst, npages);
>> > 2102         migrate_device_finalize(src, dst, npages);
>> >
>> >
>> >> > Sima has strongly suggested avoiding a CPUVMA
>> >> > lookup during eviction cases and this would let me fixup
>> >> > drm_gpusvm_range_evict in [1] to avoid this.
>> >> 
>> >> That sounds reasonable but for context do you have a link to the
>> >> comments/discussion on this? I couldn't readily find it, but I may have
>> >> just missed it.
>> >> 
>> >
>> > See in [4], search for '2. eviction' comment from sima.
>> 
>> Thanks for pointing that out. For reference here's Sima's comment:
>> 
>> > 2. eviction
>> > 
>> > Requirements much like migrate_to_ram, because otherwise we break the
>> > migration gurantee:
>> > 
>> > - Only looking at physical memory datastructures and locks, no looking at
>> >   mm/vma structs or relying on those being locked. We rely entirely on
>> >   reverse maps from try_to_migrate to find all the mappings on both cpu
>> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
>>
>> I also very much agree with this. That's basically why I added
>> migrate_device_range(), so that we can forcibly evict pages when the
>> driver needs them freed (eg. driver unload, low memory, etc.). In
>> general it is impossible to guarantee eviction og all pages using just
>> hmm_range_fault().
>> 
>
> In this code path we don't have device pages available, hence the
> purposed collection via hmm_range_fault.

Why don't you have the pfns requiring eviction available? I need to read
this series in more depth, but generally hmm_range_fault() can't
gurantee you will find every device page.

>> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
>> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>> >
>> >> > It would also make the function exported in this patch unnecessary too
>> >> > as non-contiguous pfns can be setup on driver side via
>> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
>> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>> >> > in [1].
>> >> >
>> >> > Do you see an issue exporting migrate_device_pfn_lock,
>> >> > migrate_device_unmap?
>> >> 
>> >> If there is a good justification for it I can't see a problem with
>> >> exporting it. That said I don't really understand why you would
>> >> want/need to split those steps up but I'll wait to see the code.
>> >>
>> >
>> > It is so the device pages returned from hmm_range_fault, which are only
>> > guaranteed to be valid under the notifier lock + a seqno check, to be
>> > locked and ref taken for migration. migrate_device_unmap() can trigger a
>> > MMU invalidation which takes the notifier lock thus calling the function
>> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>> >
>> > I think this flow makes sense and agree in general this likely better
>> > than looking at a CPUVMA.
>> 
>> I'm still a bit confused about what is better with this flow if you are
>> still calling hmm_range_fault(). How is it better than just calling
>> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
>
> The code in rev2 calls migrate_vma_setup but the requires a struct
> vm_area_struct argument whereas hmm_range_fault does not.

I'm not sure that's a good enough justfication because the problem isn't
whether you're looking up vma's in driver code or mm code. The problem
is you are looking up vma's at all and all that goes with that (mainly
taking mmap lock, etc.)

And for eviction hmm_range_fault() won't even find all the pages because
their virtual address may have changed - consider what happens in cases
of mremap(), fork(), etc. So eviction really needs physical pages
(pfn's), not virtual addresses.

>> we're talking about eviction here so I don't understand why that would
>> be relevant. And hmm_range_fault() still requires the VMA, although I
>> need to look at the patches more closely, probably CPUVMA is a DRM
>
> 'hmm_range_fault() still requires the VMA' internal yes, but again not
> as argument. This is about avoiding a driver side lookup of the VMA.
>
> CPUVMA == struct vm_area_struct in this email.

Thanks for the clarification.

 - Alistair

> Matt
>
>> specific concept?
>> 
>> Thanks.
>> 
>>  - Alistair
>> 
>> > Matt
>> >  
>> >>  - Alistair
>> >> 
>> >> > Matt
>> >> >
>> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>> >> >
>> >> >> Matt
>> >> >> 
>> >> >> > > +	}
>> >> >> > > +
>> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
>> >> >> > > +
>> >> >> > > +	return 0;
>> >> >> > > +}
>> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>> >> >> > > +
>> >> >> > >  /*
>> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
>> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
>> >> >> > 
>> >> 
>> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17  5:49                 ` Alistair Popple
@ 2024-10-17 15:40                   ` Matthew Brost
  2024-10-17 21:58                     ` Alistair Popple
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-17 15:40 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
> >> 
> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> 
> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
> >> >> 
> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> 
> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> >> >> 
> >> >> [...]
> >> >> 
> >> >> >> > > +{
> >> >> >> > > +	unsigned long i;
> >> >> >> > > +
> >> >> >> > > +	for (i = 0; i < npages; i++) {
> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> >> >> >> > > +
> >> >> >> > > +		if (!get_page_unless_zero(page)) {
> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> > > +			continue;
> >> >> >> > > +		}
> >> >> >> > > +
> >> >> >> > > +		if (!trylock_page(page)) {
> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> > > +			put_page(page);
> >> >> >> > > +			continue;
> >> >> >> > > +		}
> >> >> >> > > +
> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> >> >> >> > 
> >> >> >> > This needs to be converted to use a folio like
> >> >> >> > migrate_device_range(). But more importantly this should be split out as
> >> >> >> > a function that both migrate_device_range() and this function can call
> >> >> >> > given this bit is identical.
> >> >> >> > 
> >> >> >> 
> >> >> >> Missed the folio conversion and agree a helper shared between this
> >> >> >> function and migrate_device_range would be a good idea. Let add that.
> >> >> >> 
> >> >> >
> >> >> > Alistair,
> >> >> >
> >> >> > Ok, I think now I want to go slightly different direction here to give
> >> >> > GPUSVM a bit more control over several eviction scenarios.
> >> >> >
> >> >> > What if I exported the helper discussed above, e.g.,
> >> >> >
> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
> >> >> >  906 {
> >> >> >  907         struct folio *folio;
> >> >> >  908
> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
> >> >> >  910         if (!folio)
> >> >> >  911                 return 0;
> >> >> >  912
> >> >> >  913         if (!folio_trylock(folio)) {
> >> >> >  914                 folio_put(folio);
> >> >> >  915                 return 0;
> >> >> >  916         }
> >> >> >  917
> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> >> >> >  919 }
> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
> >> >> >
> >> >> > And then also export migrate_device_unmap.
> >> >> >
> >> >> > The usage here would be let a driver collect the device pages in virtual
> >> >> > address range via hmm_range_fault, lock device pages under notifier
> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
> >> >> > migrate_device_unmap.
> >> >> 
> >> >> I'm still working through this series but that seems a bit dubious, the
> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
> >> >> would help me a lot in understanding what you're suggesting.
> >> >>
> >> >
> >> > For sure locking in tricky, my mistake on not working through this
> >> > before sending out the next rev but it came to mind after sending +
> >> > regarding some late feedback from Thomas about using hmm for eviction
> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
> >> > doesn't work for coherent pages, while something like below does.
> >> >
> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
> >> >
> >> > Here is a snippet I have locally which seems to work.
> >> >
> >> > 2024 retry:
> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> > 2026         hmm_range.hmm_pfns = src;
> >> > 2027
> >> > 2028         while (true) {
> >> > 2029                 mmap_read_lock(mm);
> >> > 2030                 err = hmm_range_fault(&hmm_range);
> >> > 2031                 mmap_read_unlock(mm);
> >> > 2032                 if (err == -EBUSY) {
> >> > 2033                         if (time_after(jiffies, timeout))
> >> > 2034                                 break;
> >> > 2035
> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> > 2037                         continue;
> >> > 2038                 }
> >> > 2039                 break;
> >> > 2040         }
> >> > 2041         if (err)
> >> > 2042                 goto err_put;
> >> > 2043
> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
> >> > 2048                 goto retry;
> >> > 2049         }
> >> > 2050         for (i = 0; i < npages; ++i) {
> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
> >> > 2052
> >> > 2053                 if (page && (is_device_private_page(page) ||
> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
> >> > 2056                 else
> >> > 2057                         src[i] = 0;
> >> > 2058                 if (src[i])
> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
> >> > 2060         }
> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
> >> 
> >> Practically for eviction isn't this much the same as calling
> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
> >> probably shouldn't be looking at mm/vma structs.
> >> 
> >
> > hmm_range_fault is just collecting the pages, internally I suppose it
> > does look at a VMA (struct vm_area_struct) but I think the point is
> > drivers should not be looking at VMA here.
> 
> migrate_vma_setup() is designed to be called by drivers and needs a vma,
> so in general I don't see a problem with drivers looking up vma's. The
> problem arises specifically for eviction and whether or not that happens
> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
> issues there (see below).
> 

Ok, if you think it ok for drivers to lookup the VMA then purposed
exporting of migrate_device_pfn_lock & migrate_device_unmap is not
needed, rather just the original function exported in the this patch.

More below too.

> >> > 2063         migrate_device_unmap(src, npages, NULL);
> >> > ...
> >> > 2101         migrate_device_pages(src, dst, npages);
> >> > 2102         migrate_device_finalize(src, dst, npages);
> >> >
> >> >
> >> >> > Sima has strongly suggested avoiding a CPUVMA
> >> >> > lookup during eviction cases and this would let me fixup
> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
> >> >> 
> >> >> That sounds reasonable but for context do you have a link to the
> >> >> comments/discussion on this? I couldn't readily find it, but I may have
> >> >> just missed it.
> >> >> 
> >> >
> >> > See in [4], search for '2. eviction' comment from sima.
> >> 
> >> Thanks for pointing that out. For reference here's Sima's comment:
> >> 
> >> > 2. eviction
> >> > 
> >> > Requirements much like migrate_to_ram, because otherwise we break the
> >> > migration gurantee:
> >> > 
> >> > - Only looking at physical memory datastructures and locks, no looking at
> >> >   mm/vma structs or relying on those being locked. We rely entirely on
> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
> >>
> >> I also very much agree with this. That's basically why I added
> >> migrate_device_range(), so that we can forcibly evict pages when the
> >> driver needs them freed (eg. driver unload, low memory, etc.). In
> >> general it is impossible to guarantee eviction og all pages using just
> >> hmm_range_fault().
> >> 
> >
> > In this code path we don't have device pages available, hence the
> > purposed collection via hmm_range_fault.
> 
> Why don't you have the pfns requiring eviction available? I need to read
> this series in more depth, but generally hmm_range_fault() can't
> gurantee you will find every device page.
> 

There are two cases for eviction in my series:

1. TTM decides it needs to move memory. This calls
drm_gpusvm_evict_to_ram. In this case the device pfns are available
directly from drm_gpusvm_devmem so the migrate_device_* calls be used
here albiet with the new function added in this patch as device pfns may
be non-contiguous.

2. An inconsistent state for VA range occurs (mixed system and device pages,
partial unmap of a range, etc...). Here we want to evict the range ram
to make the state consistent. No device pages are available due to an
intentional disconnect between a virtual range and physical
drm_gpusvm_devmem, thus the device pages have to be looked up. This the
function drm_gpusvm_range_evict. Based on what you tell me, it likely is
fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
using hmm_range_fault like I have suggested here.

Note #2 may be removed or unnecessary at some point if we decide to add
support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
now though. See 'Ranges with mixed system and device pages' in [5].

[5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2

> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
> >> >
> >> >> > It would also make the function exported in this patch unnecessary too
> >> >> > as non-contiguous pfns can be setup on driver side via
> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> >> >> > in [1].
> >> >> >
> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
> >> >> > migrate_device_unmap?
> >> >> 
> >> >> If there is a good justification for it I can't see a problem with
> >> >> exporting it. That said I don't really understand why you would
> >> >> want/need to split those steps up but I'll wait to see the code.
> >> >>
> >> >
> >> > It is so the device pages returned from hmm_range_fault, which are only
> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
> >> > MMU invalidation which takes the notifier lock thus calling the function
> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
> >> >
> >> > I think this flow makes sense and agree in general this likely better
> >> > than looking at a CPUVMA.
> >> 
> >> I'm still a bit confused about what is better with this flow if you are
> >> still calling hmm_range_fault(). How is it better than just calling
> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
> >
> > The code in rev2 calls migrate_vma_setup but the requires a struct
> > vm_area_struct argument whereas hmm_range_fault does not.
> 
> I'm not sure that's a good enough justfication because the problem isn't
> whether you're looking up vma's in driver code or mm code. The problem
> is you are looking up vma's at all and all that goes with that (mainly
> taking mmap lock, etc.)
> 
> And for eviction hmm_range_fault() won't even find all the pages because
> their virtual address may have changed - consider what happens in cases
> of mremap(), fork(), etc. So eviction really needs physical pages
> (pfn's), not virtual addresses.
>

See above, #1 yes we use a physical pages. For #2 it is about making the
state consistent within a virtual address range.

Matt
 
> >> we're talking about eviction here so I don't understand why that would
> >> be relevant. And hmm_range_fault() still requires the VMA, although I
> >> need to look at the patches more closely, probably CPUVMA is a DRM
> >
> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
> > as argument. This is about avoiding a driver side lookup of the VMA.
> >
> > CPUVMA == struct vm_area_struct in this email.
> 
> Thanks for the clarification.
> 
>  - Alistair
> 
> > Matt
> >
> >> specific concept?
> >> 
> >> Thanks.
> >> 
> >>  - Alistair
> >> 
> >> > Matt
> >> >  
> >> >>  - Alistair
> >> >> 
> >> >> > Matt
> >> >> >
> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
> >> >> >
> >> >> >> Matt
> >> >> >> 
> >> >> >> > > +	}
> >> >> >> > > +
> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> >> >> >> > > +
> >> >> >> > > +	return 0;
> >> >> >> > > +}
> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> >> >> >> > > +
> >> >> >> > >  /*
> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
> >> >> >> > 
> >> >> 
> >> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17 15:40                   ` Matthew Brost
@ 2024-10-17 21:58                     ` Alistair Popple
  2024-10-18  0:54                       ` Matthew Brost
  2024-10-18  4:02                       ` Mika Penttilä
  0 siblings, 2 replies; 129+ messages in thread
From: Alistair Popple @ 2024-10-17 21:58 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
>> >> 
>> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> 
>> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>> >> >> 
>> >> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> >> 
>> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>> >> >> 
>> >> >> [...]
>> >> >> 
>> >> >> >> > > +{
>> >> >> >> > > +	unsigned long i;
>> >> >> >> > > +
>> >> >> >> > > +	for (i = 0; i < npages; i++) {
>> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
>> >> >> >> > > +
>> >> >> >> > > +		if (!get_page_unless_zero(page)) {
>> >> >> >> > > +			src_pfns[i] = 0;
>> >> >> >> > > +			continue;
>> >> >> >> > > +		}
>> >> >> >> > > +
>> >> >> >> > > +		if (!trylock_page(page)) {
>> >> >> >> > > +			src_pfns[i] = 0;
>> >> >> >> > > +			put_page(page);
>> >> >> >> > > +			continue;
>> >> >> >> > > +		}
>> >> >> >> > > +
>> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>> >> >> >> > 
>> >> >> >> > This needs to be converted to use a folio like
>> >> >> >> > migrate_device_range(). But more importantly this should be split out as
>> >> >> >> > a function that both migrate_device_range() and this function can call
>> >> >> >> > given this bit is identical.
>> >> >> >> > 
>> >> >> >> 
>> >> >> >> Missed the folio conversion and agree a helper shared between this
>> >> >> >> function and migrate_device_range would be a good idea. Let add that.
>> >> >> >> 
>> >> >> >
>> >> >> > Alistair,
>> >> >> >
>> >> >> > Ok, I think now I want to go slightly different direction here to give
>> >> >> > GPUSVM a bit more control over several eviction scenarios.
>> >> >> >
>> >> >> > What if I exported the helper discussed above, e.g.,
>> >> >> >
>> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>> >> >> >  906 {
>> >> >> >  907         struct folio *folio;
>> >> >> >  908
>> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>> >> >> >  910         if (!folio)
>> >> >> >  911                 return 0;
>> >> >> >  912
>> >> >> >  913         if (!folio_trylock(folio)) {
>> >> >> >  914                 folio_put(folio);
>> >> >> >  915                 return 0;
>> >> >> >  916         }
>> >> >> >  917
>> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>> >> >> >  919 }
>> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>> >> >> >
>> >> >> > And then also export migrate_device_unmap.
>> >> >> >
>> >> >> > The usage here would be let a driver collect the device pages in virtual
>> >> >> > address range via hmm_range_fault, lock device pages under notifier
>> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
>> >> >> > migrate_device_unmap.
>> >> >> 
>> >> >> I'm still working through this series but that seems a bit dubious, the
>> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
>> >> >> would help me a lot in understanding what you're suggesting.
>> >> >>
>> >> >
>> >> > For sure locking in tricky, my mistake on not working through this
>> >> > before sending out the next rev but it came to mind after sending +
>> >> > regarding some late feedback from Thomas about using hmm for eviction
>> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
>> >> > doesn't work for coherent pages, while something like below does.
>> >> >
>> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>> >> >
>> >> > Here is a snippet I have locally which seems to work.
>> >> >
>> >> > 2024 retry:
>> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> >> > 2026         hmm_range.hmm_pfns = src;
>> >> > 2027
>> >> > 2028         while (true) {
>> >> > 2029                 mmap_read_lock(mm);
>> >> > 2030                 err = hmm_range_fault(&hmm_range);
>> >> > 2031                 mmap_read_unlock(mm);
>> >> > 2032                 if (err == -EBUSY) {
>> >> > 2033                         if (time_after(jiffies, timeout))
>> >> > 2034                                 break;
>> >> > 2035
>> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> >> > 2037                         continue;
>> >> > 2038                 }
>> >> > 2039                 break;
>> >> > 2040         }
>> >> > 2041         if (err)
>> >> > 2042                 goto err_put;
>> >> > 2043
>> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
>> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
>> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
>> >> > 2048                 goto retry;
>> >> > 2049         }
>> >> > 2050         for (i = 0; i < npages; ++i) {
>> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
>> >> > 2052
>> >> > 2053                 if (page && (is_device_private_page(page) ||
>> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
>> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
>> >> > 2056                 else
>> >> > 2057                         src[i] = 0;
>> >> > 2058                 if (src[i])
>> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
>> >> > 2060         }
>> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
>> >> 
>> >> Practically for eviction isn't this much the same as calling
>> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
>> >> probably shouldn't be looking at mm/vma structs.
>> >> 
>> >
>> > hmm_range_fault is just collecting the pages, internally I suppose it
>> > does look at a VMA (struct vm_area_struct) but I think the point is
>> > drivers should not be looking at VMA here.
>> 
>> migrate_vma_setup() is designed to be called by drivers and needs a vma,
>> so in general I don't see a problem with drivers looking up vma's. The
>> problem arises specifically for eviction and whether or not that happens
>> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
>> issues there (see below).
>> 
>
> Ok, if you think it ok for drivers to lookup the VMA then purposed
> exporting of migrate_device_pfn_lock & migrate_device_unmap is not
> needed, rather just the original function exported in the this patch.
>
> More below too.
>
>> >> > 2063         migrate_device_unmap(src, npages, NULL);
>> >> > ...
>> >> > 2101         migrate_device_pages(src, dst, npages);
>> >> > 2102         migrate_device_finalize(src, dst, npages);
>> >> >
>> >> >
>> >> >> > Sima has strongly suggested avoiding a CPUVMA
>> >> >> > lookup during eviction cases and this would let me fixup
>> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
>> >> >> 
>> >> >> That sounds reasonable but for context do you have a link to the
>> >> >> comments/discussion on this? I couldn't readily find it, but I may have
>> >> >> just missed it.
>> >> >> 
>> >> >
>> >> > See in [4], search for '2. eviction' comment from sima.
>> >> 
>> >> Thanks for pointing that out. For reference here's Sima's comment:
>> >> 
>> >> > 2. eviction
>> >> > 
>> >> > Requirements much like migrate_to_ram, because otherwise we break the
>> >> > migration gurantee:
>> >> > 
>> >> > - Only looking at physical memory datastructures and locks, no looking at
>> >> >   mm/vma structs or relying on those being locked. We rely entirely on
>> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
>> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
>> >>
>> >> I also very much agree with this. That's basically why I added
>> >> migrate_device_range(), so that we can forcibly evict pages when the
>> >> driver needs them freed (eg. driver unload, low memory, etc.). In
>> >> general it is impossible to guarantee eviction og all pages using just
>> >> hmm_range_fault().
>> >> 
>> >
>> > In this code path we don't have device pages available, hence the
>> > purposed collection via hmm_range_fault.
>> 
>> Why don't you have the pfns requiring eviction available? I need to read
>> this series in more depth, but generally hmm_range_fault() can't
>> gurantee you will find every device page.
>> 
>
> There are two cases for eviction in my series:
>
> 1. TTM decides it needs to move memory. This calls
> drm_gpusvm_evict_to_ram. In this case the device pfns are available
> directly from drm_gpusvm_devmem so the migrate_device_* calls be used
> here albiet with the new function added in this patch as device pfns may
> be non-contiguous.

That makes sense and is generally what I think of when I'm thinking of
eviction. The new function makes sense too - migrate_device_range() was
primarily introduced to allow a driver to evict all device-private pages
from a GPU so didn't consider non-contiguous cases, etc.

> 2. An inconsistent state for VA range occurs (mixed system and device pages,
> partial unmap of a range, etc...). Here we want to evict the range ram
> to make the state consistent. No device pages are available due to an
> intentional disconnect between a virtual range and physical
> drm_gpusvm_devmem, thus the device pages have to be looked up. This the
> function drm_gpusvm_range_evict. Based on what you tell me, it likely is
> fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
> using hmm_range_fault like I have suggested here.

Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
fine for this usage and is exactly what you want - it was designed to
either select all the system memory pages or device-private pages within
a VA range and migrate them.

FWIW I have toyed with the idea of a combined
hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
migrate_vma_*() process but haven't come up with something nice as
yet. I don't think mixing the two in an open-coded fashion is a good
idea though, I'd rather we come up with a new API that addresses the
short-comings of migrate_vma_setup().

> Note #2 may be removed or unnecessary at some point if we decide to add
> support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
> now though. See 'Ranges with mixed system and device pages' in [5].
>
> [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
>
>> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
>> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>> >> >
>> >> >> > It would also make the function exported in this patch unnecessary too
>> >> >> > as non-contiguous pfns can be setup on driver side via
>> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
>> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>> >> >> > in [1].
>> >> >> >
>> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
>> >> >> > migrate_device_unmap?
>> >> >> 
>> >> >> If there is a good justification for it I can't see a problem with
>> >> >> exporting it. That said I don't really understand why you would
>> >> >> want/need to split those steps up but I'll wait to see the code.
>> >> >>
>> >> >
>> >> > It is so the device pages returned from hmm_range_fault, which are only
>> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
>> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
>> >> > MMU invalidation which takes the notifier lock thus calling the function
>> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>> >> >
>> >> > I think this flow makes sense and agree in general this likely better
>> >> > than looking at a CPUVMA.
>> >> 
>> >> I'm still a bit confused about what is better with this flow if you are
>> >> still calling hmm_range_fault(). How is it better than just calling
>> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
>> >
>> > The code in rev2 calls migrate_vma_setup but the requires a struct
>> > vm_area_struct argument whereas hmm_range_fault does not.
>> 
>> I'm not sure that's a good enough justfication because the problem isn't
>> whether you're looking up vma's in driver code or mm code. The problem
>> is you are looking up vma's at all and all that goes with that (mainly
>> taking mmap lock, etc.)
>> 
>> And for eviction hmm_range_fault() won't even find all the pages because
>> their virtual address may have changed - consider what happens in cases
>> of mremap(), fork(), etc. So eviction really needs physical pages
>> (pfn's), not virtual addresses.
>>
>
> See above, #1 yes we use a physical pages. For #2 it is about making the
> state consistent within a virtual address range.

Yep, makes sense now. For migration of physical pages you want
migrate_device_*, virtual address ranges want migrate_vma_*

 - Alistair

> Matt
>  
>> >> we're talking about eviction here so I don't understand why that would
>> >> be relevant. And hmm_range_fault() still requires the VMA, although I
>> >> need to look at the patches more closely, probably CPUVMA is a DRM
>> >
>> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
>> > as argument. This is about avoiding a driver side lookup of the VMA.
>> >
>> > CPUVMA == struct vm_area_struct in this email.
>> 
>> Thanks for the clarification.
>> 
>>  - Alistair
>> 
>> > Matt
>> >
>> >> specific concept?
>> >> 
>> >> Thanks.
>> >> 
>> >>  - Alistair
>> >> 
>> >> > Matt
>> >> >  
>> >> >>  - Alistair
>> >> >> 
>> >> >> > Matt
>> >> >> >
>> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>> >> >> >
>> >> >> >> Matt
>> >> >> >> 
>> >> >> >> > > +	}
>> >> >> >> > > +
>> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
>> >> >> >> > > +
>> >> >> >> > > +	return 0;
>> >> >> >> > > +}
>> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>> >> >> >> > > +
>> >> >> >> > >  /*
>> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
>> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
>> >> >> >> > 
>> >> >> 
>> >> 
>> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17 21:58                     ` Alistair Popple
@ 2024-10-18  0:54                       ` Matthew Brost
  2024-10-18  5:59                         ` Alistair Popple
  2024-10-18  4:02                       ` Mika Penttilä
  1 sibling, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-18  0:54 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Fri, Oct 18, 2024 at 08:58:02AM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
> >> 
> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> 
> >> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
> >> >> 
> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> 
> >> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
> >> >> >> 
> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> >> 
> >> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> >> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> >> >> >> 
> >> >> >> [...]
> >> >> >> 
> >> >> >> >> > > +{
> >> >> >> >> > > +	unsigned long i;
> >> >> >> >> > > +
> >> >> >> >> > > +	for (i = 0; i < npages; i++) {
> >> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> >> >> >> >> > > +
> >> >> >> >> > > +		if (!get_page_unless_zero(page)) {
> >> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> >> > > +			continue;
> >> >> >> >> > > +		}
> >> >> >> >> > > +
> >> >> >> >> > > +		if (!trylock_page(page)) {
> >> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> >> > > +			put_page(page);
> >> >> >> >> > > +			continue;
> >> >> >> >> > > +		}
> >> >> >> >> > > +
> >> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> >> >> >> >> > 
> >> >> >> >> > This needs to be converted to use a folio like
> >> >> >> >> > migrate_device_range(). But more importantly this should be split out as
> >> >> >> >> > a function that both migrate_device_range() and this function can call
> >> >> >> >> > given this bit is identical.
> >> >> >> >> > 
> >> >> >> >> 
> >> >> >> >> Missed the folio conversion and agree a helper shared between this
> >> >> >> >> function and migrate_device_range would be a good idea. Let add that.
> >> >> >> >> 
> >> >> >> >
> >> >> >> > Alistair,
> >> >> >> >
> >> >> >> > Ok, I think now I want to go slightly different direction here to give
> >> >> >> > GPUSVM a bit more control over several eviction scenarios.
> >> >> >> >
> >> >> >> > What if I exported the helper discussed above, e.g.,
> >> >> >> >
> >> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
> >> >> >> >  906 {
> >> >> >> >  907         struct folio *folio;
> >> >> >> >  908
> >> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
> >> >> >> >  910         if (!folio)
> >> >> >> >  911                 return 0;
> >> >> >> >  912
> >> >> >> >  913         if (!folio_trylock(folio)) {
> >> >> >> >  914                 folio_put(folio);
> >> >> >> >  915                 return 0;
> >> >> >> >  916         }
> >> >> >> >  917
> >> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> >> >> >> >  919 }
> >> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
> >> >> >> >
> >> >> >> > And then also export migrate_device_unmap.
> >> >> >> >
> >> >> >> > The usage here would be let a driver collect the device pages in virtual
> >> >> >> > address range via hmm_range_fault, lock device pages under notifier
> >> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
> >> >> >> > migrate_device_unmap.
> >> >> >> 
> >> >> >> I'm still working through this series but that seems a bit dubious, the
> >> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
> >> >> >> would help me a lot in understanding what you're suggesting.
> >> >> >>
> >> >> >
> >> >> > For sure locking in tricky, my mistake on not working through this
> >> >> > before sending out the next rev but it came to mind after sending +
> >> >> > regarding some late feedback from Thomas about using hmm for eviction
> >> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
> >> >> > doesn't work for coherent pages, while something like below does.
> >> >> >
> >> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
> >> >> >
> >> >> > Here is a snippet I have locally which seems to work.
> >> >> >
> >> >> > 2024 retry:
> >> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> >> > 2026         hmm_range.hmm_pfns = src;
> >> >> > 2027
> >> >> > 2028         while (true) {
> >> >> > 2029                 mmap_read_lock(mm);
> >> >> > 2030                 err = hmm_range_fault(&hmm_range);
> >> >> > 2031                 mmap_read_unlock(mm);
> >> >> > 2032                 if (err == -EBUSY) {
> >> >> > 2033                         if (time_after(jiffies, timeout))
> >> >> > 2034                                 break;
> >> >> > 2035
> >> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> >> > 2037                         continue;
> >> >> > 2038                 }
> >> >> > 2039                 break;
> >> >> > 2040         }
> >> >> > 2041         if (err)
> >> >> > 2042                 goto err_put;
> >> >> > 2043
> >> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
> >> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> >> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
> >> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
> >> >> > 2048                 goto retry;
> >> >> > 2049         }
> >> >> > 2050         for (i = 0; i < npages; ++i) {
> >> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
> >> >> > 2052
> >> >> > 2053                 if (page && (is_device_private_page(page) ||
> >> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
> >> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
> >> >> > 2056                 else
> >> >> > 2057                         src[i] = 0;
> >> >> > 2058                 if (src[i])
> >> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
> >> >> > 2060         }
> >> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
> >> >> 
> >> >> Practically for eviction isn't this much the same as calling
> >> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
> >> >> probably shouldn't be looking at mm/vma structs.
> >> >> 
> >> >
> >> > hmm_range_fault is just collecting the pages, internally I suppose it
> >> > does look at a VMA (struct vm_area_struct) but I think the point is
> >> > drivers should not be looking at VMA here.
> >> 
> >> migrate_vma_setup() is designed to be called by drivers and needs a vma,
> >> so in general I don't see a problem with drivers looking up vma's. The
> >> problem arises specifically for eviction and whether or not that happens
> >> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
> >> issues there (see below).
> >> 
> >
> > Ok, if you think it ok for drivers to lookup the VMA then purposed
> > exporting of migrate_device_pfn_lock & migrate_device_unmap is not
> > needed, rather just the original function exported in the this patch.
> >
> > More below too.
> >
> >> >> > 2063         migrate_device_unmap(src, npages, NULL);
> >> >> > ...
> >> >> > 2101         migrate_device_pages(src, dst, npages);
> >> >> > 2102         migrate_device_finalize(src, dst, npages);
> >> >> >
> >> >> >
> >> >> >> > Sima has strongly suggested avoiding a CPUVMA
> >> >> >> > lookup during eviction cases and this would let me fixup
> >> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
> >> >> >> 
> >> >> >> That sounds reasonable but for context do you have a link to the
> >> >> >> comments/discussion on this? I couldn't readily find it, but I may have
> >> >> >> just missed it.
> >> >> >> 
> >> >> >
> >> >> > See in [4], search for '2. eviction' comment from sima.
> >> >> 
> >> >> Thanks for pointing that out. For reference here's Sima's comment:
> >> >> 
> >> >> > 2. eviction
> >> >> > 
> >> >> > Requirements much like migrate_to_ram, because otherwise we break the
> >> >> > migration gurantee:
> >> >> > 
> >> >> > - Only looking at physical memory datastructures and locks, no looking at
> >> >> >   mm/vma structs or relying on those being locked. We rely entirely on
> >> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
> >> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
> >> >>
> >> >> I also very much agree with this. That's basically why I added
> >> >> migrate_device_range(), so that we can forcibly evict pages when the
> >> >> driver needs them freed (eg. driver unload, low memory, etc.). In
> >> >> general it is impossible to guarantee eviction og all pages using just
> >> >> hmm_range_fault().
> >> >> 
> >> >
> >> > In this code path we don't have device pages available, hence the
> >> > purposed collection via hmm_range_fault.
> >> 
> >> Why don't you have the pfns requiring eviction available? I need to read
> >> this series in more depth, but generally hmm_range_fault() can't
> >> gurantee you will find every device page.
> >> 
> >
> > There are two cases for eviction in my series:
> >
> > 1. TTM decides it needs to move memory. This calls
> > drm_gpusvm_evict_to_ram. In this case the device pfns are available
> > directly from drm_gpusvm_devmem so the migrate_device_* calls be used
> > here albiet with the new function added in this patch as device pfns may
> > be non-contiguous.
> 
> That makes sense and is generally what I think of when I'm thinking of
> eviction. The new function makes sense too - migrate_device_range() was
> primarily introduced to allow a driver to evict all device-private pages
> from a GPU so didn't consider non-contiguous cases, etc.
> 
> > 2. An inconsistent state for VA range occurs (mixed system and device pages,
> > partial unmap of a range, etc...). Here we want to evict the range ram
> > to make the state consistent. No device pages are available due to an
> > intentional disconnect between a virtual range and physical
> > drm_gpusvm_devmem, thus the device pages have to be looked up. This the
> > function drm_gpusvm_range_evict. Based on what you tell me, it likely is
> > fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
> > using hmm_range_fault like I have suggested here.
> 
> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
> fine for this usage and is exactly what you want - it was designed to
> either select all the system memory pages or device-private pages within
> a VA range and migrate them.
> 
> FWIW I have toyed with the idea of a combined
> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
> migrate_vma_*() process but haven't come up with something nice as
> yet. I don't think mixing the two in an open-coded fashion is a good
> idea though, I'd rather we come up with a new API that addresses the
> short-comings of migrate_vma_setup().
> 

I think that would good. Here we actually need to lookup multiple VMAs
and have a sequence of migrate_vma_* calls as it possible for VMAs to
have changed after the driver range was created. It might be nice to
hide the VMA lookup from the drivers with an API saying collect and
migrate all pages of a type in a VA range much like hmm_range_fault. If
the range spans multiple VMAs that would be hidden from the caller.

Matt

> > Note #2 may be removed or unnecessary at some point if we decide to add
> > support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
> > now though. See 'Ranges with mixed system and device pages' in [5].
> >
> > [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
> >
> >> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
> >> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
> >> >> >
> >> >> >> > It would also make the function exported in this patch unnecessary too
> >> >> >> > as non-contiguous pfns can be setup on driver side via
> >> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
> >> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> >> >> >> > in [1].
> >> >> >> >
> >> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
> >> >> >> > migrate_device_unmap?
> >> >> >> 
> >> >> >> If there is a good justification for it I can't see a problem with
> >> >> >> exporting it. That said I don't really understand why you would
> >> >> >> want/need to split those steps up but I'll wait to see the code.
> >> >> >>
> >> >> >
> >> >> > It is so the device pages returned from hmm_range_fault, which are only
> >> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
> >> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
> >> >> > MMU invalidation which takes the notifier lock thus calling the function
> >> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
> >> >> >
> >> >> > I think this flow makes sense and agree in general this likely better
> >> >> > than looking at a CPUVMA.
> >> >> 
> >> >> I'm still a bit confused about what is better with this flow if you are
> >> >> still calling hmm_range_fault(). How is it better than just calling
> >> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
> >> >
> >> > The code in rev2 calls migrate_vma_setup but the requires a struct
> >> > vm_area_struct argument whereas hmm_range_fault does not.
> >> 
> >> I'm not sure that's a good enough justfication because the problem isn't
> >> whether you're looking up vma's in driver code or mm code. The problem
> >> is you are looking up vma's at all and all that goes with that (mainly
> >> taking mmap lock, etc.)
> >> 
> >> And for eviction hmm_range_fault() won't even find all the pages because
> >> their virtual address may have changed - consider what happens in cases
> >> of mremap(), fork(), etc. So eviction really needs physical pages
> >> (pfn's), not virtual addresses.
> >>
> >
> > See above, #1 yes we use a physical pages. For #2 it is about making the
> > state consistent within a virtual address range.
> 
> Yep, makes sense now. For migration of physical pages you want
> migrate_device_*, virtual address ranges want migrate_vma_*
> 
>  - Alistair
> 
> > Matt
> >  
> >> >> we're talking about eviction here so I don't understand why that would
> >> >> be relevant. And hmm_range_fault() still requires the VMA, although I
> >> >> need to look at the patches more closely, probably CPUVMA is a DRM
> >> >
> >> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
> >> > as argument. This is about avoiding a driver side lookup of the VMA.
> >> >
> >> > CPUVMA == struct vm_area_struct in this email.
> >> 
> >> Thanks for the clarification.
> >> 
> >>  - Alistair
> >> 
> >> > Matt
> >> >
> >> >> specific concept?
> >> >> 
> >> >> Thanks.
> >> >> 
> >> >>  - Alistair
> >> >> 
> >> >> > Matt
> >> >> >  
> >> >> >>  - Alistair
> >> >> >> 
> >> >> >> > Matt
> >> >> >> >
> >> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
> >> >> >> >
> >> >> >> >> Matt
> >> >> >> >> 
> >> >> >> >> > > +	}
> >> >> >> >> > > +
> >> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> >> >> >> >> > > +
> >> >> >> >> > > +	return 0;
> >> >> >> >> > > +}
> >> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> >> >> >> >> > > +
> >> >> >> >> > >  /*
> >> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> >> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
> >> >> >> >> > 
> >> >> >> 
> >> >> 
> >> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-18  0:54                       ` Matthew Brost
@ 2024-10-18  5:59                         ` Alistair Popple
  2024-10-18  6:39                           ` Mika Penttilä
  2024-10-18  7:16                           ` Matthew Brost
  0 siblings, 2 replies; 129+ messages in thread
From: Alistair Popple @ 2024-10-18  5:59 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Fri, Oct 18, 2024 at 08:58:02AM +1100, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> > On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
>> >> 
>> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> 
>> >> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
>> >> >> 
>> >> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> >> 
>> >> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>> >> >> >> 
>> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> >> >> 
>> >> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>> >> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>> >> >> >> 
>> >> >> >> [...]
>> >> >> >> 
>> >> >> >> >> > > +{
>> >> >> >> >> > > +	unsigned long i;
>> >> >> >> >> > > +
>> >> >> >> >> > > +	for (i = 0; i < npages; i++) {
>> >> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
>> >> >> >> >> > > +
>> >> >> >> >> > > +		if (!get_page_unless_zero(page)) {
>> >> >> >> >> > > +			src_pfns[i] = 0;
>> >> >> >> >> > > +			continue;
>> >> >> >> >> > > +		}
>> >> >> >> >> > > +
>> >> >> >> >> > > +		if (!trylock_page(page)) {
>> >> >> >> >> > > +			src_pfns[i] = 0;
>> >> >> >> >> > > +			put_page(page);
>> >> >> >> >> > > +			continue;
>> >> >> >> >> > > +		}
>> >> >> >> >> > > +
>> >> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>> >> >> >> >> > 
>> >> >> >> >> > This needs to be converted to use a folio like
>> >> >> >> >> > migrate_device_range(). But more importantly this should be split out as
>> >> >> >> >> > a function that both migrate_device_range() and this function can call
>> >> >> >> >> > given this bit is identical.
>> >> >> >> >> > 
>> >> >> >> >> 
>> >> >> >> >> Missed the folio conversion and agree a helper shared between this
>> >> >> >> >> function and migrate_device_range would be a good idea. Let add that.
>> >> >> >> >> 
>> >> >> >> >
>> >> >> >> > Alistair,
>> >> >> >> >
>> >> >> >> > Ok, I think now I want to go slightly different direction here to give
>> >> >> >> > GPUSVM a bit more control over several eviction scenarios.
>> >> >> >> >
>> >> >> >> > What if I exported the helper discussed above, e.g.,
>> >> >> >> >
>> >> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>> >> >> >> >  906 {
>> >> >> >> >  907         struct folio *folio;
>> >> >> >> >  908
>> >> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>> >> >> >> >  910         if (!folio)
>> >> >> >> >  911                 return 0;
>> >> >> >> >  912
>> >> >> >> >  913         if (!folio_trylock(folio)) {
>> >> >> >> >  914                 folio_put(folio);
>> >> >> >> >  915                 return 0;
>> >> >> >> >  916         }
>> >> >> >> >  917
>> >> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>> >> >> >> >  919 }
>> >> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>> >> >> >> >
>> >> >> >> > And then also export migrate_device_unmap.
>> >> >> >> >
>> >> >> >> > The usage here would be let a driver collect the device pages in virtual
>> >> >> >> > address range via hmm_range_fault, lock device pages under notifier
>> >> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
>> >> >> >> > migrate_device_unmap.
>> >> >> >> 
>> >> >> >> I'm still working through this series but that seems a bit dubious, the
>> >> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
>> >> >> >> would help me a lot in understanding what you're suggesting.
>> >> >> >>
>> >> >> >
>> >> >> > For sure locking in tricky, my mistake on not working through this
>> >> >> > before sending out the next rev but it came to mind after sending +
>> >> >> > regarding some late feedback from Thomas about using hmm for eviction
>> >> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
>> >> >> > doesn't work for coherent pages, while something like below does.
>> >> >> >
>> >> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>> >> >> >
>> >> >> > Here is a snippet I have locally which seems to work.
>> >> >> >
>> >> >> > 2024 retry:
>> >> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> >> >> > 2026         hmm_range.hmm_pfns = src;
>> >> >> > 2027
>> >> >> > 2028         while (true) {
>> >> >> > 2029                 mmap_read_lock(mm);
>> >> >> > 2030                 err = hmm_range_fault(&hmm_range);
>> >> >> > 2031                 mmap_read_unlock(mm);
>> >> >> > 2032                 if (err == -EBUSY) {
>> >> >> > 2033                         if (time_after(jiffies, timeout))
>> >> >> > 2034                                 break;
>> >> >> > 2035
>> >> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> >> >> > 2037                         continue;
>> >> >> > 2038                 }
>> >> >> > 2039                 break;
>> >> >> > 2040         }
>> >> >> > 2041         if (err)
>> >> >> > 2042                 goto err_put;
>> >> >> > 2043
>> >> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
>> >> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>> >> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
>> >> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
>> >> >> > 2048                 goto retry;
>> >> >> > 2049         }
>> >> >> > 2050         for (i = 0; i < npages; ++i) {
>> >> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
>> >> >> > 2052
>> >> >> > 2053                 if (page && (is_device_private_page(page) ||
>> >> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
>> >> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
>> >> >> > 2056                 else
>> >> >> > 2057                         src[i] = 0;
>> >> >> > 2058                 if (src[i])
>> >> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
>> >> >> > 2060         }
>> >> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
>> >> >> 
>> >> >> Practically for eviction isn't this much the same as calling
>> >> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
>> >> >> probably shouldn't be looking at mm/vma structs.
>> >> >> 
>> >> >
>> >> > hmm_range_fault is just collecting the pages, internally I suppose it
>> >> > does look at a VMA (struct vm_area_struct) but I think the point is
>> >> > drivers should not be looking at VMA here.
>> >> 
>> >> migrate_vma_setup() is designed to be called by drivers and needs a vma,
>> >> so in general I don't see a problem with drivers looking up vma's. The
>> >> problem arises specifically for eviction and whether or not that happens
>> >> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
>> >> issues there (see below).
>> >> 
>> >
>> > Ok, if you think it ok for drivers to lookup the VMA then purposed
>> > exporting of migrate_device_pfn_lock & migrate_device_unmap is not
>> > needed, rather just the original function exported in the this patch.
>> >
>> > More below too.
>> >
>> >> >> > 2063         migrate_device_unmap(src, npages, NULL);
>> >> >> > ...
>> >> >> > 2101         migrate_device_pages(src, dst, npages);
>> >> >> > 2102         migrate_device_finalize(src, dst, npages);
>> >> >> >
>> >> >> >
>> >> >> >> > Sima has strongly suggested avoiding a CPUVMA
>> >> >> >> > lookup during eviction cases and this would let me fixup
>> >> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
>> >> >> >> 
>> >> >> >> That sounds reasonable but for context do you have a link to the
>> >> >> >> comments/discussion on this? I couldn't readily find it, but I may have
>> >> >> >> just missed it.
>> >> >> >> 
>> >> >> >
>> >> >> > See in [4], search for '2. eviction' comment from sima.
>> >> >> 
>> >> >> Thanks for pointing that out. For reference here's Sima's comment:
>> >> >> 
>> >> >> > 2. eviction
>> >> >> > 
>> >> >> > Requirements much like migrate_to_ram, because otherwise we break the
>> >> >> > migration gurantee:
>> >> >> > 
>> >> >> > - Only looking at physical memory datastructures and locks, no looking at
>> >> >> >   mm/vma structs or relying on those being locked. We rely entirely on
>> >> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
>> >> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
>> >> >>
>> >> >> I also very much agree with this. That's basically why I added
>> >> >> migrate_device_range(), so that we can forcibly evict pages when the
>> >> >> driver needs them freed (eg. driver unload, low memory, etc.). In
>> >> >> general it is impossible to guarantee eviction og all pages using just
>> >> >> hmm_range_fault().
>> >> >> 
>> >> >
>> >> > In this code path we don't have device pages available, hence the
>> >> > purposed collection via hmm_range_fault.
>> >> 
>> >> Why don't you have the pfns requiring eviction available? I need to read
>> >> this series in more depth, but generally hmm_range_fault() can't
>> >> gurantee you will find every device page.
>> >> 
>> >
>> > There are two cases for eviction in my series:
>> >
>> > 1. TTM decides it needs to move memory. This calls
>> > drm_gpusvm_evict_to_ram. In this case the device pfns are available
>> > directly from drm_gpusvm_devmem so the migrate_device_* calls be used
>> > here albiet with the new function added in this patch as device pfns may
>> > be non-contiguous.
>> 
>> That makes sense and is generally what I think of when I'm thinking of
>> eviction. The new function makes sense too - migrate_device_range() was
>> primarily introduced to allow a driver to evict all device-private pages
>> from a GPU so didn't consider non-contiguous cases, etc.
>> 
>> > 2. An inconsistent state for VA range occurs (mixed system and device pages,
>> > partial unmap of a range, etc...). Here we want to evict the range ram
>> > to make the state consistent. No device pages are available due to an
>> > intentional disconnect between a virtual range and physical
>> > drm_gpusvm_devmem, thus the device pages have to be looked up. This the
>> > function drm_gpusvm_range_evict. Based on what you tell me, it likely is
>> > fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
>> > using hmm_range_fault like I have suggested here.
>> 
>> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
>> fine for this usage and is exactly what you want - it was designed to
>> either select all the system memory pages or device-private pages within
>> a VA range and migrate them.
>> 
>> FWIW I have toyed with the idea of a combined
>> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
>> migrate_vma_*() process but haven't come up with something nice as
>> yet. I don't think mixing the two in an open-coded fashion is a good
>> idea though, I'd rather we come up with a new API that addresses the
>> short-comings of migrate_vma_setup().
>> 
>
> I think that would good. Here we actually need to lookup multiple VMAs
> and have a sequence of migrate_vma_* calls as it possible for VMAs to
> have changed after the driver range was created. It might be nice to
> hide the VMA lookup from the drivers with an API saying collect and
> migrate all pages of a type in a VA range much like hmm_range_fault. If
> the range spans multiple VMAs that would be hidden from the caller.

Ok. I wasn't really considering multiple VMAs. UVM and Nouveau don't
really have a requirement to migrate across multiple VMAs but if that's
neccessary I think an API that hides that specifically for working with
migrate_vma_*() might make sense.

> Matt
>
>> > Note #2 may be removed or unnecessary at some point if we decide to add
>> > support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
>> > now though. See 'Ranges with mixed system and device pages' in [5].

As someone not very familiar with some of the DRM layers can I ask why
having virtual address ranges with a mix of system and device pages is
hard to support? It seems to me that in practice it might be quite
difficult to keep a VMA range as exclusively all in system memory or all
in device memory.

>> > [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
>> >
>> >> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
>> >> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>> >> >> >
>> >> >> >> > It would also make the function exported in this patch unnecessary too
>> >> >> >> > as non-contiguous pfns can be setup on driver side via
>> >> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
>> >> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>> >> >> >> > in [1].
>> >> >> >> >
>> >> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
>> >> >> >> > migrate_device_unmap?
>> >> >> >> 
>> >> >> >> If there is a good justification for it I can't see a problem with
>> >> >> >> exporting it. That said I don't really understand why you would
>> >> >> >> want/need to split those steps up but I'll wait to see the code.
>> >> >> >>
>> >> >> >
>> >> >> > It is so the device pages returned from hmm_range_fault, which are only
>> >> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
>> >> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
>> >> >> > MMU invalidation which takes the notifier lock thus calling the function
>> >> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>> >> >> >
>> >> >> > I think this flow makes sense and agree in general this likely better
>> >> >> > than looking at a CPUVMA.
>> >> >> 
>> >> >> I'm still a bit confused about what is better with this flow if you are
>> >> >> still calling hmm_range_fault(). How is it better than just calling
>> >> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
>> >> >
>> >> > The code in rev2 calls migrate_vma_setup but the requires a struct
>> >> > vm_area_struct argument whereas hmm_range_fault does not.
>> >> 
>> >> I'm not sure that's a good enough justfication because the problem isn't
>> >> whether you're looking up vma's in driver code or mm code. The problem
>> >> is you are looking up vma's at all and all that goes with that (mainly
>> >> taking mmap lock, etc.)
>> >> 
>> >> And for eviction hmm_range_fault() won't even find all the pages because
>> >> their virtual address may have changed - consider what happens in cases
>> >> of mremap(), fork(), etc. So eviction really needs physical pages
>> >> (pfn's), not virtual addresses.
>> >>
>> >
>> > See above, #1 yes we use a physical pages. For #2 it is about making the
>> > state consistent within a virtual address range.
>> 
>> Yep, makes sense now. For migration of physical pages you want
>> migrate_device_*, virtual address ranges want migrate_vma_*
>> 
>>  - Alistair
>> 
>> > Matt
>> >  
>> >> >> we're talking about eviction here so I don't understand why that would
>> >> >> be relevant. And hmm_range_fault() still requires the VMA, although I
>> >> >> need to look at the patches more closely, probably CPUVMA is a DRM
>> >> >
>> >> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
>> >> > as argument. This is about avoiding a driver side lookup of the VMA.
>> >> >
>> >> > CPUVMA == struct vm_area_struct in this email.
>> >> 
>> >> Thanks for the clarification.
>> >> 
>> >>  - Alistair
>> >> 
>> >> > Matt
>> >> >
>> >> >> specific concept?
>> >> >> 
>> >> >> Thanks.
>> >> >> 
>> >> >>  - Alistair
>> >> >> 
>> >> >> > Matt
>> >> >> >  
>> >> >> >>  - Alistair
>> >> >> >> 
>> >> >> >> > Matt
>> >> >> >> >
>> >> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>> >> >> >> >
>> >> >> >> >> Matt
>> >> >> >> >> 
>> >> >> >> >> > > +	}
>> >> >> >> >> > > +
>> >> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
>> >> >> >> >> > > +
>> >> >> >> >> > > +	return 0;
>> >> >> >> >> > > +}
>> >> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>> >> >> >> >> > > +
>> >> >> >> >> > >  /*
>> >> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
>> >> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
>> >> >> >> >> > 
>> >> >> >> 
>> >> >> 
>> >> 
>> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-18  5:59                         ` Alistair Popple
@ 2024-10-18  6:39                           ` Mika Penttilä
  2024-10-18  7:16                           ` Matthew Brost
  1 sibling, 0 replies; 129+ messages in thread
From: Mika Penttilä @ 2024-10-18  6:39 UTC (permalink / raw)
  To: Alistair Popple, Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


On 10/18/24 08:59, Alistair Popple wrote:
> Matthew Brost <matthew.brost@intel.com> writes:
>
>> On Fri, Oct 18, 2024 at 08:58:02AM +1100, Alistair Popple wrote:
>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>
>>>> On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>
>>>>>> On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
>>>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>>>
>>>>>>>> On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>>>>>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>>>>>
>>>>>>>>>> On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>>>>>>>>>>> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +	unsigned long i;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	for (i = 0; i < npages; i++) {
>>>>>>>>>>>>> +		struct page *page = pfn_to_page(src_pfns[i]);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +		if (!get_page_unless_zero(page)) {
>>>>>>>>>>>>> +			src_pfns[i] = 0;
>>>>>>>>>>>>> +			continue;
>>>>>>>>>>>>> +		}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +		if (!trylock_page(page)) {
>>>>>>>>>>>>> +			src_pfns[i] = 0;
>>>>>>>>>>>>> +			put_page(page);
>>>>>>>>>>>>> +			continue;
>>>>>>>>>>>>> +		}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>>>>>>>>>>>> This needs to be converted to use a folio like
>>>>>>>>>>>> migrate_device_range(). But more importantly this should be split out as
>>>>>>>>>>>> a function that both migrate_device_range() and this function can call
>>>>>>>>>>>> given this bit is identical.
>>>>>>>>>>>>
>>>>>>>>>>> Missed the folio conversion and agree a helper shared between this
>>>>>>>>>>> function and migrate_device_range would be a good idea. Let add that.
>>>>>>>>>>>
>>>>>>>>>> Alistair,
>>>>>>>>>>
>>>>>>>>>> Ok, I think now I want to go slightly different direction here to give
>>>>>>>>>> GPUSVM a bit more control over several eviction scenarios.
>>>>>>>>>>
>>>>>>>>>> What if I exported the helper discussed above, e.g.,
>>>>>>>>>>
>>>>>>>>>>  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>>>>>>>>>>  906 {
>>>>>>>>>>  907         struct folio *folio;
>>>>>>>>>>  908
>>>>>>>>>>  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>>>>>>>>>>  910         if (!folio)
>>>>>>>>>>  911                 return 0;
>>>>>>>>>>  912
>>>>>>>>>>  913         if (!folio_trylock(folio)) {
>>>>>>>>>>  914                 folio_put(folio);
>>>>>>>>>>  915                 return 0;
>>>>>>>>>>  916         }
>>>>>>>>>>  917
>>>>>>>>>>  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>>>>>>>>>>  919 }
>>>>>>>>>>  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>>>>>>>>>>
>>>>>>>>>> And then also export migrate_device_unmap.
>>>>>>>>>>
>>>>>>>>>> The usage here would be let a driver collect the device pages in virtual
>>>>>>>>>> address range via hmm_range_fault, lock device pages under notifier
>>>>>>>>>> lock ensuring device pages are valid, drop the notifier lock and call
>>>>>>>>>> migrate_device_unmap.
>>>>>>>>> I'm still working through this series but that seems a bit dubious, the
>>>>>>>>> locking here is pretty subtle and easy to get wrong so seeing some code
>>>>>>>>> would help me a lot in understanding what you're suggesting.
>>>>>>>>>
>>>>>>>> For sure locking in tricky, my mistake on not working through this
>>>>>>>> before sending out the next rev but it came to mind after sending +
>>>>>>>> regarding some late feedback from Thomas about using hmm for eviction
>>>>>>>> [2]. His suggestion of using hmm_range_fault to trigger migration
>>>>>>>> doesn't work for coherent pages, while something like below does.
>>>>>>>>
>>>>>>>> [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>>>>>>>>
>>>>>>>> Here is a snippet I have locally which seems to work.
>>>>>>>>
>>>>>>>> 2024 retry:
>>>>>>>> 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>>>> 2026         hmm_range.hmm_pfns = src;
>>>>>>>> 2027
>>>>>>>> 2028         while (true) {
>>>>>>>> 2029                 mmap_read_lock(mm);
>>>>>>>> 2030                 err = hmm_range_fault(&hmm_range);
>>>>>>>> 2031                 mmap_read_unlock(mm);
>>>>>>>> 2032                 if (err == -EBUSY) {
>>>>>>>> 2033                         if (time_after(jiffies, timeout))
>>>>>>>> 2034                                 break;
>>>>>>>> 2035
>>>>>>>> 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>>>> 2037                         continue;
>>>>>>>> 2038                 }
>>>>>>>> 2039                 break;
>>>>>>>> 2040         }
>>>>>>>> 2041         if (err)
>>>>>>>> 2042                 goto err_put;
>>>>>>>> 2043
>>>>>>>> 2044         drm_gpusvm_notifier_lock(gpusvm);
>>>>>>>> 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>>>>>>>> 2046                 drm_gpusvm_notifier_unlock(gpusvm);
>>>>>>>> 2047                 memset(src, 0, sizeof(*src) * npages);
>>>>>>>> 2048                 goto retry;
>>>>>>>> 2049         }
>>>>>>>> 2050         for (i = 0; i < npages; ++i) {
>>>>>>>> 2051                 struct page *page = hmm_pfn_to_page(src[i]);
>>>>>>>> 2052
>>>>>>>> 2053                 if (page && (is_device_private_page(page) ||
>>>>>>>> 2054                     is_device_coherent_page(page)) && page->zone_device_data)
>>>>>>>> 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
>>>>>>>> 2056                 else
>>>>>>>> 2057                         src[i] = 0;
>>>>>>>> 2058                 if (src[i])
>>>>>>>> 2059                         src[i] = migrate_device_pfn_lock(src[i]);
>>>>>>>> 2060         }
>>>>>>>> 2061         drm_gpusvm_notifier_unlock(gpusvm);
>>>>>>> Practically for eviction isn't this much the same as calling
>>>>>>> migrate_vma_setup()? And also for eviction as Sima mentioned you
>>>>>>> probably shouldn't be looking at mm/vma structs.
>>>>>>>
>>>>>> hmm_range_fault is just collecting the pages, internally I suppose it
>>>>>> does look at a VMA (struct vm_area_struct) but I think the point is
>>>>>> drivers should not be looking at VMA here.
>>>>> migrate_vma_setup() is designed to be called by drivers and needs a vma,
>>>>> so in general I don't see a problem with drivers looking up vma's. The
>>>>> problem arises specifically for eviction and whether or not that happens
>>>>> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
>>>>> issues there (see below).
>>>>>
>>>> Ok, if you think it ok for drivers to lookup the VMA then purposed
>>>> exporting of migrate_device_pfn_lock & migrate_device_unmap is not
>>>> needed, rather just the original function exported in the this patch.
>>>>
>>>> More below too.
>>>>
>>>>>>>> 2063         migrate_device_unmap(src, npages, NULL);
>>>>>>>> ...
>>>>>>>> 2101         migrate_device_pages(src, dst, npages);
>>>>>>>> 2102         migrate_device_finalize(src, dst, npages);
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Sima has strongly suggested avoiding a CPUVMA
>>>>>>>>>> lookup during eviction cases and this would let me fixup
>>>>>>>>>> drm_gpusvm_range_evict in [1] to avoid this.
>>>>>>>>> That sounds reasonable but for context do you have a link to the
>>>>>>>>> comments/discussion on this? I couldn't readily find it, but I may have
>>>>>>>>> just missed it.
>>>>>>>>>
>>>>>>>> See in [4], search for '2. eviction' comment from sima.
>>>>>>> Thanks for pointing that out. For reference here's Sima's comment:
>>>>>>>
>>>>>>>> 2. eviction
>>>>>>>>
>>>>>>>> Requirements much like migrate_to_ram, because otherwise we break the
>>>>>>>> migration gurantee:
>>>>>>>>
>>>>>>>> - Only looking at physical memory datastructures and locks, no looking at
>>>>>>>>   mm/vma structs or relying on those being locked. We rely entirely on
>>>>>>>>   reverse maps from try_to_migrate to find all the mappings on both cpu
>>>>>>>>   and gpu side (cpu only zone device swap or migration pte entries ofc).
>>>>>>> I also very much agree with this. That's basically why I added
>>>>>>> migrate_device_range(), so that we can forcibly evict pages when the
>>>>>>> driver needs them freed (eg. driver unload, low memory, etc.). In
>>>>>>> general it is impossible to guarantee eviction og all pages using just
>>>>>>> hmm_range_fault().
>>>>>>>
>>>>>> In this code path we don't have device pages available, hence the
>>>>>> purposed collection via hmm_range_fault.
>>>>> Why don't you have the pfns requiring eviction available? I need to read
>>>>> this series in more depth, but generally hmm_range_fault() can't
>>>>> gurantee you will find every device page.
>>>>>
>>>> There are two cases for eviction in my series:
>>>>
>>>> 1. TTM decides it needs to move memory. This calls
>>>> drm_gpusvm_evict_to_ram. In this case the device pfns are available
>>>> directly from drm_gpusvm_devmem so the migrate_device_* calls be used
>>>> here albiet with the new function added in this patch as device pfns may
>>>> be non-contiguous.
>>> That makes sense and is generally what I think of when I'm thinking of
>>> eviction. The new function makes sense too - migrate_device_range() was
>>> primarily introduced to allow a driver to evict all device-private pages
>>> from a GPU so didn't consider non-contiguous cases, etc.
>>>
>>>> 2. An inconsistent state for VA range occurs (mixed system and device pages,
>>>> partial unmap of a range, etc...). Here we want to evict the range ram
>>>> to make the state consistent. No device pages are available due to an
>>>> intentional disconnect between a virtual range and physical
>>>> drm_gpusvm_devmem, thus the device pages have to be looked up. This the
>>>> function drm_gpusvm_range_evict. Based on what you tell me, it likely is
>>>> fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
>>>> using hmm_range_fault like I have suggested here.
>>> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
>>> fine for this usage and is exactly what you want - it was designed to
>>> either select all the system memory pages or device-private pages within
>>> a VA range and migrate them.
>>>
>>> FWIW I have toyed with the idea of a combined
>>> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
>>> migrate_vma_*() process but haven't come up with something nice as
>>> yet. I don't think mixing the two in an open-coded fashion is a good
>>> idea though, I'd rather we come up with a new API that addresses the
>>> short-comings of migrate_vma_setup().
>>>
>> I think that would good. Here we actually need to lookup multiple VMAs
>> and have a sequence of migrate_vma_* calls as it possible for VMAs to
>> have changed after the driver range was created. It might be nice to
>> hide the VMA lookup from the drivers with an API saying collect and
>> migrate all pages of a type in a VA range much like hmm_range_fault. If
>> the range spans multiple VMAs that would be hidden from the caller.
> Ok. I wasn't really considering multiple VMAs. UVM and Nouveau don't
> really have a requirement to migrate across multiple VMAs but if that's
> neccessary I think an API that hides that specifically for working with
> migrate_vma_*() might make sense.

Yes that's what I'm currently doing. You call it in a loop, the
fault+migrate prepare part chunks the calls to vma boundaries and you do
the migrations for each vma and loop until the whole range done.


>
>> Matt
>>
>>>> Note #2 may be removed or unnecessary at some point if we decide to add
>>>> support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
>>>> now though. See 'Ranges with mixed system and device pages' in [5].
> As someone not very familiar with some of the DRM layers can I ask why
> having virtual address ranges with a mix of system and device pages is
> hard to support? It seems to me that in practice it might be quite
> difficult to keep a VMA range as exclusively all in system memory or all
> in device memory.
>
>>>> [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
>>>>
>>>>>>>> [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
>>>>>>>> [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>>>>>>>>
>>>>>>>>>> It would also make the function exported in this patch unnecessary too
>>>>>>>>>> as non-contiguous pfns can be setup on driver side via
>>>>>>>>>> migrate_device_pfn_lock and then migrate_device_unmap can be called.
>>>>>>>>>> This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>>>>>>>>>> in [1].
>>>>>>>>>>
>>>>>>>>>> Do you see an issue exporting migrate_device_pfn_lock,
>>>>>>>>>> migrate_device_unmap?
>>>>>>>>> If there is a good justification for it I can't see a problem with
>>>>>>>>> exporting it. That said I don't really understand why you would
>>>>>>>>> want/need to split those steps up but I'll wait to see the code.
>>>>>>>>>
>>>>>>>> It is so the device pages returned from hmm_range_fault, which are only
>>>>>>>> guaranteed to be valid under the notifier lock + a seqno check, to be
>>>>>>>> locked and ref taken for migration. migrate_device_unmap() can trigger a
>>>>>>>> MMU invalidation which takes the notifier lock thus calling the function
>>>>>>>> which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>>>>>>>>
>>>>>>>> I think this flow makes sense and agree in general this likely better
>>>>>>>> than looking at a CPUVMA.
>>>>>>> I'm still a bit confused about what is better with this flow if you are
>>>>>>> still calling hmm_range_fault(). How is it better than just calling
>>>>>>> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
>>>>>> The code in rev2 calls migrate_vma_setup but the requires a struct
>>>>>> vm_area_struct argument whereas hmm_range_fault does not.
>>>>> I'm not sure that's a good enough justfication because the problem isn't
>>>>> whether you're looking up vma's in driver code or mm code. The problem
>>>>> is you are looking up vma's at all and all that goes with that (mainly
>>>>> taking mmap lock, etc.)
>>>>>
>>>>> And for eviction hmm_range_fault() won't even find all the pages because
>>>>> their virtual address may have changed - consider what happens in cases
>>>>> of mremap(), fork(), etc. So eviction really needs physical pages
>>>>> (pfn's), not virtual addresses.
>>>>>
>>>> See above, #1 yes we use a physical pages. For #2 it is about making the
>>>> state consistent within a virtual address range.
>>> Yep, makes sense now. For migration of physical pages you want
>>> migrate_device_*, virtual address ranges want migrate_vma_*
>>>
>>>  - Alistair
>>>
>>>> Matt
>>>>  
>>>>>>> we're talking about eviction here so I don't understand why that would
>>>>>>> be relevant. And hmm_range_fault() still requires the VMA, although I
>>>>>>> need to look at the patches more closely, probably CPUVMA is a DRM
>>>>>> 'hmm_range_fault() still requires the VMA' internal yes, but again not
>>>>>> as argument. This is about avoiding a driver side lookup of the VMA.
>>>>>>
>>>>>> CPUVMA == struct vm_area_struct in this email.
>>>>> Thanks for the clarification.
>>>>>
>>>>>  - Alistair
>>>>>
>>>>>> Matt
>>>>>>
>>>>>>> specific concept?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>  - Alistair
>>>>>>>
>>>>>>>> Matt
>>>>>>>>  
>>>>>>>>>  - Alistair
>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>> [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>>> +	}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	migrate_device_unmap(src_pfns, npages, NULL);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	return 0;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>>>>>>>>>>>>> +
>>>>>>>>>>>>>  /*
>>>>>>>>>>>>>   * Migrate a device coherent folio back to normal memory. The caller should have
>>>>>>>>>>>>>   * a reference on folio which will be copied to the new folio if migration is


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-18  5:59                         ` Alistair Popple
  2024-10-18  6:39                           ` Mika Penttilä
@ 2024-10-18  7:16                           ` Matthew Brost
  2024-10-18  7:33                             ` Matthew Brost
  2024-10-18  7:34                             ` Alistair Popple
  1 sibling, 2 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-18  7:16 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Fri, Oct 18, 2024 at 04:59:05PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Fri, Oct 18, 2024 at 08:58:02AM +1100, Alistair Popple wrote:
> >> 
> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> 
> >> > On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
> >> >> 
> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> 
> >> >> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
> >> >> >> 
> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> >> 
> >> >> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
> >> >> >> >> 
> >> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> >> >> 
> >> >> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> >> >> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> >> >> >> >> 
> >> >> >> >> [...]
> >> >> >> >> 
> >> >> >> >> >> > > +{
> >> >> >> >> >> > > +	unsigned long i;
> >> >> >> >> >> > > +
> >> >> >> >> >> > > +	for (i = 0; i < npages; i++) {
> >> >> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> >> >> >> >> >> > > +
> >> >> >> >> >> > > +		if (!get_page_unless_zero(page)) {
> >> >> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> >> >> > > +			continue;
> >> >> >> >> >> > > +		}
> >> >> >> >> >> > > +
> >> >> >> >> >> > > +		if (!trylock_page(page)) {
> >> >> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> >> >> > > +			put_page(page);
> >> >> >> >> >> > > +			continue;
> >> >> >> >> >> > > +		}
> >> >> >> >> >> > > +
> >> >> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> >> >> >> >> >> > 
> >> >> >> >> >> > This needs to be converted to use a folio like
> >> >> >> >> >> > migrate_device_range(). But more importantly this should be split out as
> >> >> >> >> >> > a function that both migrate_device_range() and this function can call
> >> >> >> >> >> > given this bit is identical.
> >> >> >> >> >> > 
> >> >> >> >> >> 
> >> >> >> >> >> Missed the folio conversion and agree a helper shared between this
> >> >> >> >> >> function and migrate_device_range would be a good idea. Let add that.
> >> >> >> >> >> 
> >> >> >> >> >
> >> >> >> >> > Alistair,
> >> >> >> >> >
> >> >> >> >> > Ok, I think now I want to go slightly different direction here to give
> >> >> >> >> > GPUSVM a bit more control over several eviction scenarios.
> >> >> >> >> >
> >> >> >> >> > What if I exported the helper discussed above, e.g.,
> >> >> >> >> >
> >> >> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
> >> >> >> >> >  906 {
> >> >> >> >> >  907         struct folio *folio;
> >> >> >> >> >  908
> >> >> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
> >> >> >> >> >  910         if (!folio)
> >> >> >> >> >  911                 return 0;
> >> >> >> >> >  912
> >> >> >> >> >  913         if (!folio_trylock(folio)) {
> >> >> >> >> >  914                 folio_put(folio);
> >> >> >> >> >  915                 return 0;
> >> >> >> >> >  916         }
> >> >> >> >> >  917
> >> >> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> >> >> >> >> >  919 }
> >> >> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
> >> >> >> >> >
> >> >> >> >> > And then also export migrate_device_unmap.
> >> >> >> >> >
> >> >> >> >> > The usage here would be let a driver collect the device pages in virtual
> >> >> >> >> > address range via hmm_range_fault, lock device pages under notifier
> >> >> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
> >> >> >> >> > migrate_device_unmap.
> >> >> >> >> 
> >> >> >> >> I'm still working through this series but that seems a bit dubious, the
> >> >> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
> >> >> >> >> would help me a lot in understanding what you're suggesting.
> >> >> >> >>
> >> >> >> >
> >> >> >> > For sure locking in tricky, my mistake on not working through this
> >> >> >> > before sending out the next rev but it came to mind after sending +
> >> >> >> > regarding some late feedback from Thomas about using hmm for eviction
> >> >> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
> >> >> >> > doesn't work for coherent pages, while something like below does.
> >> >> >> >
> >> >> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
> >> >> >> >
> >> >> >> > Here is a snippet I have locally which seems to work.
> >> >> >> >
> >> >> >> > 2024 retry:
> >> >> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> >> >> > 2026         hmm_range.hmm_pfns = src;
> >> >> >> > 2027
> >> >> >> > 2028         while (true) {
> >> >> >> > 2029                 mmap_read_lock(mm);
> >> >> >> > 2030                 err = hmm_range_fault(&hmm_range);
> >> >> >> > 2031                 mmap_read_unlock(mm);
> >> >> >> > 2032                 if (err == -EBUSY) {
> >> >> >> > 2033                         if (time_after(jiffies, timeout))
> >> >> >> > 2034                                 break;
> >> >> >> > 2035
> >> >> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> >> >> > 2037                         continue;
> >> >> >> > 2038                 }
> >> >> >> > 2039                 break;
> >> >> >> > 2040         }
> >> >> >> > 2041         if (err)
> >> >> >> > 2042                 goto err_put;
> >> >> >> > 2043
> >> >> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
> >> >> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> >> >> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
> >> >> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
> >> >> >> > 2048                 goto retry;
> >> >> >> > 2049         }
> >> >> >> > 2050         for (i = 0; i < npages; ++i) {
> >> >> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
> >> >> >> > 2052
> >> >> >> > 2053                 if (page && (is_device_private_page(page) ||
> >> >> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
> >> >> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
> >> >> >> > 2056                 else
> >> >> >> > 2057                         src[i] = 0;
> >> >> >> > 2058                 if (src[i])
> >> >> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
> >> >> >> > 2060         }
> >> >> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
> >> >> >> 
> >> >> >> Practically for eviction isn't this much the same as calling
> >> >> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
> >> >> >> probably shouldn't be looking at mm/vma structs.
> >> >> >> 
> >> >> >
> >> >> > hmm_range_fault is just collecting the pages, internally I suppose it
> >> >> > does look at a VMA (struct vm_area_struct) but I think the point is
> >> >> > drivers should not be looking at VMA here.
> >> >> 
> >> >> migrate_vma_setup() is designed to be called by drivers and needs a vma,
> >> >> so in general I don't see a problem with drivers looking up vma's. The
> >> >> problem arises specifically for eviction and whether or not that happens
> >> >> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
> >> >> issues there (see below).
> >> >> 
> >> >
> >> > Ok, if you think it ok for drivers to lookup the VMA then purposed
> >> > exporting of migrate_device_pfn_lock & migrate_device_unmap is not
> >> > needed, rather just the original function exported in the this patch.
> >> >
> >> > More below too.
> >> >
> >> >> >> > 2063         migrate_device_unmap(src, npages, NULL);
> >> >> >> > ...
> >> >> >> > 2101         migrate_device_pages(src, dst, npages);
> >> >> >> > 2102         migrate_device_finalize(src, dst, npages);
> >> >> >> >
> >> >> >> >
> >> >> >> >> > Sima has strongly suggested avoiding a CPUVMA
> >> >> >> >> > lookup during eviction cases and this would let me fixup
> >> >> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
> >> >> >> >> 
> >> >> >> >> That sounds reasonable but for context do you have a link to the
> >> >> >> >> comments/discussion on this? I couldn't readily find it, but I may have
> >> >> >> >> just missed it.
> >> >> >> >> 
> >> >> >> >
> >> >> >> > See in [4], search for '2. eviction' comment from sima.
> >> >> >> 
> >> >> >> Thanks for pointing that out. For reference here's Sima's comment:
> >> >> >> 
> >> >> >> > 2. eviction
> >> >> >> > 
> >> >> >> > Requirements much like migrate_to_ram, because otherwise we break the
> >> >> >> > migration gurantee:
> >> >> >> > 
> >> >> >> > - Only looking at physical memory datastructures and locks, no looking at
> >> >> >> >   mm/vma structs or relying on those being locked. We rely entirely on
> >> >> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
> >> >> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
> >> >> >>
> >> >> >> I also very much agree with this. That's basically why I added
> >> >> >> migrate_device_range(), so that we can forcibly evict pages when the
> >> >> >> driver needs them freed (eg. driver unload, low memory, etc.). In
> >> >> >> general it is impossible to guarantee eviction og all pages using just
> >> >> >> hmm_range_fault().
> >> >> >> 
> >> >> >
> >> >> > In this code path we don't have device pages available, hence the
> >> >> > purposed collection via hmm_range_fault.
> >> >> 
> >> >> Why don't you have the pfns requiring eviction available? I need to read
> >> >> this series in more depth, but generally hmm_range_fault() can't
> >> >> gurantee you will find every device page.
> >> >> 
> >> >
> >> > There are two cases for eviction in my series:
> >> >
> >> > 1. TTM decides it needs to move memory. This calls
> >> > drm_gpusvm_evict_to_ram. In this case the device pfns are available
> >> > directly from drm_gpusvm_devmem so the migrate_device_* calls be used
> >> > here albiet with the new function added in this patch as device pfns may
> >> > be non-contiguous.
> >> 
> >> That makes sense and is generally what I think of when I'm thinking of
> >> eviction. The new function makes sense too - migrate_device_range() was
> >> primarily introduced to allow a driver to evict all device-private pages
> >> from a GPU so didn't consider non-contiguous cases, etc.
> >> 
> >> > 2. An inconsistent state for VA range occurs (mixed system and device pages,
> >> > partial unmap of a range, etc...). Here we want to evict the range ram
> >> > to make the state consistent. No device pages are available due to an
> >> > intentional disconnect between a virtual range and physical
> >> > drm_gpusvm_devmem, thus the device pages have to be looked up. This the
> >> > function drm_gpusvm_range_evict. Based on what you tell me, it likely is
> >> > fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
> >> > using hmm_range_fault like I have suggested here.
> >> 
> >> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
> >> fine for this usage and is exactly what you want - it was designed to
> >> either select all the system memory pages or device-private pages within
> >> a VA range and migrate them.
> >> 
> >> FWIW I have toyed with the idea of a combined
> >> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
> >> migrate_vma_*() process but haven't come up with something nice as
> >> yet. I don't think mixing the two in an open-coded fashion is a good
> >> idea though, I'd rather we come up with a new API that addresses the
> >> short-comings of migrate_vma_setup().
> >> 
> >
> > I think that would good. Here we actually need to lookup multiple VMAs
> > and have a sequence of migrate_vma_* calls as it possible for VMAs to
> > have changed after the driver range was created. It might be nice to
> > hide the VMA lookup from the drivers with an API saying collect and
> > migrate all pages of a type in a VA range much like hmm_range_fault. If
> > the range spans multiple VMAs that would be hidden from the caller.
> 
> Ok. I wasn't really considering multiple VMAs. UVM and Nouveau don't
> really have a requirement to migrate across multiple VMAs but if that's
> neccessary I think an API that hides that specifically for working with
> migrate_vma_*() might make sense.
> 

We can run into multiple VMA scenarios if a user does something rude
like this:

mmap	0x000000...0x1fffff -> fault migrates 2M to VRAM and creates an internal range to track
munmap	0x080000...0x17ffff -> now we have two VMAs instead of one and the range has a hole in it

In this scenario, which we believe to rare / unsual, we just evict
remaining VRAM pages to SRAM, destroy range, and fixup on next GPU
fault.

> > Matt
> >
> >> > Note #2 may be removed or unnecessary at some point if we decide to add
> >> > support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
> >> > now though. See 'Ranges with mixed system and device pages' in [5].
> 
> As someone not very familiar with some of the DRM layers can I ask why
> having virtual address ranges with a mix of system and device pages is
> hard to support? It seems to me that in practice it might be quite
> difficult to keep a VMA range as exclusively all in system memory or all
> in device memory.
>

A few things that make this difficult are:

- Our (Xe) bind code would need to be updated to support this
- TTM / DRM buddy allocator doesn't support freeing / reallocation of
  individual pages rather aligned chunks of initial allocation size
  (e.g., 2M would be common allocation size).
- Spliting ranges would add complications

All workable problems but since we are writing a new common
implementation trying to keep it as simple as possible for initial merge
of the design. Almost certainly at some point we will add support for
mixed ranges to the common GPU SVM layer with a driver choosing if it
wants mixed or non-mixed ranges via a flag to function calls.

wrt to being difficult keeping exclusively in system or vram, in
addition to the above case the only other case I have found in which
this occurs is CPU and GPU faults to same address range racing. This can
cause hmm_range_fault to grab a set mixed pages. In this case again we
do an eviction remaining page and restart the GPU fault.

I don't have real workloads yet but I do have a very aggressive test
case that intentionally does things which could break the design in a
highly parallel manner and the design as work. Is it ideal? Maybe not.
But getting in a simple design which we can build upon is the current
goal.

Matt

> >> > [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
> >> >
> >> >> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
> >> >> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
> >> >> >> >
> >> >> >> >> > It would also make the function exported in this patch unnecessary too
> >> >> >> >> > as non-contiguous pfns can be setup on driver side via
> >> >> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
> >> >> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> >> >> >> >> > in [1].
> >> >> >> >> >
> >> >> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
> >> >> >> >> > migrate_device_unmap?
> >> >> >> >> 
> >> >> >> >> If there is a good justification for it I can't see a problem with
> >> >> >> >> exporting it. That said I don't really understand why you would
> >> >> >> >> want/need to split those steps up but I'll wait to see the code.
> >> >> >> >>
> >> >> >> >
> >> >> >> > It is so the device pages returned from hmm_range_fault, which are only
> >> >> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
> >> >> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
> >> >> >> > MMU invalidation which takes the notifier lock thus calling the function
> >> >> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
> >> >> >> >
> >> >> >> > I think this flow makes sense and agree in general this likely better
> >> >> >> > than looking at a CPUVMA.
> >> >> >> 
> >> >> >> I'm still a bit confused about what is better with this flow if you are
> >> >> >> still calling hmm_range_fault(). How is it better than just calling
> >> >> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
> >> >> >
> >> >> > The code in rev2 calls migrate_vma_setup but the requires a struct
> >> >> > vm_area_struct argument whereas hmm_range_fault does not.
> >> >> 
> >> >> I'm not sure that's a good enough justfication because the problem isn't
> >> >> whether you're looking up vma's in driver code or mm code. The problem
> >> >> is you are looking up vma's at all and all that goes with that (mainly
> >> >> taking mmap lock, etc.)
> >> >> 
> >> >> And for eviction hmm_range_fault() won't even find all the pages because
> >> >> their virtual address may have changed - consider what happens in cases
> >> >> of mremap(), fork(), etc. So eviction really needs physical pages
> >> >> (pfn's), not virtual addresses.
> >> >>
> >> >
> >> > See above, #1 yes we use a physical pages. For #2 it is about making the
> >> > state consistent within a virtual address range.
> >> 
> >> Yep, makes sense now. For migration of physical pages you want
> >> migrate_device_*, virtual address ranges want migrate_vma_*
> >> 
> >>  - Alistair
> >> 
> >> > Matt
> >> >  
> >> >> >> we're talking about eviction here so I don't understand why that would
> >> >> >> be relevant. And hmm_range_fault() still requires the VMA, although I
> >> >> >> need to look at the patches more closely, probably CPUVMA is a DRM
> >> >> >
> >> >> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
> >> >> > as argument. This is about avoiding a driver side lookup of the VMA.
> >> >> >
> >> >> > CPUVMA == struct vm_area_struct in this email.
> >> >> 
> >> >> Thanks for the clarification.
> >> >> 
> >> >>  - Alistair
> >> >> 
> >> >> > Matt
> >> >> >
> >> >> >> specific concept?
> >> >> >> 
> >> >> >> Thanks.
> >> >> >> 
> >> >> >>  - Alistair
> >> >> >> 
> >> >> >> > Matt
> >> >> >> >  
> >> >> >> >>  - Alistair
> >> >> >> >> 
> >> >> >> >> > Matt
> >> >> >> >> >
> >> >> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
> >> >> >> >> >
> >> >> >> >> >> Matt
> >> >> >> >> >> 
> >> >> >> >> >> > > +	}
> >> >> >> >> >> > > +
> >> >> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> >> >> >> >> >> > > +
> >> >> >> >> >> > > +	return 0;
> >> >> >> >> >> > > +}
> >> >> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> >> >> >> >> >> > > +
> >> >> >> >> >> > >  /*
> >> >> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> >> >> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
> >> >> >> >> >> > 
> >> >> >> >> 
> >> >> >> 
> >> >> 
> >> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-18  7:16                           ` Matthew Brost
@ 2024-10-18  7:33                             ` Matthew Brost
  2024-10-18  7:34                             ` Alistair Popple
  1 sibling, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-18  7:33 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Fri, Oct 18, 2024 at 07:16:15AM +0000, Matthew Brost wrote:
> On Fri, Oct 18, 2024 at 04:59:05PM +1100, Alistair Popple wrote:
> > 
> > Matthew Brost <matthew.brost@intel.com> writes:
> > 
> > > On Fri, Oct 18, 2024 at 08:58:02AM +1100, Alistair Popple wrote:
> > >> 
> > >> Matthew Brost <matthew.brost@intel.com> writes:
> > >> 
> > >> > On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
> > >> >> 
> > >> >> Matthew Brost <matthew.brost@intel.com> writes:
> > >> >> 
> > >> >> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
> > >> >> >> 
> > >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> > >> >> >> 
> > >> >> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
> > >> >> >> >> 
> > >> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> > >> >> >> >> 
> > >> >> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> > >> >> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> > >> >> >> >> 
> > >> >> >> >> [...]
> > >> >> >> >> 
> > >> >> >> >> >> > > +{
> > >> >> >> >> >> > > +	unsigned long i;
> > >> >> >> >> >> > > +
> > >> >> >> >> >> > > +	for (i = 0; i < npages; i++) {
> > >> >> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> > >> >> >> >> >> > > +
> > >> >> >> >> >> > > +		if (!get_page_unless_zero(page)) {
> > >> >> >> >> >> > > +			src_pfns[i] = 0;
> > >> >> >> >> >> > > +			continue;
> > >> >> >> >> >> > > +		}
> > >> >> >> >> >> > > +
> > >> >> >> >> >> > > +		if (!trylock_page(page)) {
> > >> >> >> >> >> > > +			src_pfns[i] = 0;
> > >> >> >> >> >> > > +			put_page(page);
> > >> >> >> >> >> > > +			continue;
> > >> >> >> >> >> > > +		}
> > >> >> >> >> >> > > +
> > >> >> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> > >> >> >> >> >> > 
> > >> >> >> >> >> > This needs to be converted to use a folio like
> > >> >> >> >> >> > migrate_device_range(). But more importantly this should be split out as
> > >> >> >> >> >> > a function that both migrate_device_range() and this function can call
> > >> >> >> >> >> > given this bit is identical.
> > >> >> >> >> >> > 
> > >> >> >> >> >> 
> > >> >> >> >> >> Missed the folio conversion and agree a helper shared between this
> > >> >> >> >> >> function and migrate_device_range would be a good idea. Let add that.
> > >> >> >> >> >> 
> > >> >> >> >> >
> > >> >> >> >> > Alistair,
> > >> >> >> >> >
> > >> >> >> >> > Ok, I think now I want to go slightly different direction here to give
> > >> >> >> >> > GPUSVM a bit more control over several eviction scenarios.
> > >> >> >> >> >
> > >> >> >> >> > What if I exported the helper discussed above, e.g.,
> > >> >> >> >> >
> > >> >> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
> > >> >> >> >> >  906 {
> > >> >> >> >> >  907         struct folio *folio;
> > >> >> >> >> >  908
> > >> >> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
> > >> >> >> >> >  910         if (!folio)
> > >> >> >> >> >  911                 return 0;
> > >> >> >> >> >  912
> > >> >> >> >> >  913         if (!folio_trylock(folio)) {
> > >> >> >> >> >  914                 folio_put(folio);
> > >> >> >> >> >  915                 return 0;
> > >> >> >> >> >  916         }
> > >> >> >> >> >  917
> > >> >> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> > >> >> >> >> >  919 }
> > >> >> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
> > >> >> >> >> >
> > >> >> >> >> > And then also export migrate_device_unmap.
> > >> >> >> >> >
> > >> >> >> >> > The usage here would be let a driver collect the device pages in virtual
> > >> >> >> >> > address range via hmm_range_fault, lock device pages under notifier
> > >> >> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
> > >> >> >> >> > migrate_device_unmap.
> > >> >> >> >> 
> > >> >> >> >> I'm still working through this series but that seems a bit dubious, the
> > >> >> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
> > >> >> >> >> would help me a lot in understanding what you're suggesting.
> > >> >> >> >>
> > >> >> >> >
> > >> >> >> > For sure locking in tricky, my mistake on not working through this
> > >> >> >> > before sending out the next rev but it came to mind after sending +
> > >> >> >> > regarding some late feedback from Thomas about using hmm for eviction
> > >> >> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
> > >> >> >> > doesn't work for coherent pages, while something like below does.
> > >> >> >> >
> > >> >> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
> > >> >> >> >
> > >> >> >> > Here is a snippet I have locally which seems to work.
> > >> >> >> >
> > >> >> >> > 2024 retry:
> > >> >> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > >> >> >> > 2026         hmm_range.hmm_pfns = src;
> > >> >> >> > 2027
> > >> >> >> > 2028         while (true) {
> > >> >> >> > 2029                 mmap_read_lock(mm);
> > >> >> >> > 2030                 err = hmm_range_fault(&hmm_range);
> > >> >> >> > 2031                 mmap_read_unlock(mm);
> > >> >> >> > 2032                 if (err == -EBUSY) {
> > >> >> >> > 2033                         if (time_after(jiffies, timeout))
> > >> >> >> > 2034                                 break;
> > >> >> >> > 2035
> > >> >> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > >> >> >> > 2037                         continue;
> > >> >> >> > 2038                 }
> > >> >> >> > 2039                 break;
> > >> >> >> > 2040         }
> > >> >> >> > 2041         if (err)
> > >> >> >> > 2042                 goto err_put;
> > >> >> >> > 2043
> > >> >> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
> > >> >> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > >> >> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
> > >> >> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
> > >> >> >> > 2048                 goto retry;
> > >> >> >> > 2049         }
> > >> >> >> > 2050         for (i = 0; i < npages; ++i) {
> > >> >> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
> > >> >> >> > 2052
> > >> >> >> > 2053                 if (page && (is_device_private_page(page) ||
> > >> >> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
> > >> >> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
> > >> >> >> > 2056                 else
> > >> >> >> > 2057                         src[i] = 0;
> > >> >> >> > 2058                 if (src[i])
> > >> >> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
> > >> >> >> > 2060         }
> > >> >> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
> > >> >> >> 
> > >> >> >> Practically for eviction isn't this much the same as calling
> > >> >> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
> > >> >> >> probably shouldn't be looking at mm/vma structs.
> > >> >> >> 
> > >> >> >
> > >> >> > hmm_range_fault is just collecting the pages, internally I suppose it
> > >> >> > does look at a VMA (struct vm_area_struct) but I think the point is
> > >> >> > drivers should not be looking at VMA here.
> > >> >> 
> > >> >> migrate_vma_setup() is designed to be called by drivers and needs a vma,
> > >> >> so in general I don't see a problem with drivers looking up vma's. The
> > >> >> problem arises specifically for eviction and whether or not that happens
> > >> >> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
> > >> >> issues there (see below).
> > >> >> 
> > >> >
> > >> > Ok, if you think it ok for drivers to lookup the VMA then purposed
> > >> > exporting of migrate_device_pfn_lock & migrate_device_unmap is not
> > >> > needed, rather just the original function exported in the this patch.
> > >> >
> > >> > More below too.
> > >> >
> > >> >> >> > 2063         migrate_device_unmap(src, npages, NULL);
> > >> >> >> > ...
> > >> >> >> > 2101         migrate_device_pages(src, dst, npages);
> > >> >> >> > 2102         migrate_device_finalize(src, dst, npages);
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >> > Sima has strongly suggested avoiding a CPUVMA
> > >> >> >> >> > lookup during eviction cases and this would let me fixup
> > >> >> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
> > >> >> >> >> 
> > >> >> >> >> That sounds reasonable but for context do you have a link to the
> > >> >> >> >> comments/discussion on this? I couldn't readily find it, but I may have
> > >> >> >> >> just missed it.
> > >> >> >> >> 
> > >> >> >> >
> > >> >> >> > See in [4], search for '2. eviction' comment from sima.
> > >> >> >> 
> > >> >> >> Thanks for pointing that out. For reference here's Sima's comment:
> > >> >> >> 
> > >> >> >> > 2. eviction
> > >> >> >> > 
> > >> >> >> > Requirements much like migrate_to_ram, because otherwise we break the
> > >> >> >> > migration gurantee:
> > >> >> >> > 
> > >> >> >> > - Only looking at physical memory datastructures and locks, no looking at
> > >> >> >> >   mm/vma structs or relying on those being locked. We rely entirely on
> > >> >> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
> > >> >> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
> > >> >> >>
> > >> >> >> I also very much agree with this. That's basically why I added
> > >> >> >> migrate_device_range(), so that we can forcibly evict pages when the
> > >> >> >> driver needs them freed (eg. driver unload, low memory, etc.). In
> > >> >> >> general it is impossible to guarantee eviction og all pages using just
> > >> >> >> hmm_range_fault().
> > >> >> >> 
> > >> >> >
> > >> >> > In this code path we don't have device pages available, hence the
> > >> >> > purposed collection via hmm_range_fault.
> > >> >> 
> > >> >> Why don't you have the pfns requiring eviction available? I need to read
> > >> >> this series in more depth, but generally hmm_range_fault() can't
> > >> >> gurantee you will find every device page.
> > >> >> 
> > >> >
> > >> > There are two cases for eviction in my series:
> > >> >
> > >> > 1. TTM decides it needs to move memory. This calls
> > >> > drm_gpusvm_evict_to_ram. In this case the device pfns are available
> > >> > directly from drm_gpusvm_devmem so the migrate_device_* calls be used
> > >> > here albiet with the new function added in this patch as device pfns may
> > >> > be non-contiguous.
> > >> 
> > >> That makes sense and is generally what I think of when I'm thinking of
> > >> eviction. The new function makes sense too - migrate_device_range() was
> > >> primarily introduced to allow a driver to evict all device-private pages
> > >> from a GPU so didn't consider non-contiguous cases, etc.
> > >> 
> > >> > 2. An inconsistent state for VA range occurs (mixed system and device pages,
> > >> > partial unmap of a range, etc...). Here we want to evict the range ram
> > >> > to make the state consistent. No device pages are available due to an
> > >> > intentional disconnect between a virtual range and physical
> > >> > drm_gpusvm_devmem, thus the device pages have to be looked up. This the
> > >> > function drm_gpusvm_range_evict. Based on what you tell me, it likely is
> > >> > fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
> > >> > using hmm_range_fault like I have suggested here.
> > >> 
> > >> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
> > >> fine for this usage and is exactly what you want - it was designed to
> > >> either select all the system memory pages or device-private pages within
> > >> a VA range and migrate them.
> > >> 
> > >> FWIW I have toyed with the idea of a combined
> > >> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
> > >> migrate_vma_*() process but haven't come up with something nice as
> > >> yet. I don't think mixing the two in an open-coded fashion is a good
> > >> idea though, I'd rather we come up with a new API that addresses the
> > >> short-comings of migrate_vma_setup().
> > >> 
> > >
> > > I think that would good. Here we actually need to lookup multiple VMAs
> > > and have a sequence of migrate_vma_* calls as it possible for VMAs to
> > > have changed after the driver range was created. It might be nice to
> > > hide the VMA lookup from the drivers with an API saying collect and
> > > migrate all pages of a type in a VA range much like hmm_range_fault. If
> > > the range spans multiple VMAs that would be hidden from the caller.
> > 
> > Ok. I wasn't really considering multiple VMAs. UVM and Nouveau don't
> > really have a requirement to migrate across multiple VMAs but if that's
> > neccessary I think an API that hides that specifically for working with
> > migrate_vma_*() might make sense.
> > 
> 
> We can run into multiple VMA scenarios if a user does something rude
> like this:
> 
> mmap	0x000000...0x1fffff -> fault migrates 2M to VRAM and creates an internal range to track
> munmap	0x080000...0x17ffff -> now we have two VMAs instead of one and the range has a hole in it
> 
> In this scenario, which we believe to rare / unsual, we just evict
> remaining VRAM pages to SRAM, destroy range, and fixup on next GPU
> fault.
> 
> > > Matt
> > >
> > >> > Note #2 may be removed or unnecessary at some point if we decide to add
> > >> > support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
> > >> > now though. See 'Ranges with mixed system and device pages' in [5].
> > 
> > As someone not very familiar with some of the DRM layers can I ask why
> > having virtual address ranges with a mix of system and device pages is
> > hard to support? It seems to me that in practice it might be quite
> > difficult to keep a VMA range as exclusively all in system memory or all
> > in device memory.
> >

Sorry double reply - missed 'VMA range as exclusively' comment.

I'm refering to DRM GPU SVM range being exclusive in system or device
memory not a VMA. A DRM GPU SVM range is relatively small and controled
by the driver, via a table, and VMA boundaries. In Xe, we current
support 4k, 64k, or 2M DRM GPU SVM ranges. So say if a VMA is 1 GB, we
can have individual 2M ranges in either system or device memory based on
access patterns. What range allocation size, that is migration
grainularity which all pages in the range being exclusively in system or
device memory.

Matt

> 
> A few things that make this difficult are:
> 
> - Our (Xe) bind code would need to be updated to support this
> - TTM / DRM buddy allocator doesn't support freeing / reallocation of
>   individual pages rather aligned chunks of initial allocation size
>   (e.g., 2M would be common allocation size).
> - Spliting ranges would add complications
> 
> All workable problems but since we are writing a new common
> implementation trying to keep it as simple as possible for initial merge
> of the design. Almost certainly at some point we will add support for
> mixed ranges to the common GPU SVM layer with a driver choosing if it
> wants mixed or non-mixed ranges via a flag to function calls.
> 
> wrt to being difficult keeping exclusively in system or vram, in
> addition to the above case the only other case I have found in which
> this occurs is CPU and GPU faults to same address range racing. This can
> cause hmm_range_fault to grab a set mixed pages. In this case again we
> do an eviction remaining page and restart the GPU fault.
> 
> I don't have real workloads yet but I do have a very aggressive test
> case that intentionally does things which could break the design in a
> highly parallel manner and the design as work. Is it ideal? Maybe not.
> But getting in a simple design which we can build upon is the current
> goal.
> 
> Matt
> 
> > >> > [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
> > >> >
> > >> >> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
> > >> >> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
> > >> >> >> >
> > >> >> >> >> > It would also make the function exported in this patch unnecessary too
> > >> >> >> >> > as non-contiguous pfns can be setup on driver side via
> > >> >> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
> > >> >> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> > >> >> >> >> > in [1].
> > >> >> >> >> >
> > >> >> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
> > >> >> >> >> > migrate_device_unmap?
> > >> >> >> >> 
> > >> >> >> >> If there is a good justification for it I can't see a problem with
> > >> >> >> >> exporting it. That said I don't really understand why you would
> > >> >> >> >> want/need to split those steps up but I'll wait to see the code.
> > >> >> >> >>
> > >> >> >> >
> > >> >> >> > It is so the device pages returned from hmm_range_fault, which are only
> > >> >> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
> > >> >> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
> > >> >> >> > MMU invalidation which takes the notifier lock thus calling the function
> > >> >> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
> > >> >> >> >
> > >> >> >> > I think this flow makes sense and agree in general this likely better
> > >> >> >> > than looking at a CPUVMA.
> > >> >> >> 
> > >> >> >> I'm still a bit confused about what is better with this flow if you are
> > >> >> >> still calling hmm_range_fault(). How is it better than just calling
> > >> >> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
> > >> >> >
> > >> >> > The code in rev2 calls migrate_vma_setup but the requires a struct
> > >> >> > vm_area_struct argument whereas hmm_range_fault does not.
> > >> >> 
> > >> >> I'm not sure that's a good enough justfication because the problem isn't
> > >> >> whether you're looking up vma's in driver code or mm code. The problem
> > >> >> is you are looking up vma's at all and all that goes with that (mainly
> > >> >> taking mmap lock, etc.)
> > >> >> 
> > >> >> And for eviction hmm_range_fault() won't even find all the pages because
> > >> >> their virtual address may have changed - consider what happens in cases
> > >> >> of mremap(), fork(), etc. So eviction really needs physical pages
> > >> >> (pfn's), not virtual addresses.
> > >> >>
> > >> >
> > >> > See above, #1 yes we use a physical pages. For #2 it is about making the
> > >> > state consistent within a virtual address range.
> > >> 
> > >> Yep, makes sense now. For migration of physical pages you want
> > >> migrate_device_*, virtual address ranges want migrate_vma_*
> > >> 
> > >>  - Alistair
> > >> 
> > >> > Matt
> > >> >  
> > >> >> >> we're talking about eviction here so I don't understand why that would
> > >> >> >> be relevant. And hmm_range_fault() still requires the VMA, although I
> > >> >> >> need to look at the patches more closely, probably CPUVMA is a DRM
> > >> >> >
> > >> >> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
> > >> >> > as argument. This is about avoiding a driver side lookup of the VMA.
> > >> >> >
> > >> >> > CPUVMA == struct vm_area_struct in this email.
> > >> >> 
> > >> >> Thanks for the clarification.
> > >> >> 
> > >> >>  - Alistair
> > >> >> 
> > >> >> > Matt
> > >> >> >
> > >> >> >> specific concept?
> > >> >> >> 
> > >> >> >> Thanks.
> > >> >> >> 
> > >> >> >>  - Alistair
> > >> >> >> 
> > >> >> >> > Matt
> > >> >> >> >  
> > >> >> >> >>  - Alistair
> > >> >> >> >> 
> > >> >> >> >> > Matt
> > >> >> >> >> >
> > >> >> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
> > >> >> >> >> >
> > >> >> >> >> >> Matt
> > >> >> >> >> >> 
> > >> >> >> >> >> > > +	}
> > >> >> >> >> >> > > +
> > >> >> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> > >> >> >> >> >> > > +
> > >> >> >> >> >> > > +	return 0;
> > >> >> >> >> >> > > +}
> > >> >> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> > >> >> >> >> >> > > +
> > >> >> >> >> >> > >  /*
> > >> >> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> > >> >> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
> > >> >> >> >> >> > 
> > >> >> >> >> 
> > >> >> >> 
> > >> >> 
> > >> 
> > 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-18  7:16                           ` Matthew Brost
  2024-10-18  7:33                             ` Matthew Brost
@ 2024-10-18  7:34                             ` Alistair Popple
  2024-10-18  7:57                               ` Matthew Brost
  1 sibling, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-18  7:34 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Fri, Oct 18, 2024 at 04:59:05PM +1100, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> > On Fri, Oct 18, 2024 at 08:58:02AM +1100, Alistair Popple wrote:
>> >> 
>> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> 
>> >> > On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
>> >> >> 
>> >> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> >> 
>> >> >> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
>> >> >> >> 
>> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> >> >> 
>> >> >> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>> >> >> >> >> 
>> >> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> >> >> >> 
>> >> >> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>> >> >> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>> >> >> >> >> 
>> >> >> >> >> [...]
>> >> >> >> >> 
>> >> >> >> >> >> > > +{
>> >> >> >> >> >> > > +	unsigned long i;
>> >> >> >> >> >> > > +
>> >> >> >> >> >> > > +	for (i = 0; i < npages; i++) {
>> >> >> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
>> >> >> >> >> >> > > +
>> >> >> >> >> >> > > +		if (!get_page_unless_zero(page)) {
>> >> >> >> >> >> > > +			src_pfns[i] = 0;
>> >> >> >> >> >> > > +			continue;
>> >> >> >> >> >> > > +		}
>> >> >> >> >> >> > > +
>> >> >> >> >> >> > > +		if (!trylock_page(page)) {
>> >> >> >> >> >> > > +			src_pfns[i] = 0;
>> >> >> >> >> >> > > +			put_page(page);
>> >> >> >> >> >> > > +			continue;
>> >> >> >> >> >> > > +		}
>> >> >> >> >> >> > > +
>> >> >> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>> >> >> >> >> >> > 
>> >> >> >> >> >> > This needs to be converted to use a folio like
>> >> >> >> >> >> > migrate_device_range(). But more importantly this should be split out as
>> >> >> >> >> >> > a function that both migrate_device_range() and this function can call
>> >> >> >> >> >> > given this bit is identical.
>> >> >> >> >> >> > 
>> >> >> >> >> >> 
>> >> >> >> >> >> Missed the folio conversion and agree a helper shared between this
>> >> >> >> >> >> function and migrate_device_range would be a good idea. Let add that.
>> >> >> >> >> >> 
>> >> >> >> >> >
>> >> >> >> >> > Alistair,
>> >> >> >> >> >
>> >> >> >> >> > Ok, I think now I want to go slightly different direction here to give
>> >> >> >> >> > GPUSVM a bit more control over several eviction scenarios.
>> >> >> >> >> >
>> >> >> >> >> > What if I exported the helper discussed above, e.g.,
>> >> >> >> >> >
>> >> >> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>> >> >> >> >> >  906 {
>> >> >> >> >> >  907         struct folio *folio;
>> >> >> >> >> >  908
>> >> >> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>> >> >> >> >> >  910         if (!folio)
>> >> >> >> >> >  911                 return 0;
>> >> >> >> >> >  912
>> >> >> >> >> >  913         if (!folio_trylock(folio)) {
>> >> >> >> >> >  914                 folio_put(folio);
>> >> >> >> >> >  915                 return 0;
>> >> >> >> >> >  916         }
>> >> >> >> >> >  917
>> >> >> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>> >> >> >> >> >  919 }
>> >> >> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>> >> >> >> >> >
>> >> >> >> >> > And then also export migrate_device_unmap.
>> >> >> >> >> >
>> >> >> >> >> > The usage here would be let a driver collect the device pages in virtual
>> >> >> >> >> > address range via hmm_range_fault, lock device pages under notifier
>> >> >> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
>> >> >> >> >> > migrate_device_unmap.
>> >> >> >> >> 
>> >> >> >> >> I'm still working through this series but that seems a bit dubious, the
>> >> >> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
>> >> >> >> >> would help me a lot in understanding what you're suggesting.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > For sure locking in tricky, my mistake on not working through this
>> >> >> >> > before sending out the next rev but it came to mind after sending +
>> >> >> >> > regarding some late feedback from Thomas about using hmm for eviction
>> >> >> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
>> >> >> >> > doesn't work for coherent pages, while something like below does.
>> >> >> >> >
>> >> >> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>> >> >> >> >
>> >> >> >> > Here is a snippet I have locally which seems to work.
>> >> >> >> >
>> >> >> >> > 2024 retry:
>> >> >> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> >> >> >> > 2026         hmm_range.hmm_pfns = src;
>> >> >> >> > 2027
>> >> >> >> > 2028         while (true) {
>> >> >> >> > 2029                 mmap_read_lock(mm);
>> >> >> >> > 2030                 err = hmm_range_fault(&hmm_range);
>> >> >> >> > 2031                 mmap_read_unlock(mm);
>> >> >> >> > 2032                 if (err == -EBUSY) {
>> >> >> >> > 2033                         if (time_after(jiffies, timeout))
>> >> >> >> > 2034                                 break;
>> >> >> >> > 2035
>> >> >> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> >> >> >> > 2037                         continue;
>> >> >> >> > 2038                 }
>> >> >> >> > 2039                 break;
>> >> >> >> > 2040         }
>> >> >> >> > 2041         if (err)
>> >> >> >> > 2042                 goto err_put;
>> >> >> >> > 2043
>> >> >> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
>> >> >> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>> >> >> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
>> >> >> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
>> >> >> >> > 2048                 goto retry;
>> >> >> >> > 2049         }
>> >> >> >> > 2050         for (i = 0; i < npages; ++i) {
>> >> >> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
>> >> >> >> > 2052
>> >> >> >> > 2053                 if (page && (is_device_private_page(page) ||
>> >> >> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
>> >> >> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
>> >> >> >> > 2056                 else
>> >> >> >> > 2057                         src[i] = 0;
>> >> >> >> > 2058                 if (src[i])
>> >> >> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
>> >> >> >> > 2060         }
>> >> >> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
>> >> >> >> 
>> >> >> >> Practically for eviction isn't this much the same as calling
>> >> >> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
>> >> >> >> probably shouldn't be looking at mm/vma structs.
>> >> >> >> 
>> >> >> >
>> >> >> > hmm_range_fault is just collecting the pages, internally I suppose it
>> >> >> > does look at a VMA (struct vm_area_struct) but I think the point is
>> >> >> > drivers should not be looking at VMA here.
>> >> >> 
>> >> >> migrate_vma_setup() is designed to be called by drivers and needs a vma,
>> >> >> so in general I don't see a problem with drivers looking up vma's. The
>> >> >> problem arises specifically for eviction and whether or not that happens
>> >> >> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
>> >> >> issues there (see below).
>> >> >> 
>> >> >
>> >> > Ok, if you think it ok for drivers to lookup the VMA then purposed
>> >> > exporting of migrate_device_pfn_lock & migrate_device_unmap is not
>> >> > needed, rather just the original function exported in the this patch.
>> >> >
>> >> > More below too.
>> >> >
>> >> >> >> > 2063         migrate_device_unmap(src, npages, NULL);
>> >> >> >> > ...
>> >> >> >> > 2101         migrate_device_pages(src, dst, npages);
>> >> >> >> > 2102         migrate_device_finalize(src, dst, npages);
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >> > Sima has strongly suggested avoiding a CPUVMA
>> >> >> >> >> > lookup during eviction cases and this would let me fixup
>> >> >> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
>> >> >> >> >> 
>> >> >> >> >> That sounds reasonable but for context do you have a link to the
>> >> >> >> >> comments/discussion on this? I couldn't readily find it, but I may have
>> >> >> >> >> just missed it.
>> >> >> >> >> 
>> >> >> >> >
>> >> >> >> > See in [4], search for '2. eviction' comment from sima.
>> >> >> >> 
>> >> >> >> Thanks for pointing that out. For reference here's Sima's comment:
>> >> >> >> 
>> >> >> >> > 2. eviction
>> >> >> >> > 
>> >> >> >> > Requirements much like migrate_to_ram, because otherwise we break the
>> >> >> >> > migration gurantee:
>> >> >> >> > 
>> >> >> >> > - Only looking at physical memory datastructures and locks, no looking at
>> >> >> >> >   mm/vma structs or relying on those being locked. We rely entirely on
>> >> >> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
>> >> >> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
>> >> >> >>
>> >> >> >> I also very much agree with this. That's basically why I added
>> >> >> >> migrate_device_range(), so that we can forcibly evict pages when the
>> >> >> >> driver needs them freed (eg. driver unload, low memory, etc.). In
>> >> >> >> general it is impossible to guarantee eviction og all pages using just
>> >> >> >> hmm_range_fault().
>> >> >> >> 
>> >> >> >
>> >> >> > In this code path we don't have device pages available, hence the
>> >> >> > purposed collection via hmm_range_fault.
>> >> >> 
>> >> >> Why don't you have the pfns requiring eviction available? I need to read
>> >> >> this series in more depth, but generally hmm_range_fault() can't
>> >> >> gurantee you will find every device page.
>> >> >> 
>> >> >
>> >> > There are two cases for eviction in my series:
>> >> >
>> >> > 1. TTM decides it needs to move memory. This calls
>> >> > drm_gpusvm_evict_to_ram. In this case the device pfns are available
>> >> > directly from drm_gpusvm_devmem so the migrate_device_* calls be used
>> >> > here albiet with the new function added in this patch as device pfns may
>> >> > be non-contiguous.
>> >> 
>> >> That makes sense and is generally what I think of when I'm thinking of
>> >> eviction. The new function makes sense too - migrate_device_range() was
>> >> primarily introduced to allow a driver to evict all device-private pages
>> >> from a GPU so didn't consider non-contiguous cases, etc.
>> >> 
>> >> > 2. An inconsistent state for VA range occurs (mixed system and device pages,
>> >> > partial unmap of a range, etc...). Here we want to evict the range ram
>> >> > to make the state consistent. No device pages are available due to an
>> >> > intentional disconnect between a virtual range and physical
>> >> > drm_gpusvm_devmem, thus the device pages have to be looked up. This the
>> >> > function drm_gpusvm_range_evict. Based on what you tell me, it likely is
>> >> > fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
>> >> > using hmm_range_fault like I have suggested here.
>> >> 
>> >> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
>> >> fine for this usage and is exactly what you want - it was designed to
>> >> either select all the system memory pages or device-private pages within
>> >> a VA range and migrate them.
>> >> 
>> >> FWIW I have toyed with the idea of a combined
>> >> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
>> >> migrate_vma_*() process but haven't come up with something nice as
>> >> yet. I don't think mixing the two in an open-coded fashion is a good
>> >> idea though, I'd rather we come up with a new API that addresses the
>> >> short-comings of migrate_vma_setup().
>> >> 
>> >
>> > I think that would good. Here we actually need to lookup multiple VMAs
>> > and have a sequence of migrate_vma_* calls as it possible for VMAs to
>> > have changed after the driver range was created. It might be nice to
>> > hide the VMA lookup from the drivers with an API saying collect and
>> > migrate all pages of a type in a VA range much like hmm_range_fault. If
>> > the range spans multiple VMAs that would be hidden from the caller.
>> 
>> Ok. I wasn't really considering multiple VMAs. UVM and Nouveau don't
>> really have a requirement to migrate across multiple VMAs but if that's
>> neccessary I think an API that hides that specifically for working with
>> migrate_vma_*() might make sense.
>> 
>
> We can run into multiple VMA scenarios if a user does something rude
> like this:

fork and mremap were the other "rude" things we've had fun with. They
basically mean you can get references to device pages which a driver
can't track with virtual addresses.

> mmap	0x000000...0x1fffff -> fault migrates 2M to VRAM and creates an internal range to track
> munmap	0x080000...0x17ffff -> now we have two VMAs instead of one and the range has a hole in it
>
> In this scenario, which we believe to rare / unsual, we just evict
> remaining VRAM pages to SRAM, destroy range, and fixup on next GPU
> fault.
>
>> > Matt
>> >
>> >> > Note #2 may be removed or unnecessary at some point if we decide to add
>> >> > support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
>> >> > now though. See 'Ranges with mixed system and device pages' in [5].
>> 
>> As someone not very familiar with some of the DRM layers can I ask why
>> having virtual address ranges with a mix of system and device pages is
>> hard to support? It seems to me that in practice it might be quite
>> difficult to keep a VMA range as exclusively all in system memory or all
>> in device memory.
>>
>
> A few things that make this difficult are:
>
> - Our (Xe) bind code would need to be updated to support this
> - TTM / DRM buddy allocator doesn't support freeing / reallocation of
>   individual pages rather aligned chunks of initial allocation size
>   (e.g., 2M would be common allocation size).
> - Spliting ranges would add complications
>
> All workable problems but since we are writing a new common
> implementation trying to keep it as simple as possible for initial merge
> of the design. Almost certainly at some point we will add support for
> mixed ranges to the common GPU SVM layer with a driver choosing if it
> wants mixed or non-mixed ranges via a flag to function calls.
>
> wrt to being difficult keeping exclusively in system or vram, in
> addition to the above case the only other case I have found in which
> this occurs is CPU and GPU faults to same address range racing. This can
> cause hmm_range_fault to grab a set mixed pages. In this case again we
> do an eviction remaining page and restart the GPU fault.
>
> I don't have real workloads yet but I do have a very aggressive test
> case that intentionally does things which could break the design in a
> highly parallel manner and the design as work. Is it ideal? Maybe not.
> But getting in a simple design which we can build upon is the current
> goal.

Taking a simple approach first definitely sounds lie the right approach
to me. I was just interested in the background because it wasn't
something I'd run into (though we built on top of something quite
different to the DRM layer). But I have often thought that the
interfaces we have between core mm and GPU drivers is still a bit too
low level at the moment and is calling out for a slightly higher level
common implementation in the middle so am very interested to see where
this all goes. Thanks.

 - Alistair

> Matt
>
>> >> > [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
>> >> >
>> >> >> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
>> >> >> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>> >> >> >> >
>> >> >> >> >> > It would also make the function exported in this patch unnecessary too
>> >> >> >> >> > as non-contiguous pfns can be setup on driver side via
>> >> >> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
>> >> >> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>> >> >> >> >> > in [1].
>> >> >> >> >> >
>> >> >> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
>> >> >> >> >> > migrate_device_unmap?
>> >> >> >> >> 
>> >> >> >> >> If there is a good justification for it I can't see a problem with
>> >> >> >> >> exporting it. That said I don't really understand why you would
>> >> >> >> >> want/need to split those steps up but I'll wait to see the code.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > It is so the device pages returned from hmm_range_fault, which are only
>> >> >> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
>> >> >> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
>> >> >> >> > MMU invalidation which takes the notifier lock thus calling the function
>> >> >> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>> >> >> >> >
>> >> >> >> > I think this flow makes sense and agree in general this likely better
>> >> >> >> > than looking at a CPUVMA.
>> >> >> >> 
>> >> >> >> I'm still a bit confused about what is better with this flow if you are
>> >> >> >> still calling hmm_range_fault(). How is it better than just calling
>> >> >> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
>> >> >> >
>> >> >> > The code in rev2 calls migrate_vma_setup but the requires a struct
>> >> >> > vm_area_struct argument whereas hmm_range_fault does not.
>> >> >> 
>> >> >> I'm not sure that's a good enough justfication because the problem isn't
>> >> >> whether you're looking up vma's in driver code or mm code. The problem
>> >> >> is you are looking up vma's at all and all that goes with that (mainly
>> >> >> taking mmap lock, etc.)
>> >> >> 
>> >> >> And for eviction hmm_range_fault() won't even find all the pages because
>> >> >> their virtual address may have changed - consider what happens in cases
>> >> >> of mremap(), fork(), etc. So eviction really needs physical pages
>> >> >> (pfn's), not virtual addresses.
>> >> >>
>> >> >
>> >> > See above, #1 yes we use a physical pages. For #2 it is about making the
>> >> > state consistent within a virtual address range.
>> >> 
>> >> Yep, makes sense now. For migration of physical pages you want
>> >> migrate_device_*, virtual address ranges want migrate_vma_*
>> >> 
>> >>  - Alistair
>> >> 
>> >> > Matt
>> >> >  
>> >> >> >> we're talking about eviction here so I don't understand why that would
>> >> >> >> be relevant. And hmm_range_fault() still requires the VMA, although I
>> >> >> >> need to look at the patches more closely, probably CPUVMA is a DRM
>> >> >> >
>> >> >> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
>> >> >> > as argument. This is about avoiding a driver side lookup of the VMA.
>> >> >> >
>> >> >> > CPUVMA == struct vm_area_struct in this email.
>> >> >> 
>> >> >> Thanks for the clarification.
>> >> >> 
>> >> >>  - Alistair
>> >> >> 
>> >> >> > Matt
>> >> >> >
>> >> >> >> specific concept?
>> >> >> >> 
>> >> >> >> Thanks.
>> >> >> >> 
>> >> >> >>  - Alistair
>> >> >> >> 
>> >> >> >> > Matt
>> >> >> >> >  
>> >> >> >> >>  - Alistair
>> >> >> >> >> 
>> >> >> >> >> > Matt
>> >> >> >> >> >
>> >> >> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>> >> >> >> >> >
>> >> >> >> >> >> Matt
>> >> >> >> >> >> 
>> >> >> >> >> >> > > +	}
>> >> >> >> >> >> > > +
>> >> >> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
>> >> >> >> >> >> > > +
>> >> >> >> >> >> > > +	return 0;
>> >> >> >> >> >> > > +}
>> >> >> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>> >> >> >> >> >> > > +
>> >> >> >> >> >> > >  /*
>> >> >> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
>> >> >> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
>> >> >> >> >> >> > 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-18  7:34                             ` Alistair Popple
@ 2024-10-18  7:57                               ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-18  7:57 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Fri, Oct 18, 2024 at 06:34:13PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Fri, Oct 18, 2024 at 04:59:05PM +1100, Alistair Popple wrote:
> >> 
> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> 
> >> > On Fri, Oct 18, 2024 at 08:58:02AM +1100, Alistair Popple wrote:
> >> >> 
> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> 
> >> >> > On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
> >> >> >> 
> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> >> 
> >> >> >> > On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
> >> >> >> >> 
> >> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> >> >> 
> >> >> >> >> > On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
> >> >> >> >> >> 
> >> >> >> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> >> >> >> 
> >> >> >> >> >> > On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
> >> >> >> >> >> >> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
> >> >> >> >> >> 
> >> >> >> >> >> [...]
> >> >> >> >> >> 
> >> >> >> >> >> >> > > +{
> >> >> >> >> >> >> > > +	unsigned long i;
> >> >> >> >> >> >> > > +
> >> >> >> >> >> >> > > +	for (i = 0; i < npages; i++) {
> >> >> >> >> >> >> > > +		struct page *page = pfn_to_page(src_pfns[i]);
> >> >> >> >> >> >> > > +
> >> >> >> >> >> >> > > +		if (!get_page_unless_zero(page)) {
> >> >> >> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> >> >> >> > > +			continue;
> >> >> >> >> >> >> > > +		}
> >> >> >> >> >> >> > > +
> >> >> >> >> >> >> > > +		if (!trylock_page(page)) {
> >> >> >> >> >> >> > > +			src_pfns[i] = 0;
> >> >> >> >> >> >> > > +			put_page(page);
> >> >> >> >> >> >> > > +			continue;
> >> >> >> >> >> >> > > +		}
> >> >> >> >> >> >> > > +
> >> >> >> >> >> >> > > +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> > This needs to be converted to use a folio like
> >> >> >> >> >> >> > migrate_device_range(). But more importantly this should be split out as
> >> >> >> >> >> >> > a function that both migrate_device_range() and this function can call
> >> >> >> >> >> >> > given this bit is identical.
> >> >> >> >> >> >> > 
> >> >> >> >> >> >> 
> >> >> >> >> >> >> Missed the folio conversion and agree a helper shared between this
> >> >> >> >> >> >> function and migrate_device_range would be a good idea. Let add that.
> >> >> >> >> >> >> 
> >> >> >> >> >> >
> >> >> >> >> >> > Alistair,
> >> >> >> >> >> >
> >> >> >> >> >> > Ok, I think now I want to go slightly different direction here to give
> >> >> >> >> >> > GPUSVM a bit more control over several eviction scenarios.
> >> >> >> >> >> >
> >> >> >> >> >> > What if I exported the helper discussed above, e.g.,
> >> >> >> >> >> >
> >> >> >> >> >> >  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
> >> >> >> >> >> >  906 {
> >> >> >> >> >> >  907         struct folio *folio;
> >> >> >> >> >> >  908
> >> >> >> >> >> >  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
> >> >> >> >> >> >  910         if (!folio)
> >> >> >> >> >> >  911                 return 0;
> >> >> >> >> >> >  912
> >> >> >> >> >> >  913         if (!folio_trylock(folio)) {
> >> >> >> >> >> >  914                 folio_put(folio);
> >> >> >> >> >> >  915                 return 0;
> >> >> >> >> >> >  916         }
> >> >> >> >> >> >  917
> >> >> >> >> >> >  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> >> >> >> >> >> >  919 }
> >> >> >> >> >> >  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
> >> >> >> >> >> >
> >> >> >> >> >> > And then also export migrate_device_unmap.
> >> >> >> >> >> >
> >> >> >> >> >> > The usage here would be let a driver collect the device pages in virtual
> >> >> >> >> >> > address range via hmm_range_fault, lock device pages under notifier
> >> >> >> >> >> > lock ensuring device pages are valid, drop the notifier lock and call
> >> >> >> >> >> > migrate_device_unmap.
> >> >> >> >> >> 
> >> >> >> >> >> I'm still working through this series but that seems a bit dubious, the
> >> >> >> >> >> locking here is pretty subtle and easy to get wrong so seeing some code
> >> >> >> >> >> would help me a lot in understanding what you're suggesting.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > For sure locking in tricky, my mistake on not working through this
> >> >> >> >> > before sending out the next rev but it came to mind after sending +
> >> >> >> >> > regarding some late feedback from Thomas about using hmm for eviction
> >> >> >> >> > [2]. His suggestion of using hmm_range_fault to trigger migration
> >> >> >> >> > doesn't work for coherent pages, while something like below does.
> >> >> >> >> >
> >> >> >> >> > [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
> >> >> >> >> >
> >> >> >> >> > Here is a snippet I have locally which seems to work.
> >> >> >> >> >
> >> >> >> >> > 2024 retry:
> >> >> >> >> > 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> >> >> >> > 2026         hmm_range.hmm_pfns = src;
> >> >> >> >> > 2027
> >> >> >> >> > 2028         while (true) {
> >> >> >> >> > 2029                 mmap_read_lock(mm);
> >> >> >> >> > 2030                 err = hmm_range_fault(&hmm_range);
> >> >> >> >> > 2031                 mmap_read_unlock(mm);
> >> >> >> >> > 2032                 if (err == -EBUSY) {
> >> >> >> >> > 2033                         if (time_after(jiffies, timeout))
> >> >> >> >> > 2034                                 break;
> >> >> >> >> > 2035
> >> >> >> >> > 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> >> >> >> >> > 2037                         continue;
> >> >> >> >> > 2038                 }
> >> >> >> >> > 2039                 break;
> >> >> >> >> > 2040         }
> >> >> >> >> > 2041         if (err)
> >> >> >> >> > 2042                 goto err_put;
> >> >> >> >> > 2043
> >> >> >> >> > 2044         drm_gpusvm_notifier_lock(gpusvm);
> >> >> >> >> > 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> >> >> >> >> > 2046                 drm_gpusvm_notifier_unlock(gpusvm);
> >> >> >> >> > 2047                 memset(src, 0, sizeof(*src) * npages);
> >> >> >> >> > 2048                 goto retry;
> >> >> >> >> > 2049         }
> >> >> >> >> > 2050         for (i = 0; i < npages; ++i) {
> >> >> >> >> > 2051                 struct page *page = hmm_pfn_to_page(src[i]);
> >> >> >> >> > 2052
> >> >> >> >> > 2053                 if (page && (is_device_private_page(page) ||
> >> >> >> >> > 2054                     is_device_coherent_page(page)) && page->zone_device_data)
> >> >> >> >> > 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
> >> >> >> >> > 2056                 else
> >> >> >> >> > 2057                         src[i] = 0;
> >> >> >> >> > 2058                 if (src[i])
> >> >> >> >> > 2059                         src[i] = migrate_device_pfn_lock(src[i]);
> >> >> >> >> > 2060         }
> >> >> >> >> > 2061         drm_gpusvm_notifier_unlock(gpusvm);
> >> >> >> >> 
> >> >> >> >> Practically for eviction isn't this much the same as calling
> >> >> >> >> migrate_vma_setup()? And also for eviction as Sima mentioned you
> >> >> >> >> probably shouldn't be looking at mm/vma structs.
> >> >> >> >> 
> >> >> >> >
> >> >> >> > hmm_range_fault is just collecting the pages, internally I suppose it
> >> >> >> > does look at a VMA (struct vm_area_struct) but I think the point is
> >> >> >> > drivers should not be looking at VMA here.
> >> >> >> 
> >> >> >> migrate_vma_setup() is designed to be called by drivers and needs a vma,
> >> >> >> so in general I don't see a problem with drivers looking up vma's. The
> >> >> >> problem arises specifically for eviction and whether or not that happens
> >> >> >> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
> >> >> >> issues there (see below).
> >> >> >> 
> >> >> >
> >> >> > Ok, if you think it ok for drivers to lookup the VMA then purposed
> >> >> > exporting of migrate_device_pfn_lock & migrate_device_unmap is not
> >> >> > needed, rather just the original function exported in the this patch.
> >> >> >
> >> >> > More below too.
> >> >> >
> >> >> >> >> > 2063         migrate_device_unmap(src, npages, NULL);
> >> >> >> >> > ...
> >> >> >> >> > 2101         migrate_device_pages(src, dst, npages);
> >> >> >> >> > 2102         migrate_device_finalize(src, dst, npages);
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >> > Sima has strongly suggested avoiding a CPUVMA
> >> >> >> >> >> > lookup during eviction cases and this would let me fixup
> >> >> >> >> >> > drm_gpusvm_range_evict in [1] to avoid this.
> >> >> >> >> >> 
> >> >> >> >> >> That sounds reasonable but for context do you have a link to the
> >> >> >> >> >> comments/discussion on this? I couldn't readily find it, but I may have
> >> >> >> >> >> just missed it.
> >> >> >> >> >> 
> >> >> >> >> >
> >> >> >> >> > See in [4], search for '2. eviction' comment from sima.
> >> >> >> >> 
> >> >> >> >> Thanks for pointing that out. For reference here's Sima's comment:
> >> >> >> >> 
> >> >> >> >> > 2. eviction
> >> >> >> >> > 
> >> >> >> >> > Requirements much like migrate_to_ram, because otherwise we break the
> >> >> >> >> > migration gurantee:
> >> >> >> >> > 
> >> >> >> >> > - Only looking at physical memory datastructures and locks, no looking at
> >> >> >> >> >   mm/vma structs or relying on those being locked. We rely entirely on
> >> >> >> >> >   reverse maps from try_to_migrate to find all the mappings on both cpu
> >> >> >> >> >   and gpu side (cpu only zone device swap or migration pte entries ofc).
> >> >> >> >>
> >> >> >> >> I also very much agree with this. That's basically why I added
> >> >> >> >> migrate_device_range(), so that we can forcibly evict pages when the
> >> >> >> >> driver needs them freed (eg. driver unload, low memory, etc.). In
> >> >> >> >> general it is impossible to guarantee eviction og all pages using just
> >> >> >> >> hmm_range_fault().
> >> >> >> >> 
> >> >> >> >
> >> >> >> > In this code path we don't have device pages available, hence the
> >> >> >> > purposed collection via hmm_range_fault.
> >> >> >> 
> >> >> >> Why don't you have the pfns requiring eviction available? I need to read
> >> >> >> this series in more depth, but generally hmm_range_fault() can't
> >> >> >> gurantee you will find every device page.
> >> >> >> 
> >> >> >
> >> >> > There are two cases for eviction in my series:
> >> >> >
> >> >> > 1. TTM decides it needs to move memory. This calls
> >> >> > drm_gpusvm_evict_to_ram. In this case the device pfns are available
> >> >> > directly from drm_gpusvm_devmem so the migrate_device_* calls be used
> >> >> > here albiet with the new function added in this patch as device pfns may
> >> >> > be non-contiguous.
> >> >> 
> >> >> That makes sense and is generally what I think of when I'm thinking of
> >> >> eviction. The new function makes sense too - migrate_device_range() was
> >> >> primarily introduced to allow a driver to evict all device-private pages
> >> >> from a GPU so didn't consider non-contiguous cases, etc.
> >> >> 
> >> >> > 2. An inconsistent state for VA range occurs (mixed system and device pages,
> >> >> > partial unmap of a range, etc...). Here we want to evict the range ram
> >> >> > to make the state consistent. No device pages are available due to an
> >> >> > intentional disconnect between a virtual range and physical
> >> >> > drm_gpusvm_devmem, thus the device pages have to be looked up. This the
> >> >> > function drm_gpusvm_range_evict. Based on what you tell me, it likely is
> >> >> > fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
> >> >> > using hmm_range_fault like I have suggested here.
> >> >> 
> >> >> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
> >> >> fine for this usage and is exactly what you want - it was designed to
> >> >> either select all the system memory pages or device-private pages within
> >> >> a VA range and migrate them.
> >> >> 
> >> >> FWIW I have toyed with the idea of a combined
> >> >> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
> >> >> migrate_vma_*() process but haven't come up with something nice as
> >> >> yet. I don't think mixing the two in an open-coded fashion is a good
> >> >> idea though, I'd rather we come up with a new API that addresses the
> >> >> short-comings of migrate_vma_setup().
> >> >> 
> >> >
> >> > I think that would good. Here we actually need to lookup multiple VMAs
> >> > and have a sequence of migrate_vma_* calls as it possible for VMAs to
> >> > have changed after the driver range was created. It might be nice to
> >> > hide the VMA lookup from the drivers with an API saying collect and
> >> > migrate all pages of a type in a VA range much like hmm_range_fault. If
> >> > the range spans multiple VMAs that would be hidden from the caller.
> >> 
> >> Ok. I wasn't really considering multiple VMAs. UVM and Nouveau don't
> >> really have a requirement to migrate across multiple VMAs but if that's
> >> neccessary I think an API that hides that specifically for working with
> >> migrate_vma_*() might make sense.
> >> 
> >
> > We can run into multiple VMA scenarios if a user does something rude
> > like this:
> 
> fork and mremap were the other "rude" things we've had fun with. They
> basically mean you can get references to device pages which a driver
> can't track with virtual addresses.
> 

Yes, I've tested those two and are fun as well. But both are COW which
hasn't turned out to be too difficult.

> > mmap	0x000000...0x1fffff -> fault migrates 2M to VRAM and creates an internal range to track
> > munmap	0x080000...0x17ffff -> now we have two VMAs instead of one and the range has a hole in it
> >
> > In this scenario, which we believe to rare / unsual, we just evict
> > remaining VRAM pages to SRAM, destroy range, and fixup on next GPU
> > fault.
> >
> >> > Matt
> >> >
> >> >> > Note #2 may be removed or unnecessary at some point if we decide to add
> >> >> > support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
> >> >> > now though. See 'Ranges with mixed system and device pages' in [5].
> >> 
> >> As someone not very familiar with some of the DRM layers can I ask why
> >> having virtual address ranges with a mix of system and device pages is
> >> hard to support? It seems to me that in practice it might be quite
> >> difficult to keep a VMA range as exclusively all in system memory or all
> >> in device memory.
> >>
> >
> > A few things that make this difficult are:
> >
> > - Our (Xe) bind code would need to be updated to support this
> > - TTM / DRM buddy allocator doesn't support freeing / reallocation of
> >   individual pages rather aligned chunks of initial allocation size
> >   (e.g., 2M would be common allocation size).
> > - Spliting ranges would add complications
> >
> > All workable problems but since we are writing a new common
> > implementation trying to keep it as simple as possible for initial merge
> > of the design. Almost certainly at some point we will add support for
> > mixed ranges to the common GPU SVM layer with a driver choosing if it
> > wants mixed or non-mixed ranges via a flag to function calls.
> >
> > wrt to being difficult keeping exclusively in system or vram, in
> > addition to the above case the only other case I have found in which
> > this occurs is CPU and GPU faults to same address range racing. This can
> > cause hmm_range_fault to grab a set mixed pages. In this case again we
> > do an eviction remaining page and restart the GPU fault.
> >
> > I don't have real workloads yet but I do have a very aggressive test
> > case that intentionally does things which could break the design in a
> > highly parallel manner and the design as work. Is it ideal? Maybe not.
> > But getting in a simple design which we can build upon is the current
> > goal.
> 
> Taking a simple approach first definitely sounds lie the right approach
> to me. I was just interested in the background because it wasn't
> something I'd run into (though we built on top of something quite
> different to the DRM layer). But I have often thought that the
> interfaces we have between core mm and GPU drivers is still a bit too
> low level at the moment and is calling out for a slightly higher level
> common implementation in the middle so am very interested to see where
> this all goes. Thanks.

The idea is build something common which other DRM drivers can use or
perhaps even pull out of the DRM layer eventually into a common device
layer for SVM.

Matt

> 
>  - Alistair
> 
> > Matt
> >
> >> >> > [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
> >> >> >
> >> >> >> >> > [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
> >> >> >> >> > [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
> >> >> >> >> >
> >> >> >> >> >> > It would also make the function exported in this patch unnecessary too
> >> >> >> >> >> > as non-contiguous pfns can be setup on driver side via
> >> >> >> >> >> > migrate_device_pfn_lock and then migrate_device_unmap can be called.
> >> >> >> >> >> > This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
> >> >> >> >> >> > in [1].
> >> >> >> >> >> >
> >> >> >> >> >> > Do you see an issue exporting migrate_device_pfn_lock,
> >> >> >> >> >> > migrate_device_unmap?
> >> >> >> >> >> 
> >> >> >> >> >> If there is a good justification for it I can't see a problem with
> >> >> >> >> >> exporting it. That said I don't really understand why you would
> >> >> >> >> >> want/need to split those steps up but I'll wait to see the code.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > It is so the device pages returned from hmm_range_fault, which are only
> >> >> >> >> > guaranteed to be valid under the notifier lock + a seqno check, to be
> >> >> >> >> > locked and ref taken for migration. migrate_device_unmap() can trigger a
> >> >> >> >> > MMU invalidation which takes the notifier lock thus calling the function
> >> >> >> >> > which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
> >> >> >> >> >
> >> >> >> >> > I think this flow makes sense and agree in general this likely better
> >> >> >> >> > than looking at a CPUVMA.
> >> >> >> >> 
> >> >> >> >> I'm still a bit confused about what is better with this flow if you are
> >> >> >> >> still calling hmm_range_fault(). How is it better than just calling
> >> >> >> >> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
> >> >> >> >
> >> >> >> > The code in rev2 calls migrate_vma_setup but the requires a struct
> >> >> >> > vm_area_struct argument whereas hmm_range_fault does not.
> >> >> >> 
> >> >> >> I'm not sure that's a good enough justfication because the problem isn't
> >> >> >> whether you're looking up vma's in driver code or mm code. The problem
> >> >> >> is you are looking up vma's at all and all that goes with that (mainly
> >> >> >> taking mmap lock, etc.)
> >> >> >> 
> >> >> >> And for eviction hmm_range_fault() won't even find all the pages because
> >> >> >> their virtual address may have changed - consider what happens in cases
> >> >> >> of mremap(), fork(), etc. So eviction really needs physical pages
> >> >> >> (pfn's), not virtual addresses.
> >> >> >>
> >> >> >
> >> >> > See above, #1 yes we use a physical pages. For #2 it is about making the
> >> >> > state consistent within a virtual address range.
> >> >> 
> >> >> Yep, makes sense now. For migration of physical pages you want
> >> >> migrate_device_*, virtual address ranges want migrate_vma_*
> >> >> 
> >> >>  - Alistair
> >> >> 
> >> >> > Matt
> >> >> >  
> >> >> >> >> we're talking about eviction here so I don't understand why that would
> >> >> >> >> be relevant. And hmm_range_fault() still requires the VMA, although I
> >> >> >> >> need to look at the patches more closely, probably CPUVMA is a DRM
> >> >> >> >
> >> >> >> > 'hmm_range_fault() still requires the VMA' internal yes, but again not
> >> >> >> > as argument. This is about avoiding a driver side lookup of the VMA.
> >> >> >> >
> >> >> >> > CPUVMA == struct vm_area_struct in this email.
> >> >> >> 
> >> >> >> Thanks for the clarification.
> >> >> >> 
> >> >> >>  - Alistair
> >> >> >> 
> >> >> >> > Matt
> >> >> >> >
> >> >> >> >> specific concept?
> >> >> >> >> 
> >> >> >> >> Thanks.
> >> >> >> >> 
> >> >> >> >>  - Alistair
> >> >> >> >> 
> >> >> >> >> > Matt
> >> >> >> >> >  
> >> >> >> >> >>  - Alistair
> >> >> >> >> >> 
> >> >> >> >> >> > Matt
> >> >> >> >> >> >
> >> >> >> >> >> > [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
> >> >> >> >> >> >
> >> >> >> >> >> >> Matt
> >> >> >> >> >> >> 
> >> >> >> >> >> >> > > +	}
> >> >> >> >> >> >> > > +
> >> >> >> >> >> >> > > +	migrate_device_unmap(src_pfns, npages, NULL);
> >> >> >> >> >> >> > > +
> >> >> >> >> >> >> > > +	return 0;
> >> >> >> >> >> >> > > +}
> >> >> >> >> >> >> > > +EXPORT_SYMBOL(migrate_device_prepopulated_range);
> >> >> >> >> >> >> > > +
> >> >> >> >> >> >> > >  /*
> >> >> >> >> >> >> > >   * Migrate a device coherent folio back to normal memory. The caller should have
> >> >> >> >> >> >> > >   * a reference on folio which will be copied to the new folio if migration is
> >> >> >> >> >> >> > 
> >> >> >> >> >> 
> >> >> >> >> 
> >> >> >> 
> >> >> 
> >> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-17 21:58                     ` Alistair Popple
  2024-10-18  0:54                       ` Matthew Brost
@ 2024-10-18  4:02                       ` Mika Penttilä
  2024-10-18  5:55                         ` Alistair Popple
  1 sibling, 1 reply; 129+ messages in thread
From: Mika Penttilä @ 2024-10-18  4:02 UTC (permalink / raw)
  To: Alistair Popple, Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Hi,

On 10/18/24 00:58, Alistair Popple wrote:
> Matthew Brost <matthew.brost@intel.com> writes:
>
>> On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>
>>>> On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>
>>>>>> On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>>>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>>>
>>>>>>>> On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>>>>>>>>> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +{
>>>>>>>>>>> +	unsigned long i;
>>>>>>>>>>> +
>>>>>>>>>>> +	for (i = 0; i < npages; i++) {
>>>>>>>>>>> +		struct page *page = pfn_to_page(src_pfns[i]);
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!get_page_unless_zero(page)) {
>>>>>>>>>>> +			src_pfns[i] = 0;
>>>>>>>>>>> +			continue;
>>>>>>>>>>> +		}
>>>>>>>>>>> +
>>>>>>>>>>> +		if (!trylock_page(page)) {
>>>>>>>>>>> +			src_pfns[i] = 0;
>>>>>>>>>>> +			put_page(page);
>>>>>>>>>>> +			continue;
>>>>>>>>>>> +		}
>>>>>>>>>>> +
>>>>>>>>>>> +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>>>>>>>>>> This needs to be converted to use a folio like
>>>>>>>>>> migrate_device_range(). But more importantly this should be split out as
>>>>>>>>>> a function that both migrate_device_range() and this function can call
>>>>>>>>>> given this bit is identical.
>>>>>>>>>>
>>>>>>>>> Missed the folio conversion and agree a helper shared between this
>>>>>>>>> function and migrate_device_range would be a good idea. Let add that.
>>>>>>>>>
>>>>>>>> Alistair,
>>>>>>>>
>>>>>>>> Ok, I think now I want to go slightly different direction here to give
>>>>>>>> GPUSVM a bit more control over several eviction scenarios.
>>>>>>>>
>>>>>>>> What if I exported the helper discussed above, e.g.,
>>>>>>>>
>>>>>>>>  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>>>>>>>>  906 {
>>>>>>>>  907         struct folio *folio;
>>>>>>>>  908
>>>>>>>>  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>>>>>>>>  910         if (!folio)
>>>>>>>>  911                 return 0;
>>>>>>>>  912
>>>>>>>>  913         if (!folio_trylock(folio)) {
>>>>>>>>  914                 folio_put(folio);
>>>>>>>>  915                 return 0;
>>>>>>>>  916         }
>>>>>>>>  917
>>>>>>>>  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>>>>>>>>  919 }
>>>>>>>>  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>>>>>>>>
>>>>>>>> And then also export migrate_device_unmap.
>>>>>>>>
>>>>>>>> The usage here would be let a driver collect the device pages in virtual
>>>>>>>> address range via hmm_range_fault, lock device pages under notifier
>>>>>>>> lock ensuring device pages are valid, drop the notifier lock and call
>>>>>>>> migrate_device_unmap.
>>>>>>> I'm still working through this series but that seems a bit dubious, the
>>>>>>> locking here is pretty subtle and easy to get wrong so seeing some code
>>>>>>> would help me a lot in understanding what you're suggesting.
>>>>>>>
>>>>>> For sure locking in tricky, my mistake on not working through this
>>>>>> before sending out the next rev but it came to mind after sending +
>>>>>> regarding some late feedback from Thomas about using hmm for eviction
>>>>>> [2]. His suggestion of using hmm_range_fault to trigger migration
>>>>>> doesn't work for coherent pages, while something like below does.
>>>>>>
>>>>>> [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>>>>>>
>>>>>> Here is a snippet I have locally which seems to work.
>>>>>>
>>>>>> 2024 retry:
>>>>>> 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>> 2026         hmm_range.hmm_pfns = src;
>>>>>> 2027
>>>>>> 2028         while (true) {
>>>>>> 2029                 mmap_read_lock(mm);
>>>>>> 2030                 err = hmm_range_fault(&hmm_range);
>>>>>> 2031                 mmap_read_unlock(mm);
>>>>>> 2032                 if (err == -EBUSY) {
>>>>>> 2033                         if (time_after(jiffies, timeout))
>>>>>> 2034                                 break;
>>>>>> 2035
>>>>>> 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>> 2037                         continue;
>>>>>> 2038                 }
>>>>>> 2039                 break;
>>>>>> 2040         }
>>>>>> 2041         if (err)
>>>>>> 2042                 goto err_put;
>>>>>> 2043
>>>>>> 2044         drm_gpusvm_notifier_lock(gpusvm);
>>>>>> 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>>>>>> 2046                 drm_gpusvm_notifier_unlock(gpusvm);
>>>>>> 2047                 memset(src, 0, sizeof(*src) * npages);
>>>>>> 2048                 goto retry;
>>>>>> 2049         }
>>>>>> 2050         for (i = 0; i < npages; ++i) {
>>>>>> 2051                 struct page *page = hmm_pfn_to_page(src[i]);
>>>>>> 2052
>>>>>> 2053                 if (page && (is_device_private_page(page) ||
>>>>>> 2054                     is_device_coherent_page(page)) && page->zone_device_data)
>>>>>> 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
>>>>>> 2056                 else
>>>>>> 2057                         src[i] = 0;
>>>>>> 2058                 if (src[i])
>>>>>> 2059                         src[i] = migrate_device_pfn_lock(src[i]);
>>>>>> 2060         }
>>>>>> 2061         drm_gpusvm_notifier_unlock(gpusvm);
>>>>> Practically for eviction isn't this much the same as calling
>>>>> migrate_vma_setup()? And also for eviction as Sima mentioned you
>>>>> probably shouldn't be looking at mm/vma structs.
>>>>>
>>>> hmm_range_fault is just collecting the pages, internally I suppose it
>>>> does look at a VMA (struct vm_area_struct) but I think the point is
>>>> drivers should not be looking at VMA here.
>>> migrate_vma_setup() is designed to be called by drivers and needs a vma,
>>> so in general I don't see a problem with drivers looking up vma's. The
>>> problem arises specifically for eviction and whether or not that happens
>>> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
>>> issues there (see below).
>>>
>> Ok, if you think it ok for drivers to lookup the VMA then purposed
>> exporting of migrate_device_pfn_lock & migrate_device_unmap is not
>> needed, rather just the original function exported in the this patch.
>>
>> More below too.
>>
>>>>>> 2063         migrate_device_unmap(src, npages, NULL);
>>>>>> ...
>>>>>> 2101         migrate_device_pages(src, dst, npages);
>>>>>> 2102         migrate_device_finalize(src, dst, npages);
>>>>>>
>>>>>>
>>>>>>>> Sima has strongly suggested avoiding a CPUVMA
>>>>>>>> lookup during eviction cases and this would let me fixup
>>>>>>>> drm_gpusvm_range_evict in [1] to avoid this.
>>>>>>> That sounds reasonable but for context do you have a link to the
>>>>>>> comments/discussion on this? I couldn't readily find it, but I may have
>>>>>>> just missed it.
>>>>>>>
>>>>>> See in [4], search for '2. eviction' comment from sima.
>>>>> Thanks for pointing that out. For reference here's Sima's comment:
>>>>>
>>>>>> 2. eviction
>>>>>>
>>>>>> Requirements much like migrate_to_ram, because otherwise we break the
>>>>>> migration gurantee:
>>>>>>
>>>>>> - Only looking at physical memory datastructures and locks, no looking at
>>>>>>   mm/vma structs or relying on those being locked. We rely entirely on
>>>>>>   reverse maps from try_to_migrate to find all the mappings on both cpu
>>>>>>   and gpu side (cpu only zone device swap or migration pte entries ofc).
>>>>> I also very much agree with this. That's basically why I added
>>>>> migrate_device_range(), so that we can forcibly evict pages when the
>>>>> driver needs them freed (eg. driver unload, low memory, etc.). In
>>>>> general it is impossible to guarantee eviction og all pages using just
>>>>> hmm_range_fault().
>>>>>
>>>> In this code path we don't have device pages available, hence the
>>>> purposed collection via hmm_range_fault.
>>> Why don't you have the pfns requiring eviction available? I need to read
>>> this series in more depth, but generally hmm_range_fault() can't
>>> gurantee you will find every device page.
>>>
>> There are two cases for eviction in my series:
>>
>> 1. TTM decides it needs to move memory. This calls
>> drm_gpusvm_evict_to_ram. In this case the device pfns are available
>> directly from drm_gpusvm_devmem so the migrate_device_* calls be used
>> here albiet with the new function added in this patch as device pfns may
>> be non-contiguous.
> That makes sense and is generally what I think of when I'm thinking of
> eviction. The new function makes sense too - migrate_device_range() was
> primarily introduced to allow a driver to evict all device-private pages
> from a GPU so didn't consider non-contiguous cases, etc.
>
>> 2. An inconsistent state for VA range occurs (mixed system and device pages,
>> partial unmap of a range, etc...). Here we want to evict the range ram
>> to make the state consistent. No device pages are available due to an
>> intentional disconnect between a virtual range and physical
>> drm_gpusvm_devmem, thus the device pages have to be looked up. This the
>> function drm_gpusvm_range_evict. Based on what you tell me, it likely is
>> fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
>> using hmm_range_fault like I have suggested here.
> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
> fine for this usage and is exactly what you want - it was designed to
> either select all the system memory pages or device-private pages within
> a VA range and migrate them.
>
> FWIW I have toyed with the idea of a combined
> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
> migrate_vma_*() process but haven't come up with something nice as
> yet. I don't think mixing the two in an open-coded fashion is a good
> idea though, I'd rather we come up with a new API that addresses the
> short-comings of migrate_vma_setup().

This is what I have been implementing and have a WIP version now, will
cleanup, test and send soon.

It does the migration entry installing while faulting pages, and you
continue migrate with normal migrate_vma_() flow.


>> Note #2 may be removed or unnecessary at some point if we decide to add
>> support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
>> now though. See 'Ranges with mixed system and device pages' in [5].
>>
>> [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
>>
>>>>>> [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
>>>>>> [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>>>>>>
>>>>>>>> It would also make the function exported in this patch unnecessary too
>>>>>>>> as non-contiguous pfns can be setup on driver side via
>>>>>>>> migrate_device_pfn_lock and then migrate_device_unmap can be called.
>>>>>>>> This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>>>>>>>> in [1].
>>>>>>>>
>>>>>>>> Do you see an issue exporting migrate_device_pfn_lock,
>>>>>>>> migrate_device_unmap?
>>>>>>> If there is a good justification for it I can't see a problem with
>>>>>>> exporting it. That said I don't really understand why you would
>>>>>>> want/need to split those steps up but I'll wait to see the code.
>>>>>>>
>>>>>> It is so the device pages returned from hmm_range_fault, which are only
>>>>>> guaranteed to be valid under the notifier lock + a seqno check, to be
>>>>>> locked and ref taken for migration. migrate_device_unmap() can trigger a
>>>>>> MMU invalidation which takes the notifier lock thus calling the function
>>>>>> which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>>>>>>
>>>>>> I think this flow makes sense and agree in general this likely better
>>>>>> than looking at a CPUVMA.
>>>>> I'm still a bit confused about what is better with this flow if you are
>>>>> still calling hmm_range_fault(). How is it better than just calling
>>>>> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
>>>> The code in rev2 calls migrate_vma_setup but the requires a struct
>>>> vm_area_struct argument whereas hmm_range_fault does not.
>>> I'm not sure that's a good enough justfication because the problem isn't
>>> whether you're looking up vma's in driver code or mm code. The problem
>>> is you are looking up vma's at all and all that goes with that (mainly
>>> taking mmap lock, etc.)
>>>
>>> And for eviction hmm_range_fault() won't even find all the pages because
>>> their virtual address may have changed - consider what happens in cases
>>> of mremap(), fork(), etc. So eviction really needs physical pages
>>> (pfn's), not virtual addresses.
>>>
>> See above, #1 yes we use a physical pages. For #2 it is about making the
>> state consistent within a virtual address range.
> Yep, makes sense now. For migration of physical pages you want
> migrate_device_*, virtual address ranges want migrate_vma_*
>
>  - Alistair
>
>> Matt
>>  
>>>>> we're talking about eviction here so I don't understand why that would
>>>>> be relevant. And hmm_range_fault() still requires the VMA, although I
>>>>> need to look at the patches more closely, probably CPUVMA is a DRM
>>>> 'hmm_range_fault() still requires the VMA' internal yes, but again not
>>>> as argument. This is about avoiding a driver side lookup of the VMA.
>>>>
>>>> CPUVMA == struct vm_area_struct in this email.
>>> Thanks for the clarification.
>>>
>>>  - Alistair
>>>
>>>> Matt
>>>>
>>>>> specific concept?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>  - Alistair
>>>>>
>>>>>> Matt
>>>>>>  
>>>>>>>  - Alistair
>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>>>> +	}
>>>>>>>>>>> +
>>>>>>>>>>> +	migrate_device_unmap(src_pfns, npages, NULL);
>>>>>>>>>>> +
>>>>>>>>>>> +	return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>>>>>>>>>>> +
>>>>>>>>>>>  /*
>>>>>>>>>>>   * Migrate a device coherent folio back to normal memory. The caller should have
>>>>>>>>>>>   * a reference on folio which will be copied to the new folio if migration is

--Mika



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range
  2024-10-18  4:02                       ` Mika Penttilä
@ 2024-10-18  5:55                         ` Alistair Popple
  0 siblings, 0 replies; 129+ messages in thread
From: Alistair Popple @ 2024-10-18  5:55 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Matthew Brost, intel-xe, dri-devel, airlied, christian.koenig,
	thomas.hellstrom, simona.vetter, felix.kuehling, dakr


Mika Penttilä <mpenttil@redhat.com> writes:

> Hi,
>
> On 10/18/24 00:58, Alistair Popple wrote:
>> Matthew Brost <matthew.brost@intel.com> writes:
>>
>>> On Thu, Oct 17, 2024 at 04:49:11PM +1100, Alistair Popple wrote:
>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>
>>>>> On Thu, Oct 17, 2024 at 02:21:13PM +1100, Alistair Popple wrote:
>>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>>
>>>>>>> On Thu, Oct 17, 2024 at 12:49:55PM +1100, Alistair Popple wrote:
>>>>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>>>>
>>>>>>>>> On Wed, Oct 16, 2024 at 04:46:52AM +0000, Matthew Brost wrote:
>>>>>>>>>> On Wed, Oct 16, 2024 at 03:04:06PM +1100, Alistair Popple wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	unsigned long i;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	for (i = 0; i < npages; i++) {
>>>>>>>>>>>> +		struct page *page = pfn_to_page(src_pfns[i]);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (!get_page_unless_zero(page)) {
>>>>>>>>>>>> +			src_pfns[i] = 0;
>>>>>>>>>>>> +			continue;
>>>>>>>>>>>> +		}
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (!trylock_page(page)) {
>>>>>>>>>>>> +			src_pfns[i] = 0;
>>>>>>>>>>>> +			put_page(page);
>>>>>>>>>>>> +			continue;
>>>>>>>>>>>> +		}
>>>>>>>>>>>> +
>>>>>>>>>>>> +		src_pfns[i] = migrate_pfn(src_pfns[i]) | MIGRATE_PFN_MIGRATE;
>>>>>>>>>>> This needs to be converted to use a folio like
>>>>>>>>>>> migrate_device_range(). But more importantly this should be split out as
>>>>>>>>>>> a function that both migrate_device_range() and this function can call
>>>>>>>>>>> given this bit is identical.
>>>>>>>>>>>
>>>>>>>>>> Missed the folio conversion and agree a helper shared between this
>>>>>>>>>> function and migrate_device_range would be a good idea. Let add that.
>>>>>>>>>>
>>>>>>>>> Alistair,
>>>>>>>>>
>>>>>>>>> Ok, I think now I want to go slightly different direction here to give
>>>>>>>>> GPUSVM a bit more control over several eviction scenarios.
>>>>>>>>>
>>>>>>>>> What if I exported the helper discussed above, e.g.,
>>>>>>>>>
>>>>>>>>>  905 unsigned long migrate_device_pfn_lock(unsigned long pfn)
>>>>>>>>>  906 {
>>>>>>>>>  907         struct folio *folio;
>>>>>>>>>  908
>>>>>>>>>  909         folio = folio_get_nontail_page(pfn_to_page(pfn));
>>>>>>>>>  910         if (!folio)
>>>>>>>>>  911                 return 0;
>>>>>>>>>  912
>>>>>>>>>  913         if (!folio_trylock(folio)) {
>>>>>>>>>  914                 folio_put(folio);
>>>>>>>>>  915                 return 0;
>>>>>>>>>  916         }
>>>>>>>>>  917
>>>>>>>>>  918         return migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
>>>>>>>>>  919 }
>>>>>>>>>  920 EXPORT_SYMBOL(migrate_device_pfn_lock);
>>>>>>>>>
>>>>>>>>> And then also export migrate_device_unmap.
>>>>>>>>>
>>>>>>>>> The usage here would be let a driver collect the device pages in virtual
>>>>>>>>> address range via hmm_range_fault, lock device pages under notifier
>>>>>>>>> lock ensuring device pages are valid, drop the notifier lock and call
>>>>>>>>> migrate_device_unmap.
>>>>>>>> I'm still working through this series but that seems a bit dubious, the
>>>>>>>> locking here is pretty subtle and easy to get wrong so seeing some code
>>>>>>>> would help me a lot in understanding what you're suggesting.
>>>>>>>>
>>>>>>> For sure locking in tricky, my mistake on not working through this
>>>>>>> before sending out the next rev but it came to mind after sending +
>>>>>>> regarding some late feedback from Thomas about using hmm for eviction
>>>>>>> [2]. His suggestion of using hmm_range_fault to trigger migration
>>>>>>> doesn't work for coherent pages, while something like below does.
>>>>>>>
>>>>>>> [2] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1125461
>>>>>>>
>>>>>>> Here is a snippet I have locally which seems to work.
>>>>>>>
>>>>>>> 2024 retry:
>>>>>>> 2025         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>>> 2026         hmm_range.hmm_pfns = src;
>>>>>>> 2027
>>>>>>> 2028         while (true) {
>>>>>>> 2029                 mmap_read_lock(mm);
>>>>>>> 2030                 err = hmm_range_fault(&hmm_range);
>>>>>>> 2031                 mmap_read_unlock(mm);
>>>>>>> 2032                 if (err == -EBUSY) {
>>>>>>> 2033                         if (time_after(jiffies, timeout))
>>>>>>> 2034                                 break;
>>>>>>> 2035
>>>>>>> 2036                         hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>>> 2037                         continue;
>>>>>>> 2038                 }
>>>>>>> 2039                 break;
>>>>>>> 2040         }
>>>>>>> 2041         if (err)
>>>>>>> 2042                 goto err_put;
>>>>>>> 2043
>>>>>>> 2044         drm_gpusvm_notifier_lock(gpusvm);
>>>>>>> 2045         if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
>>>>>>> 2046                 drm_gpusvm_notifier_unlock(gpusvm);
>>>>>>> 2047                 memset(src, 0, sizeof(*src) * npages);
>>>>>>> 2048                 goto retry;
>>>>>>> 2049         }
>>>>>>> 2050         for (i = 0; i < npages; ++i) {
>>>>>>> 2051                 struct page *page = hmm_pfn_to_page(src[i]);
>>>>>>> 2052
>>>>>>> 2053                 if (page && (is_device_private_page(page) ||
>>>>>>> 2054                     is_device_coherent_page(page)) && page->zone_device_data)
>>>>>>> 2055                         src[i] = src[i] & ~HMM_PFN_FLAGS;
>>>>>>> 2056                 else
>>>>>>> 2057                         src[i] = 0;
>>>>>>> 2058                 if (src[i])
>>>>>>> 2059                         src[i] = migrate_device_pfn_lock(src[i]);
>>>>>>> 2060         }
>>>>>>> 2061         drm_gpusvm_notifier_unlock(gpusvm);
>>>>>> Practically for eviction isn't this much the same as calling
>>>>>> migrate_vma_setup()? And also for eviction as Sima mentioned you
>>>>>> probably shouldn't be looking at mm/vma structs.
>>>>>>
>>>>> hmm_range_fault is just collecting the pages, internally I suppose it
>>>>> does look at a VMA (struct vm_area_struct) but I think the point is
>>>>> drivers should not be looking at VMA here.
>>>> migrate_vma_setup() is designed to be called by drivers and needs a vma,
>>>> so in general I don't see a problem with drivers looking up vma's. The
>>>> problem arises specifically for eviction and whether or not that happens
>>>> in the driver or hmm_range_fault() is pretty irrelevant IMHO for the
>>>> issues there (see below).
>>>>
>>> Ok, if you think it ok for drivers to lookup the VMA then purposed
>>> exporting of migrate_device_pfn_lock & migrate_device_unmap is not
>>> needed, rather just the original function exported in the this patch.
>>>
>>> More below too.
>>>
>>>>>>> 2063         migrate_device_unmap(src, npages, NULL);
>>>>>>> ...
>>>>>>> 2101         migrate_device_pages(src, dst, npages);
>>>>>>> 2102         migrate_device_finalize(src, dst, npages);
>>>>>>>
>>>>>>>
>>>>>>>>> Sima has strongly suggested avoiding a CPUVMA
>>>>>>>>> lookup during eviction cases and this would let me fixup
>>>>>>>>> drm_gpusvm_range_evict in [1] to avoid this.
>>>>>>>> That sounds reasonable but for context do you have a link to the
>>>>>>>> comments/discussion on this? I couldn't readily find it, but I may have
>>>>>>>> just missed it.
>>>>>>>>
>>>>>>> See in [4], search for '2. eviction' comment from sima.
>>>>>> Thanks for pointing that out. For reference here's Sima's comment:
>>>>>>
>>>>>>> 2. eviction
>>>>>>>
>>>>>>> Requirements much like migrate_to_ram, because otherwise we break the
>>>>>>> migration gurantee:
>>>>>>>
>>>>>>> - Only looking at physical memory datastructures and locks, no looking at
>>>>>>>   mm/vma structs or relying on those being locked. We rely entirely on
>>>>>>>   reverse maps from try_to_migrate to find all the mappings on both cpu
>>>>>>>   and gpu side (cpu only zone device swap or migration pte entries ofc).
>>>>>> I also very much agree with this. That's basically why I added
>>>>>> migrate_device_range(), so that we can forcibly evict pages when the
>>>>>> driver needs them freed (eg. driver unload, low memory, etc.). In
>>>>>> general it is impossible to guarantee eviction og all pages using just
>>>>>> hmm_range_fault().
>>>>>>
>>>>> In this code path we don't have device pages available, hence the
>>>>> purposed collection via hmm_range_fault.
>>>> Why don't you have the pfns requiring eviction available? I need to read
>>>> this series in more depth, but generally hmm_range_fault() can't
>>>> gurantee you will find every device page.
>>>>
>>> There are two cases for eviction in my series:
>>>
>>> 1. TTM decides it needs to move memory. This calls
>>> drm_gpusvm_evict_to_ram. In this case the device pfns are available
>>> directly from drm_gpusvm_devmem so the migrate_device_* calls be used
>>> here albiet with the new function added in this patch as device pfns may
>>> be non-contiguous.
>> That makes sense and is generally what I think of when I'm thinking of
>> eviction. The new function makes sense too - migrate_device_range() was
>> primarily introduced to allow a driver to evict all device-private pages
>> from a GPU so didn't consider non-contiguous cases, etc.
>>
>>> 2. An inconsistent state for VA range occurs (mixed system and device pages,
>>> partial unmap of a range, etc...). Here we want to evict the range ram
>>> to make the state consistent. No device pages are available due to an
>>> intentional disconnect between a virtual range and physical
>>> drm_gpusvm_devmem, thus the device pages have to be looked up. This the
>>> function drm_gpusvm_range_evict. Based on what you tell me, it likely is
>>> fine the way originally coded in v2 (vma lookup + migrate_vma_*) vs
>>> using hmm_range_fault like I have suggested here.
>> Thanks for the explanation. I think vma lookup + migrate_vma_setup() is
>> fine for this usage and is exactly what you want - it was designed to
>> either select all the system memory pages or device-private pages within
>> a VA range and migrate them.
>>
>> FWIW I have toyed with the idea of a combined
>> hmm_range_fault()/migrate_vma_setup() front-end to the rest of the
>> migrate_vma_*() process but haven't come up with something nice as
>> yet. I don't think mixing the two in an open-coded fashion is a good
>> idea though, I'd rather we come up with a new API that addresses the
>> short-comings of migrate_vma_setup().
>
> This is what I have been implementing and have a WIP version now, will
> cleanup, test and send soon.
>
> It does the migration entry installing while faulting pages, and you
> continue migrate with normal migrate_vma_() flow.

Oh nice! Thanks for looking further into that idea, I'm looking forward
to seeing the results. For background Mika and I had an offline
discussion about this a little while back but I wasn't sure if it had
gone anywhere.

>>> Note #2 may be removed or unnecessary at some point if we decide to add
>>> support for ininconsistent state in GPU SVM and Xe. Keeping it simple for
>>> now though. See 'Ranges with mixed system and device pages' in [5].
>>>
>>> [5] https://patchwork.freedesktop.org/patch/619819/?series=137870&rev=2
>>>
>>>>>>> [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1110726
>>>>>>> [4] https://lore.kernel.org/all/BYAPR11MB3159A304925168D8B6B4671292692@BYAPR11MB3159.namprd11.prod.outlook.com/T/#m89cd6a37778ba5271d5381ebeb03e1f963856a78
>>>>>>>
>>>>>>>>> It would also make the function exported in this patch unnecessary too
>>>>>>>>> as non-contiguous pfns can be setup on driver side via
>>>>>>>>> migrate_device_pfn_lock and then migrate_device_unmap can be called.
>>>>>>>>> This also another eviction usage in GPUSVM, see drm_gpusvm_evict_to_ram
>>>>>>>>> in [1].
>>>>>>>>>
>>>>>>>>> Do you see an issue exporting migrate_device_pfn_lock,
>>>>>>>>> migrate_device_unmap?
>>>>>>>> If there is a good justification for it I can't see a problem with
>>>>>>>> exporting it. That said I don't really understand why you would
>>>>>>>> want/need to split those steps up but I'll wait to see the code.
>>>>>>>>
>>>>>>> It is so the device pages returned from hmm_range_fault, which are only
>>>>>>> guaranteed to be valid under the notifier lock + a seqno check, to be
>>>>>>> locked and ref taken for migration. migrate_device_unmap() can trigger a
>>>>>>> MMU invalidation which takes the notifier lock thus calling the function
>>>>>>> which combines migrate_device_pfn_lock + migrate_device_unmap deadlocks.
>>>>>>>
>>>>>>> I think this flow makes sense and agree in general this likely better
>>>>>>> than looking at a CPUVMA.
>>>>>> I'm still a bit confused about what is better with this flow if you are
>>>>>> still calling hmm_range_fault(). How is it better than just calling
>>>>>> migrate_vma_setup()? Obviously it will fault the pages in, but it seems
>>>>> The code in rev2 calls migrate_vma_setup but the requires a struct
>>>>> vm_area_struct argument whereas hmm_range_fault does not.
>>>> I'm not sure that's a good enough justfication because the problem isn't
>>>> whether you're looking up vma's in driver code or mm code. The problem
>>>> is you are looking up vma's at all and all that goes with that (mainly
>>>> taking mmap lock, etc.)
>>>>
>>>> And for eviction hmm_range_fault() won't even find all the pages because
>>>> their virtual address may have changed - consider what happens in cases
>>>> of mremap(), fork(), etc. So eviction really needs physical pages
>>>> (pfn's), not virtual addresses.
>>>>
>>> See above, #1 yes we use a physical pages. For #2 it is about making the
>>> state consistent within a virtual address range.
>> Yep, makes sense now. For migration of physical pages you want
>> migrate_device_*, virtual address ranges want migrate_vma_*
>>
>>  - Alistair
>>
>>> Matt
>>>  
>>>>>> we're talking about eviction here so I don't understand why that would
>>>>>> be relevant. And hmm_range_fault() still requires the VMA, although I
>>>>>> need to look at the patches more closely, probably CPUVMA is a DRM
>>>>> 'hmm_range_fault() still requires the VMA' internal yes, but again not
>>>>> as argument. This is about avoiding a driver side lookup of the VMA.
>>>>>
>>>>> CPUVMA == struct vm_area_struct in this email.
>>>> Thanks for the clarification.
>>>>
>>>>  - Alistair
>>>>
>>>>> Matt
>>>>>
>>>>>> specific concept?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>  - Alistair
>>>>>>
>>>>>>> Matt
>>>>>>>  
>>>>>>>>  - Alistair
>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>> [1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2
>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>>>> +	}
>>>>>>>>>>>> +
>>>>>>>>>>>> +	migrate_device_unmap(src_pfns, npages, NULL);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	return 0;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL(migrate_device_prepopulated_range);
>>>>>>>>>>>> +
>>>>>>>>>>>>  /*
>>>>>>>>>>>>   * Migrate a device coherent folio back to normal memory. The caller should have
>>>>>>>>>>>>   * a reference on folio which will be copied to the new folio if migration is
>
> --Mika


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
  2024-10-16  3:24 ` [PATCH v2 01/29] drm/xe: Retry BO allocation Matthew Brost
  2024-10-16  3:24 ` [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-10-16  4:00   ` Alistair Popple
  2024-11-28 23:31   ` Alistair Popple
  2024-10-16  3:24 ` [PATCH v2 04/29] drm/pagemap: Add DRM pagemap Matthew Brost
                   ` (28 subsequent siblings)
  31 siblings, 2 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Avoid multiple CPU page faults to the same device page racing by trying
to lock the page in do_swap_page before taking an extra reference to the
page. This prevents scenarios where multiple CPU page faults each take
an extra reference to a device page, which could abort migration in
folio_migrate_mapping. With the device page being locked in
do_swap_page, the migrate_vma_* functions need to be updated to avoid
locking the fault_page argument.

Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
DRM driver) SVM implementation if enough threads faulted the same device
page.

Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 mm/memory.c         | 13 ++++++---
 mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
 2 files changed, 56 insertions(+), 26 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..b72bde782611 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 * Get a page reference while we know the page can't be
 			 * freed.
 			 */
-			get_page(vmf->page);
-			pte_unmap_unlock(vmf->pte, vmf->ptl);
-			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
-			put_page(vmf->page);
+			if (trylock_page(vmf->page)) {
+				get_page(vmf->page);
+				pte_unmap_unlock(vmf->pte, vmf->ptl);
+				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+				put_page(vmf->page);
+				unlock_page(vmf->page);
+			} else {
+				pte_unmap_unlock(vmf->pte, vmf->ptl);
+			}
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else if (is_pte_marker_entry(entry)) {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index f163c2131022..2477d39f57be 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				   struct mm_walk *walk)
 {
 	struct migrate_vma *migrate = walk->private;
+	struct folio *fault_folio = migrate->fault_page ?
+		page_folio(migrate->fault_page) : NULL;
 	struct vm_area_struct *vma = walk->vma;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start, unmapped = 0;
@@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 			folio_get(folio);
 			spin_unlock(ptl);
-			if (unlikely(!folio_trylock(folio)))
+			if (unlikely(fault_folio != folio &&
+				     !folio_trylock(folio)))
 				return migrate_vma_collect_skip(start, end,
 								walk);
 			ret = split_folio(folio);
-			folio_unlock(folio);
+			if (fault_folio != folio)
+				folio_unlock(folio);
 			folio_put(folio);
 			if (ret)
 				return migrate_vma_collect_skip(start, end,
@@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 * optimisation to avoid walking the rmap later with
 		 * try_to_migrate().
 		 */
-		if (folio_trylock(folio)) {
+		if (fault_folio == folio || folio_trylock(folio)) {
 			bool anon_exclusive;
 			pte_t swp_pte;
 
@@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 				if (folio_try_share_anon_rmap_pte(folio, page)) {
 					set_pte_at(mm, addr, ptep, pte);
-					folio_unlock(folio);
+					if (fault_folio != folio)
+						folio_unlock(folio);
 					folio_put(folio);
 					mpfn = 0;
 					goto next;
@@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 					  unsigned long npages,
 					  struct page *fault_page)
 {
+	struct folio *fault_folio = fault_page ?
+		page_folio(fault_page) : NULL;
 	unsigned long i, restore = 0;
 	bool allow_drain = true;
 	unsigned long unmapped = 0;
@@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 		remove_migration_ptes(folio, folio, 0);
 
 		src_pfns[i] = 0;
-		folio_unlock(folio);
+		if (fault_folio != folio)
+			folio_unlock(folio);
 		folio_put(folio);
 		restore--;
 	}
@@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
 		return -EINVAL;
 	if (args->fault_page && !is_device_private_page(args->fault_page))
 		return -EINVAL;
+	if (args->fault_page && !PageLocked(args->fault_page))
+		return -EINVAL;
 
 	memset(args->src, 0, sizeof(*args->src) * nr_pages);
 	args->cpages = 0;
@@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 }
 EXPORT_SYMBOL(migrate_vma_pages);
 
-/*
- * migrate_device_finalize() - complete page migration
- * @src_pfns: src_pfns returned from migrate_device_range()
- * @dst_pfns: array of pfns allocated by the driver to migrate memory to
- * @npages: number of pages in the range
- *
- * Completes migration of the page by removing special migration entries.
- * Drivers must ensure copying of page data is complete and visible to the CPU
- * before calling this.
- */
-void migrate_device_finalize(unsigned long *src_pfns,
-			unsigned long *dst_pfns, unsigned long npages)
+static void __migrate_device_finalize(unsigned long *src_pfns,
+				      unsigned long *dst_pfns,
+				      unsigned long npages,
+				      struct page *fault_page)
 {
+	struct folio *fault_folio = fault_page ?
+		page_folio(fault_page) : NULL;
 	unsigned long i;
 
 	for (i = 0; i < npages; i++) {
@@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
 
 		if (!page) {
 			if (dst) {
-				folio_unlock(dst);
+				if (fault_folio != dst)
+					folio_unlock(dst);
 				folio_put(dst);
 			}
 			continue;
@@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
 
 		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
 			if (dst) {
-				folio_unlock(dst);
+				if (fault_folio != dst)
+					folio_unlock(dst);
 				folio_put(dst);
 			}
 			dst = src;
 		}
 
 		remove_migration_ptes(src, dst, 0);
-		folio_unlock(src);
+		if (fault_folio != src)
+			folio_unlock(src);
 
 		if (folio_is_zone_device(src))
 			folio_put(src);
@@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
 			folio_putback_lru(src);
 
 		if (dst != src) {
-			folio_unlock(dst);
+			if (fault_folio != dst)
+				folio_unlock(dst);
 			if (folio_is_zone_device(dst))
 				folio_put(dst);
 			else
@@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
 		}
 	}
 }
+
+/*
+ * migrate_device_finalize() - complete page migration
+ * @src_pfns: src_pfns returned from migrate_device_range()
+ * @dst_pfns: array of pfns allocated by the driver to migrate memory to
+ * @npages: number of pages in the range
+ *
+ * Completes migration of the page by removing special migration entries.
+ * Drivers must ensure copying of page data is complete and visible to the CPU
+ * before calling this.
+ */
+void migrate_device_finalize(unsigned long *src_pfns,
+			unsigned long *dst_pfns, unsigned long npages)
+{
+	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
+}
 EXPORT_SYMBOL(migrate_device_finalize);
 
 /**
@@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
  */
 void migrate_vma_finalize(struct migrate_vma *migrate)
 {
-	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
+	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
+				  migrate->fault_page);
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-16  3:24 ` [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page Matthew Brost
@ 2024-10-16  4:00   ` Alistair Popple
  2024-10-16  4:41     ` Matthew Brost
  2024-11-28 23:31   ` Alistair Popple
  1 sibling, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-16  4:00 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> Avoid multiple CPU page faults to the same device page racing by trying
> to lock the page in do_swap_page before taking an extra reference to the
> page. This prevents scenarios where multiple CPU page faults each take
> an extra reference to a device page, which could abort migration in
> folio_migrate_mapping. With the device page being locked in
> do_swap_page, the migrate_vma_* functions need to be updated to avoid
> locking the fault_page argument.
>
> Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> DRM driver) SVM implementation if enough threads faulted the same device
> page.
>
> Cc: Philip Yang <Philip.Yang@amd.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  mm/memory.c         | 13 ++++++---
>  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
>  2 files changed, 56 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2366578015ad..b72bde782611 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			 * Get a page reference while we know the page can't be
>  			 * freed.
>  			 */
> -			get_page(vmf->page);
> -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> -			put_page(vmf->page);
> +			if (trylock_page(vmf->page)) {
> +				get_page(vmf->page);
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> +				put_page(vmf->page);
> +				unlock_page(vmf->page);

I don't think my previous review of this change has really been
addressed. Why don't we just install the migration entry here? Seems
like it would be a much simpler way of solving this.

> +			} else {
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +			}
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
>  		} else if (is_pte_marker_entry(entry)) {
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index f163c2131022..2477d39f57be 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  				   struct mm_walk *walk)
>  {
>  	struct migrate_vma *migrate = walk->private;
> +	struct folio *fault_folio = migrate->fault_page ?
> +		page_folio(migrate->fault_page) : NULL;
>  	struct vm_area_struct *vma = walk->vma;
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long addr = start, unmapped = 0;
> @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  			folio_get(folio);
>  			spin_unlock(ptl);
> -			if (unlikely(!folio_trylock(folio)))
> +			if (unlikely(fault_folio != folio &&
> +				     !folio_trylock(folio)))
>  				return migrate_vma_collect_skip(start, end,
>  								walk);
>  			ret = split_folio(folio);
> -			folio_unlock(folio);
> +			if (fault_folio != folio)
> +				folio_unlock(folio);
>  			folio_put(folio);
>  			if (ret)
>  				return migrate_vma_collect_skip(start, end,
> @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		 * optimisation to avoid walking the rmap later with
>  		 * try_to_migrate().
>  		 */
> -		if (folio_trylock(folio)) {
> +		if (fault_folio == folio || folio_trylock(folio)) {
>  			bool anon_exclusive;
>  			pte_t swp_pte;
>  
> @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  				if (folio_try_share_anon_rmap_pte(folio, page)) {
>  					set_pte_at(mm, addr, ptep, pte);
> -					folio_unlock(folio);
> +					if (fault_folio != folio)
> +						folio_unlock(folio);
>  					folio_put(folio);
>  					mpfn = 0;
>  					goto next;
> @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  					  unsigned long npages,
>  					  struct page *fault_page)
>  {
> +	struct folio *fault_folio = fault_page ?
> +		page_folio(fault_page) : NULL;
>  	unsigned long i, restore = 0;
>  	bool allow_drain = true;
>  	unsigned long unmapped = 0;
> @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  		remove_migration_ptes(folio, folio, 0);
>  
>  		src_pfns[i] = 0;
> -		folio_unlock(folio);
> +		if (fault_folio != folio)
> +			folio_unlock(folio);
>  		folio_put(folio);
>  		restore--;
>  	}
> @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>  		return -EINVAL;
>  	if (args->fault_page && !is_device_private_page(args->fault_page))
>  		return -EINVAL;
> +	if (args->fault_page && !PageLocked(args->fault_page))
> +		return -EINVAL;
>  
>  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>  	args->cpages = 0;
> @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>  }
>  EXPORT_SYMBOL(migrate_vma_pages);
>  
> -/*
> - * migrate_device_finalize() - complete page migration
> - * @src_pfns: src_pfns returned from migrate_device_range()
> - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> - * @npages: number of pages in the range
> - *
> - * Completes migration of the page by removing special migration entries.
> - * Drivers must ensure copying of page data is complete and visible to the CPU
> - * before calling this.
> - */
> -void migrate_device_finalize(unsigned long *src_pfns,
> -			unsigned long *dst_pfns, unsigned long npages)
> +static void __migrate_device_finalize(unsigned long *src_pfns,
> +				      unsigned long *dst_pfns,
> +				      unsigned long npages,
> +				      struct page *fault_page)
>  {
> +	struct folio *fault_folio = fault_page ?
> +		page_folio(fault_page) : NULL;
>  	unsigned long i;
>  
>  	for (i = 0; i < npages; i++) {
> @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  
>  		if (!page) {
>  			if (dst) {
> -				folio_unlock(dst);
> +				if (fault_folio != dst)
> +					folio_unlock(dst);
>  				folio_put(dst);
>  			}
>  			continue;
> @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  
>  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
>  			if (dst) {
> -				folio_unlock(dst);
> +				if (fault_folio != dst)
> +					folio_unlock(dst);
>  				folio_put(dst);
>  			}
>  			dst = src;
>  		}
>  
>  		remove_migration_ptes(src, dst, 0);
> -		folio_unlock(src);
> +		if (fault_folio != src)
> +			folio_unlock(src);
>  
>  		if (folio_is_zone_device(src))
>  			folio_put(src);
> @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  			folio_putback_lru(src);
>  
>  		if (dst != src) {
> -			folio_unlock(dst);
> +			if (fault_folio != dst)
> +				folio_unlock(dst);
>  			if (folio_is_zone_device(dst))
>  				folio_put(dst);
>  			else
> @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  		}
>  	}
>  }
> +
> +/*
> + * migrate_device_finalize() - complete page migration
> + * @src_pfns: src_pfns returned from migrate_device_range()
> + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> + * @npages: number of pages in the range
> + *
> + * Completes migration of the page by removing special migration entries.
> + * Drivers must ensure copying of page data is complete and visible to the CPU
> + * before calling this.
> + */
> +void migrate_device_finalize(unsigned long *src_pfns,
> +			unsigned long *dst_pfns, unsigned long npages)
> +{
> +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> +}
>  EXPORT_SYMBOL(migrate_device_finalize);
>  
>  /**
> @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>   */
>  void migrate_vma_finalize(struct migrate_vma *migrate)
>  {
> -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> +				  migrate->fault_page);
>  }
>  EXPORT_SYMBOL(migrate_vma_finalize);


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-16  4:00   ` Alistair Popple
@ 2024-10-16  4:41     ` Matthew Brost
  2024-10-17  1:51       ` Alistair Popple
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  4:41 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Wed, Oct 16, 2024 at 03:00:08PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > Avoid multiple CPU page faults to the same device page racing by trying
> > to lock the page in do_swap_page before taking an extra reference to the
> > page. This prevents scenarios where multiple CPU page faults each take
> > an extra reference to a device page, which could abort migration in
> > folio_migrate_mapping. With the device page being locked in
> > do_swap_page, the migrate_vma_* functions need to be updated to avoid
> > locking the fault_page argument.
> >
> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> > DRM driver) SVM implementation if enough threads faulted the same device
> > page.
> >
> > Cc: Philip Yang <Philip.Yang@amd.com>
> > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  mm/memory.c         | 13 ++++++---
> >  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
> >  2 files changed, 56 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..b72bde782611 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  			 * Get a page reference while we know the page can't be
> >  			 * freed.
> >  			 */
> > -			get_page(vmf->page);
> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > -			put_page(vmf->page);
> > +			if (trylock_page(vmf->page)) {
> > +				get_page(vmf->page);
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > +				put_page(vmf->page);
> > +				unlock_page(vmf->page);
> 
> I don't think my previous review of this change has really been
> addressed. Why don't we just install the migration entry here? Seems
> like it would be a much simpler way of solving this.
> 

I should have mentioned this in the cover-letter, I haven't got around
to trying that out yet. Included this existing version for correctness
but I also think this is not strickly required to merge this series as
our locking in migrate_to_ram only relies on the core MM locks so
some thread would eventually win the race and make forward progress.

So I guess just ignore this patch and will send an updated version
individually with installing a migration entry in do_swap_page. If for
some reason that doesn't work, I'll respond here explaining why.

Matt

> > +			} else {
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +			}
> >  		} else if (is_hwpoison_entry(entry)) {
> >  			ret = VM_FAULT_HWPOISON;
> >  		} else if (is_pte_marker_entry(entry)) {
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index f163c2131022..2477d39f57be 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  				   struct mm_walk *walk)
> >  {
> >  	struct migrate_vma *migrate = walk->private;
> > +	struct folio *fault_folio = migrate->fault_page ?
> > +		page_folio(migrate->fault_page) : NULL;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	unsigned long addr = start, unmapped = 0;
> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  			folio_get(folio);
> >  			spin_unlock(ptl);
> > -			if (unlikely(!folio_trylock(folio)))
> > +			if (unlikely(fault_folio != folio &&
> > +				     !folio_trylock(folio)))
> >  				return migrate_vma_collect_skip(start, end,
> >  								walk);
> >  			ret = split_folio(folio);
> > -			folio_unlock(folio);
> > +			if (fault_folio != folio)
> > +				folio_unlock(folio);
> >  			folio_put(folio);
> >  			if (ret)
> >  				return migrate_vma_collect_skip(start, end,
> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  		 * optimisation to avoid walking the rmap later with
> >  		 * try_to_migrate().
> >  		 */
> > -		if (folio_trylock(folio)) {
> > +		if (fault_folio == folio || folio_trylock(folio)) {
> >  			bool anon_exclusive;
> >  			pte_t swp_pte;
> >  
> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
> >  					set_pte_at(mm, addr, ptep, pte);
> > -					folio_unlock(folio);
> > +					if (fault_folio != folio)
> > +						folio_unlock(folio);
> >  					folio_put(folio);
> >  					mpfn = 0;
> >  					goto next;
> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  					  unsigned long npages,
> >  					  struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i, restore = 0;
> >  	bool allow_drain = true;
> >  	unsigned long unmapped = 0;
> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  		remove_migration_ptes(folio, folio, 0);
> >  
> >  		src_pfns[i] = 0;
> > -		folio_unlock(folio);
> > +		if (fault_folio != folio)
> > +			folio_unlock(folio);
> >  		folio_put(folio);
> >  		restore--;
> >  	}
> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> >  		return -EINVAL;
> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
> >  		return -EINVAL;
> > +	if (args->fault_page && !PageLocked(args->fault_page))
> > +		return -EINVAL;
> >  
> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> >  	args->cpages = 0;
> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >  }
> >  EXPORT_SYMBOL(migrate_vma_pages);
> >  
> > -/*
> > - * migrate_device_finalize() - complete page migration
> > - * @src_pfns: src_pfns returned from migrate_device_range()
> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > - * @npages: number of pages in the range
> > - *
> > - * Completes migration of the page by removing special migration entries.
> > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > - * before calling this.
> > - */
> > -void migrate_device_finalize(unsigned long *src_pfns,
> > -			unsigned long *dst_pfns, unsigned long npages)
> > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > +				      unsigned long *dst_pfns,
> > +				      unsigned long npages,
> > +				      struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i;
> >  
> >  	for (i = 0; i < npages; i++) {
> > @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  
> >  		if (!page) {
> >  			if (dst) {
> > -				folio_unlock(dst);
> > +				if (fault_folio != dst)
> > +					folio_unlock(dst);
> >  				folio_put(dst);
> >  			}
> >  			continue;
> > @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  
> >  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
> >  			if (dst) {
> > -				folio_unlock(dst);
> > +				if (fault_folio != dst)
> > +					folio_unlock(dst);
> >  				folio_put(dst);
> >  			}
> >  			dst = src;
> >  		}
> >  
> >  		remove_migration_ptes(src, dst, 0);
> > -		folio_unlock(src);
> > +		if (fault_folio != src)
> > +			folio_unlock(src);
> >  
> >  		if (folio_is_zone_device(src))
> >  			folio_put(src);
> > @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  			folio_putback_lru(src);
> >  
> >  		if (dst != src) {
> > -			folio_unlock(dst);
> > +			if (fault_folio != dst)
> > +				folio_unlock(dst);
> >  			if (folio_is_zone_device(dst))
> >  				folio_put(dst);
> >  			else
> > @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  		}
> >  	}
> >  }
> > +
> > +/*
> > + * migrate_device_finalize() - complete page migration
> > + * @src_pfns: src_pfns returned from migrate_device_range()
> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > + * @npages: number of pages in the range
> > + *
> > + * Completes migration of the page by removing special migration entries.
> > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > + * before calling this.
> > + */
> > +void migrate_device_finalize(unsigned long *src_pfns,
> > +			unsigned long *dst_pfns, unsigned long npages)
> > +{
> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > +}
> >  EXPORT_SYMBOL(migrate_device_finalize);
> >  
> >  /**
> > @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> >   */
> >  void migrate_vma_finalize(struct migrate_vma *migrate)
> >  {
> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > +				  migrate->fault_page);
> >  }
> >  EXPORT_SYMBOL(migrate_vma_finalize);
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-16  4:41     ` Matthew Brost
@ 2024-10-17  1:51       ` Alistair Popple
  2024-10-25  0:31         ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-17  1:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Wed, Oct 16, 2024 at 03:00:08PM +1100, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> > Avoid multiple CPU page faults to the same device page racing by trying
>> > to lock the page in do_swap_page before taking an extra reference to the
>> > page. This prevents scenarios where multiple CPU page faults each take
>> > an extra reference to a device page, which could abort migration in
>> > folio_migrate_mapping. With the device page being locked in
>> > do_swap_page, the migrate_vma_* functions need to be updated to avoid
>> > locking the fault_page argument.
>> >
>> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
>> > DRM driver) SVM implementation if enough threads faulted the same device
>> > page.
>> >
>> > Cc: Philip Yang <Philip.Yang@amd.com>
>> > Cc: Felix Kuehling <felix.kuehling@amd.com>
>> > Cc: Christian König <christian.koenig@amd.com>
>> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
>> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> > ---
>> >  mm/memory.c         | 13 ++++++---
>> >  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
>> >  2 files changed, 56 insertions(+), 26 deletions(-)
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 2366578015ad..b72bde782611 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >  			 * Get a page reference while we know the page can't be
>> >  			 * freed.
>> >  			 */
>> > -			get_page(vmf->page);
>> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> > -			put_page(vmf->page);
>> > +			if (trylock_page(vmf->page)) {
>> > +				get_page(vmf->page);
>> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> > +				put_page(vmf->page);
>> > +				unlock_page(vmf->page);
>> 
>> I don't think my previous review of this change has really been
>> addressed. Why don't we just install the migration entry here? Seems
>> like it would be a much simpler way of solving this.
>> 
>
> I should have mentioned this in the cover-letter, I haven't got around
> to trying that out yet. Included this existing version for correctness
> but I also think this is not strickly required to merge this series as
> our locking in migrate_to_ram only relies on the core MM locks so
> some thread would eventually win the race and make forward progress.
>
> So I guess just ignore this patch and will send an updated version
> individually with installing a migration entry in do_swap_page. If for
> some reason that doesn't work, I'll respond here explaining why.

That would be great. I have a fairly strong preference for doing that
instead of adding more special cases for the fault page in the migration
code. And if we can't do that it would be good to understand
why. Thanks.

 - Alistair

> Matt
>
>> > +			} else {
>> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > +			}
>> >  		} else if (is_hwpoison_entry(entry)) {
>> >  			ret = VM_FAULT_HWPOISON;
>> >  		} else if (is_pte_marker_entry(entry)) {
>> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> > index f163c2131022..2477d39f57be 100644
>> > --- a/mm/migrate_device.c
>> > +++ b/mm/migrate_device.c
>> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  				   struct mm_walk *walk)
>> >  {
>> >  	struct migrate_vma *migrate = walk->private;
>> > +	struct folio *fault_folio = migrate->fault_page ?
>> > +		page_folio(migrate->fault_page) : NULL;
>> >  	struct vm_area_struct *vma = walk->vma;
>> >  	struct mm_struct *mm = vma->vm_mm;
>> >  	unsigned long addr = start, unmapped = 0;
>> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  
>> >  			folio_get(folio);
>> >  			spin_unlock(ptl);
>> > -			if (unlikely(!folio_trylock(folio)))
>> > +			if (unlikely(fault_folio != folio &&
>> > +				     !folio_trylock(folio)))
>> >  				return migrate_vma_collect_skip(start, end,
>> >  								walk);
>> >  			ret = split_folio(folio);
>> > -			folio_unlock(folio);
>> > +			if (fault_folio != folio)
>> > +				folio_unlock(folio);
>> >  			folio_put(folio);
>> >  			if (ret)
>> >  				return migrate_vma_collect_skip(start, end,
>> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  		 * optimisation to avoid walking the rmap later with
>> >  		 * try_to_migrate().
>> >  		 */
>> > -		if (folio_trylock(folio)) {
>> > +		if (fault_folio == folio || folio_trylock(folio)) {
>> >  			bool anon_exclusive;
>> >  			pte_t swp_pte;
>> >  
>> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  
>> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
>> >  					set_pte_at(mm, addr, ptep, pte);
>> > -					folio_unlock(folio);
>> > +					if (fault_folio != folio)
>> > +						folio_unlock(folio);
>> >  					folio_put(folio);
>> >  					mpfn = 0;
>> >  					goto next;
>> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> >  					  unsigned long npages,
>> >  					  struct page *fault_page)
>> >  {
>> > +	struct folio *fault_folio = fault_page ?
>> > +		page_folio(fault_page) : NULL;
>> >  	unsigned long i, restore = 0;
>> >  	bool allow_drain = true;
>> >  	unsigned long unmapped = 0;
>> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> >  		remove_migration_ptes(folio, folio, 0);
>> >  
>> >  		src_pfns[i] = 0;
>> > -		folio_unlock(folio);
>> > +		if (fault_folio != folio)
>> > +			folio_unlock(folio);
>> >  		folio_put(folio);
>> >  		restore--;
>> >  	}
>> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>> >  		return -EINVAL;
>> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
>> >  		return -EINVAL;
>> > +	if (args->fault_page && !PageLocked(args->fault_page))
>> > +		return -EINVAL;
>> >  
>> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>> >  	args->cpages = 0;
>> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>> >  }
>> >  EXPORT_SYMBOL(migrate_vma_pages);
>> >  
>> > -/*
>> > - * migrate_device_finalize() - complete page migration
>> > - * @src_pfns: src_pfns returned from migrate_device_range()
>> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> > - * @npages: number of pages in the range
>> > - *
>> > - * Completes migration of the page by removing special migration entries.
>> > - * Drivers must ensure copying of page data is complete and visible to the CPU
>> > - * before calling this.
>> > - */
>> > -void migrate_device_finalize(unsigned long *src_pfns,
>> > -			unsigned long *dst_pfns, unsigned long npages)
>> > +static void __migrate_device_finalize(unsigned long *src_pfns,
>> > +				      unsigned long *dst_pfns,
>> > +				      unsigned long npages,
>> > +				      struct page *fault_page)
>> >  {
>> > +	struct folio *fault_folio = fault_page ?
>> > +		page_folio(fault_page) : NULL;
>> >  	unsigned long i;
>> >  
>> >  	for (i = 0; i < npages; i++) {
>> > @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >  
>> >  		if (!page) {
>> >  			if (dst) {
>> > -				folio_unlock(dst);
>> > +				if (fault_folio != dst)
>> > +					folio_unlock(dst);
>> >  				folio_put(dst);
>> >  			}
>> >  			continue;
>> > @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >  
>> >  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
>> >  			if (dst) {
>> > -				folio_unlock(dst);
>> > +				if (fault_folio != dst)
>> > +					folio_unlock(dst);
>> >  				folio_put(dst);
>> >  			}
>> >  			dst = src;
>> >  		}
>> >  
>> >  		remove_migration_ptes(src, dst, 0);
>> > -		folio_unlock(src);
>> > +		if (fault_folio != src)
>> > +			folio_unlock(src);
>> >  
>> >  		if (folio_is_zone_device(src))
>> >  			folio_put(src);
>> > @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >  			folio_putback_lru(src);
>> >  
>> >  		if (dst != src) {
>> > -			folio_unlock(dst);
>> > +			if (fault_folio != dst)
>> > +				folio_unlock(dst);
>> >  			if (folio_is_zone_device(dst))
>> >  				folio_put(dst);
>> >  			else
>> > @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >  		}
>> >  	}
>> >  }
>> > +
>> > +/*
>> > + * migrate_device_finalize() - complete page migration
>> > + * @src_pfns: src_pfns returned from migrate_device_range()
>> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> > + * @npages: number of pages in the range
>> > + *
>> > + * Completes migration of the page by removing special migration entries.
>> > + * Drivers must ensure copying of page data is complete and visible to the CPU
>> > + * before calling this.
>> > + */
>> > +void migrate_device_finalize(unsigned long *src_pfns,
>> > +			unsigned long *dst_pfns, unsigned long npages)
>> > +{
>> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
>> > +}
>> >  EXPORT_SYMBOL(migrate_device_finalize);
>> >  
>> >  /**
>> > @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>> >   */
>> >  void migrate_vma_finalize(struct migrate_vma *migrate)
>> >  {
>> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
>> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
>> > +				  migrate->fault_page);
>> >  }
>> >  EXPORT_SYMBOL(migrate_vma_finalize);
>> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-17  1:51       ` Alistair Popple
@ 2024-10-25  0:31         ` Matthew Brost
  2024-10-29  6:37           ` Alistair Popple
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-25  0:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Thu, Oct 17, 2024 at 12:51:08PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Wed, Oct 16, 2024 at 03:00:08PM +1100, Alistair Popple wrote:
> >> 
> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> 
> >> > Avoid multiple CPU page faults to the same device page racing by trying
> >> > to lock the page in do_swap_page before taking an extra reference to the
> >> > page. This prevents scenarios where multiple CPU page faults each take
> >> > an extra reference to a device page, which could abort migration in
> >> > folio_migrate_mapping. With the device page being locked in
> >> > do_swap_page, the migrate_vma_* functions need to be updated to avoid
> >> > locking the fault_page argument.
> >> >
> >> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> >> > DRM driver) SVM implementation if enough threads faulted the same device
> >> > page.
> >> >
> >> > Cc: Philip Yang <Philip.Yang@amd.com>
> >> > Cc: Felix Kuehling <felix.kuehling@amd.com>
> >> > Cc: Christian König <christian.koenig@amd.com>
> >> > Cc: Andrew Morton <akpm@linux-foundation.org>
> >> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> >> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >> > ---
> >> >  mm/memory.c         | 13 ++++++---
> >> >  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
> >> >  2 files changed, 56 insertions(+), 26 deletions(-)
> >> >
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index 2366578015ad..b72bde782611 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >  			 * Get a page reference while we know the page can't be
> >> >  			 * freed.
> >> >  			 */
> >> > -			get_page(vmf->page);
> >> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> >> > -			put_page(vmf->page);
> >> > +			if (trylock_page(vmf->page)) {
> >> > +				get_page(vmf->page);
> >> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> >> > +				put_page(vmf->page);
> >> > +				unlock_page(vmf->page);
> >> 
> >> I don't think my previous review of this change has really been
> >> addressed. Why don't we just install the migration entry here? Seems
> >> like it would be a much simpler way of solving this.
> >> 
> >
> > I should have mentioned this in the cover-letter, I haven't got around
> > to trying that out yet. Included this existing version for correctness
> > but I also think this is not strickly required to merge this series as
> > our locking in migrate_to_ram only relies on the core MM locks so
> > some thread would eventually win the race and make forward progress.
> >
> > So I guess just ignore this patch and will send an updated version
> > individually with installing a migration entry in do_swap_page. If for
> > some reason that doesn't work, I'll respond here explaining why.
> 
> That would be great. I have a fairly strong preference for doing that
> instead of adding more special cases for the fault page in the migration
> code. And if we can't do that it would be good to understand
> why. Thanks.
> 

I've looked into this and actually prefer the approach in this patch.

Consider the scenario where we install a migration entry, but
migrate_to_ram fails. How do we handle this?

We don't know where migrate_to_ram failed. Was migrate_device_finalize
called, removing the migration PTE? Do we need to special-case failures
in migrate_to_ram to prevent migrate_device_finalize from removing the
faulting page's migration entry? Should we check for a migration entry
after migrate_to_ram and remove it if it exists?

Now, if migrate_to_ram succeeds, it seems the migration entry should be
removed in migrate_device_finalize since the new page is only available
there. We could return the new page in migrate_to_ram, but that feels
messy.

Additionally, the page lock needs to be held across migrate_to_ram, as
this patch does, so we'll require some special handling in
migrate_device_finalize to avoid unlocking the faulting page.

Finally, installing a migration entry is non-trivial, while taking a
page reference under a lock is straightforward.

Given all this, I prefer to keep this patch as it is.

Matt

>  - Alistair
> 
> > Matt
> >
> >> > +			} else {
> >> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> > +			}
> >> >  		} else if (is_hwpoison_entry(entry)) {
> >> >  			ret = VM_FAULT_HWPOISON;
> >> >  		} else if (is_pte_marker_entry(entry)) {
> >> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> > index f163c2131022..2477d39f57be 100644
> >> > --- a/mm/migrate_device.c
> >> > +++ b/mm/migrate_device.c
> >> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >  				   struct mm_walk *walk)
> >> >  {
> >> >  	struct migrate_vma *migrate = walk->private;
> >> > +	struct folio *fault_folio = migrate->fault_page ?
> >> > +		page_folio(migrate->fault_page) : NULL;
> >> >  	struct vm_area_struct *vma = walk->vma;
> >> >  	struct mm_struct *mm = vma->vm_mm;
> >> >  	unsigned long addr = start, unmapped = 0;
> >> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >  
> >> >  			folio_get(folio);
> >> >  			spin_unlock(ptl);
> >> > -			if (unlikely(!folio_trylock(folio)))
> >> > +			if (unlikely(fault_folio != folio &&
> >> > +				     !folio_trylock(folio)))
> >> >  				return migrate_vma_collect_skip(start, end,
> >> >  								walk);
> >> >  			ret = split_folio(folio);
> >> > -			folio_unlock(folio);
> >> > +			if (fault_folio != folio)
> >> > +				folio_unlock(folio);
> >> >  			folio_put(folio);
> >> >  			if (ret)
> >> >  				return migrate_vma_collect_skip(start, end,
> >> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >  		 * optimisation to avoid walking the rmap later with
> >> >  		 * try_to_migrate().
> >> >  		 */
> >> > -		if (folio_trylock(folio)) {
> >> > +		if (fault_folio == folio || folio_trylock(folio)) {
> >> >  			bool anon_exclusive;
> >> >  			pte_t swp_pte;
> >> >  
> >> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >  
> >> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
> >> >  					set_pte_at(mm, addr, ptep, pte);
> >> > -					folio_unlock(folio);
> >> > +					if (fault_folio != folio)
> >> > +						folio_unlock(folio);
> >> >  					folio_put(folio);
> >> >  					mpfn = 0;
> >> >  					goto next;
> >> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >> >  					  unsigned long npages,
> >> >  					  struct page *fault_page)
> >> >  {
> >> > +	struct folio *fault_folio = fault_page ?
> >> > +		page_folio(fault_page) : NULL;
> >> >  	unsigned long i, restore = 0;
> >> >  	bool allow_drain = true;
> >> >  	unsigned long unmapped = 0;
> >> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >> >  		remove_migration_ptes(folio, folio, 0);
> >> >  
> >> >  		src_pfns[i] = 0;
> >> > -		folio_unlock(folio);
> >> > +		if (fault_folio != folio)
> >> > +			folio_unlock(folio);
> >> >  		folio_put(folio);
> >> >  		restore--;
> >> >  	}
> >> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> >> >  		return -EINVAL;
> >> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
> >> >  		return -EINVAL;
> >> > +	if (args->fault_page && !PageLocked(args->fault_page))
> >> > +		return -EINVAL;
> >> >  
> >> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> >> >  	args->cpages = 0;
> >> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >> >  }
> >> >  EXPORT_SYMBOL(migrate_vma_pages);
> >> >  
> >> > -/*
> >> > - * migrate_device_finalize() - complete page migration
> >> > - * @src_pfns: src_pfns returned from migrate_device_range()
> >> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> >> > - * @npages: number of pages in the range
> >> > - *
> >> > - * Completes migration of the page by removing special migration entries.
> >> > - * Drivers must ensure copying of page data is complete and visible to the CPU
> >> > - * before calling this.
> >> > - */
> >> > -void migrate_device_finalize(unsigned long *src_pfns,
> >> > -			unsigned long *dst_pfns, unsigned long npages)
> >> > +static void __migrate_device_finalize(unsigned long *src_pfns,
> >> > +				      unsigned long *dst_pfns,
> >> > +				      unsigned long npages,
> >> > +				      struct page *fault_page)
> >> >  {
> >> > +	struct folio *fault_folio = fault_page ?
> >> > +		page_folio(fault_page) : NULL;
> >> >  	unsigned long i;
> >> >  
> >> >  	for (i = 0; i < npages; i++) {
> >> > @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >  
> >> >  		if (!page) {
> >> >  			if (dst) {
> >> > -				folio_unlock(dst);
> >> > +				if (fault_folio != dst)
> >> > +					folio_unlock(dst);
> >> >  				folio_put(dst);
> >> >  			}
> >> >  			continue;
> >> > @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >  
> >> >  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
> >> >  			if (dst) {
> >> > -				folio_unlock(dst);
> >> > +				if (fault_folio != dst)
> >> > +					folio_unlock(dst);
> >> >  				folio_put(dst);
> >> >  			}
> >> >  			dst = src;
> >> >  		}
> >> >  
> >> >  		remove_migration_ptes(src, dst, 0);
> >> > -		folio_unlock(src);
> >> > +		if (fault_folio != src)
> >> > +			folio_unlock(src);
> >> >  
> >> >  		if (folio_is_zone_device(src))
> >> >  			folio_put(src);
> >> > @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >  			folio_putback_lru(src);
> >> >  
> >> >  		if (dst != src) {
> >> > -			folio_unlock(dst);
> >> > +			if (fault_folio != dst)
> >> > +				folio_unlock(dst);
> >> >  			if (folio_is_zone_device(dst))
> >> >  				folio_put(dst);
> >> >  			else
> >> > @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >  		}
> >> >  	}
> >> >  }
> >> > +
> >> > +/*
> >> > + * migrate_device_finalize() - complete page migration
> >> > + * @src_pfns: src_pfns returned from migrate_device_range()
> >> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> >> > + * @npages: number of pages in the range
> >> > + *
> >> > + * Completes migration of the page by removing special migration entries.
> >> > + * Drivers must ensure copying of page data is complete and visible to the CPU
> >> > + * before calling this.
> >> > + */
> >> > +void migrate_device_finalize(unsigned long *src_pfns,
> >> > +			unsigned long *dst_pfns, unsigned long npages)
> >> > +{
> >> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> >> > +}
> >> >  EXPORT_SYMBOL(migrate_device_finalize);
> >> >  
> >> >  /**
> >> > @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> >> >   */
> >> >  void migrate_vma_finalize(struct migrate_vma *migrate)
> >> >  {
> >> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> >> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> >> > +				  migrate->fault_page);
> >> >  }
> >> >  EXPORT_SYMBOL(migrate_vma_finalize);
> >> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-25  0:31         ` Matthew Brost
@ 2024-10-29  6:37           ` Alistair Popple
  2024-11-01 17:19             ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-10-29  6:37 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> On Thu, Oct 17, 2024 at 12:51:08PM +1100, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> > On Wed, Oct 16, 2024 at 03:00:08PM +1100, Alistair Popple wrote:
>> >> 
>> >> Matthew Brost <matthew.brost@intel.com> writes:
>> >> 
>> >> > Avoid multiple CPU page faults to the same device page racing by trying
>> >> > to lock the page in do_swap_page before taking an extra reference to the
>> >> > page. This prevents scenarios where multiple CPU page faults each take
>> >> > an extra reference to a device page, which could abort migration in
>> >> > folio_migrate_mapping. With the device page being locked in
>> >> > do_swap_page, the migrate_vma_* functions need to be updated to avoid
>> >> > locking the fault_page argument.
>> >> >
>> >> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
>> >> > DRM driver) SVM implementation if enough threads faulted the same device
>> >> > page.
>> >> >
>> >> > Cc: Philip Yang <Philip.Yang@amd.com>
>> >> > Cc: Felix Kuehling <felix.kuehling@amd.com>
>> >> > Cc: Christian König <christian.koenig@amd.com>
>> >> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
>> >> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> >> > ---
>> >> >  mm/memory.c         | 13 ++++++---
>> >> >  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
>> >> >  2 files changed, 56 insertions(+), 26 deletions(-)
>> >> >
>> >> > diff --git a/mm/memory.c b/mm/memory.c
>> >> > index 2366578015ad..b72bde782611 100644
>> >> > --- a/mm/memory.c
>> >> > +++ b/mm/memory.c
>> >> > @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >  			 * Get a page reference while we know the page can't be
>> >> >  			 * freed.
>> >> >  			 */
>> >> > -			get_page(vmf->page);
>> >> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
>> >> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> >> > -			put_page(vmf->page);
>> >> > +			if (trylock_page(vmf->page)) {
>> >> > +				get_page(vmf->page);
>> >> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> >> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> >> > +				put_page(vmf->page);
>> >> > +				unlock_page(vmf->page);
>> >> 
>> >> I don't think my previous review of this change has really been
>> >> addressed. Why don't we just install the migration entry here? Seems
>> >> like it would be a much simpler way of solving this.
>> >> 
>> >
>> > I should have mentioned this in the cover-letter, I haven't got around
>> > to trying that out yet. Included this existing version for correctness
>> > but I also think this is not strickly required to merge this series as
>> > our locking in migrate_to_ram only relies on the core MM locks so
>> > some thread would eventually win the race and make forward progress.
>> >
>> > So I guess just ignore this patch and will send an updated version
>> > individually with installing a migration entry in do_swap_page. If for
>> > some reason that doesn't work, I'll respond here explaining why.
>> 
>> That would be great. I have a fairly strong preference for doing that
>> instead of adding more special cases for the fault page in the migration
>> code. And if we can't do that it would be good to understand
>> why. Thanks.
>> 
>
> I've looked into this and actually prefer the approach in this patch.

Thanks for looking into this.

> Consider the scenario where we install a migration entry, but
> migrate_to_ram fails. How do we handle this?
>
> We don't know where migrate_to_ram failed. Was migrate_device_finalize
> called, removing the migration PTE? Do we need to special-case failures
> in migrate_to_ram to prevent migrate_device_finalize from removing the
> faulting page's migration entry? Should we check for a migration entry
> after migrate_to_ram and remove it if it exists?

The driver should always call migrate_device_finalize(). On failure it
will remove the migration entry and remap the original device private
page. That obviously doesn't handle the fault but the process is about
to die anyway with a SIGBUS because migrate_to_ram() can't fail.

> Now, if migrate_to_ram succeeds, it seems the migration entry should be
> removed in migrate_device_finalize since the new page is only available
> there. We could return the new page in migrate_to_ram, but that feels
> messy.

Agreed - I would expect migrate_device_finalize() to always be called
and remove the migration entry.

> Additionally, the page lock needs to be held across migrate_to_ram, as
> this patch does, so we'll require some special handling in
> migrate_device_finalize to avoid unlocking the faulting page.

Or just unlock it in migrate_device_finalize(). I agree locking it one
place and unlocking it in another is a bit ugly though.

> Finally, installing a migration entry is non-trivial, while taking a
> page reference under a lock is straightforward.

I didn't think it was that hard once you have the PTL - although there
is a bit of account keeping the same as migrate_vma_collect_pmd() but
that could be abstracted into a common function. The advantage is it
saves a page table walk, but I suppose that isn't that relevant if
you're migrating a group of pages.

> Given all this, I prefer to keep this patch as it is.

Ok, I will take a closer look at it. Thanks.

 - Alistair

> Matt
>
>>  - Alistair
>> 
>> > Matt
>> >
>> >> > +			} else {
>> >> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> >> > +			}
>> >> >  		} else if (is_hwpoison_entry(entry)) {
>> >> >  			ret = VM_FAULT_HWPOISON;
>> >> >  		} else if (is_pte_marker_entry(entry)) {
>> >> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> >> > index f163c2131022..2477d39f57be 100644
>> >> > --- a/mm/migrate_device.c
>> >> > +++ b/mm/migrate_device.c
>> >> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >> >  				   struct mm_walk *walk)
>> >> >  {
>> >> >  	struct migrate_vma *migrate = walk->private;
>> >> > +	struct folio *fault_folio = migrate->fault_page ?
>> >> > +		page_folio(migrate->fault_page) : NULL;
>> >> >  	struct vm_area_struct *vma = walk->vma;
>> >> >  	struct mm_struct *mm = vma->vm_mm;
>> >> >  	unsigned long addr = start, unmapped = 0;
>> >> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >> >  
>> >> >  			folio_get(folio);
>> >> >  			spin_unlock(ptl);
>> >> > -			if (unlikely(!folio_trylock(folio)))
>> >> > +			if (unlikely(fault_folio != folio &&
>> >> > +				     !folio_trylock(folio)))
>> >> >  				return migrate_vma_collect_skip(start, end,
>> >> >  								walk);
>> >> >  			ret = split_folio(folio);
>> >> > -			folio_unlock(folio);
>> >> > +			if (fault_folio != folio)
>> >> > +				folio_unlock(folio);
>> >> >  			folio_put(folio);
>> >> >  			if (ret)
>> >> >  				return migrate_vma_collect_skip(start, end,
>> >> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >> >  		 * optimisation to avoid walking the rmap later with
>> >> >  		 * try_to_migrate().
>> >> >  		 */
>> >> > -		if (folio_trylock(folio)) {
>> >> > +		if (fault_folio == folio || folio_trylock(folio)) {
>> >> >  			bool anon_exclusive;
>> >> >  			pte_t swp_pte;
>> >> >  
>> >> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >> >  
>> >> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
>> >> >  					set_pte_at(mm, addr, ptep, pte);
>> >> > -					folio_unlock(folio);
>> >> > +					if (fault_folio != folio)
>> >> > +						folio_unlock(folio);
>> >> >  					folio_put(folio);
>> >> >  					mpfn = 0;
>> >> >  					goto next;
>> >> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> >> >  					  unsigned long npages,
>> >> >  					  struct page *fault_page)
>> >> >  {
>> >> > +	struct folio *fault_folio = fault_page ?
>> >> > +		page_folio(fault_page) : NULL;
>> >> >  	unsigned long i, restore = 0;
>> >> >  	bool allow_drain = true;
>> >> >  	unsigned long unmapped = 0;
>> >> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> >> >  		remove_migration_ptes(folio, folio, 0);
>> >> >  
>> >> >  		src_pfns[i] = 0;
>> >> > -		folio_unlock(folio);
>> >> > +		if (fault_folio != folio)
>> >> > +			folio_unlock(folio);
>> >> >  		folio_put(folio);
>> >> >  		restore--;
>> >> >  	}
>> >> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>> >> >  		return -EINVAL;
>> >> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
>> >> >  		return -EINVAL;
>> >> > +	if (args->fault_page && !PageLocked(args->fault_page))
>> >> > +		return -EINVAL;
>> >> >  
>> >> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>> >> >  	args->cpages = 0;
>> >> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>> >> >  }
>> >> >  EXPORT_SYMBOL(migrate_vma_pages);
>> >> >  
>> >> > -/*
>> >> > - * migrate_device_finalize() - complete page migration
>> >> > - * @src_pfns: src_pfns returned from migrate_device_range()
>> >> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> >> > - * @npages: number of pages in the range
>> >> > - *
>> >> > - * Completes migration of the page by removing special migration entries.
>> >> > - * Drivers must ensure copying of page data is complete and visible to the CPU
>> >> > - * before calling this.
>> >> > - */
>> >> > -void migrate_device_finalize(unsigned long *src_pfns,
>> >> > -			unsigned long *dst_pfns, unsigned long npages)
>> >> > +static void __migrate_device_finalize(unsigned long *src_pfns,
>> >> > +				      unsigned long *dst_pfns,
>> >> > +				      unsigned long npages,
>> >> > +				      struct page *fault_page)
>> >> >  {
>> >> > +	struct folio *fault_folio = fault_page ?
>> >> > +		page_folio(fault_page) : NULL;
>> >> >  	unsigned long i;
>> >> >  
>> >> >  	for (i = 0; i < npages; i++) {
>> >> > @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >> >  
>> >> >  		if (!page) {
>> >> >  			if (dst) {
>> >> > -				folio_unlock(dst);
>> >> > +				if (fault_folio != dst)
>> >> > +					folio_unlock(dst);
>> >> >  				folio_put(dst);
>> >> >  			}
>> >> >  			continue;
>> >> > @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >> >  
>> >> >  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
>> >> >  			if (dst) {
>> >> > -				folio_unlock(dst);
>> >> > +				if (fault_folio != dst)
>> >> > +					folio_unlock(dst);
>> >> >  				folio_put(dst);
>> >> >  			}
>> >> >  			dst = src;
>> >> >  		}
>> >> >  
>> >> >  		remove_migration_ptes(src, dst, 0);
>> >> > -		folio_unlock(src);
>> >> > +		if (fault_folio != src)
>> >> > +			folio_unlock(src);
>> >> >  
>> >> >  		if (folio_is_zone_device(src))
>> >> >  			folio_put(src);
>> >> > @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >> >  			folio_putback_lru(src);
>> >> >  
>> >> >  		if (dst != src) {
>> >> > -			folio_unlock(dst);
>> >> > +			if (fault_folio != dst)
>> >> > +				folio_unlock(dst);
>> >> >  			if (folio_is_zone_device(dst))
>> >> >  				folio_put(dst);
>> >> >  			else
>> >> > @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >> >  		}
>> >> >  	}
>> >> >  }
>> >> > +
>> >> > +/*
>> >> > + * migrate_device_finalize() - complete page migration
>> >> > + * @src_pfns: src_pfns returned from migrate_device_range()
>> >> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> >> > + * @npages: number of pages in the range
>> >> > + *
>> >> > + * Completes migration of the page by removing special migration entries.
>> >> > + * Drivers must ensure copying of page data is complete and visible to the CPU
>> >> > + * before calling this.
>> >> > + */
>> >> > +void migrate_device_finalize(unsigned long *src_pfns,
>> >> > +			unsigned long *dst_pfns, unsigned long npages)
>> >> > +{
>> >> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
>> >> > +}
>> >> >  EXPORT_SYMBOL(migrate_device_finalize);
>> >> >  
>> >> >  /**
>> >> > @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>> >> >   */
>> >> >  void migrate_vma_finalize(struct migrate_vma *migrate)
>> >> >  {
>> >> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
>> >> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
>> >> > +				  migrate->fault_page);
>> >> >  }
>> >> >  EXPORT_SYMBOL(migrate_vma_finalize);
>> >> 
>> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-29  6:37           ` Alistair Popple
@ 2024-11-01 17:19             ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-01 17:19 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Tue, Oct 29, 2024 at 05:37:45PM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > On Thu, Oct 17, 2024 at 12:51:08PM +1100, Alistair Popple wrote:
> >> 
> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> 
> >> > On Wed, Oct 16, 2024 at 03:00:08PM +1100, Alistair Popple wrote:
> >> >> 
> >> >> Matthew Brost <matthew.brost@intel.com> writes:
> >> >> 
> >> >> > Avoid multiple CPU page faults to the same device page racing by trying
> >> >> > to lock the page in do_swap_page before taking an extra reference to the
> >> >> > page. This prevents scenarios where multiple CPU page faults each take
> >> >> > an extra reference to a device page, which could abort migration in
> >> >> > folio_migrate_mapping. With the device page being locked in
> >> >> > do_swap_page, the migrate_vma_* functions need to be updated to avoid
> >> >> > locking the fault_page argument.
> >> >> >
> >> >> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> >> >> > DRM driver) SVM implementation if enough threads faulted the same device
> >> >> > page.
> >> >> >
> >> >> > Cc: Philip Yang <Philip.Yang@amd.com>
> >> >> > Cc: Felix Kuehling <felix.kuehling@amd.com>
> >> >> > Cc: Christian König <christian.koenig@amd.com>
> >> >> > Cc: Andrew Morton <akpm@linux-foundation.org>
> >> >> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> >> >> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >> >> > ---
> >> >> >  mm/memory.c         | 13 ++++++---
> >> >> >  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
> >> >> >  2 files changed, 56 insertions(+), 26 deletions(-)
> >> >> >
> >> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> >> > index 2366578015ad..b72bde782611 100644
> >> >> > --- a/mm/memory.c
> >> >> > +++ b/mm/memory.c
> >> >> > @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> >  			 * Get a page reference while we know the page can't be
> >> >> >  			 * freed.
> >> >> >  			 */
> >> >> > -			get_page(vmf->page);
> >> >> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> >> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> >> >> > -			put_page(vmf->page);
> >> >> > +			if (trylock_page(vmf->page)) {
> >> >> > +				get_page(vmf->page);
> >> >> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> >> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> >> >> > +				put_page(vmf->page);
> >> >> > +				unlock_page(vmf->page);
> >> >> 
> >> >> I don't think my previous review of this change has really been
> >> >> addressed. Why don't we just install the migration entry here? Seems
> >> >> like it would be a much simpler way of solving this.
> >> >> 
> >> >
> >> > I should have mentioned this in the cover-letter, I haven't got around
> >> > to trying that out yet. Included this existing version for correctness
> >> > but I also think this is not strickly required to merge this series as
> >> > our locking in migrate_to_ram only relies on the core MM locks so
> >> > some thread would eventually win the race and make forward progress.
> >> >
> >> > So I guess just ignore this patch and will send an updated version
> >> > individually with installing a migration entry in do_swap_page. If for
> >> > some reason that doesn't work, I'll respond here explaining why.
> >> 
> >> That would be great. I have a fairly strong preference for doing that
> >> instead of adding more special cases for the fault page in the migration
> >> code. And if we can't do that it would be good to understand
> >> why. Thanks.
> >> 
> >
> > I've looked into this and actually prefer the approach in this patch.
> 
> Thanks for looking into this.
> 
> > Consider the scenario where we install a migration entry, but
> > migrate_to_ram fails. How do we handle this?
> >
> > We don't know where migrate_to_ram failed. Was migrate_device_finalize
> > called, removing the migration PTE? Do we need to special-case failures
> > in migrate_to_ram to prevent migrate_device_finalize from removing the
> > faulting page's migration entry? Should we check for a migration entry
> > after migrate_to_ram and remove it if it exists?
> 
> The driver should always call migrate_device_finalize(). On failure it
> will remove the migration entry and remap the original device private
> page. That obviously doesn't handle the fault but the process is about
> to die anyway with a SIGBUS because migrate_to_ram() can't fail.
> 

What if migrate_to_ram fails before calling migrate_vma_setup - e.g. a
kmalloc of the arguments fails? Very ackward situation.

> > Now, if migrate_to_ram succeeds, it seems the migration entry should be
> > removed in migrate_device_finalize since the new page is only available
> > there. We could return the new page in migrate_to_ram, but that feels
> > messy.
> 
> Agreed - I would expect migrate_device_finalize() to always be called
> and remove the migration entry.
> 
> > Additionally, the page lock needs to be held across migrate_to_ram, as
> > this patch does, so we'll require some special handling in
> > migrate_device_finalize to avoid unlocking the faulting page.
> 
> Or just unlock it in migrate_device_finalize(). I agree locking it one
> place and unlocking it in another is a bit ugly though.

Agreed.

> 
> > Finally, installing a migration entry is non-trivial, while taking a
> > page reference under a lock is straightforward.
> 
> I didn't think it was that hard once you have the PTL - although there
> is a bit of account keeping the same as migrate_vma_collect_pmd() but
> that could be abstracted into a common function. The advantage is it
> saves a page table walk, but I suppose that isn't that relevant if
> you're migrating a group of pages.
>

A helper would definitely be required if we do this.
 
> > Given all this, I prefer to keep this patch as it is.
> 
> Ok, I will take a closer look at it. Thanks.
> 

+1. Let me know what you come up with. This patch doesn't strickly block
my work but it would good to have something in to fix this problem soon.

Matt

>  - Alistair
> 
> > Matt
> >
> >>  - Alistair
> >> 
> >> > Matt
> >> >
> >> >> > +			} else {
> >> >> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> >> > +			}
> >> >> >  		} else if (is_hwpoison_entry(entry)) {
> >> >> >  			ret = VM_FAULT_HWPOISON;
> >> >> >  		} else if (is_pte_marker_entry(entry)) {
> >> >> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> >> > index f163c2131022..2477d39f57be 100644
> >> >> > --- a/mm/migrate_device.c
> >> >> > +++ b/mm/migrate_device.c
> >> >> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >> >  				   struct mm_walk *walk)
> >> >> >  {
> >> >> >  	struct migrate_vma *migrate = walk->private;
> >> >> > +	struct folio *fault_folio = migrate->fault_page ?
> >> >> > +		page_folio(migrate->fault_page) : NULL;
> >> >> >  	struct vm_area_struct *vma = walk->vma;
> >> >> >  	struct mm_struct *mm = vma->vm_mm;
> >> >> >  	unsigned long addr = start, unmapped = 0;
> >> >> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >> >  
> >> >> >  			folio_get(folio);
> >> >> >  			spin_unlock(ptl);
> >> >> > -			if (unlikely(!folio_trylock(folio)))
> >> >> > +			if (unlikely(fault_folio != folio &&
> >> >> > +				     !folio_trylock(folio)))
> >> >> >  				return migrate_vma_collect_skip(start, end,
> >> >> >  								walk);
> >> >> >  			ret = split_folio(folio);
> >> >> > -			folio_unlock(folio);
> >> >> > +			if (fault_folio != folio)
> >> >> > +				folio_unlock(folio);
> >> >> >  			folio_put(folio);
> >> >> >  			if (ret)
> >> >> >  				return migrate_vma_collect_skip(start, end,
> >> >> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >> >  		 * optimisation to avoid walking the rmap later with
> >> >> >  		 * try_to_migrate().
> >> >> >  		 */
> >> >> > -		if (folio_trylock(folio)) {
> >> >> > +		if (fault_folio == folio || folio_trylock(folio)) {
> >> >> >  			bool anon_exclusive;
> >> >> >  			pte_t swp_pte;
> >> >> >  
> >> >> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >> >  
> >> >> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
> >> >> >  					set_pte_at(mm, addr, ptep, pte);
> >> >> > -					folio_unlock(folio);
> >> >> > +					if (fault_folio != folio)
> >> >> > +						folio_unlock(folio);
> >> >> >  					folio_put(folio);
> >> >> >  					mpfn = 0;
> >> >> >  					goto next;
> >> >> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >> >> >  					  unsigned long npages,
> >> >> >  					  struct page *fault_page)
> >> >> >  {
> >> >> > +	struct folio *fault_folio = fault_page ?
> >> >> > +		page_folio(fault_page) : NULL;
> >> >> >  	unsigned long i, restore = 0;
> >> >> >  	bool allow_drain = true;
> >> >> >  	unsigned long unmapped = 0;
> >> >> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >> >> >  		remove_migration_ptes(folio, folio, 0);
> >> >> >  
> >> >> >  		src_pfns[i] = 0;
> >> >> > -		folio_unlock(folio);
> >> >> > +		if (fault_folio != folio)
> >> >> > +			folio_unlock(folio);
> >> >> >  		folio_put(folio);
> >> >> >  		restore--;
> >> >> >  	}
> >> >> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> >> >> >  		return -EINVAL;
> >> >> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
> >> >> >  		return -EINVAL;
> >> >> > +	if (args->fault_page && !PageLocked(args->fault_page))
> >> >> > +		return -EINVAL;
> >> >> >  
> >> >> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> >> >> >  	args->cpages = 0;
> >> >> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >> >> >  }
> >> >> >  EXPORT_SYMBOL(migrate_vma_pages);
> >> >> >  
> >> >> > -/*
> >> >> > - * migrate_device_finalize() - complete page migration
> >> >> > - * @src_pfns: src_pfns returned from migrate_device_range()
> >> >> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> >> >> > - * @npages: number of pages in the range
> >> >> > - *
> >> >> > - * Completes migration of the page by removing special migration entries.
> >> >> > - * Drivers must ensure copying of page data is complete and visible to the CPU
> >> >> > - * before calling this.
> >> >> > - */
> >> >> > -void migrate_device_finalize(unsigned long *src_pfns,
> >> >> > -			unsigned long *dst_pfns, unsigned long npages)
> >> >> > +static void __migrate_device_finalize(unsigned long *src_pfns,
> >> >> > +				      unsigned long *dst_pfns,
> >> >> > +				      unsigned long npages,
> >> >> > +				      struct page *fault_page)
> >> >> >  {
> >> >> > +	struct folio *fault_folio = fault_page ?
> >> >> > +		page_folio(fault_page) : NULL;
> >> >> >  	unsigned long i;
> >> >> >  
> >> >> >  	for (i = 0; i < npages; i++) {
> >> >> > @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >> >  
> >> >> >  		if (!page) {
> >> >> >  			if (dst) {
> >> >> > -				folio_unlock(dst);
> >> >> > +				if (fault_folio != dst)
> >> >> > +					folio_unlock(dst);
> >> >> >  				folio_put(dst);
> >> >> >  			}
> >> >> >  			continue;
> >> >> > @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >> >  
> >> >> >  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
> >> >> >  			if (dst) {
> >> >> > -				folio_unlock(dst);
> >> >> > +				if (fault_folio != dst)
> >> >> > +					folio_unlock(dst);
> >> >> >  				folio_put(dst);
> >> >> >  			}
> >> >> >  			dst = src;
> >> >> >  		}
> >> >> >  
> >> >> >  		remove_migration_ptes(src, dst, 0);
> >> >> > -		folio_unlock(src);
> >> >> > +		if (fault_folio != src)
> >> >> > +			folio_unlock(src);
> >> >> >  
> >> >> >  		if (folio_is_zone_device(src))
> >> >> >  			folio_put(src);
> >> >> > @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >> >  			folio_putback_lru(src);
> >> >> >  
> >> >> >  		if (dst != src) {
> >> >> > -			folio_unlock(dst);
> >> >> > +			if (fault_folio != dst)
> >> >> > +				folio_unlock(dst);
> >> >> >  			if (folio_is_zone_device(dst))
> >> >> >  				folio_put(dst);
> >> >> >  			else
> >> >> > @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> >> >  		}
> >> >> >  	}
> >> >> >  }
> >> >> > +
> >> >> > +/*
> >> >> > + * migrate_device_finalize() - complete page migration
> >> >> > + * @src_pfns: src_pfns returned from migrate_device_range()
> >> >> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> >> >> > + * @npages: number of pages in the range
> >> >> > + *
> >> >> > + * Completes migration of the page by removing special migration entries.
> >> >> > + * Drivers must ensure copying of page data is complete and visible to the CPU
> >> >> > + * before calling this.
> >> >> > + */
> >> >> > +void migrate_device_finalize(unsigned long *src_pfns,
> >> >> > +			unsigned long *dst_pfns, unsigned long npages)
> >> >> > +{
> >> >> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> >> >> > +}
> >> >> >  EXPORT_SYMBOL(migrate_device_finalize);
> >> >> >  
> >> >> >  /**
> >> >> > @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> >> >> >   */
> >> >> >  void migrate_vma_finalize(struct migrate_vma *migrate)
> >> >> >  {
> >> >> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> >> >> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> >> >> > +				  migrate->fault_page);
> >> >> >  }
> >> >> >  EXPORT_SYMBOL(migrate_vma_finalize);
> >> >> 
> >> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-10-16  3:24 ` [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page Matthew Brost
  2024-10-16  4:00   ` Alistair Popple
@ 2024-11-28 23:31   ` Alistair Popple
  2024-12-13 22:16     ` Matthew Brost
  1 sibling, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-11-28 23:31 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

> Avoid multiple CPU page faults to the same device page racing by trying
> to lock the page in do_swap_page before taking an extra reference to the
> page. This prevents scenarios where multiple CPU page faults each take
> an extra reference to a device page, which could abort migration in
> folio_migrate_mapping. With the device page being locked in
> do_swap_page, the migrate_vma_* functions need to be updated to avoid
> locking the fault_page argument.
>
> Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> DRM driver) SVM implementation if enough threads faulted the same device
> page.
>
> Cc: Philip Yang <Philip.Yang@amd.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  mm/memory.c         | 13 ++++++---
>  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
>  2 files changed, 56 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2366578015ad..b72bde782611 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			 * Get a page reference while we know the page can't be
>  			 * freed.
>  			 */
> -			get_page(vmf->page);
> -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> -			put_page(vmf->page);
> +			if (trylock_page(vmf->page)) {
> +				get_page(vmf->page);
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> +				put_page(vmf->page);
> +				unlock_page(vmf->page);

Isn't the order wrong here? In the common case put_page() will have just
dropped the last reference on the page and freed it so the unlock_page()
needs to happen first.

> +			} else {
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +			}
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
>  		} else if (is_pte_marker_entry(entry)) {
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index f163c2131022..2477d39f57be 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  				   struct mm_walk *walk)
>  {
>  	struct migrate_vma *migrate = walk->private;
> +	struct folio *fault_folio = migrate->fault_page ?
> +		page_folio(migrate->fault_page) : NULL;
>  	struct vm_area_struct *vma = walk->vma;
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long addr = start, unmapped = 0;
> @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  			folio_get(folio);
>  			spin_unlock(ptl);
> -			if (unlikely(!folio_trylock(folio)))
> +			if (unlikely(fault_folio != folio &&

We don't currently support large ZONE_DEVICE pages so we should never
get here. I think a WARN_ON_ONCE(fault_folio == folio) and bail would be
better.

> +				     !folio_trylock(folio)))
>  				return migrate_vma_collect_skip(start, end,
>  								walk);
>  			ret = split_folio(folio);
> -			folio_unlock(folio);
> +			if (fault_folio != folio)
> +				folio_unlock(folio);
>  			folio_put(folio);
>  			if (ret)
>  				return migrate_vma_collect_skip(start, end,
> @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		 * optimisation to avoid walking the rmap later with
>  		 * try_to_migrate().
>  		 */
> -		if (folio_trylock(folio)) {
> +		if (fault_folio == folio || folio_trylock(folio)) {
>  			bool anon_exclusive;
>  			pte_t swp_pte;
>  
> @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  				if (folio_try_share_anon_rmap_pte(folio, page)) {
>  					set_pte_at(mm, addr, ptep, pte);
> -					folio_unlock(folio);
> +					if (fault_folio != folio)
> +						folio_unlock(folio);
>  					folio_put(folio);
>  					mpfn = 0;
>  					goto next;
> @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  					  unsigned long npages,
>  					  struct page *fault_page)
>  {
> +	struct folio *fault_folio = fault_page ?
> +		page_folio(fault_page) : NULL;
>  	unsigned long i, restore = 0;
>  	bool allow_drain = true;
>  	unsigned long unmapped = 0;
> @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  		remove_migration_ptes(folio, folio, 0);
>  
>  		src_pfns[i] = 0;
> -		folio_unlock(folio);
> +		if (fault_folio != folio)
> +			folio_unlock(folio);
>  		folio_put(folio);
>  		restore--;
>  	}
> @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>  		return -EINVAL;
>  	if (args->fault_page && !is_device_private_page(args->fault_page))
>  		return -EINVAL;
> +	if (args->fault_page && !PageLocked(args->fault_page))
> +		return -EINVAL;
>  
>  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>  	args->cpages = 0;
> @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>  }
>  EXPORT_SYMBOL(migrate_vma_pages);
>  
> -/*
> - * migrate_device_finalize() - complete page migration
> - * @src_pfns: src_pfns returned from migrate_device_range()
> - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> - * @npages: number of pages in the range
> - *
> - * Completes migration of the page by removing special migration entries.
> - * Drivers must ensure copying of page data is complete and visible to the CPU
> - * before calling this.
> - */
> -void migrate_device_finalize(unsigned long *src_pfns,
> -			unsigned long *dst_pfns, unsigned long npages)
> +static void __migrate_device_finalize(unsigned long *src_pfns,
> +				      unsigned long *dst_pfns,
> +				      unsigned long npages,
> +				      struct page *fault_page)
>  {
> +	struct folio *fault_folio = fault_page ?
> +		page_folio(fault_page) : NULL;
>  	unsigned long i;
>  
>  	for (i = 0; i < npages; i++) {
> @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  
>  		if (!page) {
>  			if (dst) {
> -				folio_unlock(dst);
> +				if (fault_folio != dst)
> +					folio_unlock(dst);

How could the destination page be the faulting page? I think we can drop
this check.

>  				folio_put(dst);
>  			}
>  			continue;
> @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  
>  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
>  			if (dst) {
> -				folio_unlock(dst);
> +				if (fault_folio != dst)
> +					folio_unlock(dst);

Likewise.

>  				folio_put(dst);
>  			}
>  			dst = src;
>  		}
>  
>  		remove_migration_ptes(src, dst, 0);
> -		folio_unlock(src);
> +		if (fault_folio != src)
> +			folio_unlock(src);
>  
>  		if (folio_is_zone_device(src))
>  			folio_put(src);
> @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  			folio_putback_lru(src);
>  
>  		if (dst != src) {
> -			folio_unlock(dst);
> +			if (fault_folio != dst)
> +				folio_unlock(dst);

This one also seems unnecessary.

 - Alistair

>  			if (folio_is_zone_device(dst))
>  				folio_put(dst);
>  			else
> @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  		}
>  	}
>  }
> +
> +/*
> + * migrate_device_finalize() - complete page migration
> + * @src_pfns: src_pfns returned from migrate_device_range()
> + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> + * @npages: number of pages in the range
> + *
> + * Completes migration of the page by removing special migration entries.
> + * Drivers must ensure copying of page data is complete and visible to the CPU
> + * before calling this.
> + */
> +void migrate_device_finalize(unsigned long *src_pfns,
> +			unsigned long *dst_pfns, unsigned long npages)
> +{
> +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> +}
>  EXPORT_SYMBOL(migrate_device_finalize);
>  
>  /**
> @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>   */
>  void migrate_vma_finalize(struct migrate_vma *migrate)
>  {
> -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> +				  migrate->fault_page);
>  }
>  EXPORT_SYMBOL(migrate_vma_finalize);


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-11-28 23:31   ` Alistair Popple
@ 2024-12-13 22:16     ` Matthew Brost
  2024-12-14  5:59       ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-12-13 22:16 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Fri, Nov 29, 2024 at 10:31:32AM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > Avoid multiple CPU page faults to the same device page racing by trying
> > to lock the page in do_swap_page before taking an extra reference to the
> > page. This prevents scenarios where multiple CPU page faults each take
> > an extra reference to a device page, which could abort migration in
> > folio_migrate_mapping. With the device page being locked in
> > do_swap_page, the migrate_vma_* functions need to be updated to avoid
> > locking the fault_page argument.
> >
> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> > DRM driver) SVM implementation if enough threads faulted the same device
> > page.
> >
> > Cc: Philip Yang <Philip.Yang@amd.com>
> > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  mm/memory.c         | 13 ++++++---
> >  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
> >  2 files changed, 56 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..b72bde782611 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  			 * Get a page reference while we know the page can't be
> >  			 * freed.
> >  			 */
> > -			get_page(vmf->page);
> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > -			put_page(vmf->page);
> > +			if (trylock_page(vmf->page)) {
> > +				get_page(vmf->page);
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > +				put_page(vmf->page);
> > +				unlock_page(vmf->page);
> 
> Isn't the order wrong here? In the common case put_page() will have just
> dropped the last reference on the page and freed it so the unlock_page()
> needs to happen first.
> 

Yes, this appears wrong. I haven't seen this show up as a problem but
certainly a put should be done after dropping the lock. 

> > +			} else {
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +			}
> >  		} else if (is_hwpoison_entry(entry)) {
> >  			ret = VM_FAULT_HWPOISON;
> >  		} else if (is_pte_marker_entry(entry)) {
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index f163c2131022..2477d39f57be 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  				   struct mm_walk *walk)
> >  {
> >  	struct migrate_vma *migrate = walk->private;
> > +	struct folio *fault_folio = migrate->fault_page ?
> > +		page_folio(migrate->fault_page) : NULL;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	unsigned long addr = start, unmapped = 0;
> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  			folio_get(folio);
> >  			spin_unlock(ptl);
> > -			if (unlikely(!folio_trylock(folio)))
> > +			if (unlikely(fault_folio != folio &&
> 
> We don't currently support large ZONE_DEVICE pages so we should never
> get here. I think a WARN_ON_ONCE(fault_folio == folio) and bail would be
> better.
> 

Sure will fix.

> > +				     !folio_trylock(folio)))
> >  				return migrate_vma_collect_skip(start, end,
> >  								walk);
> >  			ret = split_folio(folio);
> > -			folio_unlock(folio);
> > +			if (fault_folio != folio)
> > +				folio_unlock(folio);
> >  			folio_put(folio);
> >  			if (ret)
> >  				return migrate_vma_collect_skip(start, end,
> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  		 * optimisation to avoid walking the rmap later with
> >  		 * try_to_migrate().
> >  		 */
> > -		if (folio_trylock(folio)) {
> > +		if (fault_folio == folio || folio_trylock(folio)) {
> >  			bool anon_exclusive;
> >  			pte_t swp_pte;
> >  
> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
> >  					set_pte_at(mm, addr, ptep, pte);
> > -					folio_unlock(folio);
> > +					if (fault_folio != folio)
> > +						folio_unlock(folio);
> >  					folio_put(folio);
> >  					mpfn = 0;
> >  					goto next;
> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  					  unsigned long npages,
> >  					  struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i, restore = 0;
> >  	bool allow_drain = true;
> >  	unsigned long unmapped = 0;
> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  		remove_migration_ptes(folio, folio, 0);
> >  
> >  		src_pfns[i] = 0;
> > -		folio_unlock(folio);
> > +		if (fault_folio != folio)
> > +			folio_unlock(folio);
> >  		folio_put(folio);
> >  		restore--;
> >  	}
> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> >  		return -EINVAL;
> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
> >  		return -EINVAL;
> > +	if (args->fault_page && !PageLocked(args->fault_page))
> > +		return -EINVAL;
> >  
> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> >  	args->cpages = 0;
> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >  }
> >  EXPORT_SYMBOL(migrate_vma_pages);
> >  
> > -/*
> > - * migrate_device_finalize() - complete page migration
> > - * @src_pfns: src_pfns returned from migrate_device_range()
> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > - * @npages: number of pages in the range
> > - *
> > - * Completes migration of the page by removing special migration entries.
> > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > - * before calling this.
> > - */
> > -void migrate_device_finalize(unsigned long *src_pfns,
> > -			unsigned long *dst_pfns, unsigned long npages)
> > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > +				      unsigned long *dst_pfns,
> > +				      unsigned long npages,
> > +				      struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i;
> >  
> >  	for (i = 0; i < npages; i++) {
> > @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  
> >  		if (!page) {
> >  			if (dst) {
> > -				folio_unlock(dst);
> > +				if (fault_folio != dst)
> > +					folio_unlock(dst);
> 
> How could the destination page be the faulting page? I think we can drop
> this check.
> 

Yea, will drop.

> >  				folio_put(dst);
> >  			}
> >  			continue;
> > @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  
> >  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
> >  			if (dst) {
> > -				folio_unlock(dst);
> > +				if (fault_folio != dst)
> > +					folio_unlock(dst);
> 
> Likewise.
> 

Same here.

> >  				folio_put(dst);
> >  			}
> >  			dst = src;
> >  		}
> >  
> >  		remove_migration_ptes(src, dst, 0);
> > -		folio_unlock(src);
> > +		if (fault_folio != src)
> > +			folio_unlock(src);
> >  
> >  		if (folio_is_zone_device(src))
> >  			folio_put(src);
> > @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  			folio_putback_lru(src);
> >  
> >  		if (dst != src) {
> > -			folio_unlock(dst);
> > +			if (fault_folio != dst)
> > +				folio_unlock(dst);
> 
> This one also seems unnecessary.
>

Same here.

Thanks - Matt
 
>  - Alistair
> 
> >  			if (folio_is_zone_device(dst))
> >  				folio_put(dst);
> >  			else
> > @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  		}
> >  	}
> >  }
> > +
> > +/*
> > + * migrate_device_finalize() - complete page migration
> > + * @src_pfns: src_pfns returned from migrate_device_range()
> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > + * @npages: number of pages in the range
> > + *
> > + * Completes migration of the page by removing special migration entries.
> > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > + * before calling this.
> > + */
> > +void migrate_device_finalize(unsigned long *src_pfns,
> > +			unsigned long *dst_pfns, unsigned long npages)
> > +{
> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > +}
> >  EXPORT_SYMBOL(migrate_device_finalize);
> >  
> >  /**
> > @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> >   */
> >  void migrate_vma_finalize(struct migrate_vma *migrate)
> >  {
> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > +				  migrate->fault_page);
> >  }
> >  EXPORT_SYMBOL(migrate_vma_finalize);
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page
  2024-12-13 22:16     ` Matthew Brost
@ 2024-12-14  5:59       ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-14  5:59 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Fri, Dec 13, 2024 at 02:16:51PM -0800, Matthew Brost wrote:
> On Fri, Nov 29, 2024 at 10:31:32AM +1100, Alistair Popple wrote:
> > 
> > Matthew Brost <matthew.brost@intel.com> writes:
> > 
> > > Avoid multiple CPU page faults to the same device page racing by trying
> > > to lock the page in do_swap_page before taking an extra reference to the
> > > page. This prevents scenarios where multiple CPU page faults each take
> > > an extra reference to a device page, which could abort migration in
> > > folio_migrate_mapping. With the device page being locked in
> > > do_swap_page, the migrate_vma_* functions need to be updated to avoid
> > > locking the fault_page argument.
> > >
> > > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> > > DRM driver) SVM implementation if enough threads faulted the same device
> > > page.
> > >
> > > Cc: Philip Yang <Philip.Yang@amd.com>
> > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  mm/memory.c         | 13 ++++++---
> > >  mm/migrate_device.c | 69 ++++++++++++++++++++++++++++++---------------
> > >  2 files changed, 56 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 2366578015ad..b72bde782611 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4252,10 +4252,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >  			 * Get a page reference while we know the page can't be
> > >  			 * freed.
> > >  			 */
> > > -			get_page(vmf->page);
> > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > -			put_page(vmf->page);
> > > +			if (trylock_page(vmf->page)) {
> > > +				get_page(vmf->page);
> > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > +				put_page(vmf->page);
> > > +				unlock_page(vmf->page);
> > 
> > Isn't the order wrong here? In the common case put_page() will have just
> > dropped the last reference on the page and freed it so the unlock_page()
> > needs to happen first.
> > 
> 
> Yes, this appears wrong. I haven't seen this show up as a problem but
> certainly a put should be done after dropping the lock. 
> 
> > > +			} else {
> > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > +			}
> > >  		} else if (is_hwpoison_entry(entry)) {
> > >  			ret = VM_FAULT_HWPOISON;
> > >  		} else if (is_pte_marker_entry(entry)) {
> > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > index f163c2131022..2477d39f57be 100644
> > > --- a/mm/migrate_device.c
> > > +++ b/mm/migrate_device.c
> > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >  				   struct mm_walk *walk)
> > >  {
> > >  	struct migrate_vma *migrate = walk->private;
> > > +	struct folio *fault_folio = migrate->fault_page ?
> > > +		page_folio(migrate->fault_page) : NULL;
> > >  	struct vm_area_struct *vma = walk->vma;
> > >  	struct mm_struct *mm = vma->vm_mm;
> > >  	unsigned long addr = start, unmapped = 0;
> > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >  
> > >  			folio_get(folio);
> > >  			spin_unlock(ptl);
> > > -			if (unlikely(!folio_trylock(folio)))
> > > +			if (unlikely(fault_folio != folio &&
> > 
> > We don't currently support large ZONE_DEVICE pages so we should never
> > get here. I think a WARN_ON_ONCE(fault_folio == folio) and bail would be
> > better.
> > 
> 
> Sure will fix.
> 
> > > +				     !folio_trylock(folio)))
> > >  				return migrate_vma_collect_skip(start, end,
> > >  								walk);
> > >  			ret = split_folio(folio);
> > > -			folio_unlock(folio);
> > > +			if (fault_folio != folio)
> > > +				folio_unlock(folio);
> > >  			folio_put(folio);
> > >  			if (ret)
> > >  				return migrate_vma_collect_skip(start, end,
> > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >  		 * optimisation to avoid walking the rmap later with
> > >  		 * try_to_migrate().
> > >  		 */
> > > -		if (folio_trylock(folio)) {
> > > +		if (fault_folio == folio || folio_trylock(folio)) {
> > >  			bool anon_exclusive;
> > >  			pte_t swp_pte;
> > >  
> > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >  
> > >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
> > >  					set_pte_at(mm, addr, ptep, pte);
> > > -					folio_unlock(folio);
> > > +					if (fault_folio != folio)
> > > +						folio_unlock(folio);
> > >  					folio_put(folio);
> > >  					mpfn = 0;
> > >  					goto next;
> > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > >  					  unsigned long npages,
> > >  					  struct page *fault_page)
> > >  {
> > > +	struct folio *fault_folio = fault_page ?
> > > +		page_folio(fault_page) : NULL;
> > >  	unsigned long i, restore = 0;
> > >  	bool allow_drain = true;
> > >  	unsigned long unmapped = 0;
> > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > >  		remove_migration_ptes(folio, folio, 0);
> > >  
> > >  		src_pfns[i] = 0;
> > > -		folio_unlock(folio);
> > > +		if (fault_folio != folio)
> > > +			folio_unlock(folio);
> > >  		folio_put(folio);
> > >  		restore--;
> > >  	}
> > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> > >  		return -EINVAL;
> > >  	if (args->fault_page && !is_device_private_page(args->fault_page))
> > >  		return -EINVAL;
> > > +	if (args->fault_page && !PageLocked(args->fault_page))
> > > +		return -EINVAL;
> > >  
> > >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> > >  	args->cpages = 0;
> > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > >  }
> > >  EXPORT_SYMBOL(migrate_vma_pages);
> > >  
> > > -/*
> > > - * migrate_device_finalize() - complete page migration
> > > - * @src_pfns: src_pfns returned from migrate_device_range()
> > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > - * @npages: number of pages in the range
> > > - *
> > > - * Completes migration of the page by removing special migration entries.
> > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > > - * before calling this.
> > > - */
> > > -void migrate_device_finalize(unsigned long *src_pfns,
> > > -			unsigned long *dst_pfns, unsigned long npages)
> > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > > +				      unsigned long *dst_pfns,
> > > +				      unsigned long npages,
> > > +				      struct page *fault_page)
> > >  {
> > > +	struct folio *fault_folio = fault_page ?
> > > +		page_folio(fault_page) : NULL;
> > >  	unsigned long i;
> > >  
> > >  	for (i = 0; i < npages; i++) {
> > > @@ -824,7 +828,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > >  
> > >  		if (!page) {
> > >  			if (dst) {
> > > -				folio_unlock(dst);
> > > +				if (fault_folio != dst)
> > > +					folio_unlock(dst);
> > 
> > How could the destination page be the faulting page? I think we can drop
> > this check.
> > 
> 
> Yea, will drop.
> 

Actually landing on a WARN_ON_ONCE(fault_folio == dst) to catch
potentail bugs. Will include that in my next rev. If you disagree let me
know and can discuss ahead of next rev.

Matt

> > >  				folio_put(dst);
> > >  			}
> > >  			continue;
> > > @@ -834,14 +839,16 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > >  
> > >  		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) {
> > >  			if (dst) {
> > > -				folio_unlock(dst);
> > > +				if (fault_folio != dst)
> > > +					folio_unlock(dst);
> > 
> > Likewise.
> > 
> 
> Same here.
> 
> > >  				folio_put(dst);
> > >  			}
> > >  			dst = src;
> > >  		}
> > >  
> > >  		remove_migration_ptes(src, dst, 0);
> > > -		folio_unlock(src);
> > > +		if (fault_folio != src)
> > > +			folio_unlock(src);
> > >  
> > >  		if (folio_is_zone_device(src))
> > >  			folio_put(src);
> > > @@ -849,7 +856,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > >  			folio_putback_lru(src);
> > >  
> > >  		if (dst != src) {
> > > -			folio_unlock(dst);
> > > +			if (fault_folio != dst)
> > > +				folio_unlock(dst);
> > 
> > This one also seems unnecessary.
> >
> 
> Same here.
> 
> Thanks - Matt
>  
> >  - Alistair
> > 
> > >  			if (folio_is_zone_device(dst))
> > >  				folio_put(dst);
> > >  			else
> > > @@ -857,6 +865,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > >  		}
> > >  	}
> > >  }
> > > +
> > > +/*
> > > + * migrate_device_finalize() - complete page migration
> > > + * @src_pfns: src_pfns returned from migrate_device_range()
> > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > + * @npages: number of pages in the range
> > > + *
> > > + * Completes migration of the page by removing special migration entries.
> > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > > + * before calling this.
> > > + */
> > > +void migrate_device_finalize(unsigned long *src_pfns,
> > > +			unsigned long *dst_pfns, unsigned long npages)
> > > +{
> > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > > +}
> > >  EXPORT_SYMBOL(migrate_device_finalize);
> > >  
> > >  /**
> > > @@ -872,7 +896,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> > >   */
> > >  void migrate_vma_finalize(struct migrate_vma *migrate)
> > >  {
> > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > > +				  migrate->fault_page);
> > >  }
> > >  EXPORT_SYMBOL(migrate_vma_finalize);
> > 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 04/29] drm/pagemap: Add DRM pagemap
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (2 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Introduce drm_pagemap ops to map and unmap dma to VRAM resources. In the
local memory case it's a matter of merely providing an offset into the
device's physical address. For future p2p the map and unmap functions may
encode as needed.

Similar to how dma-buf works, let the memory provider (drm_pagemap) provide
the mapping functionality.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/drm_pagemap.h | 103 +++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/drm_pagemap.h

diff --git a/drivers/gpu/drm/xe/drm_pagemap.h b/drivers/gpu/drm/xe/drm_pagemap.h
new file mode 100644
index 000000000000..b6b387b81479
--- /dev/null
+++ b/drivers/gpu/drm/xe/drm_pagemap.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: MIT */
+#ifndef _DRM_PAGEMAP_H_
+#define _DRM_PAGEMAP_H_
+
+#include <linux/dma-direction.h>
+#include <linux/hmm.h>
+#include <linux/types.h>
+
+struct drm_pagemap;
+struct device;
+
+/**
+ * enum drm_interconnect_protocol - Used to identify an interconnect protocol.
+ */
+enum drm_interconnect_protocol {
+	DRM_INTERCONNECT_SYSTEM,    /* DMA map is system pages. */
+	DRM_INTERCONNECT_PCIE_P2P,  /* DMA map is PCIE P2P */
+	DRM_INTERCONNECT_DRIVER,    /* DMA map is driver defined */
+	/* A driver can add private values beyond DRM_INTERCONNECT_DRIVER */
+};
+
+/**
+ * struct drm_pagemap_dma_addr - DMA address representation.
+ * @addr: The dma address or driver-defined address for driver private interconnects.
+ * @proto: The interconnect protocol.
+ * @order: The page order of the dma mapping. (Size is PAGE_SIZE << order).
+ * @dir: The DMA direction.
+ *
+ * Note: There is room for improvement here. We should be able to pack into
+ * 64 bits.
+ */
+struct drm_pagemap_dma_addr {
+	dma_addr_t addr;
+	u64 proto : 54;
+	u64 order : 8;
+	u64 dir : 2;
+};
+
+/**
+ * drm_pagemap_dma_addr_encode() - Encode a dma address with metadata
+ * @addr: The dma address or driver-defined address for driver private interconnects.
+ * @proto: The interconnect protocol.
+ * @order: The page order of the dma mapping. (Size is PAGE_SIZE << order).
+ * @dir: The DMA direction.
+ *
+ * Return: A struct drm_pagemap_dma_addr encoding the above information.
+ */
+static inline struct drm_pagemap_dma_addr
+drm_pagemap_dma_addr_encode(dma_addr_t addr,
+			    enum drm_interconnect_protocol proto,
+			    unsigned int order,
+			    enum dma_data_direction dir)
+{
+	return (struct drm_pagemap_dma_addr) {
+		.addr = addr,
+		.proto = proto,
+		.order = order,
+		.dir = dir,
+	};
+}
+
+/**
+ * struct drm_pagemap_ops: Ops for a drm-pagemap.
+ */
+struct drm_pagemap_ops {
+	/**
+	 * @map_dma: Map for dma access or provide a virtual address suitable for
+	 * @dev.
+	 * @dpagemap: The struct drm_pagemap for the page.
+	 * @dev: The dma mapper.
+	 * @page: The page to map.
+	 * @dir: The transfer direction.
+	 * @protocol: The protocol to use.
+	 */
+	struct drm_pagemap_dma_addr (*map_dma)(struct drm_pagemap *dpagemap,
+					       struct device *dev,
+					       struct page *page,
+					       unsigned int order,
+					       enum dma_data_direction dir);
+
+	/**
+	 * @unmap_dma: Unmap a dma address previously obtained using  @map_dma.
+	 * @dev: The dma unmapper.
+	 * @addr: The dma address obtained when mapping.
+	 */
+	void (*unmap_dma)(struct drm_pagemap *dpagemap,
+			  struct device *dev,
+			  struct drm_pagemap_dma_addr addr);
+
+};
+
+/**
+ * struct drm_pagemap: Additional information for a struct dev_pagemap
+ * used for device p2p handshaking.
+ * @ops: The struct drm_pagemap_ops.
+ * @dev: The struct drevice owning the device-private memory.
+ */
+struct drm_pagemap {
+	const struct drm_pagemap_ops *ops;
+	struct device *dev;
+};
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (3 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 04/29] drm/pagemap: Add DRM pagemap Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-10-31 18:58   ` Thomas Hellström
                     ` (4 more replies)
  2024-10-16  3:24 ` [PATCH v2 06/29] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
                   ` (26 subsequent siblings)
  31 siblings, 5 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

This patch introduces support for GPU Shared Virtual Memory (SVM) in the
Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
sharing of memory between the CPU and GPU, enhancing performance and
flexibility in GPU computing tasks.

The patch adds the necessary infrastructure for SVM, including data
structures and functions for managing SVM ranges and notifiers. It also
provides mechanisms for allocating, deallocating, and migrating memory
regions between system RAM and GPU VRAM.

This is largely inspired by GPUVM.

v2:
 - Take order into account in check pages
 - Clear range->pages in get pages error
 - Drop setting dirty or accessed bit in get pages (Vetter)
 - Remove mmap assert for cpu faults
 - Drop mmap write lock abuse (Vetter, Christian)
 - Decouple zdd from range (Vetter, Oak)
 - Add drm_gpusvm_range_evict, make it work with coherent pages
 - Export drm_gpusvm_evict_to_sram, only use in BO evict path (Vetter)
 - mmget/put in drm_gpusvm_evict_to_sram
 - Drop range->vram_alloation variable
 - Don't return in drm_gpusvm_evict_to_sram until all pages detached
 - Don't warn on mixing sram and device pages
 - Update kernel doc
 - Add coherent page support to get pages
 - Use DMA_FROM_DEVICE rather than DMA_BIDIRECTIONAL
 - Add struct drm_gpusvm_vram and ops (Thomas)
 - Update the range's seqno if the range is valid (Thomas)
 - Remove the is_unmapped check before hmm_range_fault (Thomas)
 - Use drm_pagemap (Thomas)
 - Drop kfree_mapping (Thomas)
 - dma mapp pages under notifier lock (Thomas)
 - Remove ctx.prefault
 - Remove ctx.mmap_locked
 - Add ctx.check_pages
 - s/vram/devmem (Thomas)

Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/Makefile     |    3 +-
 drivers/gpu/drm/xe/drm_gpusvm.c | 2074 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/drm_gpusvm.h |  447 +++++++
 3 files changed, 2523 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
 create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index da80c29aa363..8d991d4a92a5 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
 
 # core driver code
 
-xe-y += xe_bb.o \
+xe-y += drm_gpusvm.o \
+	xe_bb.o \
 	xe_bo.o \
 	xe_bo_evict.o \
 	xe_devcoredump.o \
diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c b/drivers/gpu/drm/xe/drm_gpusvm.c
new file mode 100644
index 000000000000..1ff104d2b42c
--- /dev/null
+++ b/drivers/gpu/drm/xe/drm_gpusvm.c
@@ -0,0 +1,2074 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ *
+ * Authors:
+ *     Matthew Brost <matthew.brost@intel.com>
+ */
+
+#include <linux/dma-mapping.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/hmm.h>
+#include <linux/memremap.h>
+#include <linux/migrate.h>
+#include <linux/mm_types.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+
+#include <drm/drm_device.h>
+#include "drm/drm_print.h"
+#include "drm_gpusvm.h"
+#include "drm_pagemap.h"
+
+/**
+ * DOC: Overview
+ *
+ * GPU Shared Virtual Memory (GPU SVM) layer for the Direct Rendering Manager (DRM)
+ *
+ * The GPU SVM layer is a component of the DRM framework designed to manage shared
+ * virtual memory between the CPU and GPU. It enables efficient data exchange and
+ * processing for GPU-accelerated applications by allowing memory sharing and
+ * synchronization between the CPU's and GPU's virtual address spaces.
+ *
+ * Key GPU SVM Components:
+ * - Notifiers: Notifiers: Used for tracking memory intervals and notifying the
+ *		GPU of changes, notifiers are sized based on a GPU SVM
+ *		initialization parameter, with a recommendation of 512M or
+ *		larger. They maintain a Red-BlacK tree and a list of ranges that
+ *		fall within the notifier interval. Notifiers are tracked within
+ *		a GPU SVM Red-BlacK tree and list and are dynamically inserted
+ *		or removed as ranges within the interval are created or
+ *		destroyed.
+ * - Ranges: Represent memory ranges mapped in a DRM device and managed
+ *	     by GPU SVM. They are sized based on an array of chunk sizes, which
+ *	     is a GPU SVM initialization parameter, and the CPU address space.
+ *	     Upon GPU fault, the largest aligned chunk that fits within the
+ *	     faulting CPU address space is chosen for the range size. Ranges are
+ *	     expected to be dynamically allocated on GPU fault and removed on an
+ *	     MMU notifier UNMAP event. As mentioned above, ranges are tracked in
+ *	     a notifier's Red-Black tree.
+ * - Operations: Define the interface for driver-specific GPU SVM operations
+ *               such as range allocation, notifier allocation, and
+ *               invalidations.
+ * - Device Memory Allocations: Embedded structure containing enough information
+ *                              for GPU SVM to migrate to / from device memory.
+ * - Device Memory Operations: Define the interface for driver-specific device
+ *                             memory operations release memory, populate pfns,
+ *                             and copy to / from device memory.
+ *
+ * This layer provides interfaces for allocating, mapping, migrating, and
+ * releasing memory ranges between the CPU and GPU. It handles all core memory
+ * management interactions (DMA mapping, HMM, and migration) and provides
+ * driver-specific virtual functions (vfuncs). This infrastructure is sufficient
+ * to build the expected driver components for an SVM implementation as detailed
+ * below.
+ *
+ * Expected Driver Components:
+ * - GPU page fault handler: Used to create ranges and notifiers based on the
+ *			     fault address, optionally migrate the range to
+ *			     device memory, and create GPU bindings.
+ * - Garbage collector: Used to destroy GPU bindings for ranges. Ranges are
+ *			expected to be added to the garbage collector upon
+ *			MMU_NOTIFY_UNMAP event.
+ */
+
+/**
+ * DOC: Locking
+ *
+ * GPU SVM handles locking for core MM interactions, i.e., it locks/unlocks the
+ * mmap lock as needed.
+ *
+ * GPU SVM introduces a global notifier lock, which safeguards the notifier's
+ * range RB tree and list, as well as the range's DMA mappings and sequence
+ * number. GPU SVM manages all necessary locking and unlocking operations,
+ * except for the recheck of the range's sequence number
+ * (mmu_interval_read_retry) when the driver is committing GPU bindings. This
+ * lock corresponds to the 'driver->update' lock mentioned in the HMM
+ * documentation (TODO: Link). Future revisions may transition from a GPU SVM
+ * global lock to a per-notifier lock if finer-grained locking is deemed
+ * necessary.
+ *
+ * In addition to the locking mentioned above, the driver should implement a
+ * lock to safeguard core GPU SVM function calls that modify state, such as
+ * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove. Alternatively,
+ * these core functions can be called within a single kernel thread, for
+ * instance, using an ordered work queue. This lock is denoted as
+ * 'driver_svm_lock' in code examples. Finer grained driver side locking should
+ * also be possible for concurrent GPU fault processing within a single GPU SVM.
+ */
+
+/**
+ * DOC: Migrataion
+ *
+ * The migration support is quite simple, allowing migration between RAM and
+ * device memory at the range granularity. For example, GPU SVM currently does not
+ * support mixing RAM and device memory pages within a range. This means that upon GPU
+ * fault, the entire range can be migrated to device memory, and upon CPU fault, the
+ * entire range is migrated to RAM. Mixed RAM and device memory storage within a range
+ * could be added in the future if required.
+ *
+ * The reasoning for only supporting range granularity is as follows: it
+ * simplifies the implementation, and range sizes are driver-defined and should
+ * be relatively small.
+ */
+
+/**
+ * DOC: Partial Unmapping of Ranges
+ *
+ * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by CPU resulting
+ * in MMU_NOTIFY_UNMAP event) presents several challenges, with the main one
+ * being that a subset of the range still has CPU and GPU mappings. If the
+ * backing store for the range is in device memory, a subset of the backing store has
+ * references. One option would be to split the range and device memory backing store,
+ * but the implementation for this would be quite complicated. Given that
+ * partial unmappings are rare and driver-defined range sizes are relatively
+ * small, GPU SVM does not support splitting of ranges.
+ *
+ * With no support for range splitting, upon partial unmapping of a range, the
+ * driver is expected to invalidate and destroy the entire range. If the range
+ * has device memory as its backing, the driver is also expected to migrate any
+ * remaining pages back to RAM.
+ */
+
+/**
+ * DOC: Examples
+ *
+ * This section provides two examples of how to build the expected driver
+ * components: the GPU page fault handler and the garbage collector. A third
+ * example demonstrates a sample invalidation driver vfunc.
+ *
+ * The generic code provided does not include logic for complex migration
+ * policies, optimized invalidations, fined grained driver locking, or other
+ * potentially required driver locking (e.g., DMA-resv locks).
+ *
+ * 1) GPU page fault handler
+ *
+ *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct drm_gpusvm_range *range)
+ *	{
+ *		int err = 0;
+ *
+ *		driver_alloc_and_setup_memory_for_bind(gpusvm, range);
+ *
+ *		drm_gpusvm_notifier_lock(gpusvm);
+ *		if (drm_gpusvm_range_pages_valid(range))
+ *			driver_commit_bind(gpusvm, range);
+ *		else
+ *			err = -EAGAIN;
+ *		drm_gpusvm_notifier_unlock(gpusvm);
+ *
+ *		return err;
+ *	}
+ *
+ *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64 fault_addr,
+ *			     u64 gpuva_start, u64 gpuva_end)
+ *	{
+ *		struct drm_gpusvm_ctx ctx = {};
+ *		int err;
+ *
+ *		driver_svm_lock();
+ *	retry:
+ *		// Always process UNMAPs first so view of GPU SVM ranges is current
+ *		driver_garbage_collector(gpusvm);
+ *
+ *		range = drm_gpusvm_range_find_or_insert(gpusvm, fault_addr,
+ *							gpuva_start, gpuva_end,
+ *						        &ctx);
+ *		if (IS_ERR(range)) {
+ *			err = PTR_ERR(range);
+ *			goto unlock;
+ *		}
+ *
+ *		if (driver_migration_policy(range)) {
+ *			devmem = driver_alloc_devmem();
+ *			err = drm_gpusvm_migrate_to_devmem(gpusvm, range,
+ *							   devmem_allocation,
+ *							   &ctx);
+ *			if (err)	// CPU mappings may have changed
+ *				goto retry;
+ *		}
+ *
+ *		err = drm_gpusvm_range_get_pages(gpusvm, range, &ctx);
+ *		if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {	// CPU mappings changed
+ *			if (err == -EOPNOTSUPP)
+ *				drm_gpusvm_range_evict(gpusvm, range);
+ *			goto retry;
+ *		} else if (err) {
+ *			goto unlock;
+ *		}
+ *
+ *		err = driver_bind_range(gpusvm, range);
+ *		if (err == -EAGAIN)	// CPU mappings changed
+ *			goto retry
+ *
+ *	unlock:
+ *		driver_svm_unlock();
+ *		return err;
+ *	}
+ *
+ * 2) Garbage Collector.
+ *
+ *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
+ *					struct drm_gpusvm_range *range)
+ *	{
+ *		assert_driver_svm_locked(gpusvm);
+ *
+ *		// Partial unmap, migrate any remaining device memory pages back to RAM
+ *		if (range->flags.partial_unmap)
+ *			drm_gpusvm_range_evict(gpusvm, range);
+ *
+ *		driver_unbind_range(range);
+ *		drm_gpusvm_range_remove(gpusvm, range);
+ *	}
+ *
+ *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
+ *	{
+ *		assert_driver_svm_locked(gpusvm);
+ *
+ *		for_each_range_in_garbage_collector(gpusvm, range)
+ *			__driver_garbage_collector(gpusvm, range);
+ *	}
+ *
+ * 3) Invalidation driver vfunc.
+ *
+ *	void driver_invalidation(struct drm_gpusvm *gpusvm,
+ *				 struct drm_gpusvm_notifier *notifier,
+ *				 const struct mmu_notifier_range *mmu_range)
+ *	{
+ *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
+ *		struct drm_gpusvm_range *range = NULL;
+ *
+ *		driver_invalidate_device_tlb(gpusvm, mmu_range->start, mmu_range->end);
+ *
+ *		drm_gpusvm_for_each_range(range, notifier, mmu_range->start,
+ *					  mmu_range->end) {
+ *			drm_gpusvm_range_unmap_pages(gpusvm, range, &ctx);
+ *
+ *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
+ *				continue;
+ *
+ *			drm_gpusvm_range_set_unmapped(range, mmu_range);
+ *			driver_garbage_collector_add(gpusvm, range);
+ *		}
+ *	}
+ */
+
+#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
+#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
+INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64, rb.__subtree_last,
+		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
+		     static __maybe_unused, range);
+
+#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)->interval.start)
+#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)->interval.end - 1)
+INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
+		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
+		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused, notifier);
+
+/**
+ * npages_in_range() - Calculate the number of pages in a given range
+ * @start__: The start address of the range
+ * @end__: The end address of the range
+ *
+ * This macro calculates the number of pages in a given memory range,
+ * specified by the start and end addresses. It divides the difference
+ * between the end and start addresses by the page size (PAGE_SIZE) to
+ * determine the number of pages in the range.
+ *
+ * Return: The number of pages in the specified range.
+ */
+#define npages_in_range(start__, end__)	\
+	(((end__) - (start__)) >> PAGE_SHIFT)
+
+/**
+ * struct drm_gpusvm_zdd - GPU SVM zone device data
+ *
+ * @refcount: Reference count for the zdd
+ * @destroy_work: Work structure for asynchronous zdd destruction
+ * @devmem_allocation: device memory allocation
+ * @device_private_page_owner: Device private pages owner
+ *
+ * This structure serves as a generic wrapper installed in
+ * page->zone_device_data. It provides infrastructure for looking up a device
+ * memory allocation upon CPU page fault and asynchronously releasing device
+ * memory once the CPU has no page references. Asynchronous release is useful
+ * because CPU page references can be dropped in IRQ contexts, while releasing
+ * device memory likely requires sleeping locks.
+ */
+struct drm_gpusvm_zdd {
+	struct kref refcount;
+	struct work_struct destroy_work;
+	struct drm_gpusvm_devmem *devmem_allocation;
+	void *device_private_page_owner;
+};
+
+/**
+ * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a zdd
+ * @w: Pointer to the work_struct
+ *
+ * This function releases device memory, puts GPU SVM range, and frees zdd.
+ */
+static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
+{
+	struct drm_gpusvm_zdd *zdd =
+		container_of(w, struct drm_gpusvm_zdd, destroy_work);
+	const struct drm_gpusvm_devmem_ops *ops = zdd->devmem_allocation ?
+		zdd->devmem_allocation->ops : NULL;
+
+	if (zdd->devmem_allocation && ops->devmem_release)
+		ops->devmem_release(zdd->devmem_allocation);
+	kfree(zdd);
+}
+
+/**
+ * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
+ * @device_private_page_owner: Device private pages owner
+ *
+ * This function allocates and initializes a new zdd structure. It sets up the
+ * reference count and initializes the destroy work.
+ *
+ * Returns:
+ * Pointer to the allocated zdd on success, ERR_PTR() on failure.
+ */
+static struct drm_gpusvm_zdd *
+drm_gpusvm_zdd_alloc(void *device_private_page_owner)
+{
+	struct drm_gpusvm_zdd *zdd;
+
+	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
+	if (!zdd)
+		return NULL;
+
+	kref_init(&zdd->refcount);
+	INIT_WORK(&zdd->destroy_work, drm_gpusvm_zdd_destroy_work_func);
+	zdd->devmem_allocation = NULL;
+	zdd->device_private_page_owner = device_private_page_owner;
+
+	return zdd;
+}
+
+/**
+ * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
+ * @zdd: Pointer to the zdd structure.
+ *
+ * This function increments the reference count of the provided zdd structure.
+ *
+ * Returns: Pointer to the zdd structure.
+ */
+static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct drm_gpusvm_zdd *zdd)
+{
+	kref_get(&zdd->refcount);
+	return zdd;
+}
+
+/**
+ * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
+ * @ref: Pointer to the reference count structure.
+ *
+ * This function queues the destroy_work of the zdd for asynchronous destruction.
+ */
+static void drm_gpusvm_zdd_destroy(struct kref *ref)
+{
+	struct drm_gpusvm_zdd *zdd =
+		container_of(ref, struct drm_gpusvm_zdd, refcount);
+
+	if (zdd->devmem_allocation)
+		WRITE_ONCE(zdd->devmem_allocation->detached, true);
+	schedule_work(&zdd->destroy_work);
+}
+
+/**
+ * drm_gpusvm_zdd_put - Put a zdd reference.
+ * @zdd: Pointer to the zdd structure.
+ *
+ * This function decrements the reference count of the provided zdd structure
+ * and schedules its destruction if the count drops to zero.
+ */
+static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
+{
+	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
+}
+
+/**
+ * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
+ * @notifier: Pointer to the GPU SVM notifier structure.
+ * @start: Start address of the range
+ * @end: End address of the range
+ *
+ * Return: A pointer to the drm_gpusvm_range if found or NULL
+ */
+struct drm_gpusvm_range *
+drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end)
+{
+	return range_iter_first(&notifier->root, start, end - 1);
+}
+
+/**
+ * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM ranges in a notifier
+ * @range__: Iterator variable for the ranges
+ * @next__: Iterator variable for the ranges temporay storage
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the range
+ * @end__: End address of the range
+ *
+ * This macro is used to iterate over GPU SVM ranges in a notifier while
+ * removing ranges from it.
+ */
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))
+
+/**
+ * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in the list
+ * @notifier: a pointer to the current drm_gpusvm_notifier
+ *
+ * Return: A pointer to the next drm_gpusvm_notifier if available, or NULL if
+ *         the current notifier is the last one or if the input notifier is
+ *         NULL.
+ */
+static struct drm_gpusvm_notifier *
+__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
+{
+	if (notifier && !list_is_last(&notifier->rb.entry,
+				      &notifier->gpusvm->notifier_list))
+		return list_next_entry(notifier, rb.entry);
+
+	return NULL;
+}
+
+/**
+ * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in a gpusvm
+ * @notifier__: Iterator variable for the notifiers
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the notifier
+ * @end__: End address of the notifier
+ *
+ * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
+ */
+#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
+
+/**
+ * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM notifiers in a gpusvm
+ * @notifier__: Iterator variable for the notifiers
+ * @next__: Iterator variable for the notifiers temporay storage
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the notifier
+ * @end__: End address of the notifier
+ *
+ * This macro is used to iterate over GPU SVM notifiers in a gpusvm while
+ * removing notifiers from it.
+ */
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))
+
+/**
+ * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
+ * @mni: Pointer to the mmu_interval_notifier structure.
+ * @mmu_range: Pointer to the mmu_notifier_range structure.
+ * @cur_seq: Current sequence number.
+ *
+ * This function serves as a generic MMU notifier for GPU SVM. It sets the MMU
+ * notifier sequence number and calls the driver invalidate vfunc under
+ * gpusvm->notifier_lock.
+ *
+ * Returns:
+ * true if the operation succeeds, false otherwise.
+ */
+static bool
+drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
+			       const struct mmu_notifier_range *mmu_range,
+			       unsigned long cur_seq)
+{
+	struct drm_gpusvm_notifier *notifier =
+		container_of(mni, typeof(*notifier), notifier);
+	struct drm_gpusvm *gpusvm = notifier->gpusvm;
+
+	if (!mmu_notifier_range_blockable(mmu_range))
+		return false;
+
+	down_write(&gpusvm->notifier_lock);
+	mmu_interval_set_seq(mni, cur_seq);
+	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
+	up_write(&gpusvm->notifier_lock);
+
+	return true;
+}
+
+/**
+ * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
+ */
+static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
+	.invalidate = drm_gpusvm_notifier_invalidate,
+};
+
+/**
+ * drm_gpusvm_init - Initialize the GPU SVM.
+ * @gpusvm: Pointer to the GPU SVM structure.
+ * @name: Name of the GPU SVM.
+ * @drm: Pointer to the DRM device structure.
+ * @mm: Pointer to the mm_struct for the address space.
+ * @device_private_page_owner: Device private pages owner.
+ * @mm_start: Start address of GPU SVM.
+ * @mm_range: Range of the GPU SVM.
+ * @notifier_size: Size of individual notifiers.
+ * @ops: Pointer to the operations structure for GPU SVM.
+ * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
+ *               Entries should be powers of 2 in descending order with last
+ *               entry being SZ_4K.
+ * @num_chunks: Number of chunks.
+ *
+ * This function initializes the GPU SVM.
+ *
+ * Returns:
+ * 0 on success, a negative error code on failure.
+ */
+int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
+		    const char *name, struct drm_device *drm,
+		    struct mm_struct *mm, void *device_private_page_owner,
+		    u64 mm_start, u64 mm_range, u64 notifier_size,
+		    const struct drm_gpusvm_ops *ops,
+		    const u64 *chunk_sizes, int num_chunks)
+{
+	if (!ops->invalidate || !num_chunks)
+		return -EINVAL;
+
+	gpusvm->name = name;
+	gpusvm->drm = drm;
+	gpusvm->mm = mm;
+	gpusvm->device_private_page_owner = device_private_page_owner;
+	gpusvm->mm_start = mm_start;
+	gpusvm->mm_range = mm_range;
+	gpusvm->notifier_size = notifier_size;
+	gpusvm->ops = ops;
+	gpusvm->chunk_sizes = chunk_sizes;
+	gpusvm->num_chunks = num_chunks;
+
+	mmgrab(mm);
+	gpusvm->root = RB_ROOT_CACHED;
+	INIT_LIST_HEAD(&gpusvm->notifier_list);
+
+	init_rwsem(&gpusvm->notifier_lock);
+
+	fs_reclaim_acquire(GFP_KERNEL);
+	might_lock(&gpusvm->notifier_lock);
+	fs_reclaim_release(GFP_KERNEL);
+
+	return 0;
+}
+
+/**
+ * drm_gpusvm_notifier_find - Find GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM structure
+ * @fault_addr__: Fault address
+ *
+ * This macro finds the GPU SVM notifier associated with the fault address.
+ *
+ * Returns:
+ * Pointer to the GPU SVM notifier on success, NULL otherwise.
+ */
+#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
+	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
+			    (fault_addr__ + 1))
+
+/**
+ * to_drm_gpusvm_notifier - retrieve the container struct for a given rbtree node
+ * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_notifier struct
+ *
+ * Return: A pointer to the containing drm_gpusvm_notifier structure.
+ */
+#define to_drm_gpusvm_notifier(__node)				\
+	container_of((__node), struct drm_gpusvm_notifier, rb.node)
+
+/**
+ * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ *
+ * This function inserts the GPU SVM notifier into the GPU SVM RB tree and list.
+ */
+static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
+				       struct drm_gpusvm_notifier *notifier)
+{
+	struct rb_node *node;
+	struct list_head *head;
+
+	notifier_insert(notifier, &gpusvm->root);
+
+	node = rb_prev(&notifier->rb.node);
+	if (node)
+		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
+	else
+		head = &gpusvm->notifier_list;
+
+	list_add(&notifier->rb.entry, head);
+}
+
+/**
+ * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM tructure
+ * @notifier__: Pointer to the GPU SVM notifier structure
+ *
+ * This macro removes the GPU SVM notifier from the GPU SVM RB tree and list.
+ */
+#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
+	notifier_remove((notifier__), &(gpusvm__)->root);	\
+	list_del(&(notifier__)->rb.entry)
+
+/**
+ * drm_gpusvm_fini - Finalize the GPU SVM.
+ * @gpusvm: Pointer to the GPU SVM structure.
+ *
+ * This function finalizes the GPU SVM by cleaning up any remaining ranges and
+ * notifiers, and dropping a reference to struct MM.
+ */
+void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
+{
+	struct drm_gpusvm_notifier *notifier, *next;
+
+	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0, LONG_MAX) {
+		struct drm_gpusvm_range *range, *__next;
+
+		/*
+		 * Remove notifier first to avoid racing with any invalidation
+		 */
+		mmu_interval_notifier_remove(&notifier->notifier);
+		notifier->flags.removed = true;
+
+		drm_gpusvm_for_each_range_safe(range, __next, notifier, 0,
+					       LONG_MAX)
+			drm_gpusvm_range_remove(gpusvm, range);
+	}
+
+	mmdrop(gpusvm->mm);
+	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
+}
+
+/**
+ * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @fault_addr: Fault address
+ *
+ * This function allocates and initializes the GPU SVM notifier structure.
+ *
+ * Returns:
+ * Pointer to the allocated GPU SVM notifier on success, ERR_PTR() on failure.
+ */
+static struct drm_gpusvm_notifier *
+drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
+{
+	struct drm_gpusvm_notifier *notifier;
+
+	if (gpusvm->ops->notifier_alloc)
+		notifier = gpusvm->ops->notifier_alloc();
+	else
+		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
+
+	if (!notifier)
+		return ERR_PTR(-ENOMEM);
+
+	notifier->gpusvm = gpusvm;
+	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
+	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm->notifier_size);
+	INIT_LIST_HEAD(&notifier->rb.entry);
+	notifier->root = RB_ROOT_CACHED;
+	INIT_LIST_HEAD(&notifier->range_list);
+
+	return notifier;
+}
+
+/**
+ * drm_gpusvm_notifier_free - Free GPU SVM notifier
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ *
+ * This function frees the GPU SVM notifier structure.
+ */
+static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
+				     struct drm_gpusvm_notifier *notifier)
+{
+	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
+
+	if (gpusvm->ops->notifier_free)
+		gpusvm->ops->notifier_free(notifier);
+	else
+		kfree(notifier);
+}
+
+/**
+ * to_drm_gpusvm_range - retrieve the container struct for a given rbtree node
+ * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_range struct
+ *
+ * Return: A pointer to the containing drm_gpusvm_range structure.
+ */
+#define to_drm_gpusvm_range(node__)	\
+	container_of((node__), struct drm_gpusvm_range, rb.node)
+
+/**
+ * drm_gpusvm_range_insert - Insert GPU SVM range
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function inserts the GPU SVM range into the notifier RB tree and list.
+ */
+static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier *notifier,
+				    struct drm_gpusvm_range *range)
+{
+	struct rb_node *node;
+	struct list_head *head;
+
+	drm_gpusvm_notifier_lock(notifier->gpusvm);
+	range_insert(range, &notifier->root);
+
+	node = rb_prev(&range->rb.node);
+	if (node)
+		head = &(to_drm_gpusvm_range(node))->rb.entry;
+	else
+		head = &notifier->range_list;
+
+	list_add(&range->rb.entry, head);
+	drm_gpusvm_notifier_unlock(notifier->gpusvm);
+}
+
+/**
+ * __drm_gpusvm_range_remove - Remove GPU SVM range
+ * @notifier__: Pointer to the GPU SVM notifier structure
+ * @range__: Pointer to the GPU SVM range structure
+ *
+ * This macro removes the GPU SVM range from the notifier RB tree and list.
+ */
+#define __drm_gpusvm_range_remove(notifier__, range__)		\
+	range_remove((range__), &(notifier__)->root);		\
+	list_del(&(range__)->rb.entry)
+
+/**
+ * drm_gpusvm_range_alloc - Allocate GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @fault_addr: Fault address
+ * @chunk_size: Chunk size
+ * @migrate_devmem: Flag indicating whether to migrate device memory
+ *
+ * This function allocates and initializes the GPU SVM range structure.
+ *
+ * Returns:
+ * Pointer to the allocated GPU SVM range on success, ERR_PTR() on failure.
+ */
+static struct drm_gpusvm_range *
+drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
+		       struct drm_gpusvm_notifier *notifier,
+		       u64 fault_addr, u64 chunk_size, bool migrate_devmem)
+{
+	struct drm_gpusvm_range *range;
+
+	if (gpusvm->ops->range_alloc)
+		range = gpusvm->ops->range_alloc(gpusvm);
+	else
+		range = kzalloc(sizeof(*range), GFP_KERNEL);
+
+	if (!range)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&range->refcount);
+	range->gpusvm = gpusvm;
+	range->notifier = notifier;
+	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
+	range->va.end = ALIGN(fault_addr + 1, chunk_size);
+	INIT_LIST_HEAD(&range->rb.entry);
+	range->notifier_seq = LONG_MAX;
+	range->flags.migrate_devmem = migrate_devmem ? 1 : 0;
+
+	return range;
+}
+
+/**
+ * drm_gpusvm_check_pages - Check pages
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @start: Start address
+ * @end: End address
+ *
+ * Check if pages between start and end have been faulted in on the CPU. Use to
+ * prevent migration of pages without CPU backing store.
+ *
+ * Returns:
+ * True if pages have been faulted into CPU, False otherwise
+ */
+static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
+				   struct drm_gpusvm_notifier *notifier,
+				   u64 start, u64 end)
+{
+	struct hmm_range hmm_range = {
+		.default_flags = 0,
+		.notifier = &notifier->notifier,
+		.start = start,
+		.end = end,
+		.dev_private_owner = gpusvm->device_private_page_owner,
+	};
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long *pfns;
+	unsigned long npages = npages_in_range(start, end);
+	int err, i;
+
+	mmap_assert_locked(gpusvm->mm);
+
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (!pfns)
+		return false;
+
+	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+	hmm_range.hmm_pfns = pfns;
+
+	while (true) {
+		err = hmm_range_fault(&hmm_range);
+		if (err == -EBUSY) {
+			if (time_after(jiffies, timeout))
+				break;
+
+			hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+			continue;
+		}
+		break;
+	}
+	if (err)
+		goto err_free;
+
+	for (i = 0; i < npages;) {
+		if (!(pfns[i] & HMM_PFN_VALID)) {
+			err = -EFAULT;
+			goto err_free;
+		}
+		i += 0x1 << hmm_pfn_to_map_order(pfns[i]);
+	}
+
+err_free:
+	kvfree(pfns);
+	return err ? false : true;
+}
+
+/**
+ * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier structure
+ * @vas: Pointer to the virtual memory area structure
+ * @fault_addr: Fault address
+ * @gpuva_start: Start address of GPUVA which mirrors CPU
+ * @gpuva_end: End address of GPUVA which mirrors CPU
+ * @check_pages: Flag indicating whether to check pages
+ *
+ * This function determines the chunk size for the GPU SVM range based on the
+ * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and the virtual
+ * memory area boundaries.
+ *
+ * Returns:
+ * Chunk size on success, LONG_MAX on failure.
+ */
+static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
+				       struct drm_gpusvm_notifier *notifier,
+				       struct vm_area_struct *vas,
+				       u64 fault_addr, u64 gpuva_start,
+				       u64 gpuva_end, bool check_pages)
+{
+	u64 start, end;
+	int i = 0;
+
+retry:
+	for (; i < gpusvm->num_chunks; ++i) {
+		start = ALIGN_DOWN(fault_addr, gpusvm->chunk_sizes[i]);
+		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
+
+		if (start >= vas->vm_start && end <= vas->vm_end &&
+		    start >= notifier->interval.start &&
+		    end <= notifier->interval.end &&
+		    start >= gpuva_start && end <= gpuva_end)
+			break;
+	}
+
+	if (i == gpusvm->num_chunks)
+		return LONG_MAX;
+
+	/*
+	 * If allocation more than page, ensure not to overlap with existing
+	 * ranges.
+	 */
+	if (end - start != SZ_4K) {
+		struct drm_gpusvm_range *range;
+
+		range = drm_gpusvm_range_find(notifier, start, end);
+		if (range) {
+			++i;
+			goto retry;
+		}
+
+		/*
+		 * XXX: Only create range on pages CPU has faulted in. Without
+		 * this check, or prefault, on BMG 'xe_exec_system_allocator --r
+		 * process-many-malloc' fails. In the failure case, each process
+		 * mallocs 16k but the CPU VMA is ~128k which results in 64k SVM
+		 * ranges. When migrating the SVM ranges, some processes fail in
+		 * drm_gpusvm_migrate_to_devmem with 'migrate.cpages != npages'
+		 * and then upon drm_gpusvm_range_get_pages device pages from
+		 * other processes are collected + faulted in which creates all
+		 * sorts of problems. Unsure exactly how this happening, also
+		 * problem goes away if 'xe_exec_system_allocator --r
+		 * process-many-malloc' mallocs at least 64k at a time.
+		 */
+		if (check_pages &&
+		    !drm_gpusvm_check_pages(gpusvm, notifier, start, end)) {
+			++i;
+			goto retry;
+		}
+	}
+
+	return end - start;
+}
+
+/**
+ * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @fault_addr: Fault address
+ * @gpuva_start: Start address of GPUVA which mirrors CPU
+ * @gpuva_end: End address of GPUVA which mirrors CPU
+ * @ctx: GPU SVM context
+ *
+ * This function finds or inserts a newly allocated a GPU SVM range based on the
+ * fault address. Caller must hold a lock to protect range lookup and insertion.
+ *
+ * Returns:
+ * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
+ */
+struct drm_gpusvm_range *
+drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
+				u64 gpuva_start, u64 gpuva_end,
+				const struct drm_gpusvm_ctx *ctx)
+{
+	struct drm_gpusvm_notifier *notifier;
+	struct drm_gpusvm_range *range;
+	struct mm_struct *mm = gpusvm->mm;
+	struct vm_area_struct *vas;
+	bool notifier_alloc = false;
+	u64 chunk_size;
+	int err;
+	bool migrate_devmem;
+
+	if (fault_addr < gpusvm->mm_start ||
+	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	if (!mmget_not_zero(mm)) {
+		err = -EFAULT;
+		goto err_out;
+	}
+
+	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
+	if (!notifier) {
+		notifier = drm_gpusvm_notifier_alloc(gpusvm, fault_addr);
+		if (IS_ERR(notifier)) {
+			err = PTR_ERR(notifier);
+			goto err_mmunlock;
+		}
+		notifier_alloc = true;
+		err = mmu_interval_notifier_insert(&notifier->notifier,
+						   mm, notifier->interval.start,
+						   notifier->interval.end -
+						   notifier->interval.start,
+						   &drm_gpusvm_notifier_ops);
+		if (err)
+			goto err_notifier;
+	}
+
+	mmap_read_lock(mm);
+
+	vas = vma_lookup(mm, fault_addr);
+	if (!vas) {
+		err = -ENOENT;
+		goto err_notifier_remove;
+	}
+
+	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
+		err = -EPERM;
+		goto err_notifier_remove;
+	}
+
+	range = drm_gpusvm_range_find(notifier, fault_addr, fault_addr + 1);
+	if (range)
+		goto out_mmunlock;
+	/*
+	 * XXX: Short-circuiting migration based on migrate_vma_* current
+	 * limitations. If/when migrate_vma_* add more support, this logic will
+	 * have to change.
+	 */
+	migrate_devmem = ctx->devmem_possible &&
+		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
+
+	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier, vas,
+						 fault_addr, gpuva_start,
+						 gpuva_end, migrate_devmem &&
+						 ctx->check_pages);
+	if (chunk_size == LONG_MAX) {
+		err = -EINVAL;
+		goto err_notifier_remove;
+	}
+
+	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr, chunk_size,
+				       migrate_devmem);
+	if (IS_ERR(range)) {
+		err = PTR_ERR(range);
+		goto err_notifier_remove;
+	}
+
+	drm_gpusvm_range_insert(notifier, range);
+	if (notifier_alloc)
+		drm_gpusvm_notifier_insert(gpusvm, notifier);
+
+out_mmunlock:
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	return range;
+
+err_notifier_remove:
+	mmap_read_unlock(mm);
+	if (notifier_alloc)
+		mmu_interval_notifier_remove(&notifier->notifier);
+err_notifier:
+	if (notifier_alloc)
+		drm_gpusvm_notifier_free(gpusvm, notifier);
+err_mmunlock:
+	mmput(mm);
+err_out:
+	return ERR_PTR(err);
+}
+
+/**
+ * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range (internal)
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ * @npages: Number of pages to unmap
+ *
+ * This function unmap pages associated with a GPU SVM range. Assumes and
+ * asserts correct locking is in place when called.
+ */
+static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
+					   struct drm_gpusvm_range *range,
+					   unsigned long npages)
+{
+	unsigned long i, j;
+	struct drm_pagemap *dpagemap = range->dpagemap;
+	struct device *dev = gpusvm->drm->dev;
+
+	lockdep_assert_held(&gpusvm->notifier_lock);
+
+	if (range->flags.has_dma_mapping) {
+		for (i = 0, j = 0; i < npages; j++) {
+			struct drm_pagemap_dma_addr *addr = &range->dma_addr[j];
+
+			if (addr->proto == DRM_INTERCONNECT_SYSTEM) {
+				dma_unmap_page(dev,
+					       addr->addr,
+					       PAGE_SIZE << addr->order,
+					       addr->dir);
+			} else if (dpagemap && dpagemap->ops->unmap_dma) {
+				dpagemap->ops->unmap_dma(dpagemap,
+							 dev,
+							 *addr);
+			}
+			i += 1 << addr->order;
+		}
+		range->flags.has_devmem_pages = false;
+		range->flags.has_dma_mapping = false;
+		range->dpagemap = NULL;
+	}
+}
+
+/**
+ * drm_gpusvm_range_free_pages - Free pages associated with a GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function free pages associated with a GPU SVM range.
+ */
+static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
+					struct drm_gpusvm_range *range)
+{
+	lockdep_assert_held(&gpusvm->notifier_lock);
+
+	if (range->dma_addr) {
+		kvfree(range->dma_addr);
+		range->dma_addr = NULL;
+	}
+}
+
+/**
+ * drm_gpusvm_range_remove - Remove GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range to be removed
+ *
+ * This function removes the specified GPU SVM range and also removes the parent
+ * GPU SVM notifier if no more ranges remain in the notifier. The caller must
+ * hold a lock to protect range and notifier removal.
+ */
+void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
+			     struct drm_gpusvm_range *range)
+{
+	unsigned long npages = npages_in_range(range->va.start, range->va.end);
+	struct drm_gpusvm_notifier *notifier;
+
+	notifier = drm_gpusvm_notifier_find(gpusvm, range->va.start);
+	if (WARN_ON_ONCE(!notifier))
+		return;
+
+	drm_gpusvm_notifier_lock(gpusvm);
+	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
+	drm_gpusvm_range_free_pages(gpusvm, range);
+	__drm_gpusvm_range_remove(notifier, range);
+	drm_gpusvm_notifier_unlock(gpusvm);
+
+	drm_gpusvm_range_put(range);
+
+	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
+		if (!notifier->flags.removed)
+			mmu_interval_notifier_remove(&notifier->notifier);
+		drm_gpusvm_notifier_remove(gpusvm, notifier);
+		drm_gpusvm_notifier_free(gpusvm, notifier);
+	}
+}
+
+/**
+ * drm_gpusvm_range_get - Get a reference to GPU SVM range
+ * @range: Pointer to the GPU SVM range
+ *
+ * This function increments the reference count of the specified GPU SVM range.
+ *
+ * Returns:
+ * Pointer to the GPU SVM range.
+ */
+struct drm_gpusvm_range *
+drm_gpusvm_range_get(struct drm_gpusvm_range *range)
+{
+	kref_get(&range->refcount);
+
+	return range;
+}
+
+/**
+ * drm_gpusvm_range_destroy - Destroy GPU SVM range
+ * @refcount: Pointer to the reference counter embedded in the GPU SVM range
+ *
+ * This function destroys the specified GPU SVM range when its reference count
+ * reaches zero. If a custom range-free function is provided, it is invoked to
+ * free the range; otherwise, the range is deallocated using kfree().
+ */
+static void drm_gpusvm_range_destroy(struct kref *refcount)
+{
+	struct drm_gpusvm_range *range =
+		container_of(refcount, struct drm_gpusvm_range, refcount);
+	struct drm_gpusvm *gpusvm = range->gpusvm;
+
+	if (gpusvm->ops->range_free)
+		gpusvm->ops->range_free(range);
+	else
+		kfree(range);
+}
+
+/**
+ * drm_gpusvm_range_put - Put a reference to GPU SVM range
+ * @range: Pointer to the GPU SVM range
+ *
+ * This function decrements the reference count of the specified GPU SVM range
+ * and frees it when the count reaches zero.
+ */
+void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
+{
+	kref_put(&range->refcount, drm_gpusvm_range_destroy);
+}
+
+/**
+ * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function determines if a GPU SVM range pages are valid. Expected be
+ * called holding gpusvm->notifier_lock and as the last step before commiting a
+ * GPU binding.
+ *
+ * Returns:
+ * True if GPU SVM range has valid pages, False otherwise
+ */
+bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range)
+{
+	lockdep_assert_held(&gpusvm->notifier_lock);
+
+	return range->flags.has_devmem_pages || range->flags.has_dma_mapping;
+}
+
+/**
+ * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid unlocked
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ *
+ * This function determines if a GPU SVM range pages are valid. Expected be
+ * called without holding gpusvm->notifier_lock.
+ *
+ * Returns:
+ * True if GPU SVM range has valid pages, False otherwise
+ */
+static bool
+drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
+				      struct drm_gpusvm_range *range)
+{
+	bool pages_valid;
+
+	if (!range->dma_addr)
+		return false;
+
+	drm_gpusvm_notifier_lock(gpusvm);
+	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
+	if (!pages_valid)
+		drm_gpusvm_range_free_pages(gpusvm, range);
+	drm_gpusvm_notifier_unlock(gpusvm);
+
+	return pages_valid;
+}
+
+/**
+ * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ * @ctx: GPU SVM context
+ *
+ * This function gets pages for a GPU SVM range and ensures they are mapped for
+ * DMA access.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       const struct drm_gpusvm_ctx *ctx)
+{
+	struct mmu_interval_notifier *notifier = &range->notifier->notifier;
+	struct hmm_range hmm_range = {
+		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only ? 0 :
+			HMM_PFN_REQ_WRITE),
+		.notifier = notifier,
+		.start = range->va.start,
+		.end = range->va.end,
+		.dev_private_owner = gpusvm->device_private_page_owner,
+	};
+	struct mm_struct *mm = gpusvm->mm;
+	struct drm_gpusvm_zdd *zdd;
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long i, j;
+	unsigned long npages = npages_in_range(range->va.start, range->va.end);
+	unsigned long num_dma_mapped;
+	unsigned int order = 0;
+	unsigned long *pfns;
+	struct page **pages;
+	int err = 0;
+	struct dev_pagemap *pagemap;
+	struct drm_pagemap *dpagemap;
+
+retry:
+	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
+	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
+		goto set_seqno;
+
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (!pfns)
+		return -ENOMEM;
+
+	if (!mmget_not_zero(mm)) {
+		err = -EFAULT;
+		goto err_out;
+	}
+
+	hmm_range.hmm_pfns = pfns;
+	while (true) {
+		mmap_read_lock(mm);
+		err = hmm_range_fault(&hmm_range);
+		mmap_read_unlock(mm);
+
+		if (err == -EBUSY) {
+			if (time_after(jiffies, timeout))
+				break;
+
+			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
+			continue;
+		}
+		break;
+	}
+	mmput(mm);
+	if (err)
+		goto err_free;
+
+	pages = (struct page **)pfns;
+map_pages:
+	/*
+	 * Perform all dma mappings under the notifier lock to not
+	 * access freed pages. A notifier will either block on
+	 * the notifier lock or unmap dma.
+	 */
+	drm_gpusvm_notifier_lock(gpusvm);
+	if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
+		drm_gpusvm_notifier_unlock(gpusvm);
+		goto retry;
+	}
+
+	if (!range->dma_addr) {
+		/* Unlock and restart mapping to allocate memory. */
+		drm_gpusvm_notifier_unlock(gpusvm);
+		range->dma_addr = kvmalloc_array(npages, sizeof(*range->dma_addr),
+						 GFP_KERNEL);
+		if (!range->dma_addr) {
+			err = -ENOMEM;
+			goto err_free;
+		}
+		goto map_pages;
+	}
+
+	zdd = NULL;
+	num_dma_mapped = 0;
+	for (i = 0, j = 0; i < npages; ++j) {
+		struct page *page = hmm_pfn_to_page(pfns[i]);
+
+		order = hmm_pfn_to_map_order(pfns[i]);
+		if (is_device_private_page(page) || is_device_coherent_page(page)) {
+			if (zdd != page->zone_device_data && i > 0) {
+				err = -EOPNOTSUPP;
+				goto err_unmap;
+			}
+			zdd = page->zone_device_data;
+			if (pagemap != page->pgmap) {
+				if (i > 0) {
+					err = -EOPNOTSUPP;
+					goto err_unmap;
+				}
+
+				pagemap = page->pgmap;
+				dpagemap = zdd->devmem_allocation->dpagemap;
+				if (drm_WARN_ON(gpusvm->drm, !dpagemap)) {
+					/*
+					 * Raced. This is not supposed to happen
+					 * since hmm_range_fault() should've migrated
+					 * this page to system.
+					 */
+					err = -EAGAIN;
+					goto err_unmap;
+				}
+			}
+			range->dma_addr[j] =
+				dpagemap->ops->map_dma(dpagemap, gpusvm->drm->dev,
+						       page, order,
+						       DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(gpusvm->drm->dev, range->dma_addr[j].addr)) {
+				err = -EFAULT;
+				goto err_unmap;
+			}
+
+			pages[i] = page;
+		} else {
+			dma_addr_t addr;
+
+			if (is_zone_device_page(page) || zdd) {
+				err = -EOPNOTSUPP;
+				goto err_unmap;
+			}
+
+			addr = dma_map_page(gpusvm->drm->dev,
+					    page, 0,
+					    PAGE_SIZE << order,
+					    DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(gpusvm->drm->dev, addr)) {
+				err = -EFAULT;
+				goto err_unmap;
+			}
+
+			range->dma_addr[j] = drm_pagemap_dma_addr_encode
+				(addr, DRM_INTERCONNECT_SYSTEM, order,
+				 DMA_BIDIRECTIONAL);
+		}
+		i += 1 << order;
+		num_dma_mapped = i;
+	}
+
+	range->flags.has_dma_mapping = true;
+	if (zdd) {
+		range->flags.has_devmem_pages = true;
+		range->dpagemap = dpagemap;
+	}
+
+	drm_gpusvm_notifier_unlock(gpusvm);
+	kvfree(pfns);
+set_seqno:
+	range->notifier_seq = hmm_range.notifier_seq;
+
+	return 0;
+
+err_unmap:
+	__drm_gpusvm_range_unmap_pages(gpusvm, range, num_dma_mapped);
+	drm_gpusvm_notifier_unlock(gpusvm);
+err_free:
+	kvfree(pfns);
+err_out:
+	if (err == -EAGAIN)
+		goto retry;
+	return err;
+}
+
+/**
+ * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ * @ctx: GPU SVM context
+ *
+ * This function unmaps pages associated with a GPU SVM range. If @in_notifier
+ * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
+ * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
+ * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
+ * security model.
+ */
+void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range,
+				  const struct drm_gpusvm_ctx *ctx)
+{
+	unsigned long npages = npages_in_range(range->va.start, range->va.end);
+
+	if (ctx->in_notifier)
+		lockdep_assert_held_write(&gpusvm->notifier_lock);
+	else
+		drm_gpusvm_notifier_lock(gpusvm);
+
+	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
+
+	if (!ctx->in_notifier)
+		drm_gpusvm_notifier_unlock(gpusvm);
+}
+
+/**
+ * drm_gpusvm_migration_put_page - Put a migration page
+ * @page: Pointer to the page to put
+ *
+ * This function unlocks and puts a page.
+ */
+static void drm_gpusvm_migration_put_page(struct page *page)
+{
+	unlock_page(page);
+	put_page(page);
+}
+
+/**
+ * drm_gpusvm_migration_put_pages - Put migration pages
+ * @npages: Number of pages
+ * @migrate_pfn: Array of migrate page frame numbers
+ *
+ * This function puts an array of pages.
+ */
+static void drm_gpusvm_migration_put_pages(unsigned long npages,
+					   unsigned long *migrate_pfn)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i) {
+		if (!migrate_pfn[i])
+			continue;
+
+		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
+		migrate_pfn[i] = 0;
+	}
+}
+
+/**
+ * drm_gpusvm_get_devmem_page - Get a reference to a device memory page
+ * @page: Pointer to the page
+ * @zdd: Pointer to the GPU SVM zone device data
+ *
+ * This function associates the given page with the specified GPU SVM zone
+ * device data and initializes it for zone device usage.
+ */
+static void drm_gpusvm_get_devmem_page(struct page *page,
+				     struct drm_gpusvm_zdd *zdd)
+{
+	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
+	zone_device_page_init(page);
+}
+
+/**
+ * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
+ * @dev: The device for which the pages are being mapped
+ * @dma_addr: Array to store DMA addresses corresponding to mapped pages
+ * @migrate_pfn: Array of migrate page frame numbers to map
+ * @npages: Number of pages to map
+ * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
+ *
+ * This function maps pages of memory for migration usage in GPU SVM. It
+ * iterates over each page frame number provided in @migrate_pfn, maps the
+ * corresponding page, and stores the DMA address in the provided @dma_addr
+ * array.
+ *
+ * Return: 0 on success, -EFAULT if an error occurs during mapping.
+ */
+static int drm_gpusvm_migrate_map_pages(struct device *dev,
+					dma_addr_t *dma_addr,
+					long unsigned int *migrate_pfn,
+					unsigned long npages,
+					enum dma_data_direction dir)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
+
+		if (!page)
+			continue;
+
+		if (WARN_ON_ONCE(is_zone_device_page(page)))
+			return -EFAULT;
+
+		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
+		if (dma_mapping_error(dev, dma_addr[i]))
+			return -EFAULT;
+	}
+
+	return 0;
+}
+
+/**
+ * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
+ * @dev: The device for which the pages were mapped
+ * @dma_addr: Array of DMA addresses corresponding to mapped pages
+ * @npages: Number of pages to unmap
+ * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
+ *
+ * This function unmaps previously mapped pages of memory for GPU Shared Virtual
+ * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
+ * if it's valid and not already unmapped, and unmaps the corresponding page.
+ */
+static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
+					   dma_addr_t *dma_addr,
+					   unsigned long npages,
+					   enum dma_data_direction dir)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i) {
+		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
+			continue;
+
+		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
+	}
+}
+
+/**
+ * drm_gpusvm_migrate_to_devmem - Migrate GPU SVM range to device memory
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range structure
+ * @devmem_allocation: Pointer to the device memory allocation. The caller
+ *                     should hold a reference to the device memory allocation,
+ *                     which should be dropped via ops->devmem_release or upon
+ *                     the failure of this function.
+ * @ctx: GPU SVM context
+ *
+ * This function migrates the specified GPU SVM range to device memory. It performs the
+ * necessary setup and invokes the driver-specific operations for migration to
+ * device memory. Upon successful return, @devmem_allocation can safely reference @range
+ * until ops->devmem_release is called which only upon successful return.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
+				 struct drm_gpusvm_range *range,
+				 struct drm_gpusvm_devmem *devmem_allocation,
+				 const struct drm_gpusvm_ctx *ctx)
+{
+	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation->ops;
+	u64 start = range->va.start, end = range->va.end;
+	struct migrate_vma migrate = {
+		.start		= start,
+		.end		= end,
+		.pgmap_owner	= gpusvm->device_private_page_owner,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+	};
+	struct mm_struct *mm = gpusvm->mm;
+	unsigned long i, npages = npages_in_range(start, end);
+	struct vm_area_struct *vas;
+	struct drm_gpusvm_zdd *zdd = NULL;
+	struct page **pages;
+	dma_addr_t *dma_addr;
+	void *buf;
+	int err;
+
+	if (!range->flags.migrate_devmem)
+		return -EINVAL;
+
+	if (!ops->populate_devmem_pfn || !ops->copy_to_devmem || !ops->copy_to_ram)
+		return -EOPNOTSUPP;
+
+	if (!mmget_not_zero(mm)) {
+		err = -EFAULT;
+		goto err_out;
+	}
+	mmap_read_lock(mm);
+
+	vas = vma_lookup(mm, start);
+	if (!vas) {
+		err = -ENOENT;
+		goto err_mmunlock;
+	}
+
+	if (end > vas->vm_end || start < vas->vm_start) {
+		err = -EINVAL;
+		goto err_mmunlock;
+	}
+
+	if (!vma_is_anonymous(vas)) {
+		err = -EBUSY;
+		goto err_mmunlock;
+	}
+
+	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
+		       sizeof(*pages), GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto err_mmunlock;
+	}
+	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
+	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
+
+	zdd = drm_gpusvm_zdd_alloc(gpusvm->device_private_page_owner);
+	if (!zdd) {
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	migrate.vma = vas;
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+
+	err = migrate_vma_setup(&migrate);
+	if (err)
+		goto err_free;
+
+	/*
+	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
+	 * always an error. Need to revisit possible cases and how to handle. We
+	 * could prefault on migrate.cpages != npages via hmm_range_fault.
+	 */
+
+	if (!migrate.cpages) {
+		err = -EFAULT;
+		goto err_free;
+	}
+
+	if (migrate.cpages != npages) {
+		err = -EBUSY;
+		goto err_finalize;
+	}
+
+	err = ops->populate_devmem_pfn(devmem_allocation, npages, migrate.dst);
+	if (err)
+		goto err_finalize;
+
+	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev, dma_addr,
+					   migrate.src, npages, DMA_TO_DEVICE);
+	if (err)
+		goto err_finalize;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = pfn_to_page(migrate.dst[i]);
+
+		pages[i] = page;
+		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
+		drm_gpusvm_get_devmem_page(page, zdd);
+	}
+
+	err = ops->copy_to_devmem(pages, dma_addr, npages);
+	if (err)
+		goto err_finalize;
+
+	/* Upon success bind devmem allocation to range and zdd */
+	WRITE_ONCE(zdd->devmem_allocation, devmem_allocation);	/* Owns ref */
+
+err_finalize:
+	if (err)
+		drm_gpusvm_migration_put_pages(npages, migrate.dst);
+	migrate_vma_pages(&migrate);
+	migrate_vma_finalize(&migrate);
+	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev, dma_addr, npages,
+				       DMA_TO_DEVICE);
+err_free:
+	if (zdd)
+		drm_gpusvm_zdd_put(zdd);
+	kvfree(buf);
+err_mmunlock:
+	mmap_read_unlock(mm);
+	mmput(mm);
+err_out:
+	return err;
+}
+
+/**
+ * drm_gpusvm_migrate_populate_ram_pfn - Populate RAM PFNs for a VM area
+ * @vas: Pointer to the VM area structure, can be NULL
+ * @npages: Number of pages to populate
+ * @mpages: Number of pages to migrate
+ * @src_mpfn: Source array of migrate PFNs
+ * @mpfn: Array of migrate PFNs to populate
+ * @addr: Start address for PFN allocation
+ *
+ * This function populates the RAM migrate page frame numbers (PFNs) for the
+ * specified VM area structure. It allocates and locks pages in the VM area for
+ * RAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
+ * alloc_page for allocation.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+static int drm_gpusvm_migrate_populate_ram_pfn(struct vm_area_struct *vas,
+					       unsigned long npages,
+					       unsigned long *mpages,
+					       unsigned long *src_mpfn,
+					       unsigned long *mpfn, u64 addr)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
+		struct page *page;
+
+		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		if (vas)
+			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
+		else
+			page = alloc_page(GFP_HIGHUSER);
+
+		if (!page)
+			return -ENOMEM;
+
+		lock_page(page);
+		mpfn[i] = migrate_pfn(page_to_pfn(page));
+		++*mpages;
+	}
+
+	return 0;
+}
+
+/**
+ * drm_gpusvm_evict_to_ram - Evict GPU SVM range to RAM
+ * @devmem_allocation: Pointer to the device memory allocation
+ *
+ * Similar to __drm_gpusvm_migrate_to_ram but does not require mmap lock and
+ * migration done via migrate_device_* functions.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem *devmem_allocation)
+{
+	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation->ops;
+	unsigned long npages, mpages = 0;
+	struct page **pages;
+	unsigned long *src, *dst;
+	dma_addr_t *dma_addr;
+	void *buf;
+	int i, err = 0;
+
+	npages = devmem_allocation->size >> PAGE_SHIFT;
+
+retry:
+	if (!mmget_not_zero(devmem_allocation->mm))
+		return -EFAULT;
+
+	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
+		       sizeof(*pages), GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+	src = buf;
+	dst = buf + (sizeof(*src) * npages);
+	dma_addr = buf + (2 * sizeof(*src) * npages);
+	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
+
+	err = ops->populate_devmem_pfn(devmem_allocation, npages, src);
+	if (err)
+		goto err_free;
+
+	err = migrate_device_prepopulated_range(src, npages);
+	if (err)
+		goto err_free;
+
+	err = drm_gpusvm_migrate_populate_ram_pfn(NULL, npages, &mpages, src,
+						  dst, 0);
+	if (err || !mpages)
+		goto err_finalize;
+
+	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev, dma_addr,
+					   dst, npages, DMA_FROM_DEVICE);
+	if (err)
+		goto err_finalize;
+
+	for (i = 0; i < npages; ++i)
+		pages[i] = migrate_pfn_to_page(src[i]);
+
+	err = ops->copy_to_ram(pages, dma_addr, npages);
+	if (err)
+		goto err_finalize;
+
+err_finalize:
+	if (err)
+		drm_gpusvm_migration_put_pages(npages, dst);
+	migrate_device_pages(src, dst, npages);
+	migrate_device_finalize(src, dst, npages);
+	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev, dma_addr, npages,
+				       DMA_FROM_DEVICE);
+err_free:
+	kvfree(buf);
+err_out:
+	mmput_async(devmem_allocation->mm);
+	if (!err && !READ_ONCE(devmem_allocation->detached)) {
+		cond_resched();
+		goto retry;
+	}
+
+	return err;
+}
+
+/**
+ * __drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (internal)
+ * @vas: Pointer to the VM area structure
+ * @device_private_page_owner: Device private pages owner
+ * @page: Pointer to the page for fault handling (can be NULL)
+ * @fault_addr: Fault address
+ * @size: Size of migration
+ *
+ * This internal function performs the migration of the specified GPU SVM range
+ * to RAM. It sets up the migration, populates + dma maps RAM PFNs, and
+ * invokes the driver-specific operations for migration to RAM.
+ *
+ * Returns:
+ * 0 on success, negative error code on failure.
+ */
+static int __drm_gpusvm_migrate_to_ram(struct vm_area_struct *vas,
+				       void *device_private_page_owner,
+				       struct page *page, u64 fault_addr,
+				       u64 size)
+{
+	struct migrate_vma migrate = {
+		.vma		= vas,
+		.pgmap_owner	= device_private_page_owner,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+			MIGRATE_VMA_SELECT_DEVICE_COHERENT,
+		.fault_page	= page,
+	};
+	struct drm_gpusvm_zdd *zdd;
+	const struct drm_gpusvm_devmem_ops *ops;
+	struct device *dev;
+	unsigned long npages, mpages = 0;
+	struct page **pages;
+	dma_addr_t *dma_addr;
+	u64 start, end;
+	void *buf;
+	int i, err = 0;
+
+	start = ALIGN_DOWN(fault_addr, size);
+	end = ALIGN(fault_addr + 1, size);
+
+	/* Corner where VMA area struct has been partially unmapped */
+	if (start < vas->vm_start)
+		start = vas->vm_start;
+	if (end > vas->vm_end)
+		end = vas->vm_end;
+
+	migrate.start = start;
+	migrate.end = end;
+	npages = npages_in_range(start, end);
+
+	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
+		       sizeof(*pages), GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
+	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
+
+	migrate.vma = vas;
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+
+	err = migrate_vma_setup(&migrate);
+	if (err)
+		goto err_free;
+
+	/* Raced with another CPU fault, nothing to do */
+	if (!migrate.cpages)
+		goto err_free;
+
+	if (!page) {
+		for (i = 0; i < npages; ++i) {
+			if (!(migrate.src[i] & MIGRATE_PFN_MIGRATE))
+				continue;
+
+			page = migrate_pfn_to_page(migrate.src[i]);
+			break;
+		}
+
+		if (!page)
+			goto err_finalize;
+	}
+	zdd = page->zone_device_data;
+	ops = zdd->devmem_allocation->ops;
+	dev = zdd->devmem_allocation->dev;
+
+	err = drm_gpusvm_migrate_populate_ram_pfn(vas, npages, &mpages,
+						  migrate.src, migrate.dst,
+						  start);
+	if (err)
+		goto err_finalize;
+
+	err = drm_gpusvm_migrate_map_pages(dev, dma_addr, migrate.dst, npages,
+					   DMA_FROM_DEVICE);
+	if (err)
+		goto err_finalize;
+
+	for (i = 0; i < npages; ++i)
+		pages[i] = migrate_pfn_to_page(migrate.src[i]);
+
+	err = ops->copy_to_ram(pages, dma_addr, npages);
+	if (err)
+		goto err_finalize;
+
+err_finalize:
+	if (err)
+		drm_gpusvm_migration_put_pages(npages, migrate.dst);
+	migrate_vma_pages(&migrate);
+	migrate_vma_finalize(&migrate);
+	drm_gpusvm_migrate_unmap_pages(dev, dma_addr, npages,
+				       DMA_FROM_DEVICE);
+err_free:
+	kvfree(buf);
+err_out:
+
+	return err;
+}
+
+/**
+ * drm_gpusvm_range_evict - Evict GPU SVM range
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @range: Pointer to the GPU SVM range to be removed
+ *
+ * This function evicts the specified GPU SVM range.
+ */
+void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
+			    struct drm_gpusvm_range *range)
+{
+	struct mm_struct *mm = gpusvm->mm;
+	struct vm_area_struct *vas;
+
+	if (!mmget_not_zero(mm))
+		return;
+
+	mmap_read_lock(mm);
+	vas = vma_lookup(mm, range->va.start);
+	if (!vas)
+		goto unlock;
+
+	__drm_gpusvm_migrate_to_ram(vas, gpusvm->device_private_page_owner,
+				    NULL, range->va.start,
+				    range->va.end - range->va.start);
+unlock:
+	mmap_read_unlock(mm);
+	mmput(mm);
+}
+
+/**
+ * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
+ * @page: Pointer to the page
+ *
+ * This function is a callback used to put the GPU SVM zone device data
+ * associated with a page when it is being released.
+ */
+static void drm_gpusvm_page_free(struct page *page)
+{
+	drm_gpusvm_zdd_put(page->zone_device_data);
+}
+
+/**
+ * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
+ * @vmf: Pointer to the fault information structure
+ *
+ * This function is a page fault handler used to migrate a GPU SVM range to RAM.
+ * It retrieves the GPU SVM range information from the faulting page and invokes
+ * the internal migration function to migrate the range back to RAM.
+ *
+ * Returns:
+ * VM_FAULT_SIGBUS on failure, 0 on success.
+ */
+static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
+{
+	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
+	int err;
+
+	err = __drm_gpusvm_migrate_to_ram(vmf->vma,
+					  zdd->device_private_page_owner,
+					  vmf->page, vmf->address,
+					  zdd->devmem_allocation->size);
+
+	return err ? VM_FAULT_SIGBUS : 0;
+}
+
+/**
+ * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
+ */
+static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
+	.page_free = drm_gpusvm_page_free,
+	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
+};
+
+/**
+ * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
+ *
+ * Returns:
+ * Pointer to the GPU SVM device page map operations structure.
+ */
+const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
+{
+	return &drm_gpusvm_pagemap_ops;
+}
+
+/**
+ * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
+ * @gpusvm: Pointer to the GPU SVM structure.
+ * @start: Start address
+ * @end: End address
+ *
+ * Returns:
+ * True if GPU SVM has mapping, False otherwise
+ */
+bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
+{
+	struct drm_gpusvm_notifier *notifier;
+
+	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
+		struct drm_gpusvm_range *range = NULL;
+
+		drm_gpusvm_for_each_range(range, notifier, start, end)
+			return true;
+	}
+
+	return false;
+}
diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
new file mode 100644
index 000000000000..15ec22d4f9a5
--- /dev/null
+++ b/drivers/gpu/drm/xe/drm_gpusvm.h
@@ -0,0 +1,447 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef __DRM_GPUSVM_H__
+#define __DRM_GPUSVM_H__
+
+#include <linux/kref.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+
+struct dev_pagemap_ops;
+struct drm_device;
+struct drm_gpusvm;
+struct drm_gpusvm_notifier;
+struct drm_gpusvm_ops;
+struct drm_gpusvm_range;
+struct drm_gpusvm_devmem;
+struct drm_pagemap;
+struct drm_pagemap_dma_addr;
+
+/**
+ * struct drm_gpusvm_devmem_ops - Operations structure for GPU SVM device memory
+ *
+ * This structure defines the operations for GPU Shared Virtual Memory (SVM)
+ * device memory. These operations are provided by the GPU driver to manage device memory
+ * allocations and perform operations such as migration between device memory and system
+ * RAM.
+ */
+struct drm_gpusvm_devmem_ops {
+	/**
+	 * @devmem_release: Release device memory allocation (optional)
+	 * @devmem_allocation: device memory allocation
+	 *
+	 * This function shall release device memory allocation and expects to drop a
+	 * reference to device memory allocation.
+	 */
+	void (*devmem_release)(struct drm_gpusvm_devmem *devmem_allocation);
+
+	/**
+	 * @populate_devmem_pfn: Populate device memory PFN (required for migration)
+	 * @devmem_allocation: device memory allocation
+	 * @npages: Number of pages to populate
+	 * @pfn: Array of page frame numbers to populate
+	 *
+	 * This function shall populate device memory page frame numbers (PFN).
+	 *
+	 * Returns:
+	 * 0 on success, a negative error code on failure.
+	 */
+	int (*populate_devmem_pfn)(struct drm_gpusvm_devmem *devmem_allocation,
+				 unsigned long npages, unsigned long *pfn);
+
+	/**
+	 * @copy_to_devmem: Copy to device memory (required for migration)
+	 * @pages: Pointer to array of device memory pages (destination)
+	 * @dma_addr: Pointer to array of DMA addresses (source)
+	 * @npages: Number of pages to copy
+	 *
+	 * This function shall copy pages to device memory.
+	 *
+	 * Returns:
+	 * 0 on success, a negative error code on failure.
+	 */
+	int (*copy_to_devmem)(struct page **pages,
+			      dma_addr_t *dma_addr,
+			      unsigned long npages);
+
+	/**
+	 * @copy_to_ram: Copy to system RAM (required for migration)
+	 * @pages: Pointer to array of device memory pages (source)
+	 * @dma_addr: Pointer to array of DMA addresses (destination)
+	 * @npages: Number of pages to copy
+	 *
+	 * This function shall copy pages to system RAM.
+	 *
+	 * Returns:
+	 * 0 on success, a negative error code on failure.
+	 */
+	int (*copy_to_ram)(struct page **pages,
+			   dma_addr_t *dma_addr,
+			   unsigned long npages);
+};
+
+/**
+ * struct drm_gpusvm_devmem - Structure representing a GPU SVM device memory allocation
+ *
+ * @dev: Pointer to the device structure which device memory allocation belongs to
+ * @mm: Pointer to the mm_struct for the address space
+ * @ops: Pointer to the operations structure for GPU SVM device memory
+ * @dpagemap: The struct drm_pagemap of the pages this allocation belongs to.
+ * @size: Size of device memory allocation
+ * @detached: device memory allocations is detached from device pages
+ */
+struct drm_gpusvm_devmem {
+	struct device *dev;
+	struct mm_struct *mm;
+	const struct drm_gpusvm_devmem_ops *ops;
+	struct drm_pagemap *dpagemap;
+	size_t size;
+	bool detached;
+};
+
+/**
+ * struct drm_gpusvm_ops - Operations structure for GPU SVM
+ *
+ * This structure defines the operations for GPU Shared Virtual Memory (SVM).
+ * These operations are provided by the GPU driver to manage SVM ranges and
+ * notifiers.
+ */
+struct drm_gpusvm_ops {
+	/**
+	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
+	 *
+	 * This function shall allocate a GPU SVM notifier.
+	 *
+	 * Returns:
+	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
+	 */
+	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
+
+	/**
+	 * @notifier_free: Free a GPU SVM notifier (optional)
+	 * @notifier: Pointer to the GPU SVM notifier to be freed
+	 *
+	 * This function shall free a GPU SVM notifier.
+	 */
+	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
+
+	/**
+	 * @range_alloc: Allocate a GPU SVM range (optional)
+	 * @gpusvm: Pointer to the GPU SVM
+	 *
+	 * This function shall allocate a GPU SVM range.
+	 *
+	 * Returns:
+	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
+	 */
+	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
+
+	/**
+	 * @range_free: Free a GPU SVM range (optional)
+	 * @range: Pointer to the GPU SVM range to be freed
+	 *
+	 * This function shall free a GPU SVM range.
+	 */
+	void (*range_free)(struct drm_gpusvm_range *range);
+
+	/**
+	 * @invalidate: Invalidate GPU SVM notifier (required)
+	 * @gpusvm: Pointer to the GPU SVM
+	 * @notifier: Pointer to the GPU SVM notifier
+	 * @mmu_range: Pointer to the mmu_notifier_range structure
+	 *
+	 * This function shall invalidate the GPU page tables. It can safely
+	 * walk the notifier range RB tree/list in this function. Called while
+	 * holding the notifier lock.
+	 */
+	void (*invalidate)(struct drm_gpusvm *gpusvm,
+			   struct drm_gpusvm_notifier *notifier,
+			   const struct mmu_notifier_range *mmu_range);
+};
+
+/**
+ * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
+ *
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: MMU interval notifier
+ * @interval: Interval for the notifier
+ * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
+ * @root: Cached root node of the RB tree containing ranges
+ * @range_list: List head containing of ranges in the same order they appear in
+ *              interval tree. This is useful to keep iterating ranges while
+ *              doing modifications to RB tree.
+ * @flags.removed: Flag indicating whether the MMU interval notifier has been
+ *                 removed
+ *
+ * This structure represents a GPU SVM notifier.
+ */
+struct drm_gpusvm_notifier {
+	struct drm_gpusvm *gpusvm;
+	struct mmu_interval_notifier notifier;
+	struct {
+		u64 start;
+		u64 end;
+	} interval;
+	struct {
+		struct rb_node node;
+		struct list_head entry;
+		u64 __subtree_last;
+	} rb;
+	struct rb_root_cached root;
+	struct list_head range_list;
+	struct {
+		u32 removed : 1;
+	} flags;
+};
+
+/**
+ * struct drm_gpusvm_range - Structure representing a GPU SVM range
+ *
+ * @gpusvm: Pointer to the GPU SVM structure
+ * @notifier: Pointer to the GPU SVM notifier
+ * @refcount: Reference count for the range
+ * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
+ * @va: Virtual address range
+ * @notifier_seq: Notifier sequence number of the range's pages
+ * @dma_addr: DMA address array
+ * @dpagemap: The struct drm_pagemap of the device pages we're dma-mapping.
+ * Note this is assuming only one drm_pagemap per range is allowed.
+ * @flags.migrate_devmem: Flag indicating whether the range can be migrated to device memory
+ * @flags.unmapped: Flag indicating if the range has been unmapped
+ * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
+ * @flags.has_devmem_pages: Flag indicating if the range has devmem pages
+ * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
+ *
+ * This structure represents a GPU SVM range used for tracking memory ranges
+ * mapped in a DRM device.
+ */
+struct drm_gpusvm_range {
+	struct drm_gpusvm *gpusvm;
+	struct drm_gpusvm_notifier *notifier;
+	struct kref refcount;
+	struct {
+		struct rb_node node;
+		struct list_head entry;
+		u64 __subtree_last;
+	} rb;
+	struct {
+		u64 start;
+		u64 end;
+	} va;
+	unsigned long notifier_seq;
+	struct drm_pagemap_dma_addr *dma_addr;
+	struct drm_pagemap *dpagemap;
+	struct {
+		/* All flags below must be set upon creation */
+		u16 migrate_devmem : 1;
+		/* All flags below must be set / cleared under notifier lock */
+		u16 unmapped : 1;
+		u16 partial_unmap : 1;
+		u16 has_devmem_pages : 1;
+		u16 has_dma_mapping : 1;
+	} flags;
+};
+
+/**
+ * struct drm_gpusvm - GPU SVM structure
+ *
+ * @name: Name of the GPU SVM
+ * @drm: Pointer to the DRM device structure
+ * @mm: Pointer to the mm_struct for the address space
+ * @device_private_page_owner: Device private pages owner
+ * @mm_start: Start address of GPU SVM
+ * @mm_range: Range of the GPU SVM
+ * @notifier_size: Size of individual notifiers
+ * @ops: Pointer to the operations structure for GPU SVM
+ * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
+ *               Entries should be powers of 2 in descending order.
+ * @num_chunks: Number of chunks
+ * @notifier_lock: Read-write semaphore for protecting notifier operations
+ * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
+ * @notifier_list: list head containing of notifiers in the same order they
+ *                 appear in interval tree. This is useful to keep iterating
+ *                 notifiers while doing modifications to RB tree.
+ *
+ * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
+ * memory ranges mapped in a DRM (Direct Rendering Manager) device.
+ *
+ * No reference counting is provided, as this is expected to be embedded in the
+ * driver VM structure along with the struct drm_gpuvm, which handles reference
+ * counting.
+ */
+struct drm_gpusvm {
+	const char *name;
+	struct drm_device *drm;
+	struct mm_struct *mm;
+	void *device_private_page_owner;
+	u64 mm_start;
+	u64 mm_range;
+	u64 notifier_size;
+	const struct drm_gpusvm_ops *ops;
+	const u64 *chunk_sizes;
+	int num_chunks;
+	struct rw_semaphore notifier_lock;
+	struct rb_root_cached root;
+	struct list_head notifier_list;
+};
+
+/**
+ * struct drm_gpusvm_ctx - DRM GPU SVM context
+ *
+ * @in_notifier: entering from a MMU notifier
+ * @read_only: operating on read-only memory
+ * @devmem_possible: possible to use device memory
+ * @check_pages: check pages and only create range for pages faulted in
+ *
+ * Context that is DRM GPUSVM is operating in (i.e. user arguments).
+ */
+struct drm_gpusvm_ctx {
+	u32 in_notifier :1;
+	u32 read_only :1;
+	u32 devmem_possible :1;
+	u32 check_pages :1;
+};
+
+int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
+		    const char *name, struct drm_device *drm,
+		    struct mm_struct *mm, void *device_private_page_owner,
+		    u64 mm_start, u64 mm_range, u64 notifier_size,
+		    const struct drm_gpusvm_ops *ops,
+		    const u64 *chunk_sizes, int num_chunks);
+void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
+void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
+
+struct drm_gpusvm_range *
+drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
+				u64 gpuva_start, u64 gpuva_end,
+				const struct drm_gpusvm_ctx *ctx);
+void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
+			     struct drm_gpusvm_range *range);
+void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
+			    struct drm_gpusvm_range *range);
+
+struct drm_gpusvm_range *
+drm_gpusvm_range_get(struct drm_gpusvm_range *range);
+void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
+
+bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range);
+
+int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
+			       struct drm_gpusvm_range *range,
+			       const struct drm_gpusvm_ctx *ctx);
+void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
+				  struct drm_gpusvm_range *range,
+				  const struct drm_gpusvm_ctx *ctx);
+
+int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
+				 struct drm_gpusvm_range *range,
+				 struct drm_gpusvm_devmem *devmem_allocation,
+				 const struct drm_gpusvm_ctx *ctx);
+int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem *devmem_allocation);
+
+const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
+
+bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
+
+struct drm_gpusvm_range *
+drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
+
+/**
+ * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM structure.
+ *
+ * Abstract client usage GPU SVM notifier lock, take lock
+ */
+#define drm_gpusvm_notifier_lock(gpusvm__)	\
+	down_read(&(gpusvm__)->notifier_lock)
+
+/**
+ * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
+ * @gpusvm__: Pointer to the GPU SVM structure.
+ *
+ * Abstract client usage GPU SVM notifier lock, drop lock
+ */
+#define drm_gpusvm_notifier_unlock(gpusvm__)	\
+	up_read(&(gpusvm__)->notifier_lock)
+
+/**
+ * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
+ * @range: a pointer to the current GPU SVM range
+ *
+ * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
+ *         current range is the last one or if the input range is NULL.
+ */
+static inline struct drm_gpusvm_range *
+__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
+{
+	if (range && !list_is_last(&range->rb.entry,
+				   &range->notifier->range_list))
+		return list_next_entry(range, rb.entry);
+
+	return NULL;
+}
+
+/**
+ * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
+ * @range__: Iterator variable for the ranges. If set, it indicates the start of
+ *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
+ * @notifier__: Pointer to the GPU SVM notifier
+ * @start__: Start address of the range
+ * @end__: End address of the range
+ *
+ * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
+ * to use while holding the driver SVM lock or the notifier lock.
+ */
+#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
+	for ((range__) = (range__) ?:					\
+	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
+	     (range__) && (range__->va.start < (end__));		\
+	     (range__) = __drm_gpusvm_range_next(range__))
+
+/**
+ * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
+ * @range: Pointer to the GPU SVM range structure.
+ * @mmu_range: Pointer to the MMU notifier range structure.
+ *
+ * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
+ * if the range partially falls within the provided MMU notifier range.
+ */
+static inline void
+drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
+			      const struct mmu_notifier_range *mmu_range)
+{
+	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
+
+	range->flags.unmapped = true;
+	if (range->va.start < mmu_range->start ||
+	    range->va.end > mmu_range->end)
+		range->flags.partial_unmap = true;
+}
+
+/**
+ * drm_gpusvm_devmem_init - Initialize a GPU SVM device memory allocation
+ *
+ * @dev: Pointer to the device structure which device memory allocation belongs to
+ * @mm: Pointer to the mm_struct for the address space
+ * @ops: Pointer to the operations structure for GPU SVM device memory
+ * @dpagemap: The struct drm_pagemap we're allocating from.
+ * @size: Size of device memory allocation
+ */
+static inline void
+drm_gpusvm_devmem_init(struct drm_gpusvm_devmem *devmem_allocation,
+		       struct device *dev, struct mm_struct *mm,
+		       const struct drm_gpusvm_devmem_ops *ops,
+		       struct drm_pagemap *dpagemap, size_t size)
+{
+	devmem_allocation->dev = dev;
+	devmem_allocation->mm = mm;
+	devmem_allocation->ops = ops;
+	devmem_allocation->dpagemap = dpagemap;
+	devmem_allocation->size = size;
+}
+
+#endif /* __DRM_GPUSVM_H__ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
@ 2024-10-31 18:58   ` Thomas Hellström
  2024-11-04 22:53     ` Matthew Brost
  2024-11-04 15:25   ` Thomas Hellström
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-10-31 18:58 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> This patch introduces support for GPU Shared Virtual Memory (SVM) in
> the
> Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> sharing of memory between the CPU and GPU, enhancing performance and
> flexibility in GPU computing tasks.
> 
> The patch adds the necessary infrastructure for SVM, including data
> structures and functions for managing SVM ranges and notifiers. It
> also
> provides mechanisms for allocating, deallocating, and migrating
> memory
> regions between system RAM and GPU VRAM.
> 
> This is largely inspired by GPUVM.
> 
> v2:
>  - Take order into account in check pages
>  - Clear range->pages in get pages error
>  - Drop setting dirty or accessed bit in get pages (Vetter)
>  - Remove mmap assert for cpu faults
>  - Drop mmap write lock abuse (Vetter, Christian)
>  - Decouple zdd from range (Vetter, Oak)
>  - Add drm_gpusvm_range_evict, make it work with coherent pages
>  - Export drm_gpusvm_evict_to_sram, only use in BO evict path
> (Vetter)
>  - mmget/put in drm_gpusvm_evict_to_sram
>  - Drop range->vram_alloation variable
>  - Don't return in drm_gpusvm_evict_to_sram until all pages detached
>  - Don't warn on mixing sram and device pages
>  - Update kernel doc
>  - Add coherent page support to get pages
>  - Use DMA_FROM_DEVICE rather than DMA_BIDIRECTIONAL
>  - Add struct drm_gpusvm_vram and ops (Thomas)
>  - Update the range's seqno if the range is valid (Thomas)
>  - Remove the is_unmapped check before hmm_range_fault (Thomas)
>  - Use drm_pagemap (Thomas)
>  - Drop kfree_mapping (Thomas)
>  - dma mapp pages under notifier lock (Thomas)
>  - Remove ctx.prefault
>  - Remove ctx.mmap_locked
>  - Add ctx.check_pages
>  - s/vram/devmem (Thomas)
> 
> Cc: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile     |    3 +-
>  drivers/gpu/drm/xe/drm_gpusvm.c | 2074
> +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/drm_gpusvm.h |  447 +++++++
>  3 files changed, 2523 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index da80c29aa363..8d991d4a92a5 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
>  
>  # core driver code
>  
> -xe-y += xe_bb.o \
> +xe-y += drm_gpusvm.o \
> +	xe_bb.o \
>  	xe_bo.o \
>  	xe_bo_evict.o \
>  	xe_devcoredump.o \
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> b/drivers/gpu/drm/xe/drm_gpusvm.c
> new file mode 100644
> index 000000000000..1ff104d2b42c
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> @@ -0,0 +1,2074 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + *
> + * Authors:
> + *     Matthew Brost <matthew.brost@intel.com>
> + */
> +
> +#include <linux/dma-mapping.h>
> +#include <linux/interval_tree_generic.h>
> +#include <linux/hmm.h>
> +#include <linux/memremap.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_types.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +
> +#include <drm/drm_device.h>
> +#include "drm/drm_print.h"
> +#include "drm_gpusvm.h"
> +#include "drm_pagemap.h"
> +
> +/**
> + * DOC: Overview
> + *
> + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> Rendering Manager (DRM)
> + *
> + * The GPU SVM layer is a component of the DRM framework designed to
> manage shared
> + * virtual memory between the CPU and GPU. It enables efficient data
> exchange and
> + * processing for GPU-accelerated applications by allowing memory
> sharing and
> + * synchronization between the CPU's and GPU's virtual address
> spaces.
> + *
> + * Key GPU SVM Components:
> + * - Notifiers: Notifiers: Used for tracking memory intervals and
> notifying the
> + *		GPU of changes, notifiers are sized based on a GPU
> SVM
> + *		initialization parameter, with a recommendation of
> 512M or
> + *		larger. They maintain a Red-BlacK tree and a list of
> ranges that
> + *		fall within the notifier interval. Notifiers are
> tracked within
> + *		a GPU SVM Red-BlacK tree and list and are
> dynamically inserted
> + *		or removed as ranges within the interval are created
> or
> + *		destroyed.
> + * - Ranges: Represent memory ranges mapped in a DRM device and
> managed
> + *	     by GPU SVM. They are sized based on an array of chunk
> sizes, which
> + *	     is a GPU SVM initialization parameter, and the CPU
> address space.
> + *	     Upon GPU fault, the largest aligned chunk that fits
> within the
> + *	     faulting CPU address space is chosen for the range
> size. Ranges are
> + *	     expected to be dynamically allocated on GPU fault and
> removed on an
> + *	     MMU notifier UNMAP event. As mentioned above, ranges
> are tracked in
> + *	     a notifier's Red-Black tree.
> + * - Operations: Define the interface for driver-specific GPU SVM
> operations
> + *               such as range allocation, notifier allocation, and
> + *               invalidations.
> + * - Device Memory Allocations: Embedded structure containing enough
> information
> + *                              for GPU SVM to migrate to / from
> device memory.
> + * - Device Memory Operations: Define the interface for driver-
> specific device
> + *                             memory operations release memory,
> populate pfns,
> + *                             and copy to / from device memory.
> + *
> + * This layer provides interfaces for allocating, mapping,
> migrating, and
> + * releasing memory ranges between the CPU and GPU. It handles all
> core memory
> + * management interactions (DMA mapping, HMM, and migration) and
> provides
> + * driver-specific virtual functions (vfuncs). This infrastructure
> is sufficient
> + * to build the expected driver components for an SVM implementation
> as detailed
> + * below.
> + *
> + * Expected Driver Components:
> + * - GPU page fault handler: Used to create ranges and notifiers
> based on the
> + *			     fault address, optionally migrate the
> range to
> + *			     device memory, and create GPU bindings.
> + * - Garbage collector: Used to destroy GPU bindings for ranges.

Perhaps "clean up GPU bindings for ranges" to differentiate from
unmapping GPU bindings for ranges which needs to be done in the
notifier callback?

> Ranges are
> + *			expected to be added to the garbage
> collector upon
> + *			MMU_NOTIFY_UNMAP event.
> + */
> +

- Notifier callback, to unmap GPU bindings for ranges.

> +/**
> + * DOC: Locking
> + *
> + * GPU SVM handles locking for core MM interactions, i.e., it
> locks/unlocks the
> + * mmap lock as needed.
> + *
> + * GPU SVM introduces a global notifier lock, which safeguards the
> notifier's
> + * range RB tree and list, as well as the range's DMA mappings and
> sequence
> + * number. GPU SVM manages all necessary locking and unlocking
> operations,

How difficult would it be to make this per-notifier?
One of the design comments we got from Jason was to prioritize avoiding
core slowdowns and any fault processing might block an unrelated
notifier during dma mapping and page-table commit.
I think this should at least be considered as a follow-up.

> + * except for the recheck of the range's sequence number
> + * (mmu_interval_read_retry) when the driver is committing GPU
> bindings. This

Perhaps add a discussion on valid pages rather than valid sequence
number, since the sequence number might be bumped even if pages stay
valid for the range, as the sequence number spans the whole notifier.

> + * lock corresponds to the 'driver->update' lock mentioned in the
> HMM
> + * documentation (TODO: Link). Future revisions may transition from
> a GPU SVM
> + * global lock to a per-notifier lock if finer-grained locking is
> deemed
> + * necessary.
> + *
> + * In addition to the locking mentioned above, the driver should
> implement a
> + * lock to safeguard core GPU SVM function calls that modify state,
> such as
> + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> Alternatively,
> + * these core functions can be called within a single kernel thread,
> for
> + * instance, using an ordered work queue.

I don't think not we should encourage the use of ordered workqueues to
protect data / state? Locks should really be the preferred mechanism?

>  This lock is denoted as
> + * 'driver_svm_lock' in code examples. Finer grained driver side
> locking should
> + * also be possible for concurrent GPU fault processing within a
> single GPU SVM.

GPUVM (or rather the gem part of GPUVM that resides in drm_gem.h)
allows for the driver to register a lock map which, if present, is used
in the code to assert locks are correctly taken.

In the xe example, if we're using the vm lock to protect, then we could
register that and thoroughly annotate the gpusvm code with lockdep
asserts. That will probably help making the code a lot easier to
maintain moving forward.

> + */
> +GPUVM (or rGPUVM (or rather the gem part of GPUVM that resides in
> drm_gem.h) allows for the driver to register a lock map which, if
> present, is used in the code to assert locks are correctly taken.

> In the xe example, if we're using the vm lock to protect, then we
> could register that and thoroughly annotate the gpusvm code with
> lockdep asserts. That will probably help making the code a lot easier
> to maintain moving forward.GPUVM (or rather the gem part of GPUVM
> that resides in drm_gem.h) allows for the driver to register a lock
> map which, if present, is used in the code to assert locks are
> correctly taken.

> In the xe example, if we're using the vm lock to protect, then we
> could register that and thoroughly annotate the gpusvm code with
> lockdep asserts. That will probably help making the code a lot easier
> to maintain moving forward.ather the gem part of GPUVM that resides
> in drm_gem.h) allows for the driver to register a lock map which, if
> present, is used in the code to assert locks are correctly taken.

In the xe example, if we're using the vm lock to protect, then we could
register that and thoroughly annotate the gpusvm code with lockdep
asserts. That will probably help making the code a lot easier to
maintain moving forward.
> +/**
> + * DOC: Migrataion

s/Migrataion/Migration/

> + *
> + * The migration support is quite simple, allowing migration between
> RAM and
> + * device memory at the range granularity. For example, GPU SVM
> currently does not
> + * support mixing RAM and device memory pages within a range. This
> means that upon GPU
> + * fault, the entire range can be migrated to device memory, and
> upon CPU fault, the
> + * entire range is migrated to RAM. Mixed RAM and device memory
> storage within a range
> + * could be added in the future if required.
> + *
> + * The reasoning for only supporting range granularity is as
> follows: it
> + * simplifies the implementation, and range sizes are driver-defined
> and should
> + * be relatively small.
> + */
> +
> +/**
> + * DOC: Partial Unmapping of Ranges
> + *
> + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> CPU resulting
> + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> main one
> + * being that a subset of the range still has CPU and GPU mappings.
> If the
> + * backing store for the range is in device memory, a subset of the
> backing store has
> + * references. One option would be to split the range and device
> memory backing store,
> + * but the implementation for this would be quite complicated. Given
> that
> + * partial unmappings are rare and driver-defined range sizes are
> relatively
> + * small, GPU SVM does not support splitting of ranges.
> + *
> + * With no support for range splitting, upon partial unmapping of a
> range, the
> + * driver is expected to invalidate and destroy the entire range. If
> the range
> + * has device memory as its backing, the driver is also expected to
> migrate any
> + * remaining pages back to RAM.
> + */
> +
> +/**
> + * DOC: Examples
> + *
> + * This section provides two examples of how to build the expected
> driver
> + * components: the GPU page fault handler and the garbage collector.
> A third
> + * example demonstrates a sample invalidation driver vfunc.
> + *
> + * The generic code provided does not include logic for complex
> migration
> + * policies, optimized invalidations, fined grained driver locking,
> or other
> + * potentially required driver locking (e.g., DMA-resv locks).
> + *
> + * 1) GPU page fault handler
> + *
> + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> drm_gpusvm_range *range)
> + *	{
> + *		int err = 0;
> + *
> + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> range);
> + *
> + *		drm_gpusvm_notifier_lock(gpusvm);
> + *		if (drm_gpusvm_range_pages_valid(range))
> + *			driver_commit_bind(gpusvm, range);
> + *		else
> + *			err = -EAGAIN;
> + *		drm_gpusvm_notifier_unlock(gpusvm);
> + *
> + *		return err;
> + *	}
> + *
> + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> + *			     u64 gpuva_start, u64 gpuva_end)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = {};
> + *		int err;
> + *
> + *		driver_svm_lock();
> + *	retry:
> + *		// Always process UNMAPs first so view of GPU SVM
> ranges is current
> + *		driver_garbage_collector(gpusvm);
> + *
> + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> fault_addr,
> + *							gpuva_start,
> gpuva_end,
> + *						        &ctx);
> + *		if (IS_ERR(range)) {
> + *			err = PTR_ERR(range);
> + *			goto unlock;
> + *		}
> + *
> + *		if (driver_migration_policy(range)) {
> + *			devmem = driver_alloc_devmem();
> + *			err = drm_gpusvm_migrate_to_devmem(gpusvm,
> range,
> + *							  
> devmem_allocation,
> + *							   &ctx);
> + *			if (err)	// CPU mappings may have
> changed
> + *				goto retry;
> + *		}
> + *
> + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> &ctx);
> + *		if (err == -EOPNOTSUPP || err == -EFAULT || err == -
> EPERM) {	// CPU mappings changed
> + *			if (err == -EOPNOTSUPP)
> + *				drm_gpusvm_range_evict(gpusvm,
> range);
> + *			goto retry;
> + *		} else if (err) {
> + *			goto unlock;
> + *		}
> + *
> + *		err = driver_bind_range(gpusvm, range);
> + *		if (err == -EAGAIN)	// CPU mappings changed
> + *			goto retry
> + *
> + *	unlock:
> + *		driver_svm_unlock();
> + *		return err;
> + *	}
> + *
> + * 2) Garbage Collector.
> + *
> + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> + *					struct drm_gpusvm_range
> *range)
> + *	{
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		// Partial unmap, migrate any remaining device
> memory pages back to RAM
> + *		if (range->flags.partial_unmap)
> + *			drm_gpusvm_range_evict(gpusvm, range);
> + *
> + *		driver_unbind_range(range);
> + *		drm_gpusvm_range_remove(gpusvm, range);
> + *	}
> + *
> + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> + *	{
> + *		assert_driver_svm_locked(gpusvm);
> + *
> + *		for_each_range_in_garbage_collector(gpusvm, range)
> + *			__driver_garbage_collector(gpusvm, range);
> + *	}
> + *
> + * 3) Invalidation driver vfunc.
> + *
> + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> + *				 struct drm_gpusvm_notifier
> *notifier,
> + *				 const struct mmu_notifier_range
> *mmu_range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> };
> + *		struct drm_gpusvm_range *range = NULL;
> + *
> + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> >start, mmu_range->end);
> + *
> + *		drm_gpusvm_for_each_range(range, notifier,
> mmu_range->start,
> + *					  mmu_range->end) {
> + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> &ctx);
> + *
> + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> + *				continue;
> + *
> + *			drm_gpusvm_range_set_unmapped(range,
> mmu_range);
> + *			driver_garbage_collector_add(gpusvm, range);
> + *		}
> + *	}
> + */
> +
> +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> rb.__subtree_last,
> +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> +		     static __maybe_unused, range);
> +
> +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> >interval.start)
> +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> >interval.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> notifier);

Did you look at removing these instantiations after the RFC comments,
and instead embed a struct interval_tree_node?

Perhaps the notifier tree could use a maple tree?

> +
> +/**
> + * npages_in_range() - Calculate the number of pages in a given
> range
> + * @start__: The start address of the range
> + * @end__: The end address of the range
> + *
> + * This macro calculates the number of pages in a given memory
> range,
> + * specified by the start and end addresses. It divides the
> difference
> + * between the end and start addresses by the page size (PAGE_SIZE)
> to
> + * determine the number of pages in the range.
> + *
> + * Return: The number of pages in the specified range.
> + */
> +#define npages_in_range(start__, end__)	\
> +	(((end__) - (start__)) >> PAGE_SHIFT)

Could use a static function?

> +
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd destruction
> + * @devmem_allocation: device memory allocation
> + * @device_private_page_owner: Device private pages owner
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up
> a device
> + * memory allocation upon CPU page fault and asynchronously
> releasing device
> + * memory once the CPU has no page references. Asynchronous release
> is useful
> + * because CPU page references can be dropped in IRQ contexts, while
> releasing
> + * device memory likely requires sleeping locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_devmem *devmem_allocation;
> +	void *device_private_page_owner;
> +};

I think the zdd and migration helpers should be moved to drm_pagemap.c
We should consider looking at that once patches for drm_pagemap
functionality are posted.


TBC
/Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-31 18:58   ` Thomas Hellström
@ 2024-11-04 22:53     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-04 22:53 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Thu, Oct 31, 2024 at 07:58:45PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > This patch introduces support for GPU Shared Virtual Memory (SVM) in
> > the
> > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > sharing of memory between the CPU and GPU, enhancing performance and
> > flexibility in GPU computing tasks.
> > 
> > The patch adds the necessary infrastructure for SVM, including data
> > structures and functions for managing SVM ranges and notifiers. It
> > also
> > provides mechanisms for allocating, deallocating, and migrating
> > memory
> > regions between system RAM and GPU VRAM.
> > 
> > This is largely inspired by GPUVM.
> > 
> > v2:
> >  - Take order into account in check pages
> >  - Clear range->pages in get pages error
> >  - Drop setting dirty or accessed bit in get pages (Vetter)
> >  - Remove mmap assert for cpu faults
> >  - Drop mmap write lock abuse (Vetter, Christian)
> >  - Decouple zdd from range (Vetter, Oak)
> >  - Add drm_gpusvm_range_evict, make it work with coherent pages
> >  - Export drm_gpusvm_evict_to_sram, only use in BO evict path
> > (Vetter)
> >  - mmget/put in drm_gpusvm_evict_to_sram
> >  - Drop range->vram_alloation variable
> >  - Don't return in drm_gpusvm_evict_to_sram until all pages detached
> >  - Don't warn on mixing sram and device pages
> >  - Update kernel doc
> >  - Add coherent page support to get pages
> >  - Use DMA_FROM_DEVICE rather than DMA_BIDIRECTIONAL
> >  - Add struct drm_gpusvm_vram and ops (Thomas)
> >  - Update the range's seqno if the range is valid (Thomas)
> >  - Remove the is_unmapped check before hmm_range_fault (Thomas)
> >  - Use drm_pagemap (Thomas)
> >  - Drop kfree_mapping (Thomas)
> >  - dma mapp pages under notifier lock (Thomas)
> >  - Remove ctx.prefault
> >  - Remove ctx.mmap_locked
> >  - Add ctx.check_pages
> >  - s/vram/devmem (Thomas)
> > 
> > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile     |    3 +-
> >  drivers/gpu/drm/xe/drm_gpusvm.c | 2074
> > +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/drm_gpusvm.h |  447 +++++++
> >  3 files changed, 2523 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> >  create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index da80c29aa363..8d991d4a92a5 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -25,7 +25,8 @@ $(obj)/generated/%_wa_oob.c
> > $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> >  
> >  # core driver code
> >  
> > -xe-y += xe_bb.o \
> > +xe-y += drm_gpusvm.o \
> > +	xe_bb.o \
> >  	xe_bo.o \
> >  	xe_bo_evict.o \
> >  	xe_devcoredump.o \
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.c
> > b/drivers/gpu/drm/xe/drm_gpusvm.c
> > new file mode 100644
> > index 000000000000..1ff104d2b42c
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.c
> > @@ -0,0 +1,2074 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + *
> > + * Authors:
> > + *     Matthew Brost <matthew.brost@intel.com>
> > + */
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interval_tree_generic.h>
> > +#include <linux/hmm.h>
> > +#include <linux/memremap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_types.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/slab.h>
> > +
> > +#include <drm/drm_device.h>
> > +#include "drm/drm_print.h"
> > +#include "drm_gpusvm.h"
> > +#include "drm_pagemap.h"
> > +
> > +/**
> > + * DOC: Overview
> > + *
> > + * GPU Shared Virtual Memory (GPU SVM) layer for the Direct
> > Rendering Manager (DRM)
> > + *
> > + * The GPU SVM layer is a component of the DRM framework designed to
> > manage shared
> > + * virtual memory between the CPU and GPU. It enables efficient data
> > exchange and
> > + * processing for GPU-accelerated applications by allowing memory
> > sharing and
> > + * synchronization between the CPU's and GPU's virtual address
> > spaces.
> > + *
> > + * Key GPU SVM Components:
> > + * - Notifiers: Notifiers: Used for tracking memory intervals and
> > notifying the
> > + *		GPU of changes, notifiers are sized based on a GPU
> > SVM
> > + *		initialization parameter, with a recommendation of
> > 512M or
> > + *		larger. They maintain a Red-BlacK tree and a list of
> > ranges that
> > + *		fall within the notifier interval. Notifiers are
> > tracked within
> > + *		a GPU SVM Red-BlacK tree and list and are
> > dynamically inserted
> > + *		or removed as ranges within the interval are created
> > or
> > + *		destroyed.
> > + * - Ranges: Represent memory ranges mapped in a DRM device and
> > managed
> > + *	     by GPU SVM. They are sized based on an array of chunk
> > sizes, which
> > + *	     is a GPU SVM initialization parameter, and the CPU
> > address space.
> > + *	     Upon GPU fault, the largest aligned chunk that fits
> > within the
> > + *	     faulting CPU address space is chosen for the range
> > size. Ranges are
> > + *	     expected to be dynamically allocated on GPU fault and
> > removed on an
> > + *	     MMU notifier UNMAP event. As mentioned above, ranges
> > are tracked in
> > + *	     a notifier's Red-Black tree.
> > + * - Operations: Define the interface for driver-specific GPU SVM
> > operations
> > + *               such as range allocation, notifier allocation, and
> > + *               invalidations.
> > + * - Device Memory Allocations: Embedded structure containing enough
> > information
> > + *                              for GPU SVM to migrate to / from
> > device memory.
> > + * - Device Memory Operations: Define the interface for driver-
> > specific device
> > + *                             memory operations release memory,
> > populate pfns,
> > + *                             and copy to / from device memory.
> > + *
> > + * This layer provides interfaces for allocating, mapping,
> > migrating, and
> > + * releasing memory ranges between the CPU and GPU. It handles all
> > core memory
> > + * management interactions (DMA mapping, HMM, and migration) and
> > provides
> > + * driver-specific virtual functions (vfuncs). This infrastructure
> > is sufficient
> > + * to build the expected driver components for an SVM implementation
> > as detailed
> > + * below.
> > + *
> > + * Expected Driver Components:
> > + * - GPU page fault handler: Used to create ranges and notifiers
> > based on the
> > + *			     fault address, optionally migrate the
> > range to
> > + *			     device memory, and create GPU bindings.
> > + * - Garbage collector: Used to destroy GPU bindings for ranges.
> 
> Perhaps "clean up GPU bindings for ranges" to differentiate from
> unmapping GPU bindings for ranges which needs to be done in the
> notifier callback?
>

Let make this more clear.
 
> > Ranges are
> > + *			expected to be added to the garbage
> > collector upon
> > + *			MMU_NOTIFY_UNMAP event.
> > + */
> > +
> 
> - Notifier callback, to unmap GPU bindings for ranges.
> 

Will add something.

> > +/**
> > + * DOC: Locking
> > + *
> > + * GPU SVM handles locking for core MM interactions, i.e., it
> > locks/unlocks the
> > + * mmap lock as needed.
> > + *
> > + * GPU SVM introduces a global notifier lock, which safeguards the
> > notifier's
> > + * range RB tree and list, as well as the range's DMA mappings and
> > sequence
> > + * number. GPU SVM manages all necessary locking and unlocking
> > operations,
> 
> How difficult would it be to make this per-notifier?
> One of the design comments we got from Jason was to prioritize avoiding
> core slowdowns and any fault processing might block an unrelated
> notifier during dma mapping and page-table commit.
> I think this should at least be considered as a follow-up.
> 

I don't think it is particular hard for SVM but if userptr is built on top of
this it gets tricky. The reason being for userptr we can have an array of binds
with multiple notifiers so taking multiple notifier locks will confuse lockdep
or worse we could deadlock.

I'd say let keep it as is, see what the userptr rework looks like and then
adjust based on that.

> > + * except for the recheck of the range's sequence number
> > + * (mmu_interval_read_retry) when the driver is committing GPU
> > bindings. This
> 
> Perhaps add a discussion on valid pages rather than valid sequence
> number, since the sequence number might be bumped even if pages stay
> valid for the range, as the sequence number spans the whole notifier.
> 

Yes. Good idea.

> > + * lock corresponds to the 'driver->update' lock mentioned in the
> > HMM
> > + * documentation (TODO: Link). Future revisions may transition from
> > a GPU SVM
> > + * global lock to a per-notifier lock if finer-grained locking is
> > deemed
> > + * necessary.
> > + *
> > + * In addition to the locking mentioned above, the driver should
> > implement a
> > + * lock to safeguard core GPU SVM function calls that modify state,
> > such as
> > + * drm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove.
> > Alternatively,
> > + * these core functions can be called within a single kernel thread,
> > for
> > + * instance, using an ordered work queue.
> 
> I don't think not we should encourage the use of ordered workqueues to
> protect data / state? Locks should really be the preferred mechanism?
> 

Let me drop this comment.

> >  This lock is denoted as
> > + * 'driver_svm_lock' in code examples. Finer grained driver side
> > locking should
> > + * also be possible for concurrent GPU fault processing within a
> > single GPU SVM.
> 
> GPUVM (or rather the gem part of GPUVM that resides in drm_gem.h)
> allows for the driver to register a lock map which, if present, is used
> in the code to assert locks are correctly taken.
> 

So something like drm_gem_gpuva_set_lock? Sure. I think really the only
functions which need this assert are the range insert and removal functions
though IIRC from my fined grained locking (concurrent GPU fault processing)
prototype. I guess I could just add it everywhere in this version and adjust
when we land finer grained locking.

> In the xe example, if we're using the vm lock to protect, then we could
> register that and thoroughly annotate the gpusvm code with lockdep
> asserts. That will probably help making the code a lot easier to
> maintain moving forward.
> 
> > + */
> > +GPUVM (or rGPUVM (or rather the gem part of GPUVM that resides in
> > drm_gem.h) allows for the driver to register a lock map which, if
> > present, is used in the code to assert locks are correctly taken.
> 
> > In the xe example, if we're using the vm lock to protect, then we
> > could register that and thoroughly annotate the gpusvm code with
> > lockdep asserts. That will probably help making the code a lot easier
> > to maintain moving forward.GPUVM (or rather the gem part of GPUVM
> > that resides in drm_gem.h) allows for the driver to register a lock
> > map which, if present, is used in the code to assert locks are
> > correctly taken.
> 
> > In the xe example, if we're using the vm lock to protect, then we
> > could register that and thoroughly annotate the gpusvm code with
> > lockdep asserts. That will probably help making the code a lot easier
> > to maintain moving forward.ather the gem part of GPUVM that resides
> > in drm_gem.h) allows for the driver to register a lock map which, if
> > present, is used in the code to assert locks are correctly taken.
> 
> In the xe example, if we're using the vm lock to protect, then we could
> register that and thoroughly annotate the gpusvm code with lockdep
> asserts. That will probably help making the code a lot easier to
> maintain moving forward.

See above, agree.

> > +/**
> > + * DOC: Migrataion
> 
> s/Migrataion/Migration/
> 

Opps. Will fix.

> > + *
> > + * The migration support is quite simple, allowing migration between
> > RAM and
> > + * device memory at the range granularity. For example, GPU SVM
> > currently does not
> > + * support mixing RAM and device memory pages within a range. This
> > means that upon GPU
> > + * fault, the entire range can be migrated to device memory, and
> > upon CPU fault, the
> > + * entire range is migrated to RAM. Mixed RAM and device memory
> > storage within a range
> > + * could be added in the future if required.
> > + *
> > + * The reasoning for only supporting range granularity is as
> > follows: it
> > + * simplifies the implementation, and range sizes are driver-defined
> > and should
> > + * be relatively small.
> > + */
> > +
> > +/**
> > + * DOC: Partial Unmapping of Ranges
> > + *
> > + * Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by
> > CPU resulting
> > + * in MMU_NOTIFY_UNMAP event) presents several challenges, with the
> > main one
> > + * being that a subset of the range still has CPU and GPU mappings.
> > If the
> > + * backing store for the range is in device memory, a subset of the
> > backing store has
> > + * references. One option would be to split the range and device
> > memory backing store,
> > + * but the implementation for this would be quite complicated. Given
> > that
> > + * partial unmappings are rare and driver-defined range sizes are
> > relatively
> > + * small, GPU SVM does not support splitting of ranges.
> > + *
> > + * With no support for range splitting, upon partial unmapping of a
> > range, the
> > + * driver is expected to invalidate and destroy the entire range. If
> > the range
> > + * has device memory as its backing, the driver is also expected to
> > migrate any
> > + * remaining pages back to RAM.
> > + */
> > +
> > +/**
> > + * DOC: Examples
> > + *
> > + * This section provides two examples of how to build the expected
> > driver
> > + * components: the GPU page fault handler and the garbage collector.
> > A third
> > + * example demonstrates a sample invalidation driver vfunc.
> > + *
> > + * The generic code provided does not include logic for complex
> > migration
> > + * policies, optimized invalidations, fined grained driver locking,
> > or other
> > + * potentially required driver locking (e.g., DMA-resv locks).
> > + *
> > + * 1) GPU page fault handler
> > + *
> > + *	int driver_bind_range(struct drm_gpusvm *gpusvm, struct
> > drm_gpusvm_range *range)
> > + *	{
> > + *		int err = 0;
> > + *
> > + *		driver_alloc_and_setup_memory_for_bind(gpusvm,
> > range);
> > + *
> > + *		drm_gpusvm_notifier_lock(gpusvm);
> > + *		if (drm_gpusvm_range_pages_valid(range))
> > + *			driver_commit_bind(gpusvm, range);
> > + *		else
> > + *			err = -EAGAIN;
> > + *		drm_gpusvm_notifier_unlock(gpusvm);
> > + *
> > + *		return err;
> > + *	}
> > + *
> > + *	int driver_gpu_fault(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > + *			     u64 gpuva_start, u64 gpuva_end)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = {};
> > + *		int err;
> > + *
> > + *		driver_svm_lock();
> > + *	retry:
> > + *		// Always process UNMAPs first so view of GPU SVM
> > ranges is current
> > + *		driver_garbage_collector(gpusvm);
> > + *
> > + *		range = drm_gpusvm_range_find_or_insert(gpusvm,
> > fault_addr,
> > + *							gpuva_start,
> > gpuva_end,
> > + *						        &ctx);
> > + *		if (IS_ERR(range)) {
> > + *			err = PTR_ERR(range);
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		if (driver_migration_policy(range)) {
> > + *			devmem = driver_alloc_devmem();
> > + *			err = drm_gpusvm_migrate_to_devmem(gpusvm,
> > range,
> > + *							  
> > devmem_allocation,
> > + *							   &ctx);
> > + *			if (err)	// CPU mappings may have
> > changed
> > + *				goto retry;
> > + *		}
> > + *
> > + *		err = drm_gpusvm_range_get_pages(gpusvm, range,
> > &ctx);
> > + *		if (err == -EOPNOTSUPP || err == -EFAULT || err == -
> > EPERM) {	// CPU mappings changed
> > + *			if (err == -EOPNOTSUPP)
> > + *				drm_gpusvm_range_evict(gpusvm,
> > range);
> > + *			goto retry;
> > + *		} else if (err) {
> > + *			goto unlock;
> > + *		}
> > + *
> > + *		err = driver_bind_range(gpusvm, range);
> > + *		if (err == -EAGAIN)	// CPU mappings changed
> > + *			goto retry
> > + *
> > + *	unlock:
> > + *		driver_svm_unlock();
> > + *		return err;
> > + *	}
> > + *
> > + * 2) Garbage Collector.
> > + *
> > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > + *					struct drm_gpusvm_range
> > *range)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		// Partial unmap, migrate any remaining device
> > memory pages back to RAM
> > + *		if (range->flags.partial_unmap)
> > + *			drm_gpusvm_range_evict(gpusvm, range);
> > + *
> > + *		driver_unbind_range(range);
> > + *		drm_gpusvm_range_remove(gpusvm, range);
> > + *	}
> > + *
> > + *	void driver_garbage_collector(struct drm_gpusvm *gpusvm)
> > + *	{
> > + *		assert_driver_svm_locked(gpusvm);
> > + *
> > + *		for_each_range_in_garbage_collector(gpusvm, range)
> > + *			__driver_garbage_collector(gpusvm, range);
> > + *	}
> > + *
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier
> > *notifier,
> > + *				 const struct mmu_notifier_range
> > *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range-
> > >start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier,
> > mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm, range,
> > &ctx);
> > + *
> > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range,
> > mmu_range);
> > + *			driver_garbage_collector_add(gpusvm, range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64,
> > rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)-
> > >interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)-
> > >interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused,
> > notifier);
> 
> Did you look at removing these instantiations after the RFC comments,
> and instead embed a struct interval_tree_node?
> 

Both gpu vm an xe_range_fence define an interval tree like this so figured it
was fine but also some of this was lazyness on my part.

> Perhaps the notifier tree could use a maple tree?
> 

Let me in earnest follow up on using a maple tree. My understanding is a maple
tree is designed for exactly this (VA tracking) and the only reason GPU VM
doesn't use it is because it has memory allocations which break the dma-fencing
rules. I think there is a version on GPU VM out there with a maple tree so I
don't have to think to hard about this.

> > +
> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given
> > range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory
> > range,
> > + * specified by the start and end addresses. It divides the
> > difference
> > + * between the end and start addresses by the page size (PAGE_SIZE)
> > to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> 
> Could use a static function?
> 

See my other reply [1]. Will changes macro -> functions where is makes sense.

[1] https://patchwork.freedesktop.org/patch/619809/?series=137870&rev=2#comment_1133611

> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @devmem_allocation: device memory allocation
> > + * @device_private_page_owner: Device private pages owner
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up
> > a device
> > + * memory allocation upon CPU page fault and asynchronously
> > releasing device
> > + * memory once the CPU has no page references. Asynchronous release
> > is useful
> > + * because CPU page references can be dropped in IRQ contexts, while
> > releasing
> > + * device memory likely requires sleeping locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_devmem *devmem_allocation;
> > +	void *device_private_page_owner;
> > +};
> 
> I think the zdd and migration helpers should be moved to drm_pagemap.c
> We should consider looking at that once patches for drm_pagemap
> functionality are posted.
> 

We have another thread discussing this. Let's continue this there.

Matt

> 
> TBC
> /Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
  2024-10-31 18:58   ` Thomas Hellström
@ 2024-11-04 15:25   ` Thomas Hellström
  2024-11-04 17:21     ` Matthew Brost
  2024-11-05 14:48   ` Thomas Hellström
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-04 15:25 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:

Continued review.

> 
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd destruction
> + * @devmem_allocation: device memory allocation
> + * @device_private_page_owner: Device private pages owner
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up
> a device
> + * memory allocation upon CPU page fault and asynchronously
> releasing device
> + * memory once the CPU has no page references. Asynchronous release
> is useful
> + * because CPU page references can be dropped in IRQ contexts, while
> releasing
> + * device memory likely requires sleeping locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_devmem *devmem_allocation;
> +	void *device_private_page_owner;
> +};
> +
> +/**
> + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> zdd

NIT: Even if the above kerneldoc format works, I keep trying to enforce
using () after function names and function-like macros, like described
here: https://docs.kernel.org/doc-guide/kernel-doc.html Could we
update? Also that doc calls for using "Return:" instead of "Returns:".


> + * @w: Pointer to the work_struct
> + *
> + * This function releases device memory, puts GPU SVM range, and
> frees zdd.
> + */
> +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(w, struct drm_gpusvm_zdd,
> destroy_work);
> +	const struct drm_gpusvm_devmem_ops *ops = zdd-
> >devmem_allocation ?
> +		zdd->devmem_allocation->ops : NULL;
> +
> +	if (zdd->devmem_allocation && ops->devmem_release)
> +		ops->devmem_release(zdd->devmem_allocation);
> +	kfree(zdd);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> + * @device_private_page_owner: Device private pages owner
> + *
> + * This function allocates and initializes a new zdd structure. It
> sets up the
> + * reference count and initializes the destroy work.
> + *
> + * Returns:
> + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_zdd *
> +drm_gpusvm_zdd_alloc(void *device_private_page_owner)
> +{
> +	struct drm_gpusvm_zdd *zdd;
> +
> +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> +	if (!zdd)
> +		return NULL;
> +
> +	kref_init(&zdd->refcount);
> +	INIT_WORK(&zdd->destroy_work,
> drm_gpusvm_zdd_destroy_work_func);
> +	zdd->devmem_allocation = NULL;
> +	zdd->device_private_page_owner = device_private_page_owner;
> +
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function increments the reference count of the provided zdd
> structure.
> + *
> + * Returns: Pointer to the zdd structure.
> + */
> +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> drm_gpusvm_zdd *zdd)
> +{
> +	kref_get(&zdd->refcount);
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> + * @ref: Pointer to the reference count structure.
> + *
> + * This function queues the destroy_work of the zdd for asynchronous
> destruction.
> + */
> +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> +
> +	if (zdd->devmem_allocation)
> +		WRITE_ONCE(zdd->devmem_allocation->detached, true);
> +	schedule_work(&zdd->destroy_work);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_put - Put a zdd reference.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function decrements the reference count of the provided zdd
> structure
> + * and schedules its destruction if the count drops to zero.
> + */
> +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> +}

As mentioned earlier, I think the above drm_gpusvm_zdd functions should
move to drm_pagemap.c. I don't think they are used in drm_gpusvm other
than to, at get_pages time, ensure all device private pages are from
the same pagemap?

> +
> +/**
> + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> + * @notifier: Pointer to the GPU SVM notifier structure.
> + * @start: Start address of the range
> + * @end: End address of the range
> + *
> + * Return: A pointer to the drm_gpusvm_range if found or NULL
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end)
> +{
> +	return range_iter_first(&notifier->root, start, end - 1);
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> ranges in a notifier
> + * @range__: Iterator variable for the ranges
> + * @next__: Iterator variable for the ranges temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier
> while
> + * removing ranges from it.
> + */
> +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> start__, end__)	\
> +	for ((range__) = drm_gpusvm_range_find((notifier__),
> (start__), (end__)),	\
> +	     (next__) =
> __drm_gpusvm_range_next(range__);				\
> +	     (range__) && (range__->va.start <
> (end__));				\
> +	     (range__) = (next__), (next__) =
> __drm_gpusvm_range_next(range__))
> +
> +/**
> + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> the list
> + * @notifier: a pointer to the current drm_gpusvm_notifier
> + *
> + * Return: A pointer to the next drm_gpusvm_notifier if available,
> or NULL if
> + *         the current notifier is the last one or if the input
> notifier is
> + *         NULL.
> + */
> +static struct drm_gpusvm_notifier *
> +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> +{
> +	if (notifier && !list_is_last(&notifier->rb.entry,
> +				      &notifier->gpusvm-
> >notifier_list))
> +		return list_next_entry(notifier, rb.entry);

Why aren't we using notifier_iter_next() here? Then the linked list
could be skipped.

> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> + */
> +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> end__)		\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1);	\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> +

Looks like end__ is not honored except for the first iteration. Relates
to the above question.

> +/**
> + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @next__: Iterator variable for the notifiers temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> while
> + * removing notifiers from it.
> + */
> +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> gpusvm__, start__, end__)	\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> (start__), (end__) - 1),	\
> +	     (next__) =
> __drm_gpusvm_notifier_next(notifier__);				\
> +	     (notifier__) && (notifier__->interval.start <
> (end__));			\
> +	     (notifier__) = (next__), (next__) =
> __drm_gpusvm_notifier_next(notifier__))

Same here.

> +
> +/**
> + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> + * @mni: Pointer to the mmu_interval_notifier structure.
> + * @mmu_range: Pointer to the mmu_notifier_range structure.
> + * @cur_seq: Current sequence number.
> + *
> + * This function serves as a generic MMU notifier for GPU SVM. It
> sets the MMU
> + * notifier sequence number and calls the driver invalidate vfunc
> under
> + * gpusvm->notifier_lock.
> + *
> + * Returns:
> + * true if the operation succeeds, false otherwise.
> + */
> +static bool
> +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> +			       const struct mmu_notifier_range
> *mmu_range,
> +			       unsigned long cur_seq)
> +{
> +	struct drm_gpusvm_notifier *notifier =
> +		container_of(mni, typeof(*notifier), notifier);
> +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> +
> +	if (!mmu_notifier_range_blockable(mmu_range))
> +		return false;
> +
> +	down_write(&gpusvm->notifier_lock);
> +	mmu_interval_set_seq(mni, cur_seq);
> +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> +	up_write(&gpusvm->notifier_lock);
> +
> +	return true;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> GPU SVM
> + */
> +static const struct mmu_interval_notifier_ops
> drm_gpusvm_notifier_ops = {
> +	.invalidate = drm_gpusvm_notifier_invalidate,
> +};
> +
> +/**
> + * drm_gpusvm_init - Initialize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @name: Name of the GPU SVM.
> + * @drm: Pointer to the DRM device structure.
> + * @mm: Pointer to the mm_struct for the address space.
> + * @device_private_page_owner: Device private pages owner.
> + * @mm_start: Start address of GPU SVM.
> + * @mm_range: Range of the GPU SVM.
> + * @notifier_size: Size of individual notifiers.
> + * @ops: Pointer to the operations structure for GPU SVM.
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order
> with last
> + *               entry being SZ_4K.
> + * @num_chunks: Number of chunks.
> + *
> + * This function initializes the GPU SVM.
> + *
> + * Returns:
> + * 0 on success, a negative error code on failure.
> + */
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks)
> +{
> +	if (!ops->invalidate || !num_chunks)
> +		return -EINVAL;
> +
> +	gpusvm->name = name;
> +	gpusvm->drm = drm;
> +	gpusvm->mm = mm;
> +	gpusvm->device_private_page_owner =
> device_private_page_owner;
> +	gpusvm->mm_start = mm_start;
> +	gpusvm->mm_range = mm_range;
> +	gpusvm->notifier_size = notifier_size;
> +	gpusvm->ops = ops;
> +	gpusvm->chunk_sizes = chunk_sizes;
> +	gpusvm->num_chunks = num_chunks;
> +
> +	mmgrab(mm);
> +	gpusvm->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> +
> +	init_rwsem(&gpusvm->notifier_lock);
> +
> +	fs_reclaim_acquire(GFP_KERNEL);
> +	might_lock(&gpusvm->notifier_lock);
> +	fs_reclaim_release(GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure
> + * @fault_addr__: Fault address
> + *
> + * This macro finds the GPU SVM notifier associated with the fault
> address.
> + *
> + * Returns:
> + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> + */
> +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> +			    (fault_addr__ + 1))
> +
> +/**
> + * to_drm_gpusvm_notifier - retrieve the container struct for a
> given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_notifier struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_notifier
> structure.
> + */
> +#define to_drm_gpusvm_notifier(__node)				\
> +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> +

There appears to be a number of function-like macros in the code, which
look like they can be converted to functions. Linux prefers functions
over macros when possible:

https://www.kernel.org/doc/html/v5.8/process/coding-style.html#macros-enums-and-rtl


> +/**
> + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function inserts the GPU SVM notifier into the GPU SVM RB
> tree and list.
> + */
> +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	notifier_insert(notifier, &gpusvm->root);
> +
> +	node = rb_prev(&notifier->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> +	else
> +		head = &gpusvm->notifier_list;
> +
> +	list_add(&notifier->rb.entry, head);
> +}
> +
> +/**
> + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM tructure
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + *
> + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> and list.
> + */
> +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> +	list_del(&(notifier__)->rb.entry)

Unless this can be made a function, Pls use
do { } while (0)


> +
> +/**
> + * drm_gpusvm_fini - Finalize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + *
> + * This function finalizes the GPU SVM by cleaning up any remaining
> ranges and
> + * notifiers, and dropping a reference to struct MM.
> + */
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> +{
> +	struct drm_gpusvm_notifier *notifier, *next;
> +
> +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> LONG_MAX) {
> +		struct drm_gpusvm_range *range, *__next;
> +
> +		/*
> +		 * Remove notifier first to avoid racing with any
> invalidation
> +		 */
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +		notifier->flags.removed = true;
> +
> +		drm_gpusvm_for_each_range_safe(range, __next,
> notifier, 0,
> +					       LONG_MAX)
> +			drm_gpusvm_range_remove(gpusvm, range);
> +	}
> +
> +	mmdrop(gpusvm->mm);
> +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> +}
> +
> +/**
> + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + *
> + * This function allocates and initializes the GPU SVM notifier
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> on failure.
> + */
> +static struct drm_gpusvm_notifier *
> +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	if (gpusvm->ops->notifier_alloc)
> +		notifier = gpusvm->ops->notifier_alloc();
> +	else
> +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> +
> +	if (!notifier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	notifier->gpusvm = gpusvm;
> +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> >notifier_size);
> +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> >notifier_size);
> +	INIT_LIST_HEAD(&notifier->rb.entry);
> +	notifier->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&notifier->range_list);
> +
> +	return notifier;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function frees the GPU SVM notifier structure.
> + */
> +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> +				     struct drm_gpusvm_notifier
> *notifier)
> +{
> +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> +
> +	if (gpusvm->ops->notifier_free)
> +		gpusvm->ops->notifier_free(notifier);
> +	else
> +		kfree(notifier);
> +}
> +
> +/**
> + * to_drm_gpusvm_range - retrieve the container struct for a given
> rbtree node
> + * @node__: a pointer to the rbtree node embedded within a
> drm_gpusvm_range struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_range structure.
> + */
> +#define to_drm_gpusvm_range(node__)	\
> +	container_of((node__), struct drm_gpusvm_range, rb.node)
> +
> +/**
> + * drm_gpusvm_range_insert - Insert GPU SVM range
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function inserts the GPU SVM range into the notifier RB tree
> and list.
> + */
> +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> *notifier,
> +				    struct drm_gpusvm_range *range)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> +	range_insert(range, &notifier->root);
> +
> +	node = rb_prev(&range->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> +	else
> +		head = &notifier->range_list;
> +
> +	list_add(&range->rb.entry, head);
> +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> +}
> +
> +/**
> + * __drm_gpusvm_range_remove - Remove GPU SVM range
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + * @range__: Pointer to the GPU SVM range structure
> + *
> + * This macro removes the GPU SVM range from the notifier RB tree
> and list.
> + */
> +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> +	range_remove((range__), &(notifier__)->root);		\
> +	list_del(&(range__)->rb.entry)

Same thing as for the notifier rb tree. And do we need the linked list?


> +
> +/**
> + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @fault_addr: Fault address
> + * @chunk_size: Chunk size
> + * @migrate_devmem: Flag indicating whether to migrate device memory
> + *
> + * This function allocates and initializes the GPU SVM range
> structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> failure.
> + */
> +static struct drm_gpusvm_range *
> +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> +		       struct drm_gpusvm_notifier *notifier,
> +		       u64 fault_addr, u64 chunk_size, bool
> migrate_devmem)
> +{
> +	struct drm_gpusvm_range *range;
> +
> +	if (gpusvm->ops->range_alloc)
> +		range = gpusvm->ops->range_alloc(gpusvm);
> +	else
> +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&range->refcount);
> +	range->gpusvm = gpusvm;
> +	range->notifier = notifier;
> +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> +	INIT_LIST_HEAD(&range->rb.entry);
> +	range->notifier_seq = LONG_MAX;
> +	range->flags.migrate_devmem = migrate_devmem ? 1 : 0;
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_check_pages - Check pages
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @start: Start address
> + * @end: End address
> + *
> + * Check if pages between start and end have been faulted in on the
> CPU. Use to
> + * prevent migration of pages without CPU backing store.
> + *
> + * Returns:
> + * True if pages have been faulted into CPU, False otherwise
> + */
> +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> +				   struct drm_gpusvm_notifier
> *notifier,
> +				   u64 start, u64 end)
> +{
> +	struct hmm_range hmm_range = {
> +		.default_flags = 0,
> +		.notifier = &notifier->notifier,
> +		.start = start,
> +		.end = end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns;
> +	unsigned long npages = npages_in_range(start, end);
> +	int err, i;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return false;
> +
> +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> >notifier);
> +	hmm_range.hmm_pfns = pfns;
> +
> +	while (true) {
> +		err = hmm_range_fault(&hmm_range);
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(&notifier->notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (err)
> +		goto err_free;
> +
> +	for (i = 0; i < npages;) {
> +		if (!(pfns[i] & HMM_PFN_VALID)) {
> +			err = -EFAULT;
> +			goto err_free;
> +		}
> +		i += 0x1 << hmm_pfn_to_map_order(pfns[i]);
> +	}
> +
> +err_free:
> +	kvfree(pfns);
> +	return err ? false : true;
> +}
> +
> +/**
> + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @vas: Pointer to the virtual memory area structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @check_pages: Flag indicating whether to check pages
> + *
> + * This function determines the chunk size for the GPU SVM range
> based on the
> + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> the virtual
> + * memory area boundaries.
> + *
> + * Returns:
> + * Chunk size on success, LONG_MAX on failure.
> + */
> +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier
> *notifier,
> +				       struct vm_area_struct *vas,
> +				       u64 fault_addr, u64
> gpuva_start,
> +				       u64 gpuva_end, bool
> check_pages)
> +{
> +	u64 start, end;
> +	int i = 0;
> +
> +retry:
> +	for (; i < gpusvm->num_chunks; ++i) {
> +		start = ALIGN_DOWN(fault_addr, gpusvm-
> >chunk_sizes[i]);
> +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> +
> +		if (start >= vas->vm_start && end <= vas->vm_end &&
> +		    start >= notifier->interval.start &&
> +		    end <= notifier->interval.end &&
> +		    start >= gpuva_start && end <= gpuva_end)
> +			break;
> +	}
> +
> +	if (i == gpusvm->num_chunks)
> +		return LONG_MAX;
> +
> +	/*
> +	 * If allocation more than page, ensure not to overlap with
> existing
> +	 * ranges.
> +	 */
> +	if (end - start != SZ_4K) {
> +		struct drm_gpusvm_range *range;
> +
> +		range = drm_gpusvm_range_find(notifier, start, end);
> +		if (range) {
> +			++i;
> +			goto retry;
> +		}
> +
> +		/*
> +		 * XXX: Only create range on pages CPU has faulted
> in. Without
> +		 * this check, or prefault, on BMG
> 'xe_exec_system_allocator --r
> +		 * process-many-malloc' fails. In the failure case,
> each process
> +		 * mallocs 16k but the CPU VMA is ~128k which
> results in 64k SVM
> +		 * ranges. When migrating the SVM ranges, some
> processes fail in
> +		 * drm_gpusvm_migrate_to_devmem with 'migrate.cpages
> != npages'
> +		 * and then upon drm_gpusvm_range_get_pages device
> pages from
> +		 * other processes are collected + faulted in which
> creates all
> +		 * sorts of problems. Unsure exactly how this
> happening, also
> +		 * problem goes away if 'xe_exec_system_allocator --
> r
> +		 * process-many-malloc' mallocs at least 64k at a
> time.
> +		 */

Needs to be figured out. I think even in the system allocator case, if
a user uses malloc() to allocate a GPU only buffer we'd need to support
that?


> +		if (check_pages &&
> +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> end)) {
> +			++i;
> +			goto retry;
> +		}
> +	}
> +
> +	return end - start;
> +}
> +
> +/**
> + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @ctx: GPU SVM context
> + *
> + * This function finds or inserts a newly allocated a GPU SVM range
> based on the
> + * fault address. Caller must hold a lock to protect range lookup
> and insertion.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +	struct drm_gpusvm_range *range;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	bool notifier_alloc = false;
> +	u64 chunk_size;
> +	int err;
> +	bool migrate_devmem;
> +
> +	if (fault_addr < gpusvm->mm_start ||
> +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {

return ERR_PTR(-EINVAL)?

> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!mmget_not_zero(mm)) {
> +		err = -EFAULT;
> +		goto err_out;
> +	}
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> +	if (!notifier) {
> +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> fault_addr);
> +		if (IS_ERR(notifier)) {
> +			err = PTR_ERR(notifier);
> +			goto err_mmunlock;
> +		}
> +		notifier_alloc = true;
> +		err = mmu_interval_notifier_insert(&notifier-
> >notifier,
> +						   mm, notifier-
> >interval.start,
> +						   notifier-
> >interval.end -
> +						   notifier-
> >interval.start,
> +						  
> &drm_gpusvm_notifier_ops);
> +		if (err)
> +			goto err_notifier;
> +	}
> +
> +	mmap_read_lock(mm);
> +
> +	vas = vma_lookup(mm, fault_addr);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_notifier_remove;
> +	}
> +
> +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> +		err = -EPERM;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_find(notifier, fault_addr,
> fault_addr + 1);
> +	if (range)
> +		goto out_mmunlock;
> +	/*
> +	 * XXX: Short-circuiting migration based on migrate_vma_*
> current
> +	 * limitations. If/when migrate_vma_* add more support, this
> logic will
> +	 * have to change.
> +	 */
> +	migrate_devmem = ctx->devmem_possible &&
> +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> +
> +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> vas,
> +						 fault_addr,
> gpuva_start,
> +						 gpuva_end,
> migrate_devmem &&
> +						 ctx->check_pages);
> +	if (chunk_size == LONG_MAX) {
> +		err = -EINVAL;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> chunk_size,
> +				       migrate_devmem);
> +	if (IS_ERR(range)) {
> +		err = PTR_ERR(range);
> +		goto err_notifier_remove;
> +	}
> +
> +	drm_gpusvm_range_insert(notifier, range);
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> +
> +out_mmunlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	return range;
> +
> +err_notifier_remove:
> +	mmap_read_unlock(mm);
> +	if (notifier_alloc)
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +err_notifier:
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +err_mmunlock:
> +	mmput(mm);
> +err_out:
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> GPU SVM range (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @npages: Number of pages to unmap
> + *
> + * This function unmap pages associated with a GPU SVM range.
> Assumes and
> + * asserts correct locking is in place when called.
> + */
> +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> *gpusvm,
> +					   struct drm_gpusvm_range
> *range,
> +					   unsigned long npages)
> +{
> +	unsigned long i, j;
> +	struct drm_pagemap *dpagemap = range->dpagemap;
> +	struct device *dev = gpusvm->drm->dev;
> +
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->flags.has_dma_mapping) {
> +		for (i = 0, j = 0; i < npages; j++) {
> +			struct drm_pagemap_dma_addr *addr = &range-
> >dma_addr[j];
> +
> +			if (addr->proto == DRM_INTERCONNECT_SYSTEM)
> {
> +				dma_unmap_page(dev,
> +					       addr->addr,
> +					       PAGE_SIZE << addr-
> >order,
> +					       addr->dir);
> +			} else if (dpagemap && dpagemap->ops-
> >unmap_dma) {
> +				dpagemap->ops->unmap_dma(dpagemap,
> +							 dev,
> +							 *addr);
> +			}
> +			i += 1 << addr->order;
> +		}
> +		range->flags.has_devmem_pages = false;
> +		range->flags.has_dma_mapping = false;
> +		range->dpagemap = NULL;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function free pages associated with a GPU SVM range.

Frees the dma address array


> + */
> +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> +					struct drm_gpusvm_range
> *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->dma_addr) {
> +		kvfree(range->dma_addr);
> +		range->dma_addr = NULL;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_remove - Remove GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function removes the specified GPU SVM range and also
> removes the parent
> + * GPU SVM notifier if no more ranges remain in the notifier. The
> caller must
> + * hold a lock to protect range and notifier removal.
> + */
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages = npages_in_range(range->va.start,
> range->va.end);
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> >va.start);
> +	if (WARN_ON_ONCE(!notifier))
> +		return;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> +	drm_gpusvm_range_free_pages(gpusvm, range);
> +	__drm_gpusvm_range_remove(notifier, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	drm_gpusvm_range_put(range);
> +
> +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> +		if (!notifier->flags.removed)
> +			mmu_interval_notifier_remove(&notifier-
> >notifier);
> +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function increments the reference count of the specified GPU
> SVM range.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> +{
> +	kref_get(&range->refcount);
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> + * @refcount: Pointer to the reference counter embedded in the GPU
> SVM range
> + *
> + * This function destroys the specified GPU SVM range when its
> reference count
> + * reaches zero. If a custom range-free function is provided, it is
> invoked to
> + * free the range; otherwise, the range is deallocated using
> kfree().
> + */
> +static void drm_gpusvm_range_destroy(struct kref *refcount)
> +{
> +	struct drm_gpusvm_range *range =
> +		container_of(refcount, struct drm_gpusvm_range,
> refcount);
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->range_free)
> +		gpusvm->ops->range_free(range);
> +	else
> +		kfree(range);
> +}
> +
> +/**
> + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function decrements the reference count of the specified GPU
> SVM range
> + * and frees it when the count reaches zero.
> + */
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> +{
> +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called holding gpusvm->notifier_lock and as the last step before
> commiting a
> + * GPU binding.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	return range->flags.has_devmem_pages || range-
> >flags.has_dma_mapping;
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> unlocked
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid.
> Expected be
> + * called without holding gpusvm->notifier_lock.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +static bool
> +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> +				      struct drm_gpusvm_range
> *range)
> +{
> +	bool pages_valid;
> +
> +	if (!range->dma_addr)
> +		return false;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> +	if (!pages_valid)
> +		drm_gpusvm_range_free_pages(gpusvm, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	return pages_valid;
> +}
> +
> +/**
> + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function gets pages for a GPU SVM range and ensures they are
> mapped for
> + * DMA access.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct mmu_interval_notifier *notifier = &range->notifier-
> >notifier;
> +	struct hmm_range hmm_range = {
> +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> ? 0 :
> +			HMM_PFN_REQ_WRITE),
> +		.notifier = notifier,
> +		.start = range->va.start,
> +		.end = range->va.end,
> +		.dev_private_owner = gpusvm-
> >device_private_page_owner,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct drm_gpusvm_zdd *zdd;
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long i, j;
> +	unsigned long npages = npages_in_range(range->va.start,
> range->va.end);
> +	unsigned long num_dma_mapped;
> +	unsigned int order = 0;
> +	unsigned long *pfns;
> +	struct page **pages;
> +	int err = 0;
> +	struct dev_pagemap *pagemap;
> +	struct drm_pagemap *dpagemap;
> +
> +retry:
> +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> +		goto set_seqno;
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return -ENOMEM;
> +
> +	if (!mmget_not_zero(mm)) {
> +		err = -EFAULT;
> +		goto err_out;
> +	}
> +
> +	hmm_range.hmm_pfns = pfns;
> +	while (true) {
> +		mmap_read_lock(mm);
> +		err = hmm_range_fault(&hmm_range);
> +		mmap_read_unlock(mm);
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq =
> mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +map_pages:
> +	/*
> +	 * Perform all dma mappings under the notifier lock to not
> +	 * access freed pages. A notifier will either block on
> +	 * the notifier lock or unmap dma.
> +	 */
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	if (mmu_interval_read_retry(notifier,
> hmm_range.notifier_seq)) {
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +		goto retry;
> +	}
> +
> +	if (!range->dma_addr) {
> +		/* Unlock and restart mapping to allocate memory. */
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +		range->dma_addr = kvmalloc_array(npages,
> sizeof(*range->dma_addr),
> +						 GFP_KERNEL);
> +		if (!range->dma_addr) {
> +			err = -ENOMEM;
> +			goto err_free;
> +		}
> +		goto map_pages;
> +	}
> +
> +	zdd = NULL;
> +	num_dma_mapped = 0;
> +	for (i = 0, j = 0; i < npages; ++j) {
> +		struct page *page = hmm_pfn_to_page(pfns[i]);
> +
> +		order = hmm_pfn_to_map_order(pfns[i]);
> +		if (is_device_private_page(page) ||
> is_device_coherent_page(page)) {
> +			if (zdd != page->zone_device_data && i > 0)
> {
> +				err = -EOPNOTSUPP;
> +				goto err_unmap;
> +			}
> +			zdd = page->zone_device_data;
> +			if (pagemap != page->pgmap) {
> +				if (i > 0) {
> +					err = -EOPNOTSUPP;
> +					goto err_unmap;
> +				}
> +
> +				pagemap = page->pgmap;
> +				dpagemap = zdd->devmem_allocation-
> >dpagemap;
> +				if (drm_WARN_ON(gpusvm->drm,
> !dpagemap)) {
> +					/*
> +					 * Raced. This is not
> supposed to happen
> +					 * since hmm_range_fault()
> should've migrated
> +					 * this page to system.
> +					 */
> +					err = -EAGAIN;
> +					goto err_unmap;
> +				}
> +			}
> +			range->dma_addr[j] =
> +				dpagemap->ops->map_dma(dpagemap,
> gpusvm->drm->dev,
> +						       page, order,
> +						      
> DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev,
> range->dma_addr[j].addr)) {
> +				err = -EFAULT;
> +				goto err_unmap;
> +			}
> +
> +			pages[i] = page;
> +		} else {
> +			dma_addr_t addr;
> +
> +			if (is_zone_device_page(page) || zdd) {
> +				err = -EOPNOTSUPP;

I suppose before merging we want to support mixed ranges since
migration is best effort only, or what are the plans here?

> +				goto err_unmap;
> +			}
> +
> +			addr = dma_map_page(gpusvm->drm->dev,
> +					    page, 0,
> +					    PAGE_SIZE << order,
> +					    DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev,
> addr)) {
> +				err = -EFAULT;
> +				goto err_unmap;
> +			}
> +
> +			range->dma_addr[j] =
> drm_pagemap_dma_addr_encode
> +				(addr, DRM_INTERCONNECT_SYSTEM,
> order,
> +				 DMA_BIDIRECTIONAL);
> +		}
> +		i += 1 << order;
> +		num_dma_mapped = i;
> +	}
> +
> +	range->flags.has_dma_mapping = true;
> +	if (zdd) {
> +		range->flags.has_devmem_pages = true;
> +		range->dpagemap = dpagemap;
> +	}
> +
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +	kvfree(pfns);
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range,
> num_dma_mapped);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +err_free:
> +	kvfree(pfns);
> +err_out:
> +	if (err == -EAGAIN)
> +		goto retry;
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If
> @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write
> mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> called on
> + * each GPU SVM range attached to notifier in gpusvm->ops-
> >invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	unsigned long npages = npages_in_range(range->va.start,
> range->va.end);
> +
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}

NIT: Separate functions for locked / unlocked makes life easier for
static code analyzers.


Section below I think should belong to drm_pagemap.c

> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)

_unlock_put_page()?

> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> +					   unsigned long
> *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> grate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_devmem_page - Get a reference to a device memory
> page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU
> SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_devmem_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to mapped
> pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU
> SVM. It
> + * iterates over each page frame number provided in @migrate_pfn,
> maps the
> + * corresponding page, and stores the DMA address in the provided
> @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int
> *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page =
> migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}

TBC'd

/Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-04 15:25   ` Thomas Hellström
@ 2024-11-04 17:21     ` Matthew Brost
  2024-11-04 18:59       ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-11-04 17:21 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Nov 04, 2024 at 04:25:38PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> 
> Continued review.
> 
> > 
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @devmem_allocation: device memory allocation
> > + * @device_private_page_owner: Device private pages owner
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up
> > a device
> > + * memory allocation upon CPU page fault and asynchronously
> > releasing device
> > + * memory once the CPU has no page references. Asynchronous release
> > is useful
> > + * because CPU page references can be dropped in IRQ contexts, while
> > releasing
> > + * device memory likely requires sleeping locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_devmem *devmem_allocation;
> > +	void *device_private_page_owner;
> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a
> > zdd
> 
> NIT: Even if the above kerneldoc format works, I keep trying to enforce
> using () after function names and function-like macros, like described
> here: https://docs.kernel.org/doc-guide/kernel-doc.html Could we
> update? Also that doc calls for using "Return:" instead of "Returns:".
> 
> 

Will fix up. Thanks for the ref.

> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases device memory, puts GPU SVM range, and
> > frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd,
> > destroy_work);
> > +	const struct drm_gpusvm_devmem_ops *ops = zdd-
> > >devmem_allocation ?
> > +		zdd->devmem_allocation->ops : NULL;
> > +
> > +	if (zdd->devmem_allocation && ops->devmem_release)
> > +		ops->devmem_release(zdd->devmem_allocation);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @device_private_page_owner: Device private pages owner
> > + *
> > + * This function allocates and initializes a new zdd structure. It
> > sets up the
> > + * reference count and initializes the destroy work.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(void *device_private_page_owner)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work,
> > drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->devmem_allocation = NULL;
> > +	zdd->device_private_page_owner = device_private_page_owner;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd
> > structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for asynchronous
> > destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +
> > +	if (zdd->devmem_allocation)
> > +		WRITE_ONCE(zdd->devmem_allocation->detached, true);
> > +	schedule_work(&zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd
> > structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> 
> As mentioned earlier, I think the above drm_gpusvm_zdd functions should
> move to drm_pagemap.c. I don't think they are used in drm_gpusvm other
> than to, at get_pages time, ensure all device private pages are from
> the same pagemap?
> 

The are used in __drm_gpusvm_migrate_to_ram to find devmem_allocation and
associated ops.

Also in drm_gpusvm_migrate_to_ram to find the size and
device_private_page_owner.

I think the placement here is correct for now but open to shuffling this around
in the future if this makes sense.

> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier
> > while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__,
> > start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > (start__), (end__)),	\
> > +	     (next__) =
> > __drm_gpusvm_range_next(range__);				\
> > +	     (range__) && (range__->va.start <
> > (end__));				\
> > +	     (range__) = (next__), (next__) =
> > __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in
> > the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available,
> > or NULL if
> > + *         the current notifier is the last one or if the input
> > notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm-
> > >notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> 
> Why aren't we using notifier_iter_next() here? Then the linked list
> could be skipped.
> 

I shamlessly copied this from GPU VM. I think the list is useful for faster
iteration and safe removal of items while walking.

> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in
> > a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__,
> > end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> 
> Looks like end__ is not honored except for the first iteration. Relates
> to the above question.
> 

Again shameless copy from GPU VM... Missing what the problem is. The condition
to break the loop is:

'(notifier__) && (notifier__->interval.start < (end__)'

> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM
> > notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm
> > while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root,
> > (start__), (end__) - 1),	\
> > +	     (next__) =
> > __drm_gpusvm_notifier_next(notifier__);				\
> > +	     (notifier__) && (notifier__->interval.start <
> > (end__));			\
> > +	     (notifier__) = (next__), (next__) =
> > __drm_gpusvm_notifier_next(notifier__))
> 
> Same here.
> 

Alsp present:

 (notifier__) && (notifier__->interval.start < (end__)

> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It
> > sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc
> > under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range
> > *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops
> > drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order
> > with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner =
> > device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault
> > address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier
> > structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)				\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> 
> There appears to be a number of function-like macros in the code, which
> look like they can be converted to functions. Linux prefers functions
> over macros when possible:
> 
> https://www.kernel.org/doc/html/v5.8/process/coding-style.html#macros-enums-and-rtl
> 

Will convert all macros to functions if possible. Again thanks for ref.

> 
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB
> > tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree
> > and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> 
> Unless this can be made a function, Pls use
> do { } while (0)
> 

I think it can be made a function or otherwise yea will use do { } while (0).

> 
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining
> > ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0,
> > LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any
> > invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next,
> > notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR()
> > on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm-
> > >notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > >notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier
> > *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given
> > rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a
> > drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree
> > and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree
> > and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > +	range_remove((range__), &(notifier__)->root);		\
> > +	list_del(&(range__)->rb.entry)
> 
> Same thing as for the notifier rb tree. And do we need the linked list?
> 

Same answer.

> 
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_devmem: Flag indicating whether to migrate device memory
> > + *
> > + * This function allocates and initializes the GPU SVM range
> > structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on
> > failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool
> > migrate_devmem)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_devmem = migrate_devmem ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the
> > CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier
> > *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier-
> > >notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages;) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +		i += 0x1 << hmm_pfn_to_map_order(pfns[i]);
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM
> > range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range
> > based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and
> > the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier
> > *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64
> > gpuva_start,
> > +				       u64 gpuva_end, bool
> > check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > >chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with
> > existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted
> > in. Without
> > +		 * this check, or prefault, on BMG
> > 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case,
> > each process
> > +		 * mallocs 16k but the CPU VMA is ~128k which
> > results in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some
> > processes fail in
> > +		 * drm_gpusvm_migrate_to_devmem with 'migrate.cpages
> > != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages device
> > pages from
> > +		 * other processes are collected + faulted in which
> > creates all
> > +		 * sorts of problems. Unsure exactly how this
> > happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --
> > r
> > +		 * process-many-malloc' mallocs at least 64k at a
> > time.
> > +		 */
> 
> Needs to be figured out. I think even in the system allocator case, if
> a user uses malloc() to allocate a GPU only buffer we'd need to support
> that?
>

I'm not understanding this comment but I do agree what is going on here needs to
be figured out.

This comment is actually a bit stale - I think the above test case will pass now
if ctx.check_pages is false with a retry loop triggered in GPU fault handler
because of mixed pages. However it does appear the test case still finds device
pages in hmm_range_fault mapped into a different process which I think should be
impossible. Wondering if there is hmm / mm core bug here my test case hits? Let
me page this information back and dig in here to see if I can explain what is
going on better. Will take sometime but should be able to focus on this during
the week

Also I think leaving in the check_pages option is a good thing. A call then can
choose between 2 things:

1. Only create GPU mappings for CPU pages faulted in (ctx.check_pages = true)
2. create GPU mappings for a VMA and fault in CPU pages (ctx.check_pages =
false)
 
If we support 2, then I think xe_svm_copy needs to be updated to clear VRAM for
pages which the CPU has not faulted in.

> 
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start,
> > end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range
> > based on the
> > + * fault address. Caller must hold a lock to protect range lookup
> > and insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_devmem;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> 
> return ERR_PTR(-EINVAL)?
>

Sure.
 
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!mmget_not_zero(mm)) {
> > +		err = -EFAULT;
> > +		goto err_out;

And here too.

> > +	}
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert(&notifier-
> > >notifier,
> > +						   mm, notifier-
> > >interval.start,
> > +						   notifier-
> > >interval.end -
> > +						   notifier-
> > >interval.start,
> > +						  
> > &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	mmap_read_lock(mm);
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_*
> > current
> > +	 * limitations. If/when migrate_vma_* add more support, this
> > logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_devmem = ctx->devmem_possible &&
> > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier,
> > vas,
> > +						 fault_addr,
> > gpuva_start,
> > +						 gpuva_end,
> > migrate_devmem &&
> > +						 ctx->check_pages);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr,
> > chunk_size,
> > +				       migrate_devmem);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +out_mmunlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +
> > +	return range;
> > +
> > +err_notifier_remove:
> > +	mmap_read_unlock(mm);
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	mmput(mm);
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @npages: Number of pages to unmap
> > + *
> > + * This function unmap pages associated with a GPU SVM range.
> > Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > *gpusvm,
> > +					   struct drm_gpusvm_range
> > *range,
> > +					   unsigned long npages)
> > +{
> > +	unsigned long i, j;
> > +	struct drm_pagemap *dpagemap = range->dpagemap;
> > +	struct device *dev = gpusvm->drm->dev;
> > +
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->flags.has_dma_mapping) {
> > +		for (i = 0, j = 0; i < npages; j++) {
> > +			struct drm_pagemap_dma_addr *addr = &range-
> > >dma_addr[j];
> > +
> > +			if (addr->proto == DRM_INTERCONNECT_SYSTEM)
> > {
> > +				dma_unmap_page(dev,
> > +					       addr->addr,
> > +					       PAGE_SIZE << addr-
> > >order,
> > +					       addr->dir);
> > +			} else if (dpagemap && dpagemap->ops-
> > >unmap_dma) {
> > +				dpagemap->ops->unmap_dma(dpagemap,
> > +							 dev,
> > +							 *addr);
> > +			}
> > +			i += 1 << addr->order;
> > +		}
> > +		range->flags.has_devmem_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +		range->dpagemap = NULL;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> 
> Frees the dma address array
>

Yes.
 
> 
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > +					struct drm_gpusvm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->dma_addr) {
> > +		kvfree(range->dma_addr);
> > +		range->dma_addr = NULL;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also
> > removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The
> > caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > >va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier-
> > >notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified GPU
> > SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the GPU
> > SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its
> > reference count
> > + * reaches zero. If a custom range-free function is provided, it is
> > invoked to
> > + * free the range; otherwise, the range is deallocated using
> > kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range,
> > refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified GPU
> > SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before
> > commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_devmem_pages || range-
> > >flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid
> > unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid.
> > Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > +				      struct drm_gpusvm_range
> > *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->dma_addr)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > +	if (!pages_valid)
> > +		drm_gpusvm_range_free_pages(gpusvm, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they are
> > mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mmu_interval_notifier *notifier = &range->notifier-
> > >notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only
> > ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm-
> > >device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct drm_gpusvm_zdd *zdd;
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +	unsigned long num_dma_mapped;
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	struct dev_pagemap *pagemap;
> > +	struct drm_pagemap *dpagemap;
> > +
> > +retry:
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > +		goto set_seqno;
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return -ENOMEM;
> > +
> > +	if (!mmget_not_zero(mm)) {
> > +		err = -EFAULT;
> > +		goto err_out;
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		mmap_read_lock(mm);
> > +		err = hmm_range_fault(&hmm_range);
> > +		mmap_read_unlock(mm);
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq =
> > mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +map_pages:
> > +	/*
> > +	 * Perform all dma mappings under the notifier lock to not
> > +	 * access freed pages. A notifier will either block on
> > +	 * the notifier lock or unmap dma.
> > +	 */
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	if (mmu_interval_read_retry(notifier,
> > hmm_range.notifier_seq)) {
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +		goto retry;
> > +	}
> > +
> > +	if (!range->dma_addr) {
> > +		/* Unlock and restart mapping to allocate memory. */
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +		range->dma_addr = kvmalloc_array(npages,
> > sizeof(*range->dma_addr),
> > +						 GFP_KERNEL);
> > +		if (!range->dma_addr) {
> > +			err = -ENOMEM;
> > +			goto err_free;
> > +		}
> > +		goto map_pages;
> > +	}
> > +
> > +	zdd = NULL;
> > +	num_dma_mapped = 0;
> > +	for (i = 0, j = 0; i < npages; ++j) {
> > +		struct page *page = hmm_pfn_to_page(pfns[i]);
> > +
> > +		order = hmm_pfn_to_map_order(pfns[i]);
> > +		if (is_device_private_page(page) ||
> > is_device_coherent_page(page)) {
> > +			if (zdd != page->zone_device_data && i > 0)
> > {
> > +				err = -EOPNOTSUPP;
> > +				goto err_unmap;
> > +			}
> > +			zdd = page->zone_device_data;
> > +			if (pagemap != page->pgmap) {
> > +				if (i > 0) {
> > +					err = -EOPNOTSUPP;
> > +					goto err_unmap;
> > +				}
> > +
> > +				pagemap = page->pgmap;
> > +				dpagemap = zdd->devmem_allocation-
> > >dpagemap;
> > +				if (drm_WARN_ON(gpusvm->drm,
> > !dpagemap)) {
> > +					/*
> > +					 * Raced. This is not
> > supposed to happen
> > +					 * since hmm_range_fault()
> > should've migrated
> > +					 * this page to system.
> > +					 */
> > +					err = -EAGAIN;
> > +					goto err_unmap;
> > +				}
> > +			}
> > +			range->dma_addr[j] =
> > +				dpagemap->ops->map_dma(dpagemap,
> > gpusvm->drm->dev,
> > +						       page, order,
> > +						      
> > DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev,
> > range->dma_addr[j].addr)) {
> > +				err = -EFAULT;
> > +				goto err_unmap;
> > +			}
> > +
> > +			pages[i] = page;
> > +		} else {
> > +			dma_addr_t addr;
> > +
> > +			if (is_zone_device_page(page) || zdd) {
> > +				err = -EOPNOTSUPP;
> 
> I suppose before merging we want to support mixed ranges since
> migration is best effort only, or what are the plans here?
>

I'd say initial merge no mixed support given that adds complexity and the
current code is very stable - i.e., get in a simple and stable baseline and then
build complexity on top incrementally. I have a lot perf optimization I'd like
to get in but omitting for now to stick to the aforementioned plan.

Longtern I think a drm_gpusvm_ctx argument will control if we want mixed
mappings within a range.
 
> > +				goto err_unmap;
> > +			}
> > +
> > +			addr = dma_map_page(gpusvm->drm->dev,
> > +					    page, 0,
> > +					    PAGE_SIZE << order,
> > +					    DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev,
> > addr)) {
> > +				err = -EFAULT;
> > +				goto err_unmap;
> > +			}
> > +
> > +			range->dma_addr[j] =
> > drm_pagemap_dma_addr_encode
> > +				(addr, DRM_INTERCONNECT_SYSTEM,
> > order,
> > +				 DMA_BIDIRECTIONAL);
> > +		}
> > +		i += 1 << order;
> > +		num_dma_mapped = i;
> > +	}
> > +
> > +	range->flags.has_dma_mapping = true;
> > +	if (zdd) {
> > +		range->flags.has_devmem_pages = true;
> > +		range->dpagemap = dpagemap;
> > +	}
> > +
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +	kvfree(pfns);
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range,
> > num_dma_mapped);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +err_free:
> > +	kvfree(pfns);
> > +err_out:
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU
> > SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If
> > @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write
> > mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be
> > called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > >invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	unsigned long npages = npages_in_range(range->va.start,
> > range->va.end);
> > +
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> 
> NIT: Separate functions for locked / unlocked makes life easier for
> static code analyzers.
> 

Willl do.

> 
> Section below I think should belong to drm_pagemap.c
>

Diagree. See my comments on zdd above. Also drm_gpusvm_migration_put_pages uses
migration pfns which definitely should not be in drm_pagemap.c.
 
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> 
> _unlock_put_page()?
> 
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long
> > *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(mi
> > grate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_devmem_page - Get a reference to a device memory
> > page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU
> > SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_devmem_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM
> > migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped
> > pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU
> > SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn,
> > maps the
> > + * corresponding page, and stores the DMA address in the provided
> > @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int
> > *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page =
> > migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE,
> > dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> TBC'd
> 

Thanks for the comments!

Matt

> /Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-04 17:21     ` Matthew Brost
@ 2024-11-04 18:59       ` Thomas Hellström
  2024-11-04 23:07         ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-04 18:59 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, 2024-11-04 at 09:21 -0800, Matthew Brost wrote:
> On Mon, Nov 04, 2024 at 04:25:38PM +0100, Thomas Hellström wrote:
> > On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > 
> > Continued review.
> > 
> > > 
> > > +/**
> > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > + *
> > > + * @refcount: Reference count for the zdd
> > > + * @destroy_work: Work structure for asynchronous zdd
> > > destruction
> > > + * @devmem_allocation: device memory allocation
> > > + * @device_private_page_owner: Device private pages owner
> > > + *
> > > + * This structure serves as a generic wrapper installed in
> > > + * page->zone_device_data. It provides infrastructure for
> > > looking up
> > > a device
> > > + * memory allocation upon CPU page fault and asynchronously
> > > releasing device
> > > + * memory once the CPU has no page references. Asynchronous
> > > release
> > > is useful
> > > + * because CPU page references can be dropped in IRQ contexts,
> > > while
> > > releasing
> > > + * device memory likely requires sleeping locks.
> > > + */
> > > +struct drm_gpusvm_zdd {
> > > +	struct kref refcount;
> > > +	struct work_struct destroy_work;
> > > +	struct drm_gpusvm_devmem *devmem_allocation;
> > > +	void *device_private_page_owner;
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > destroying a
> > > zdd
> > 
> > NIT: Even if the above kerneldoc format works, I keep trying to
> > enforce
> > using () after function names and function-like macros, like
> > described
> > here: https://docs.kernel.org/doc-guide/kernel-doc.html Could we
> > update? Also that doc calls for using "Return:" instead of
> > "Returns:".
> > 
> > 
> 
> Will fix up. Thanks for the ref.
> 
> > > + * @w: Pointer to the work_struct
> > > + *
> > > + * This function releases device memory, puts GPU SVM range, and
> > > frees zdd.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct
> > > *w)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(w, struct drm_gpusvm_zdd,
> > > destroy_work);
> > > +	const struct drm_gpusvm_devmem_ops *ops = zdd-
> > > > devmem_allocation ?
> > > +		zdd->devmem_allocation->ops : NULL;
> > > +
> > > +	if (zdd->devmem_allocation && ops->devmem_release)
> > > +		ops->devmem_release(zdd->devmem_allocation);
> > > +	kfree(zdd);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > + * @device_private_page_owner: Device private pages owner
> > > + *
> > > + * This function allocates and initializes a new zdd structure.
> > > It
> > > sets up the
> > > + * reference count and initializes the destroy work.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_zdd *
> > > +drm_gpusvm_zdd_alloc(void *device_private_page_owner)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd;
> > > +
> > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > +	if (!zdd)
> > > +		return NULL;
> > > +
> > > +	kref_init(&zdd->refcount);
> > > +	INIT_WORK(&zdd->destroy_work,
> > > drm_gpusvm_zdd_destroy_work_func);
> > > +	zdd->devmem_allocation = NULL;
> > > +	zdd->device_private_page_owner =
> > > device_private_page_owner;
> > > +
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function increments the reference count of the provided
> > > zdd
> > > structure.
> > > + *
> > > + * Returns: Pointer to the zdd structure.
> > > + */
> > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_get(&zdd->refcount);
> > > +	return zdd;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > + * @ref: Pointer to the reference count structure.
> > > + *
> > > + * This function queues the destroy_work of the zdd for
> > > asynchronous
> > > destruction.
> > > + */
> > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd =
> > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > refcount);
> > > +
> > > +	if (zdd->devmem_allocation)
> > > +		WRITE_ONCE(zdd->devmem_allocation->detached,
> > > true);
> > > +	schedule_work(&zdd->destroy_work);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > + * @zdd: Pointer to the zdd structure.
> > > + *
> > > + * This function decrements the reference count of the provided
> > > zdd
> > > structure
> > > + * and schedules its destruction if the count drops to zero.
> > > + */
> > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > +}
> > 
> > As mentioned earlier, I think the above drm_gpusvm_zdd functions
> > should
> > move to drm_pagemap.c. I don't think they are used in drm_gpusvm
> > other
> > than to, at get_pages time, ensure all device private pages are
> > from
> > the same pagemap?
> > 
> 
> The are used in __drm_gpusvm_migrate_to_ram to find devmem_allocation
> and
> associated ops.
> 
> Also in drm_gpusvm_migrate_to_ram to find the size and
> device_private_page_owner.
> 
> I think the placement here is correct for now but open to shuffling
> this around
> in the future if this makes sense.

Yeah I was thinking with the split in the multi-device series, (which
is yet to be posted, though, the drm_pagemap op (*populate_mm)() would
in effect handle all migration to vram, and the dev_pagemap op would
handle migration to ram, and the ranges would no longer keep track of
the vram allocations.

This means that no low-level migration function would take drm_gpusvm
as an argument (well drm_gpusvm_range_evict() would, but that would, as
we discussed before probably want to evict *all* device private pages,
so that it would likely need to be implemented with hmm_range_fault()?

So this was mostly trying to avoid future shuffling around, but agree
it depends on dev_pagemap patches.

> 
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > notifier
> > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > + * @start: Start address of the range
> > > + * @end: End address of the range
> > > + *
> > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > start, u64 end)
> > > +{
> > > +	return range_iter_first(&notifier->root, start, end -
> > > 1);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > ranges in a notifier
> > > + * @range__: Iterator variable for the ranges
> > > + * @next__: Iterator variable for the ranges temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the range
> > > + * @end__: End address of the range
> > > + *
> > > + * This macro is used to iterate over GPU SVM ranges in a
> > > notifier
> > > while
> > > + * removing ranges from it.
> > > + */
> > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > notifier__,
> > > start__, end__)	\
> > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > (start__), (end__)),	\
> > > +	     (next__) =
> > > __drm_gpusvm_range_next(range__);				\
> > > +	     (range__) && (range__->va.start <
> > > (end__));				\
> > > +	     (range__) = (next__), (next__) =
> > > __drm_gpusvm_range_next(range__))
> > > +
> > > +/**
> > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier
> > > in
> > > the list
> > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > + *
> > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > available,
> > > or NULL if
> > > + *         the current notifier is the last one or if the input
> > > notifier is
> > > + *         NULL.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > +{
> > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > +				      &notifier->gpusvm-
> > > > notifier_list))
> > > +		return list_next_entry(notifier, rb.entry);
> > 
> > Why aren't we using notifier_iter_next() here? Then the linked list
> > could be skipped.
> > 
> 
> I shamlessly copied this from GPU VM. I think the list is useful for
> faster
> iteration and safe removal of items while walking.

We
havehttps://elixir.bootlin.com/linux/v6.12-rc6/source/include/linux/interval_tree_generic.h#L24

to relate to. Now GPUVM can't use the generic version since it needs
u64 intervals. These trees need unsigned long only so it should be ok.
And safe removal, isn't that possible to implement without the list?
Then it's really only the linked list as a perf optimization I guess,
but we have a lot of those pending...



> 
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> > > in
> > > a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > gpusvm.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > start__,
> > > end__)		\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > >root,
> > > (start__), (end__) - 1);	\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > > +
> > 
> > Looks like end__ is not honored except for the first iteration.
> > Relates
> > to the above question.
> > 
> 
> Again shameless copy from GPU VM... Missing what the problem is. The
> condition
> to break the loop is:
> 
> '(notifier__) && (notifier__->interval.start < (end__)'

Ah yes, you're right. I missed that.

> 
> > > +/**
> > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> > > SVM
> > > notifiers in a gpusvm
> > > + * @notifier__: Iterator variable for the notifiers
> > > + * @next__: Iterator variable for the notifiers temporay storage
> > > + * @notifier__: Pointer to the GPU SVM notifier
> > > + * @start__: Start address of the notifier
> > > + * @end__: End address of the notifier
> > > + *
> > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > gpusvm
> > > while
> > > + * removing notifiers from it.
> > > + */
> > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > gpusvm__, start__, end__)	\
> > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > >root,
> > > (start__), (end__) - 1),	\
> > > +	     (next__) =
> > > __drm_gpusvm_notifier_next(notifier__);				\
> > > +	     (notifier__) && (notifier__->interval.start <
> > > (end__));			\
> > > +	     (notifier__) = (next__), (next__) =
> > > __drm_gpusvm_notifier_next(notifier__))
> > 
> > Same here.
> > 
> 
> Alsp present:
> 
>  (notifier__) && (notifier__->interval.start < (end__)
> 
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > notifier.
> > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > + * @cur_seq: Current sequence number.
> > > + *
> > > + * This function serves as a generic MMU notifier for GPU SVM.
> > > It
> > > sets the MMU
> > > + * notifier sequence number and calls the driver invalidate
> > > vfunc
> > > under
> > > + * gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * true if the operation succeeds, false otherwise.
> > > + */
> > > +static bool
> > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > *mni,
> > > +			       const struct mmu_notifier_range
> > > *mmu_range,
> > > +			       unsigned long cur_seq)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier =
> > > +		container_of(mni, typeof(*notifier), notifier);
> > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > +
> > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > +		return false;
> > > +
> > > +	down_write(&gpusvm->notifier_lock);
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > +	up_write(&gpusvm->notifier_lock);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations
> > > for
> > > GPU SVM
> > > + */
> > > +static const struct mmu_interval_notifier_ops
> > > drm_gpusvm_notifier_ops = {
> > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > +};
> > > +
> > > +/**
> > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + * @name: Name of the GPU SVM.
> > > + * @drm: Pointer to the DRM device structure.
> > > + * @mm: Pointer to the mm_struct for the address space.
> > > + * @device_private_page_owner: Device private pages owner.
> > > + * @mm_start: Start address of GPU SVM.
> > > + * @mm_range: Range of the GPU SVM.
> > > + * @notifier_size: Size of individual notifiers.
> > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > range
> > > allocation.
> > > + *               Entries should be powers of 2 in descending
> > > order
> > > with last
> > > + *               entry being SZ_4K.
> > > + * @num_chunks: Number of chunks.
> > > + *
> > > + * This function initializes the GPU SVM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, a negative error code on failure.
> > > + */
> > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > +		    const char *name, struct drm_device *drm,
> > > +		    struct mm_struct *mm, void
> > > *device_private_page_owner,
> > > +		    u64 mm_start, u64 mm_range, u64
> > > notifier_size,
> > > +		    const struct drm_gpusvm_ops *ops,
> > > +		    const u64 *chunk_sizes, int num_chunks)
> > > +{
> > > +	if (!ops->invalidate || !num_chunks)
> > > +		return -EINVAL;
> > > +
> > > +	gpusvm->name = name;
> > > +	gpusvm->drm = drm;
> > > +	gpusvm->mm = mm;
> > > +	gpusvm->device_private_page_owner =
> > > device_private_page_owner;
> > > +	gpusvm->mm_start = mm_start;
> > > +	gpusvm->mm_range = mm_range;
> > > +	gpusvm->notifier_size = notifier_size;
> > > +	gpusvm->ops = ops;
> > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > +	gpusvm->num_chunks = num_chunks;
> > > +
> > > +	mmgrab(mm);
> > > +	gpusvm->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > +
> > > +	init_rwsem(&gpusvm->notifier_lock);
> > > +
> > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > +	might_lock(&gpusvm->notifier_lock);
> > > +	fs_reclaim_release(GFP_KERNEL);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > + * @fault_addr__: Fault address
> > > + *
> > > + * This macro finds the GPU SVM notifier associated with the
> > > fault
> > > address.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > + */
> > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > fault_addr__)	\
> > > +	notifier_iter_first(&(gpusvm__)->root,
> > > (fault_addr__),	\
> > > +			    (fault_addr__ + 1))
> > > +
> > > +/**
> > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > given rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_notifier struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > structure.
> > > + */
> > > +#define
> > > to_drm_gpusvm_notifier(__node)				\
> > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > rb.node)
> > > +
> > 
> > There appears to be a number of function-like macros in the code,
> > which
> > look like they can be converted to functions. Linux prefers
> > functions
> > over macros when possible:
> > 
> > https://www.kernel.org/doc/html/v5.8/process/coding-style.html#macros-enums-and-rtl
> > 
> 
> Will convert all macros to functions if possible. Again thanks for
> ref.
> 
> > 
> > > +/**
> > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function inserts the GPU SVM notifier into the GPU SVM
> > > RB
> > > tree and list.
> > > + */
> > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct
> > > drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	notifier_insert(notifier, &gpusvm->root);
> > > +
> > > +	node = rb_prev(&notifier->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > >rb.entry;
> > > +	else
> > > +		head = &gpusvm->notifier_list;
> > > +
> > > +	list_add(&notifier->rb.entry, head);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This macro removes the GPU SVM notifier from the GPU SVM RB
> > > tree
> > > and list.
> > > + */
> > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > notifier__)	\
> > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > >root);	\
> > > +	list_del(&(notifier__)->rb.entry)
> > 
> > Unless this can be made a function, Pls use
> > do { } while (0)
> > 
> 
> I think it can be made a function or otherwise yea will use do { }
> while (0).
> 
> > 
> > > +
> > > +/**
> > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > + *
> > > + * This function finalizes the GPU SVM by cleaning up any
> > > remaining
> > > ranges and
> > > + * notifiers, and dropping a reference to struct MM.
> > > + */
> > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > +
> > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > gpusvm, 0,
> > > LONG_MAX) {
> > > +		struct drm_gpusvm_range *range, *__next;
> > > +
> > > +		/*
> > > +		 * Remove notifier first to avoid racing with
> > > any
> > > invalidation
> > > +		 */
> > > +		mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +		notifier->flags.removed = true;
> > > +
> > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > notifier, 0,
> > > +					       LONG_MAX)
> > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > +	}
> > > +
> > > +	mmdrop(gpusvm->mm);
> > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + *
> > > + * This function allocates and initializes the GPU SVM notifier
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM notifier on success,
> > > ERR_PTR()
> > > on failure.
> > > + */
> > > +static struct drm_gpusvm_notifier *
> > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > fault_addr)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	if (gpusvm->ops->notifier_alloc)
> > > +		notifier = gpusvm->ops->notifier_alloc();
> > > +	else
> > > +		notifier = kzalloc(sizeof(*notifier),
> > > GFP_KERNEL);
> > > +
> > > +	if (!notifier)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	notifier->gpusvm = gpusvm;
> > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > gpusvm-
> > > > notifier_size);
> > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > > notifier_size);
> > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > +	notifier->root = RB_ROOT_CACHED;
> > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > +
> > > +	return notifier;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + *
> > > + * This function frees the GPU SVM notifier structure.
> > > + */
> > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > +				     struct drm_gpusvm_notifier
> > > *notifier)
> > > +{
> > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > +
> > > +	if (gpusvm->ops->notifier_free)
> > > +		gpusvm->ops->notifier_free(notifier);
> > > +	else
> > > +		kfree(notifier);
> > > +}
> > > +
> > > +/**
> > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > given
> > > rbtree node
> > > + * @node__: a pointer to the rbtree node embedded within a
> > > drm_gpusvm_range struct
> > > + *
> > > + * Return: A pointer to the containing drm_gpusvm_range
> > > structure.
> > > + */
> > > +#define to_drm_gpusvm_range(node__)	\
> > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > +
> > > +/**
> > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function inserts the GPU SVM range into the notifier RB
> > > tree
> > > and list.
> > > + */
> > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > *notifier,
> > > +				    struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	struct rb_node *node;
> > > +	struct list_head *head;
> > > +
> > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > +	range_insert(range, &notifier->root);
> > > +
> > > +	node = rb_prev(&range->rb.node);
> > > +	if (node)
> > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > +	else
> > > +		head = &notifier->range_list;
> > > +
> > > +	list_add(&range->rb.entry, head);
> > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > + * @range__: Pointer to the GPU SVM range structure
> > > + *
> > > + * This macro removes the GPU SVM range from the notifier RB
> > > tree
> > > and list.
> > > + */
> > > +#define __drm_gpusvm_range_remove(notifier__,
> > > range__)		\
> > > +	range_remove((range__), &(notifier__)-
> > > >root);		\
> > > +	list_del(&(range__)->rb.entry)
> > 
> > Same thing as for the notifier rb tree. And do we need the linked
> > list?
> > 
> 
> Same answer.
> 
> > 
> > > +
> > > +/**
> > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @fault_addr: Fault address
> > > + * @chunk_size: Chunk size
> > > + * @migrate_devmem: Flag indicating whether to migrate device
> > > memory
> > > + *
> > > + * This function allocates and initializes the GPU SVM range
> > > structure.
> > > + *
> > > + * Returns:
> > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR()
> > > on
> > > failure.
> > > + */
> > > +static struct drm_gpusvm_range *
> > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > +		       struct drm_gpusvm_notifier *notifier,
> > > +		       u64 fault_addr, u64 chunk_size, bool
> > > migrate_devmem)
> > > +{
> > > +	struct drm_gpusvm_range *range;
> > > +
> > > +	if (gpusvm->ops->range_alloc)
> > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > +	else
> > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	kref_init(&range->refcount);
> > > +	range->gpusvm = gpusvm;
> > > +	range->notifier = notifier;
> > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > +	range->notifier_seq = LONG_MAX;
> > > +	range->flags.migrate_devmem = migrate_devmem ? 1 : 0;
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_check_pages - Check pages
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @start: Start address
> > > + * @end: End address
> > > + *
> > > + * Check if pages between start and end have been faulted in on
> > > the
> > > CPU. Use to
> > > + * prevent migration of pages without CPU backing store.
> > > + *
> > > + * Returns:
> > > + * True if pages have been faulted into CPU, False otherwise
> > > + */
> > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > +				   struct drm_gpusvm_notifier
> > > *notifier,
> > > +				   u64 start, u64 end)
> > > +{
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = 0,
> > > +		.notifier = &notifier->notifier,
> > > +		.start = start,
> > > +		.end = end,
> > > +		.dev_private_owner = gpusvm-
> > > > device_private_page_owner,
> > > +	};
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns;
> > > +	unsigned long npages = npages_in_range(start, end);
> > > +	int err, i;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > > +
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +	if (!pfns)
> > > +		return false;
> > > +
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier-
> > > > notifier);
> > > +	hmm_range.hmm_pfns = pfns;
> > > +
> > > +	while (true) {
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(&notifier->notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	for (i = 0; i < npages;) {
> > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > +			err = -EFAULT;
> > > +			goto err_free;
> > > +		}
> > > +		i += 0x1 << hmm_pfn_to_map_order(pfns[i]);
> > > +	}
> > > +
> > > +err_free:
> > > +	kvfree(pfns);
> > > +	return err ? false : true;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> > > SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > + * @vas: Pointer to the virtual memory area structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @check_pages: Flag indicating whether to check pages
> > > + *
> > > + * This function determines the chunk size for the GPU SVM range
> > > based on the
> > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > > and
> > > the virtual
> > > + * memory area boundaries.
> > > + *
> > > + * Returns:
> > > + * Chunk size on success, LONG_MAX on failure.
> > > + */
> > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > *gpusvm,
> > > +				       struct
> > > drm_gpusvm_notifier
> > > *notifier,
> > > +				       struct vm_area_struct
> > > *vas,
> > > +				       u64 fault_addr, u64
> > > gpuva_start,
> > > +				       u64 gpuva_end, bool
> > > check_pages)
> > > +{
> > > +	u64 start, end;
> > > +	int i = 0;
> > > +
> > > +retry:
> > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > chunk_sizes[i]);
> > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > >chunk_sizes[i]);
> > > +
> > > +		if (start >= vas->vm_start && end <= vas->vm_end
> > > &&
> > > +		    start >= notifier->interval.start &&
> > > +		    end <= notifier->interval.end &&
> > > +		    start >= gpuva_start && end <= gpuva_end)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i == gpusvm->num_chunks)
> > > +		return LONG_MAX;
> > > +
> > > +	/*
> > > +	 * If allocation more than page, ensure not to overlap
> > > with
> > > existing
> > > +	 * ranges.
> > > +	 */
> > > +	if (end - start != SZ_4K) {
> > > +		struct drm_gpusvm_range *range;
> > > +
> > > +		range = drm_gpusvm_range_find(notifier, start,
> > > end);
> > > +		if (range) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +
> > > +		/*
> > > +		 * XXX: Only create range on pages CPU has
> > > faulted
> > > in. Without
> > > +		 * this check, or prefault, on BMG
> > > 'xe_exec_system_allocator --r
> > > +		 * process-many-malloc' fails. In the failure
> > > case,
> > > each process
> > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > results in 64k SVM
> > > +		 * ranges. When migrating the SVM ranges, some
> > > processes fail in
> > > +		 * drm_gpusvm_migrate_to_devmem with
> > > 'migrate.cpages
> > > != npages'
> > > +		 * and then upon drm_gpusvm_range_get_pages
> > > device
> > > pages from
> > > +		 * other processes are collected + faulted in
> > > which
> > > creates all
> > > +		 * sorts of problems. Unsure exactly how this
> > > happening, also
> > > +		 * problem goes away if
> > > 'xe_exec_system_allocator --
> > > r
> > > +		 * process-many-malloc' mallocs at least 64k at
> > > a
> > > time.
> > > +		 */
> > 
> > Needs to be figured out. I think even in the system allocator case,
> > if
> > a user uses malloc() to allocate a GPU only buffer we'd need to
> > support
> > that?
> > 
> 
> I'm not understanding this comment but I do agree what is going on
> here needs to
> be figured out.

What I meant was let's say the user mallocs a big buffer to be used by
the gpu only, and hence should ideally be in device memory, but it's
never faulted by the CPU. I guess my question should be reformulated,
what would happen then? Wouldn't it remain in system ram forever?

> 
> This comment is actually a bit stale - I think the above test case
> will pass now
> if ctx.check_pages is false with a retry loop triggered in GPU fault
> handler
> because of mixed pages. However it does appear the test case still
> finds device
> pages in hmm_range_fault mapped into a different process which I
> think should be
> impossible. Wondering if there is hmm / mm core bug here my test case
> hits? Let
> me page this information back and dig in here to see if I can explain
> what is
> going on better. Will take sometime but should be able to focus on
> this during
> the week
> 
> Also I think leaving in the check_pages option is a good thing. A
> call then can
> choose between 2 things:
> 
> 1. Only create GPU mappings for CPU pages faulted in (ctx.check_pages
> = true)
> 2. create GPU mappings for a VMA and fault in CPU pages
> (ctx.check_pages =
> false)
>  
> If we support 2, then I think xe_svm_copy needs to be updated to
> clear VRAM for
> pages which the CPU has not faulted in.
> 
> > 
> > > +		if (check_pages &&
> > > +		    !drm_gpusvm_check_pages(gpusvm, notifier,
> > > start,
> > > end)) {
> > > +			++i;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > > +	return end - start;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @fault_addr: Fault address
> > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function finds or inserts a newly allocated a GPU SVM
> > > range
> > > based on the
> > > + * fault address. Caller must hold a lock to protect range
> > > lookup
> > > and insertion.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > failure.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > fault_addr,
> > > +				u64 gpuva_start, u64 gpuva_end,
> > > +				const struct drm_gpusvm_ctx
> > > *ctx)
> > > +{
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +	struct drm_gpusvm_range *range;
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct vm_area_struct *vas;
> > > +	bool notifier_alloc = false;
> > > +	u64 chunk_size;
> > > +	int err;
> > > +	bool migrate_devmem;
> > > +
> > > +	if (fault_addr < gpusvm->mm_start ||
> > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > 
> > return ERR_PTR(-EINVAL)?
> > 
> 
> Sure.
>  
> > > +		err = -EINVAL;
> > > +		goto err_out;
> > > +	}
> > > +
> > > +	if (!mmget_not_zero(mm)) {
> > > +		err = -EFAULT;
> > > +		goto err_out;
> 
> And here too.
> 
> > > +	}
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > +	if (!notifier) {
> > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > fault_addr);
> > > +		if (IS_ERR(notifier)) {
> > > +			err = PTR_ERR(notifier);
> > > +			goto err_mmunlock;
> > > +		}
> > > +		notifier_alloc = true;
> > > +		err = mmu_interval_notifier_insert(&notifier-
> > > > notifier,
> > > +						   mm, notifier-
> > > > interval.start,
> > > +						   notifier-
> > > > interval.end -
> > > +						   notifier-
> > > > interval.start,
> > > +						  
> > > &drm_gpusvm_notifier_ops);
> > > +		if (err)
> > > +			goto err_notifier;
> > > +	}
> > > +
> > > +	mmap_read_lock(mm);
> > > +
> > > +	vas = vma_lookup(mm, fault_addr);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > +		err = -EPERM;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > fault_addr + 1);
> > > +	if (range)
> > > +		goto out_mmunlock;
> > > +	/*
> > > +	 * XXX: Short-circuiting migration based on
> > > migrate_vma_*
> > > current
> > > +	 * limitations. If/when migrate_vma_* add more support,
> > > this
> > > logic will
> > > +	 * have to change.
> > > +	 */
> > > +	migrate_devmem = ctx->devmem_possible &&
> > > +		vma_is_anonymous(vas) &&
> > > !is_vm_hugetlb_page(vas);
> > > +
> > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > notifier,
> > > vas,
> > > +						 fault_addr,
> > > gpuva_start,
> > > +						 gpuva_end,
> > > migrate_devmem &&
> > > +						 ctx-
> > > >check_pages);
> > > +	if (chunk_size == LONG_MAX) {
> > > +		err = -EINVAL;
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > fault_addr,
> > > chunk_size,
> > > +				       migrate_devmem);
> > > +	if (IS_ERR(range)) {
> > > +		err = PTR_ERR(range);
> > > +		goto err_notifier_remove;
> > > +	}
> > > +
> > > +	drm_gpusvm_range_insert(notifier, range);
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > +
> > > +out_mmunlock:
> > > +	mmap_read_unlock(mm);
> > > +	mmput(mm);
> > > +
> > > +	return range;
> > > +
> > > +err_notifier_remove:
> > > +	mmap_read_unlock(mm);
> > > +	if (notifier_alloc)
> > > +		mmu_interval_notifier_remove(&notifier-
> > > >notifier);
> > > +err_notifier:
> > > +	if (notifier_alloc)
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +err_mmunlock:
> > > +	mmput(mm);
> > > +err_out:
> > > +	return ERR_PTR(err);
> > > +}
> > > +
> > > +/**
> > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with
> > > a
> > > GPU SVM range (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @npages: Number of pages to unmap
> > > + *
> > > + * This function unmap pages associated with a GPU SVM range.
> > > Assumes and
> > > + * asserts correct locking is in place when called.
> > > + */
> > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					   struct
> > > drm_gpusvm_range
> > > *range,
> > > +					   unsigned long npages)
> > > +{
> > > +	unsigned long i, j;
> > > +	struct drm_pagemap *dpagemap = range->dpagemap;
> > > +	struct device *dev = gpusvm->drm->dev;
> > > +
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->flags.has_dma_mapping) {
> > > +		for (i = 0, j = 0; i < npages; j++) {
> > > +			struct drm_pagemap_dma_addr *addr =
> > > &range-
> > > > dma_addr[j];
> > > +
> > > +			if (addr->proto ==
> > > DRM_INTERCONNECT_SYSTEM)
> > > {
> > > +				dma_unmap_page(dev,
> > > +					       addr->addr,
> > > +					       PAGE_SIZE <<
> > > addr-
> > > > order,
> > > +					       addr->dir);
> > > +			} else if (dpagemap && dpagemap->ops-
> > > > unmap_dma) {
> > > +				dpagemap->ops-
> > > >unmap_dma(dpagemap,
> > > +							 dev,
> > > +							 *addr);
> > > +			}
> > > +			i += 1 << addr->order;
> > > +		}
> > > +		range->flags.has_devmem_pages = false;
> > > +		range->flags.has_dma_mapping = false;
> > > +		range->dpagemap = NULL;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_free_pages - Free pages associated with a
> > > GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function free pages associated with a GPU SVM range.
> > 
> > Frees the dma address array
> > 
> 
> Yes.
>  
> > 
> > > + */
> > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > *gpusvm,
> > > +					struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	if (range->dma_addr) {
> > > +		kvfree(range->dma_addr);
> > > +		range->dma_addr = NULL;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range to be removed
> > > + *
> > > + * This function removes the specified GPU SVM range and also
> > > removes the parent
> > > + * GPU SVM notifier if no more ranges remain in the notifier.
> > > The
> > > caller must
> > > + * hold a lock to protect range and notifier removal.
> > > + */
> > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > +			     struct drm_gpusvm_range *range)
> > > +{
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +	struct drm_gpusvm_notifier *notifier;
> > > +
> > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > va.start);
> > > +	if (WARN_ON_ONCE(!notifier))
> > > +		return;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > +	__drm_gpusvm_range_remove(notifier, range);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	drm_gpusvm_range_put(range);
> > > +
> > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > +		if (!notifier->flags.removed)
> > > +			mmu_interval_notifier_remove(&notifier-
> > > > notifier);
> > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function increments the reference count of the specified
> > > GPU
> > > SVM range.
> > > + *
> > > + * Returns:
> > > + * Pointer to the GPU SVM range.
> > > + */
> > > +struct drm_gpusvm_range *
> > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_get(&range->refcount);
> > > +
> > > +	return range;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > + * @refcount: Pointer to the reference counter embedded in the
> > > GPU
> > > SVM range
> > > + *
> > > + * This function destroys the specified GPU SVM range when its
> > > reference count
> > > + * reaches zero. If a custom range-free function is provided, it
> > > is
> > > invoked to
> > > + * free the range; otherwise, the range is deallocated using
> > > kfree().
> > > + */
> > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > +{
> > > +	struct drm_gpusvm_range *range =
> > > +		container_of(refcount, struct drm_gpusvm_range,
> > > refcount);
> > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > +
> > > +	if (gpusvm->ops->range_free)
> > > +		gpusvm->ops->range_free(range);
> > > +	else
> > > +		kfree(range);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > + * @range: Pointer to the GPU SVM range
> > > + *
> > > + * This function decrements the reference count of the specified
> > > GPU
> > > SVM range
> > > + * and frees it when the count reaches zero.
> > > + */
> > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > +{
> > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called holding gpusvm->notifier_lock and as the last step
> > > before
> > > commiting a
> > > + * GPU binding.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > +
> > > +	return range->flags.has_devmem_pages || range-
> > > > flags.has_dma_mapping;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages
> > > valid
> > > unlocked
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + *
> > > + * This function determines if a GPU SVM range pages are valid.
> > > Expected be
> > > + * called without holding gpusvm->notifier_lock.
> > > + *
> > > + * Returns:
> > > + * True if GPU SVM range has valid pages, False otherwise
> > > + */
> > > +static bool
> > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > +				      struct drm_gpusvm_range
> > > *range)
> > > +{
> > > +	bool pages_valid;
> > > +
> > > +	if (!range->dma_addr)
> > > +		return false;
> > > +
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > range);
> > > +	if (!pages_valid)
> > > +		drm_gpusvm_range_free_pages(gpusvm, range);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +
> > > +	return pages_valid;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function gets pages for a GPU SVM range and ensures they
> > > are
> > > mapped for
> > > + * DMA access.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > +			       struct drm_gpusvm_range *range,
> > > +			       const struct drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct mmu_interval_notifier *notifier = &range-
> > > >notifier-
> > > > notifier;
> > > +	struct hmm_range hmm_range = {
> > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > >read_only
> > > ? 0 :
> > > +			HMM_PFN_REQ_WRITE),
> > > +		.notifier = notifier,
> > > +		.start = range->va.start,
> > > +		.end = range->va.end,
> > > +		.dev_private_owner = gpusvm-
> > > > device_private_page_owner,
> > > +	};
> > > +	struct mm_struct *mm = gpusvm->mm;
> > > +	struct drm_gpusvm_zdd *zdd;
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long i, j;
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +	unsigned long num_dma_mapped;
> > > +	unsigned int order = 0;
> > > +	unsigned long *pfns;
> > > +	struct page **pages;
> > > +	int err = 0;
> > > +	struct dev_pagemap *pagemap;
> > > +	struct drm_pagemap *dpagemap;
> > > +
> > > +retry:
> > > +	hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > range))
> > > +		goto set_seqno;
> > > +
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +	if (!pfns)
> > > +		return -ENOMEM;
> > > +
> > > +	if (!mmget_not_zero(mm)) {
> > > +		err = -EFAULT;
> > > +		goto err_out;
> > > +	}
> > > +
> > > +	hmm_range.hmm_pfns = pfns;
> > > +	while (true) {
> > > +		mmap_read_lock(mm);
> > > +		err = hmm_range_fault(&hmm_range);
> > > +		mmap_read_unlock(mm);
> > > +
> > > +		if (err == -EBUSY) {
> > > +			if (time_after(jiffies, timeout))
> > > +				break;
> > > +
> > > +			hmm_range.notifier_seq =
> > > mmu_interval_read_begin(notifier);
> > > +			continue;
> > > +		}
> > > +		break;
> > > +	}
> > > +	mmput(mm);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	pages = (struct page **)pfns;
> > > +map_pages:
> > > +	/*
> > > +	 * Perform all dma mappings under the notifier lock to
> > > not
> > > +	 * access freed pages. A notifier will either block on
> > > +	 * the notifier lock or unmap dma.
> > > +	 */
> > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > +	if (mmu_interval_read_retry(notifier,
> > > hmm_range.notifier_seq)) {
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +		goto retry;
> > > +	}
> > > +
> > > +	if (!range->dma_addr) {
> > > +		/* Unlock and restart mapping to allocate
> > > memory. */
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +		range->dma_addr = kvmalloc_array(npages,
> > > sizeof(*range->dma_addr),
> > > +						 GFP_KERNEL);
> > > +		if (!range->dma_addr) {
> > > +			err = -ENOMEM;
> > > +			goto err_free;
> > > +		}
> > > +		goto map_pages;
> > > +	}
> > > +
> > > +	zdd = NULL;
> > > +	num_dma_mapped = 0;
> > > +	for (i = 0, j = 0; i < npages; ++j) {
> > > +		struct page *page = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +		order = hmm_pfn_to_map_order(pfns[i]);
> > > +		if (is_device_private_page(page) ||
> > > is_device_coherent_page(page)) {
> > > +			if (zdd != page->zone_device_data && i >
> > > 0)
> > > {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_unmap;
> > > +			}
> > > +			zdd = page->zone_device_data;
> > > +			if (pagemap != page->pgmap) {
> > > +				if (i > 0) {
> > > +					err = -EOPNOTSUPP;
> > > +					goto err_unmap;
> > > +				}
> > > +
> > > +				pagemap = page->pgmap;
> > > +				dpagemap = zdd-
> > > >devmem_allocation-
> > > > dpagemap;
> > > +				if (drm_WARN_ON(gpusvm->drm,
> > > !dpagemap)) {
> > > +					/*
> > > +					 * Raced. This is not
> > > supposed to happen
> > > +					 * since
> > > hmm_range_fault()
> > > should've migrated
> > > +					 * this page to system.
> > > +					 */
> > > +					err = -EAGAIN;
> > > +					goto err_unmap;
> > > +				}
> > > +			}
> > > +			range->dma_addr[j] =
> > > +				dpagemap->ops->map_dma(dpagemap,
> > > gpusvm->drm->dev,
> > > +						       page,
> > > order,
> > > +						      
> > > DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > range->dma_addr[j].addr)) {
> > > +				err = -EFAULT;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			pages[i] = page;
> > > +		} else {
> > > +			dma_addr_t addr;
> > > +
> > > +			if (is_zone_device_page(page) || zdd) {
> > > +				err = -EOPNOTSUPP;
> > 
> > I suppose before merging we want to support mixed ranges since
> > migration is best effort only, or what are the plans here?
> > 
> 
> I'd say initial merge no mixed support given that adds complexity and
> the
> current code is very stable - i.e., get in a simple and stable
> baseline and then
> build complexity on top incrementally. I have a lot perf optimization
> I'd like
> to get in but omitting for now to stick to the aforementioned plan.
> 
> Longtern I think a drm_gpusvm_ctx argument will control if we want
> mixed
> mappings within a range.

OK.

>  
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			addr = dma_map_page(gpusvm->drm->dev,
> > > +					    page, 0,
> > > +					    PAGE_SIZE << order,
> > > +					    DMA_BIDIRECTIONAL);
> > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > addr)) {
> > > +				err = -EFAULT;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			range->dma_addr[j] =
> > > drm_pagemap_dma_addr_encode
> > > +				(addr, DRM_INTERCONNECT_SYSTEM,
> > > order,
> > > +				 DMA_BIDIRECTIONAL);
> > > +		}
> > > +		i += 1 << order;
> > > +		num_dma_mapped = i;
> > > +	}
> > > +
> > > +	range->flags.has_dma_mapping = true;
> > > +	if (zdd) {
> > > +		range->flags.has_devmem_pages = true;
> > > +		range->dpagemap = dpagemap;
> > > +	}
> > > +
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +	kvfree(pfns);
> > > +set_seqno:
> > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > +
> > > +	return 0;
> > > +
> > > +err_unmap:
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range,
> > > num_dma_mapped);
> > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > +err_free:
> > > +	kvfree(pfns);
> > > +err_out:
> > > +	if (err == -EAGAIN)
> > > +		goto retry;
> > > +	return err;
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > GPU
> > > SVM range
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @range: Pointer to the GPU SVM range structure
> > > + * @ctx: GPU SVM context
> > > + *
> > > + * This function unmaps pages associated with a GPU SVM range.
> > > If
> > > @in_notifier
> > > + * is set, it is assumed that gpusvm->notifier_lock is held in
> > > write
> > > mode; if it
> > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > Must be
> > > called on
> > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > invalidate for IOMMU
> > > + * security model.
> > > + */
> > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > +				  struct drm_gpusvm_range
> > > *range,
> > > +				  const struct drm_gpusvm_ctx
> > > *ctx)
> > > +{
> > > +	unsigned long npages = npages_in_range(range->va.start,
> > > range->va.end);
> > > +
> > > +	if (ctx->in_notifier)
> > > +		lockdep_assert_held_write(&gpusvm-
> > > >notifier_lock);
> > > +	else
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +
> > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > > +
> > > +	if (!ctx->in_notifier)
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +}
> > 
> > NIT: Separate functions for locked / unlocked makes life easier for
> > static code analyzers.
> > 
> 
> Willl do.
> 
> > 
> > Section below I think should belong to drm_pagemap.c
> > 
> 
> Diagree. See my comments on zdd above. Also
> drm_gpusvm_migration_put_pages uses
> migration pfns which definitely should not be in drm_pagemap.c.

Like mentioned above, with upcoming populate_mm() moving to being a
drm_pagemap op, and none of the low level migration helpers taking
drm_gpusvm as an argument, I think it makes sense, but if you rather
want to look at shuffling things around when that is in place,
that's ok. Agree, though that anything needing struct drm_gpusvm should
*not* be in drm_pagemap.c

Thanks,
Thomas



>  
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > + * @page: Pointer to the page to put
> > > + *
> > > + * This function unlocks and puts a page.
> > > + */
> > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > 
> > _unlock_put_page()?
> > 
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > + * @npages: Number of pages
> > > + * @migrate_pfn: Array of migrate page frame numbers
> > > + *
> > > + * This function puts an array of pages.
> > > + */
> > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > +					   unsigned long
> > > *migrate_pfn)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		if (!migrate_pfn[i])
> > > +			continue;
> > > +
> > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_pag
> > > e(mi
> > > grate_pfn[i]));
> > > +		migrate_pfn[i] = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_get_devmem_page - Get a reference to a device
> > > memory
> > > page
> > > + * @page: Pointer to the page
> > > + * @zdd: Pointer to the GPU SVM zone device data
> > > + *
> > > + * This function associates the given page with the specified
> > > GPU
> > > SVM zone
> > > + * device data and initializes it for zone device usage.
> > > + */
> > > +static void drm_gpusvm_get_devmem_page(struct page *page,
> > > +				     struct drm_gpusvm_zdd *zdd)
> > > +{
> > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > +	zone_device_page_init(page);
> > > +}
> > > +
> > > +/**
> > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU
> > > SVM
> > > migration
> > > + * @dev: The device for which the pages are being mapped
> > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > mapped
> > > pages
> > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > + * @npages: Number of pages to map
> > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > + *
> > > + * This function maps pages of memory for migration usage in GPU
> > > SVM. It
> > > + * iterates over each page frame number provided in
> > > @migrate_pfn,
> > > maps the
> > > + * corresponding page, and stores the DMA address in the
> > > provided
> > > @dma_addr
> > > + * array.
> > > + *
> > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > mapping.
> > > + */
> > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > +					dma_addr_t *dma_addr,
> > > +					long unsigned int
> > > *migrate_pfn,
> > > +					unsigned long npages,
> > > +					enum dma_data_direction
> > > dir)
> > > +{
> > > +	unsigned long i;
> > > +
> > > +	for (i = 0; i < npages; ++i) {
> > > +		struct page *page =
> > > migrate_pfn_to_page(migrate_pfn[i]);
> > > +
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > +			return -EFAULT;
> > > +
> > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > PAGE_SIZE,
> > > dir);
> > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > +			return -EFAULT;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > 
> > TBC'd
> > 
> 
> Thanks for the comments!
> 
> Matt
> 
> > /Thomas
> > 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-04 18:59       ` Thomas Hellström
@ 2024-11-04 23:07         ` Matthew Brost
  2024-11-05 10:22           ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-11-04 23:07 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Nov 04, 2024 at 07:59:10PM +0100, Thomas Hellström wrote:
> On Mon, 2024-11-04 at 09:21 -0800, Matthew Brost wrote:
> > On Mon, Nov 04, 2024 at 04:25:38PM +0100, Thomas Hellström wrote:
> > > On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > > 
> > > Continued review.
> > > 
> > > > 
> > > > +/**
> > > > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > > > + *
> > > > + * @refcount: Reference count for the zdd
> > > > + * @destroy_work: Work structure for asynchronous zdd
> > > > destruction
> > > > + * @devmem_allocation: device memory allocation
> > > > + * @device_private_page_owner: Device private pages owner
> > > > + *
> > > > + * This structure serves as a generic wrapper installed in
> > > > + * page->zone_device_data. It provides infrastructure for
> > > > looking up
> > > > a device
> > > > + * memory allocation upon CPU page fault and asynchronously
> > > > releasing device
> > > > + * memory once the CPU has no page references. Asynchronous
> > > > release
> > > > is useful
> > > > + * because CPU page references can be dropped in IRQ contexts,
> > > > while
> > > > releasing
> > > > + * device memory likely requires sleeping locks.
> > > > + */
> > > > +struct drm_gpusvm_zdd {
> > > > +	struct kref refcount;
> > > > +	struct work_struct destroy_work;
> > > > +	struct drm_gpusvm_devmem *devmem_allocation;
> > > > +	void *device_private_page_owner;
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_destroy_work_func - Work function for
> > > > destroying a
> > > > zdd
> > > 
> > > NIT: Even if the above kerneldoc format works, I keep trying to
> > > enforce
> > > using () after function names and function-like macros, like
> > > described
> > > here: https://docs.kernel.org/doc-guide/kernel-doc.html Could we
> > > update? Also that doc calls for using "Return:" instead of
> > > "Returns:".
> > > 
> > > 
> > 
> > Will fix up. Thanks for the ref.
> > 
> > > > + * @w: Pointer to the work_struct
> > > > + *
> > > > + * This function releases device memory, puts GPU SVM range, and
> > > > frees zdd.
> > > > + */
> > > > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct
> > > > *w)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd =
> > > > +		container_of(w, struct drm_gpusvm_zdd,
> > > > destroy_work);
> > > > +	const struct drm_gpusvm_devmem_ops *ops = zdd-
> > > > > devmem_allocation ?
> > > > +		zdd->devmem_allocation->ops : NULL;
> > > > +
> > > > +	if (zdd->devmem_allocation && ops->devmem_release)
> > > > +		ops->devmem_release(zdd->devmem_allocation);
> > > > +	kfree(zdd);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > > > + * @device_private_page_owner: Device private pages owner
> > > > + *
> > > > + * This function allocates and initializes a new zdd structure.
> > > > It
> > > > sets up the
> > > > + * reference count and initializes the destroy work.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated zdd on success, ERR_PTR() on
> > > > failure.
> > > > + */
> > > > +static struct drm_gpusvm_zdd *
> > > > +drm_gpusvm_zdd_alloc(void *device_private_page_owner)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd;
> > > > +
> > > > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > > > +	if (!zdd)
> > > > +		return NULL;
> > > > +
> > > > +	kref_init(&zdd->refcount);
> > > > +	INIT_WORK(&zdd->destroy_work,
> > > > drm_gpusvm_zdd_destroy_work_func);
> > > > +	zdd->devmem_allocation = NULL;
> > > > +	zdd->device_private_page_owner =
> > > > device_private_page_owner;
> > > > +
> > > > +	return zdd;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > > > + * @zdd: Pointer to the zdd structure.
> > > > + *
> > > > + * This function increments the reference count of the provided
> > > > zdd
> > > > structure.
> > > > + *
> > > > + * Returns: Pointer to the zdd structure.
> > > > + */
> > > > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct
> > > > drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	kref_get(&zdd->refcount);
> > > > +	return zdd;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > > > + * @ref: Pointer to the reference count structure.
> > > > + *
> > > > + * This function queues the destroy_work of the zdd for
> > > > asynchronous
> > > > destruction.
> > > > + */
> > > > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > > > +{
> > > > +	struct drm_gpusvm_zdd *zdd =
> > > > +		container_of(ref, struct drm_gpusvm_zdd,
> > > > refcount);
> > > > +
> > > > +	if (zdd->devmem_allocation)
> > > > +		WRITE_ONCE(zdd->devmem_allocation->detached,
> > > > true);
> > > > +	schedule_work(&zdd->destroy_work);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > > > + * @zdd: Pointer to the zdd structure.
> > > > + *
> > > > + * This function decrements the reference count of the provided
> > > > zdd
> > > > structure
> > > > + * and schedules its destruction if the count drops to zero.
> > > > + */
> > > > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > > > +}
> > > 
> > > As mentioned earlier, I think the above drm_gpusvm_zdd functions
> > > should
> > > move to drm_pagemap.c. I don't think they are used in drm_gpusvm
> > > other
> > > than to, at get_pages time, ensure all device private pages are
> > > from
> > > the same pagemap?
> > > 
> > 
> > The are used in __drm_gpusvm_migrate_to_ram to find devmem_allocation
> > and
> > associated ops.
> > 
> > Also in drm_gpusvm_migrate_to_ram to find the size and
> > device_private_page_owner.
> > 
> > I think the placement here is correct for now but open to shuffling
> > this around
> > in the future if this makes sense.
> 
> Yeah I was thinking with the split in the multi-device series, (which
> is yet to be posted, though, the drm_pagemap op (*populate_mm)() would
> in effect handle all migration to vram, and the dev_pagemap op would
> handle migration to ram, and the ranges would no longer keep track of
> the vram allocations.
> 

I'd have to look at the multi-device stuff to fully understand this.

> This means that no low-level migration function would take drm_gpusvm
> as an argument (well drm_gpusvm_range_evict() would, but that would, as
> we discussed before probably want to evict *all* device private pages,
> so that it would likely need to be implemented with hmm_range_fault()?
> 

hmm_range_fault doesn't work when trying to evict coherent pages... otherwise
yea that would be an easy to implement drm_gpusvm_range_evict. At one point had
it that way.

> So this was mostly trying to avoid future shuffling around, but agree
> it depends on dev_pagemap patches.
>

Churn isn't great but I think it bound to happen as we implement more things /
do perf optimizations in Xe on top this series. I think I'm ok with that as long
it is more or less just moving code around, not large rewrites.

> > 
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM
> > > > notifier
> > > > + * @notifier: Pointer to the GPU SVM notifier structure.
> > > > + * @start: Start address of the range
> > > > + * @end: End address of the range
> > > > + *
> > > > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > > > start, u64 end)
> > > > +{
> > > > +	return range_iter_first(&notifier->root, start, end -
> > > > 1);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM
> > > > ranges in a notifier
> > > > + * @range__: Iterator variable for the ranges
> > > > + * @next__: Iterator variable for the ranges temporay storage
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the range
> > > > + * @end__: End address of the range
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM ranges in a
> > > > notifier
> > > > while
> > > > + * removing ranges from it.
> > > > + */
> > > > +#define drm_gpusvm_for_each_range_safe(range__, next__,
> > > > notifier__,
> > > > start__, end__)	\
> > > > +	for ((range__) = drm_gpusvm_range_find((notifier__),
> > > > (start__), (end__)),	\
> > > > +	     (next__) =
> > > > __drm_gpusvm_range_next(range__);				\
> > > > +	     (range__) && (range__->va.start <
> > > > (end__));				\
> > > > +	     (range__) = (next__), (next__) =
> > > > __drm_gpusvm_range_next(range__))
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier
> > > > in
> > > > the list
> > > > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > > > + *
> > > > + * Return: A pointer to the next drm_gpusvm_notifier if
> > > > available,
> > > > or NULL if
> > > > + *         the current notifier is the last one or if the input
> > > > notifier is
> > > > + *         NULL.
> > > > + */
> > > > +static struct drm_gpusvm_notifier *
> > > > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > > > +{
> > > > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > > > +				      &notifier->gpusvm-
> > > > > notifier_list))
> > > > +		return list_next_entry(notifier, rb.entry);
> > > 
> > > Why aren't we using notifier_iter_next() here? Then the linked list
> > > could be skipped.
> > > 
> > 
> > I shamlessly copied this from GPU VM. I think the list is useful for
> > faster
> > iteration and safe removal of items while walking.
> 
> We
> havehttps://elixir.bootlin.com/linux/v6.12-rc6/source/include/linux/interval_tree_generic.h#L24
> 
> to relate to. Now GPUVM can't use the generic version since it needs
> u64 intervals. These trees need unsigned long only so it should be ok.
> And safe removal, isn't that possible to implement without the list?
> Then it's really only the linked list as a perf optimization I guess,
> but we have a lot of those pending...
> 

See my other comments. Let me just follow on using a maple tree and perhaps a
list isn't required if we use that. Will have definite answer in my next rev.

> 
> 
> > 
> > > > +
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers
> > > > in
> > > > a gpusvm
> > > > + * @notifier__: Iterator variable for the notifiers
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the notifier
> > > > + * @end__: End address of the notifier
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > gpusvm.
> > > > + */
> > > > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__,
> > > > start__,
> > > > end__)		\
> > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > >root,
> > > > (start__), (end__) - 1);	\
> > > > +	     (notifier__) && (notifier__->interval.start <
> > > > (end__));			\
> > > > +	     (notifier__) =
> > > > __drm_gpusvm_notifier_next(notifier__))
> > > > +
> > > 
> > > Looks like end__ is not honored except for the first iteration.
> > > Relates
> > > to the above question.
> > > 
> > 
> > Again shameless copy from GPU VM... Missing what the problem is. The
> > condition
> > to break the loop is:
> > 
> > '(notifier__) && (notifier__->interval.start < (end__)'
> 
> Ah yes, you're right. I missed that.
> 
> > 
> > > > +/**
> > > > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU
> > > > SVM
> > > > notifiers in a gpusvm
> > > > + * @notifier__: Iterator variable for the notifiers
> > > > + * @next__: Iterator variable for the notifiers temporay storage
> > > > + * @notifier__: Pointer to the GPU SVM notifier
> > > > + * @start__: Start address of the notifier
> > > > + * @end__: End address of the notifier
> > > > + *
> > > > + * This macro is used to iterate over GPU SVM notifiers in a
> > > > gpusvm
> > > > while
> > > > + * removing notifiers from it.
> > > > + */
> > > > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__,
> > > > gpusvm__, start__, end__)	\
> > > > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)-
> > > > >root,
> > > > (start__), (end__) - 1),	\
> > > > +	     (next__) =
> > > > __drm_gpusvm_notifier_next(notifier__);				\
> > > > +	     (notifier__) && (notifier__->interval.start <
> > > > (end__));			\
> > > > +	     (notifier__) = (next__), (next__) =
> > > > __drm_gpusvm_notifier_next(notifier__))
> > > 
> > > Same here.
> > > 
> > 
> > Alsp present:
> > 
> >  (notifier__) && (notifier__->interval.start < (end__)
> > 
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM
> > > > notifier.
> > > > + * @mni: Pointer to the mmu_interval_notifier structure.
> > > > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > > > + * @cur_seq: Current sequence number.
> > > > + *
> > > > + * This function serves as a generic MMU notifier for GPU SVM.
> > > > It
> > > > sets the MMU
> > > > + * notifier sequence number and calls the driver invalidate
> > > > vfunc
> > > > under
> > > > + * gpusvm->notifier_lock.
> > > > + *
> > > > + * Returns:
> > > > + * true if the operation succeeds, false otherwise.
> > > > + */
> > > > +static bool
> > > > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier
> > > > *mni,
> > > > +			       const struct mmu_notifier_range
> > > > *mmu_range,
> > > > +			       unsigned long cur_seq)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier =
> > > > +		container_of(mni, typeof(*notifier), notifier);
> > > > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > > > +
> > > > +	if (!mmu_notifier_range_blockable(mmu_range))
> > > > +		return false;
> > > > +
> > > > +	down_write(&gpusvm->notifier_lock);
> > > > +	mmu_interval_set_seq(mni, cur_seq);
> > > > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > > > +	up_write(&gpusvm->notifier_lock);
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_ops - MMU interval notifier operations
> > > > for
> > > > GPU SVM
> > > > + */
> > > > +static const struct mmu_interval_notifier_ops
> > > > drm_gpusvm_notifier_ops = {
> > > > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > > > +};
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_init - Initialize the GPU SVM.
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + * @name: Name of the GPU SVM.
> > > > + * @drm: Pointer to the DRM device structure.
> > > > + * @mm: Pointer to the mm_struct for the address space.
> > > > + * @device_private_page_owner: Device private pages owner.
> > > > + * @mm_start: Start address of GPU SVM.
> > > > + * @mm_range: Range of the GPU SVM.
> > > > + * @notifier_size: Size of individual notifiers.
> > > > + * @ops: Pointer to the operations structure for GPU SVM.
> > > > + * @chunk_sizes: Pointer to the array of chunk sizes used in
> > > > range
> > > > allocation.
> > > > + *               Entries should be powers of 2 in descending
> > > > order
> > > > with last
> > > > + *               entry being SZ_4K.
> > > > + * @num_chunks: Number of chunks.
> > > > + *
> > > > + * This function initializes the GPU SVM.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, a negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > > > +		    const char *name, struct drm_device *drm,
> > > > +		    struct mm_struct *mm, void
> > > > *device_private_page_owner,
> > > > +		    u64 mm_start, u64 mm_range, u64
> > > > notifier_size,
> > > > +		    const struct drm_gpusvm_ops *ops,
> > > > +		    const u64 *chunk_sizes, int num_chunks)
> > > > +{
> > > > +	if (!ops->invalidate || !num_chunks)
> > > > +		return -EINVAL;
> > > > +
> > > > +	gpusvm->name = name;
> > > > +	gpusvm->drm = drm;
> > > > +	gpusvm->mm = mm;
> > > > +	gpusvm->device_private_page_owner =
> > > > device_private_page_owner;
> > > > +	gpusvm->mm_start = mm_start;
> > > > +	gpusvm->mm_range = mm_range;
> > > > +	gpusvm->notifier_size = notifier_size;
> > > > +	gpusvm->ops = ops;
> > > > +	gpusvm->chunk_sizes = chunk_sizes;
> > > > +	gpusvm->num_chunks = num_chunks;
> > > > +
> > > > +	mmgrab(mm);
> > > > +	gpusvm->root = RB_ROOT_CACHED;
> > > > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > > > +
> > > > +	init_rwsem(&gpusvm->notifier_lock);
> > > > +
> > > > +	fs_reclaim_acquire(GFP_KERNEL);
> > > > +	might_lock(&gpusvm->notifier_lock);
> > > > +	fs_reclaim_release(GFP_KERNEL);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM structure
> > > > + * @fault_addr__: Fault address
> > > > + *
> > > > + * This macro finds the GPU SVM notifier associated with the
> > > > fault
> > > > address.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > > > + */
> > > > +#define drm_gpusvm_notifier_find(gpusvm__,
> > > > fault_addr__)	\
> > > > +	notifier_iter_first(&(gpusvm__)->root,
> > > > (fault_addr__),	\
> > > > +			    (fault_addr__ + 1))
> > > > +
> > > > +/**
> > > > + * to_drm_gpusvm_notifier - retrieve the container struct for a
> > > > given rbtree node
> > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > drm_gpusvm_notifier struct
> > > > + *
> > > > + * Return: A pointer to the containing drm_gpusvm_notifier
> > > > structure.
> > > > + */
> > > > +#define
> > > > to_drm_gpusvm_notifier(__node)				\
> > > > +	container_of((__node), struct drm_gpusvm_notifier,
> > > > rb.node)
> > > > +
> > > 
> > > There appears to be a number of function-like macros in the code,
> > > which
> > > look like they can be converted to functions. Linux prefers
> > > functions
> > > over macros when possible:
> > > 
> > > https://www.kernel.org/doc/html/v5.8/process/coding-style.html#macros-enums-and-rtl
> > > 
> > 
> > Will convert all macros to functions if possible. Again thanks for
> > ref.
> > 
> > > 
> > > > +/**
> > > > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This function inserts the GPU SVM notifier into the GPU SVM
> > > > RB
> > > > tree and list.
> > > > + */
> > > > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm
> > > > *gpusvm,
> > > > +				       struct
> > > > drm_gpusvm_notifier
> > > > *notifier)
> > > > +{
> > > > +	struct rb_node *node;
> > > > +	struct list_head *head;
> > > > +
> > > > +	notifier_insert(notifier, &gpusvm->root);
> > > > +
> > > > +	node = rb_prev(&notifier->rb.node);
> > > > +	if (node)
> > > > +		head = &(to_drm_gpusvm_notifier(node))-
> > > > >rb.entry;
> > > > +	else
> > > > +		head = &gpusvm->notifier_list;
> > > > +
> > > > +	list_add(&notifier->rb.entry, head);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > > > + * @gpusvm__: Pointer to the GPU SVM tructure
> > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This macro removes the GPU SVM notifier from the GPU SVM RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +#define drm_gpusvm_notifier_remove(gpusvm__,
> > > > notifier__)	\
> > > > +	notifier_remove((notifier__), &(gpusvm__)-
> > > > >root);	\
> > > > +	list_del(&(notifier__)->rb.entry)
> > > 
> > > Unless this can be made a function, Pls use
> > > do { } while (0)
> > > 
> > 
> > I think it can be made a function or otherwise yea will use do { }
> > while (0).
> > 
> > > 
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > > > + * @gpusvm: Pointer to the GPU SVM structure.
> > > > + *
> > > > + * This function finalizes the GPU SVM by cleaning up any
> > > > remaining
> > > > ranges and
> > > > + * notifiers, and dropping a reference to struct MM.
> > > > + */
> > > > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier, *next;
> > > > +
> > > > +	drm_gpusvm_for_each_notifier_safe(notifier, next,
> > > > gpusvm, 0,
> > > > LONG_MAX) {
> > > > +		struct drm_gpusvm_range *range, *__next;
> > > > +
> > > > +		/*
> > > > +		 * Remove notifier first to avoid racing with
> > > > any
> > > > invalidation
> > > > +		 */
> > > > +		mmu_interval_notifier_remove(&notifier-
> > > > >notifier);
> > > > +		notifier->flags.removed = true;
> > > > +
> > > > +		drm_gpusvm_for_each_range_safe(range, __next,
> > > > notifier, 0,
> > > > +					       LONG_MAX)
> > > > +			drm_gpusvm_range_remove(gpusvm, range);
> > > > +	}
> > > > +
> > > > +	mmdrop(gpusvm->mm);
> > > > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @fault_addr: Fault address
> > > > + *
> > > > + * This function allocates and initializes the GPU SVM notifier
> > > > structure.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated GPU SVM notifier on success,
> > > > ERR_PTR()
> > > > on failure.
> > > > + */
> > > > +static struct drm_gpusvm_notifier *
> > > > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	if (gpusvm->ops->notifier_alloc)
> > > > +		notifier = gpusvm->ops->notifier_alloc();
> > > > +	else
> > > > +		notifier = kzalloc(sizeof(*notifier),
> > > > GFP_KERNEL);
> > > > +
> > > > +	if (!notifier)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	notifier->gpusvm = gpusvm;
> > > > +	notifier->interval.start = ALIGN_DOWN(fault_addr,
> > > > gpusvm-
> > > > > notifier_size);
> > > > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm-
> > > > > notifier_size);
> > > > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > > > +	notifier->root = RB_ROOT_CACHED;
> > > > +	INIT_LIST_HEAD(&notifier->range_list);
> > > > +
> > > > +	return notifier;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + *
> > > > + * This function frees the GPU SVM notifier structure.
> > > > + */
> > > > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > > > +				     struct drm_gpusvm_notifier
> > > > *notifier)
> > > > +{
> > > > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > > > +
> > > > +	if (gpusvm->ops->notifier_free)
> > > > +		gpusvm->ops->notifier_free(notifier);
> > > > +	else
> > > > +		kfree(notifier);
> > > > +}
> > > > +
> > > > +/**
> > > > + * to_drm_gpusvm_range - retrieve the container struct for a
> > > > given
> > > > rbtree node
> > > > + * @node__: a pointer to the rbtree node embedded within a
> > > > drm_gpusvm_range struct
> > > > + *
> > > > + * Return: A pointer to the containing drm_gpusvm_range
> > > > structure.
> > > > + */
> > > > +#define to_drm_gpusvm_range(node__)	\
> > > > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function inserts the GPU SVM range into the notifier RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier
> > > > *notifier,
> > > > +				    struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	struct rb_node *node;
> > > > +	struct list_head *head;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > > > +	range_insert(range, &notifier->root);
> > > > +
> > > > +	node = rb_prev(&range->rb.node);
> > > > +	if (node)
> > > > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > > > +	else
> > > > +		head = &notifier->range_list;
> > > > +
> > > > +	list_add(&range->rb.entry, head);
> > > > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > > > + * @notifier__: Pointer to the GPU SVM notifier structure
> > > > + * @range__: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This macro removes the GPU SVM range from the notifier RB
> > > > tree
> > > > and list.
> > > > + */
> > > > +#define __drm_gpusvm_range_remove(notifier__,
> > > > range__)		\
> > > > +	range_remove((range__), &(notifier__)-
> > > > >root);		\
> > > > +	list_del(&(range__)->rb.entry)
> > > 
> > > Same thing as for the notifier rb tree. And do we need the linked
> > > list?
> > > 
> > 
> > Same answer.
> > 
> > > 
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @fault_addr: Fault address
> > > > + * @chunk_size: Chunk size
> > > > + * @migrate_devmem: Flag indicating whether to migrate device
> > > > memory
> > > > + *
> > > > + * This function allocates and initializes the GPU SVM range
> > > > structure.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the allocated GPU SVM range on success, ERR_PTR()
> > > > on
> > > > failure.
> > > > + */
> > > > +static struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > > > +		       struct drm_gpusvm_notifier *notifier,
> > > > +		       u64 fault_addr, u64 chunk_size, bool
> > > > migrate_devmem)
> > > > +{
> > > > +	struct drm_gpusvm_range *range;
> > > > +
> > > > +	if (gpusvm->ops->range_alloc)
> > > > +		range = gpusvm->ops->range_alloc(gpusvm);
> > > > +	else
> > > > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > +
> > > > +	if (!range)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	kref_init(&range->refcount);
> > > > +	range->gpusvm = gpusvm;
> > > > +	range->notifier = notifier;
> > > > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > > > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > > > +	INIT_LIST_HEAD(&range->rb.entry);
> > > > +	range->notifier_seq = LONG_MAX;
> > > > +	range->flags.migrate_devmem = migrate_devmem ? 1 : 0;
> > > > +
> > > > +	return range;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_check_pages - Check pages
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @start: Start address
> > > > + * @end: End address
> > > > + *
> > > > + * Check if pages between start and end have been faulted in on
> > > > the
> > > > CPU. Use to
> > > > + * prevent migration of pages without CPU backing store.
> > > > + *
> > > > + * Returns:
> > > > + * True if pages have been faulted into CPU, False otherwise
> > > > + */
> > > > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > > > +				   struct drm_gpusvm_notifier
> > > > *notifier,
> > > > +				   u64 start, u64 end)
> > > > +{
> > > > +	struct hmm_range hmm_range = {
> > > > +		.default_flags = 0,
> > > > +		.notifier = &notifier->notifier,
> > > > +		.start = start,
> > > > +		.end = end,
> > > > +		.dev_private_owner = gpusvm-
> > > > > device_private_page_owner,
> > > > +	};
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long *pfns;
> > > > +	unsigned long npages = npages_in_range(start, end);
> > > > +	int err, i;
> > > > +
> > > > +	mmap_assert_locked(gpusvm->mm);
> > > > +
> > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +	if (!pfns)
> > > > +		return false;
> > > > +
> > > > +	hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(&notifier-
> > > > > notifier);
> > > > +	hmm_range.hmm_pfns = pfns;
> > > > +
> > > > +	while (true) {
> > > > +		err = hmm_range_fault(&hmm_range);
> > > > +		if (err == -EBUSY) {
> > > > +			if (time_after(jiffies, timeout))
> > > > +				break;
> > > > +
> > > > +			hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(&notifier->notifier);
> > > > +			continue;
> > > > +		}
> > > > +		break;
> > > > +	}
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	for (i = 0; i < npages;) {
> > > > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > > > +			err = -EFAULT;
> > > > +			goto err_free;
> > > > +		}
> > > > +		i += 0x1 << hmm_pfn_to_map_order(pfns[i]);
> > > > +	}
> > > > +
> > > > +err_free:
> > > > +	kvfree(pfns);
> > > > +	return err ? false : true;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU
> > > > SVM
> > > > range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @notifier: Pointer to the GPU SVM notifier structure
> > > > + * @vas: Pointer to the virtual memory area structure
> > > > + * @fault_addr: Fault address
> > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > + * @check_pages: Flag indicating whether to check pages
> > > > + *
> > > > + * This function determines the chunk size for the GPU SVM range
> > > > based on the
> > > > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges,
> > > > and
> > > > the virtual
> > > > + * memory area boundaries.
> > > > + *
> > > > + * Returns:
> > > > + * Chunk size on success, LONG_MAX on failure.
> > > > + */
> > > > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm
> > > > *gpusvm,
> > > > +				       struct
> > > > drm_gpusvm_notifier
> > > > *notifier,
> > > > +				       struct vm_area_struct
> > > > *vas,
> > > > +				       u64 fault_addr, u64
> > > > gpuva_start,
> > > > +				       u64 gpuva_end, bool
> > > > check_pages)
> > > > +{
> > > > +	u64 start, end;
> > > > +	int i = 0;
> > > > +
> > > > +retry:
> > > > +	for (; i < gpusvm->num_chunks; ++i) {
> > > > +		start = ALIGN_DOWN(fault_addr, gpusvm-
> > > > > chunk_sizes[i]);
> > > > +		end = ALIGN(fault_addr + 1, gpusvm-
> > > > >chunk_sizes[i]);
> > > > +
> > > > +		if (start >= vas->vm_start && end <= vas->vm_end
> > > > &&
> > > > +		    start >= notifier->interval.start &&
> > > > +		    end <= notifier->interval.end &&
> > > > +		    start >= gpuva_start && end <= gpuva_end)
> > > > +			break;
> > > > +	}
> > > > +
> > > > +	if (i == gpusvm->num_chunks)
> > > > +		return LONG_MAX;
> > > > +
> > > > +	/*
> > > > +	 * If allocation more than page, ensure not to overlap
> > > > with
> > > > existing
> > > > +	 * ranges.
> > > > +	 */
> > > > +	if (end - start != SZ_4K) {
> > > > +		struct drm_gpusvm_range *range;
> > > > +
> > > > +		range = drm_gpusvm_range_find(notifier, start,
> > > > end);
> > > > +		if (range) {
> > > > +			++i;
> > > > +			goto retry;
> > > > +		}
> > > > +
> > > > +		/*
> > > > +		 * XXX: Only create range on pages CPU has
> > > > faulted
> > > > in. Without
> > > > +		 * this check, or prefault, on BMG
> > > > 'xe_exec_system_allocator --r
> > > > +		 * process-many-malloc' fails. In the failure
> > > > case,
> > > > each process
> > > > +		 * mallocs 16k but the CPU VMA is ~128k which
> > > > results in 64k SVM
> > > > +		 * ranges. When migrating the SVM ranges, some
> > > > processes fail in
> > > > +		 * drm_gpusvm_migrate_to_devmem with
> > > > 'migrate.cpages
> > > > != npages'
> > > > +		 * and then upon drm_gpusvm_range_get_pages
> > > > device
> > > > pages from
> > > > +		 * other processes are collected + faulted in
> > > > which
> > > > creates all
> > > > +		 * sorts of problems. Unsure exactly how this
> > > > happening, also
> > > > +		 * problem goes away if
> > > > 'xe_exec_system_allocator --
> > > > r
> > > > +		 * process-many-malloc' mallocs at least 64k at
> > > > a
> > > > time.
> > > > +		 */
> > > 
> > > Needs to be figured out. I think even in the system allocator case,
> > > if
> > > a user uses malloc() to allocate a GPU only buffer we'd need to
> > > support
> > > that?
> > > 
> > 
> > I'm not understanding this comment but I do agree what is going on
> > here needs to
> > be figured out.
> 
> What I meant was let's say the user mallocs a big buffer to be used by
> the gpu only, and hence should ideally be in device memory, but it's
> never faulted by the CPU. I guess my question should be reformulated,
> what would happen then? Wouldn't it remain in system ram forever?
> 

Right now I think we'd fault in 1 CPU page at time, yea not great. See below
will follow on this to get a definite explaination of what is going on here.

> > 
> > This comment is actually a bit stale - I think the above test case
> > will pass now
> > if ctx.check_pages is false with a retry loop triggered in GPU fault
> > handler
> > because of mixed pages. However it does appear the test case still
> > finds device
> > pages in hmm_range_fault mapped into a different process which I
> > think should be
> > impossible. Wondering if there is hmm / mm core bug here my test case
> > hits? Let
> > me page this information back and dig in here to see if I can explain
> > what is
> > going on better. Will take sometime but should be able to focus on
> > this during
> > the week
> > 
> > Also I think leaving in the check_pages option is a good thing. A
> > call then can
> > choose between 2 things:
> > 
> > 1. Only create GPU mappings for CPU pages faulted in (ctx.check_pages
> > = true)
> > 2. create GPU mappings for a VMA and fault in CPU pages
> > (ctx.check_pages =
> > false)
> >  
> > If we support 2, then I think xe_svm_copy needs to be updated to
> > clear VRAM for
> > pages which the CPU has not faulted in.
> > 
> > > 
> > > > +		if (check_pages &&
> > > > +		    !drm_gpusvm_check_pages(gpusvm, notifier,
> > > > start,
> > > > end)) {
> > > > +			++i;
> > > > +			goto retry;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return end - start;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM
> > > > range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @fault_addr: Fault address
> > > > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > > > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function finds or inserts a newly allocated a GPU SVM
> > > > range
> > > > based on the
> > > > + * fault address. Caller must hold a lock to protect range
> > > > lookup
> > > > and insertion.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM range on success, ERR_PTR() on
> > > > failure.
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > > > fault_addr,
> > > > +				u64 gpuva_start, u64 gpuva_end,
> > > > +				const struct drm_gpusvm_ctx
> > > > *ctx)
> > > > +{
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +	struct drm_gpusvm_range *range;
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	struct vm_area_struct *vas;
> > > > +	bool notifier_alloc = false;
> > > > +	u64 chunk_size;
> > > > +	int err;
> > > > +	bool migrate_devmem;
> > > > +
> > > > +	if (fault_addr < gpusvm->mm_start ||
> > > > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > > 
> > > return ERR_PTR(-EINVAL)?
> > > 
> > 
> > Sure.
> >  
> > > > +		err = -EINVAL;
> > > > +		goto err_out;
> > > > +	}
> > > > +
> > > > +	if (!mmget_not_zero(mm)) {
> > > > +		err = -EFAULT;
> > > > +		goto err_out;
> > 
> > And here too.
> > 
> > > > +	}
> > > > +
> > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > > > +	if (!notifier) {
> > > > +		notifier = drm_gpusvm_notifier_alloc(gpusvm,
> > > > fault_addr);
> > > > +		if (IS_ERR(notifier)) {
> > > > +			err = PTR_ERR(notifier);
> > > > +			goto err_mmunlock;
> > > > +		}
> > > > +		notifier_alloc = true;
> > > > +		err = mmu_interval_notifier_insert(&notifier-
> > > > > notifier,
> > > > +						   mm, notifier-
> > > > > interval.start,
> > > > +						   notifier-
> > > > > interval.end -
> > > > +						   notifier-
> > > > > interval.start,
> > > > +						  
> > > > &drm_gpusvm_notifier_ops);
> > > > +		if (err)
> > > > +			goto err_notifier;
> > > > +	}
> > > > +
> > > > +	mmap_read_lock(mm);
> > > > +
> > > > +	vas = vma_lookup(mm, fault_addr);
> > > > +	if (!vas) {
> > > > +		err = -ENOENT;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > > > +		err = -EPERM;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	range = drm_gpusvm_range_find(notifier, fault_addr,
> > > > fault_addr + 1);
> > > > +	if (range)
> > > > +		goto out_mmunlock;
> > > > +	/*
> > > > +	 * XXX: Short-circuiting migration based on
> > > > migrate_vma_*
> > > > current
> > > > +	 * limitations. If/when migrate_vma_* add more support,
> > > > this
> > > > logic will
> > > > +	 * have to change.
> > > > +	 */
> > > > +	migrate_devmem = ctx->devmem_possible &&
> > > > +		vma_is_anonymous(vas) &&
> > > > !is_vm_hugetlb_page(vas);
> > > > +
> > > > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm,
> > > > notifier,
> > > > vas,
> > > > +						 fault_addr,
> > > > gpuva_start,
> > > > +						 gpuva_end,
> > > > migrate_devmem &&
> > > > +						 ctx-
> > > > >check_pages);
> > > > +	if (chunk_size == LONG_MAX) {
> > > > +		err = -EINVAL;
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	range = drm_gpusvm_range_alloc(gpusvm, notifier,
> > > > fault_addr,
> > > > chunk_size,
> > > > +				       migrate_devmem);
> > > > +	if (IS_ERR(range)) {
> > > > +		err = PTR_ERR(range);
> > > > +		goto err_notifier_remove;
> > > > +	}
> > > > +
> > > > +	drm_gpusvm_range_insert(notifier, range);
> > > > +	if (notifier_alloc)
> > > > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > > > +
> > > > +out_mmunlock:
> > > > +	mmap_read_unlock(mm);
> > > > +	mmput(mm);
> > > > +
> > > > +	return range;
> > > > +
> > > > +err_notifier_remove:
> > > > +	mmap_read_unlock(mm);
> > > > +	if (notifier_alloc)
> > > > +		mmu_interval_notifier_remove(&notifier-
> > > > >notifier);
> > > > +err_notifier:
> > > > +	if (notifier_alloc)
> > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > +err_mmunlock:
> > > > +	mmput(mm);
> > > > +err_out:
> > > > +	return ERR_PTR(err);
> > > > +}
> > > > +
> > > > +/**
> > > > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with
> > > > a
> > > > GPU SVM range (internal)
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @npages: Number of pages to unmap
> > > > + *
> > > > + * This function unmap pages associated with a GPU SVM range.
> > > > Assumes and
> > > > + * asserts correct locking is in place when called.
> > > > + */
> > > > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					   struct
> > > > drm_gpusvm_range
> > > > *range,
> > > > +					   unsigned long npages)
> > > > +{
> > > > +	unsigned long i, j;
> > > > +	struct drm_pagemap *dpagemap = range->dpagemap;
> > > > +	struct device *dev = gpusvm->drm->dev;
> > > > +
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	if (range->flags.has_dma_mapping) {
> > > > +		for (i = 0, j = 0; i < npages; j++) {
> > > > +			struct drm_pagemap_dma_addr *addr =
> > > > &range-
> > > > > dma_addr[j];
> > > > +
> > > > +			if (addr->proto ==
> > > > DRM_INTERCONNECT_SYSTEM)
> > > > {
> > > > +				dma_unmap_page(dev,
> > > > +					       addr->addr,
> > > > +					       PAGE_SIZE <<
> > > > addr-
> > > > > order,
> > > > +					       addr->dir);
> > > > +			} else if (dpagemap && dpagemap->ops-
> > > > > unmap_dma) {
> > > > +				dpagemap->ops-
> > > > >unmap_dma(dpagemap,
> > > > +							 dev,
> > > > +							 *addr);
> > > > +			}
> > > > +			i += 1 << addr->order;
> > > > +		}
> > > > +		range->flags.has_devmem_pages = false;
> > > > +		range->flags.has_dma_mapping = false;
> > > > +		range->dpagemap = NULL;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_free_pages - Free pages associated with a
> > > > GPU
> > > > SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function free pages associated with a GPU SVM range.
> > > 
> > > Frees the dma address array
> > > 
> > 
> > Yes.
> >  
> > > 
> > > > + */
> > > > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm
> > > > *gpusvm,
> > > > +					struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	if (range->dma_addr) {
> > > > +		kvfree(range->dma_addr);
> > > > +		range->dma_addr = NULL;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range to be removed
> > > > + *
> > > > + * This function removes the specified GPU SVM range and also
> > > > removes the parent
> > > > + * GPU SVM notifier if no more ranges remain in the notifier.
> > > > The
> > > > caller must
> > > > + * hold a lock to protect range and notifier removal.
> > > > + */
> > > > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > > > +			     struct drm_gpusvm_range *range)
> > > > +{
> > > > +	unsigned long npages = npages_in_range(range->va.start,
> > > > range->va.end);
> > > > +	struct drm_gpusvm_notifier *notifier;
> > > > +
> > > > +	notifier = drm_gpusvm_notifier_find(gpusvm, range-
> > > > > va.start);
> > > > +	if (WARN_ON_ONCE(!notifier))
> > > > +		return;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > > > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > > > +	__drm_gpusvm_range_remove(notifier, range);
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +
> > > > +	drm_gpusvm_range_put(range);
> > > > +
> > > > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > > > +		if (!notifier->flags.removed)
> > > > +			mmu_interval_notifier_remove(&notifier-
> > > > > notifier);
> > > > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > > > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > > > + * @range: Pointer to the GPU SVM range
> > > > + *
> > > > + * This function increments the reference count of the specified
> > > > GPU
> > > > SVM range.
> > > > + *
> > > > + * Returns:
> > > > + * Pointer to the GPU SVM range.
> > > > + */
> > > > +struct drm_gpusvm_range *
> > > > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	kref_get(&range->refcount);
> > > > +
> > > > +	return range;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > > > + * @refcount: Pointer to the reference counter embedded in the
> > > > GPU
> > > > SVM range
> > > > + *
> > > > + * This function destroys the specified GPU SVM range when its
> > > > reference count
> > > > + * reaches zero. If a custom range-free function is provided, it
> > > > is
> > > > invoked to
> > > > + * free the range; otherwise, the range is deallocated using
> > > > kfree().
> > > > + */
> > > > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > > > +{
> > > > +	struct drm_gpusvm_range *range =
> > > > +		container_of(refcount, struct drm_gpusvm_range,
> > > > refcount);
> > > > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > > > +
> > > > +	if (gpusvm->ops->range_free)
> > > > +		gpusvm->ops->range_free(range);
> > > > +	else
> > > > +		kfree(range);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > > > + * @range: Pointer to the GPU SVM range
> > > > + *
> > > > + * This function decrements the reference count of the specified
> > > > GPU
> > > > SVM range
> > > > + * and frees it when the count reaches zero.
> > > > + */
> > > > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function determines if a GPU SVM range pages are valid.
> > > > Expected be
> > > > + * called holding gpusvm->notifier_lock and as the last step
> > > > before
> > > > commiting a
> > > > + * GPU binding.
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > + */
> > > > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > > > +
> > > > +	return range->flags.has_devmem_pages || range-
> > > > > flags.has_dma_mapping;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages
> > > > valid
> > > > unlocked
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + *
> > > > + * This function determines if a GPU SVM range pages are valid.
> > > > Expected be
> > > > + * called without holding gpusvm->notifier_lock.
> > > > + *
> > > > + * Returns:
> > > > + * True if GPU SVM range has valid pages, False otherwise
> > > > + */
> > > > +static bool
> > > > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > > > +				      struct drm_gpusvm_range
> > > > *range)
> > > > +{
> > > > +	bool pages_valid;
> > > > +
> > > > +	if (!range->dma_addr)
> > > > +		return false;
> > > > +
> > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm,
> > > > range);
> > > > +	if (!pages_valid)
> > > > +		drm_gpusvm_range_free_pages(gpusvm, range);
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +
> > > > +	return pages_valid;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function gets pages for a GPU SVM range and ensures they
> > > > are
> > > > mapped for
> > > > + * DMA access.
> > > > + *
> > > > + * Returns:
> > > > + * 0 on success, negative error code on failure.
> > > > + */
> > > > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > > > +			       struct drm_gpusvm_range *range,
> > > > +			       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	struct mmu_interval_notifier *notifier = &range-
> > > > >notifier-
> > > > > notifier;
> > > > +	struct hmm_range hmm_range = {
> > > > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx-
> > > > >read_only
> > > > ? 0 :
> > > > +			HMM_PFN_REQ_WRITE),
> > > > +		.notifier = notifier,
> > > > +		.start = range->va.start,
> > > > +		.end = range->va.end,
> > > > +		.dev_private_owner = gpusvm-
> > > > > device_private_page_owner,
> > > > +	};
> > > > +	struct mm_struct *mm = gpusvm->mm;
> > > > +	struct drm_gpusvm_zdd *zdd;
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long i, j;
> > > > +	unsigned long npages = npages_in_range(range->va.start,
> > > > range->va.end);
> > > > +	unsigned long num_dma_mapped;
> > > > +	unsigned int order = 0;
> > > > +	unsigned long *pfns;
> > > > +	struct page **pages;
> > > > +	int err = 0;
> > > > +	struct dev_pagemap *pagemap;
> > > > +	struct drm_pagemap *dpagemap;
> > > > +
> > > > +retry:
> > > > +	hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(notifier);
> > > > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm,
> > > > range))
> > > > +		goto set_seqno;
> > > > +
> > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +	if (!pfns)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	if (!mmget_not_zero(mm)) {
> > > > +		err = -EFAULT;
> > > > +		goto err_out;
> > > > +	}
> > > > +
> > > > +	hmm_range.hmm_pfns = pfns;
> > > > +	while (true) {
> > > > +		mmap_read_lock(mm);
> > > > +		err = hmm_range_fault(&hmm_range);
> > > > +		mmap_read_unlock(mm);
> > > > +
> > > > +		if (err == -EBUSY) {
> > > > +			if (time_after(jiffies, timeout))
> > > > +				break;
> > > > +
> > > > +			hmm_range.notifier_seq =
> > > > mmu_interval_read_begin(notifier);
> > > > +			continue;
> > > > +		}
> > > > +		break;
> > > > +	}
> > > > +	mmput(mm);
> > > > +	if (err)
> > > > +		goto err_free;
> > > > +
> > > > +	pages = (struct page **)pfns;
> > > > +map_pages:
> > > > +	/*
> > > > +	 * Perform all dma mappings under the notifier lock to
> > > > not
> > > > +	 * access freed pages. A notifier will either block on
> > > > +	 * the notifier lock or unmap dma.
> > > > +	 */
> > > > +	drm_gpusvm_notifier_lock(gpusvm);
> > > > +	if (mmu_interval_read_retry(notifier,
> > > > hmm_range.notifier_seq)) {
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +		goto retry;
> > > > +	}
> > > > +
> > > > +	if (!range->dma_addr) {
> > > > +		/* Unlock and restart mapping to allocate
> > > > memory. */
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +		range->dma_addr = kvmalloc_array(npages,
> > > > sizeof(*range->dma_addr),
> > > > +						 GFP_KERNEL);
> > > > +		if (!range->dma_addr) {
> > > > +			err = -ENOMEM;
> > > > +			goto err_free;
> > > > +		}
> > > > +		goto map_pages;
> > > > +	}
> > > > +
> > > > +	zdd = NULL;
> > > > +	num_dma_mapped = 0;
> > > > +	for (i = 0, j = 0; i < npages; ++j) {
> > > > +		struct page *page = hmm_pfn_to_page(pfns[i]);
> > > > +
> > > > +		order = hmm_pfn_to_map_order(pfns[i]);
> > > > +		if (is_device_private_page(page) ||
> > > > is_device_coherent_page(page)) {
> > > > +			if (zdd != page->zone_device_data && i >
> > > > 0)
> > > > {
> > > > +				err = -EOPNOTSUPP;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +			zdd = page->zone_device_data;
> > > > +			if (pagemap != page->pgmap) {
> > > > +				if (i > 0) {
> > > > +					err = -EOPNOTSUPP;
> > > > +					goto err_unmap;
> > > > +				}
> > > > +
> > > > +				pagemap = page->pgmap;
> > > > +				dpagemap = zdd-
> > > > >devmem_allocation-
> > > > > dpagemap;
> > > > +				if (drm_WARN_ON(gpusvm->drm,
> > > > !dpagemap)) {
> > > > +					/*
> > > > +					 * Raced. This is not
> > > > supposed to happen
> > > > +					 * since
> > > > hmm_range_fault()
> > > > should've migrated
> > > > +					 * this page to system.
> > > > +					 */
> > > > +					err = -EAGAIN;
> > > > +					goto err_unmap;
> > > > +				}
> > > > +			}
> > > > +			range->dma_addr[j] =
> > > > +				dpagemap->ops->map_dma(dpagemap,
> > > > gpusvm->drm->dev,
> > > > +						       page,
> > > > order,
> > > > +						      
> > > > DMA_BIDIRECTIONAL);
> > > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > > range->dma_addr[j].addr)) {
> > > > +				err = -EFAULT;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			pages[i] = page;
> > > > +		} else {
> > > > +			dma_addr_t addr;
> > > > +
> > > > +			if (is_zone_device_page(page) || zdd) {
> > > > +				err = -EOPNOTSUPP;
> > > 
> > > I suppose before merging we want to support mixed ranges since
> > > migration is best effort only, or what are the plans here?
> > > 
> > 
> > I'd say initial merge no mixed support given that adds complexity and
> > the
> > current code is very stable - i.e., get in a simple and stable
> > baseline and then
> > build complexity on top incrementally. I have a lot perf optimization
> > I'd like
> > to get in but omitting for now to stick to the aforementioned plan.
> > 
> > Longtern I think a drm_gpusvm_ctx argument will control if we want
> > mixed
> > mappings within a range.
> 
> OK.
> 
> >  
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			addr = dma_map_page(gpusvm->drm->dev,
> > > > +					    page, 0,
> > > > +					    PAGE_SIZE << order,
> > > > +					    DMA_BIDIRECTIONAL);
> > > > +			if (dma_mapping_error(gpusvm->drm->dev,
> > > > addr)) {
> > > > +				err = -EFAULT;
> > > > +				goto err_unmap;
> > > > +			}
> > > > +
> > > > +			range->dma_addr[j] =
> > > > drm_pagemap_dma_addr_encode
> > > > +				(addr, DRM_INTERCONNECT_SYSTEM,
> > > > order,
> > > > +				 DMA_BIDIRECTIONAL);
> > > > +		}
> > > > +		i += 1 << order;
> > > > +		num_dma_mapped = i;
> > > > +	}
> > > > +
> > > > +	range->flags.has_dma_mapping = true;
> > > > +	if (zdd) {
> > > > +		range->flags.has_devmem_pages = true;
> > > > +		range->dpagemap = dpagemap;
> > > > +	}
> > > > +
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +	kvfree(pfns);
> > > > +set_seqno:
> > > > +	range->notifier_seq = hmm_range.notifier_seq;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err_unmap:
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range,
> > > > num_dma_mapped);
> > > > +	drm_gpusvm_notifier_unlock(gpusvm);
> > > > +err_free:
> > > > +	kvfree(pfns);
> > > > +err_out:
> > > > +	if (err == -EAGAIN)
> > > > +		goto retry;
> > > > +	return err;
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a
> > > > GPU
> > > > SVM range
> > > > + * @gpusvm: Pointer to the GPU SVM structure
> > > > + * @range: Pointer to the GPU SVM range structure
> > > > + * @ctx: GPU SVM context
> > > > + *
> > > > + * This function unmaps pages associated with a GPU SVM range.
> > > > If
> > > > @in_notifier
> > > > + * is set, it is assumed that gpusvm->notifier_lock is held in
> > > > write
> > > > mode; if it
> > > > + * is clear, it acquires gpusvm->notifier_lock in read mode.
> > > > Must be
> > > > called on
> > > > + * each GPU SVM range attached to notifier in gpusvm->ops-
> > > > > invalidate for IOMMU
> > > > + * security model.
> > > > + */
> > > > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > > > +				  struct drm_gpusvm_range
> > > > *range,
> > > > +				  const struct drm_gpusvm_ctx
> > > > *ctx)
> > > > +{
> > > > +	unsigned long npages = npages_in_range(range->va.start,
> > > > range->va.end);
> > > > +
> > > > +	if (ctx->in_notifier)
> > > > +		lockdep_assert_held_write(&gpusvm-
> > > > >notifier_lock);
> > > > +	else
> > > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > > +
> > > > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > > > +
> > > > +	if (!ctx->in_notifier)
> > > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > > +}
> > > 
> > > NIT: Separate functions for locked / unlocked makes life easier for
> > > static code analyzers.
> > > 
> > 
> > Willl do.
> > 
> > > 
> > > Section below I think should belong to drm_pagemap.c
> > > 
> > 
> > Diagree. See my comments on zdd above. Also
> > drm_gpusvm_migration_put_pages uses
> > migration pfns which definitely should not be in drm_pagemap.c.
> 
> Like mentioned above, with upcoming populate_mm() moving to being a
> drm_pagemap op, and none of the low level migration helpers taking
> drm_gpusvm as an argument, I think it makes sense, but if you rather
> want to look at shuffling things around when that is in place,
> that's ok. Agree, though that anything needing struct drm_gpusvm should
> *not* be in drm_pagemap.c
> 

'but if you rather want to look at shuffling things around when that is in
place'

This sounds like reasonable plan to me - land this first and work through any
reshuffling together in the drm_pagemap.c follow up.

Matt

> Thanks,
> Thomas
> 
> 
> 
> >  
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_page - Put a migration page
> > > > + * @page: Pointer to the page to put
> > > > + *
> > > > + * This function unlocks and puts a page.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_page(struct page *page)
> > > 
> > > _unlock_put_page()?
> > > 
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migration_put_pages - Put migration pages
> > > > + * @npages: Number of pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers
> > > > + *
> > > > + * This function puts an array of pages.
> > > > + */
> > > > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > > > +					   unsigned long
> > > > *migrate_pfn)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		if (!migrate_pfn[i])
> > > > +			continue;
> > > > +
> > > > +		drm_gpusvm_migration_put_page(migrate_pfn_to_pag
> > > > e(mi
> > > > grate_pfn[i]));
> > > > +		migrate_pfn[i] = 0;
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_get_devmem_page - Get a reference to a device
> > > > memory
> > > > page
> > > > + * @page: Pointer to the page
> > > > + * @zdd: Pointer to the GPU SVM zone device data
> > > > + *
> > > > + * This function associates the given page with the specified
> > > > GPU
> > > > SVM zone
> > > > + * device data and initializes it for zone device usage.
> > > > + */
> > > > +static void drm_gpusvm_get_devmem_page(struct page *page,
> > > > +				     struct drm_gpusvm_zdd *zdd)
> > > > +{
> > > > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > > > +	zone_device_page_init(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU
> > > > SVM
> > > > migration
> > > > + * @dev: The device for which the pages are being mapped
> > > > + * @dma_addr: Array to store DMA addresses corresponding to
> > > > mapped
> > > > pages
> > > > + * @migrate_pfn: Array of migrate page frame numbers to map
> > > > + * @npages: Number of pages to map
> > > > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > > > + *
> > > > + * This function maps pages of memory for migration usage in GPU
> > > > SVM. It
> > > > + * iterates over each page frame number provided in
> > > > @migrate_pfn,
> > > > maps the
> > > > + * corresponding page, and stores the DMA address in the
> > > > provided
> > > > @dma_addr
> > > > + * array.
> > > > + *
> > > > + * Return: 0 on success, -EFAULT if an error occurs during
> > > > mapping.
> > > > + */
> > > > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > > > +					dma_addr_t *dma_addr,
> > > > +					long unsigned int
> > > > *migrate_pfn,
> > > > +					unsigned long npages,
> > > > +					enum dma_data_direction
> > > > dir)
> > > > +{
> > > > +	unsigned long i;
> > > > +
> > > > +	for (i = 0; i < npages; ++i) {
> > > > +		struct page *page =
> > > > migrate_pfn_to_page(migrate_pfn[i]);
> > > > +
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > > > +			return -EFAULT;
> > > > +
> > > > +		dma_addr[i] = dma_map_page(dev, page, 0,
> > > > PAGE_SIZE,
> > > > dir);
> > > > +		if (dma_mapping_error(dev, dma_addr[i]))
> > > > +			return -EFAULT;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > 
> > > TBC'd
> > > 
> > 
> > Thanks for the comments!
> > 
> > Matt
> > 
> > > /Thomas
> > > 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-04 23:07         ` Matthew Brost
@ 2024-11-05 10:22           ` Thomas Hellström
  2024-11-05 16:12             ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-05 10:22 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, 2024-11-04 at 15:07 -0800, Matthew Brost wrote:
> > We
> > have
> > https://elixir.bootlin.com/linux/v6.12-rc6/source/include/linux/int
> > erval_tree_generic.h#L24
> > 
> > to relate to. Now GPUVM can't use the generic version since it
> > needs
> > u64 intervals. These trees need unsigned long only so it should be
> > ok.
> > And safe removal, isn't that possible to implement without the
> > list?
> > Then it's really only the linked list as a perf optimization I
> > guess,
> > but we have a lot of those pending...
> > 
> 
> See my other comments. Let me just follow on using a maple tree and
> perhaps a
> list isn't required if we use that. Will have definite answer in my
> next rev.

Note, though, that IIRC maple trees do not allow overlapping ranges,
and If we need to support multiple svm VMAs with different offsets,
like Christian suggests, we will likely have overlapping ranges for the
range tree but not for the notifier tree.

Thinking a bit more about this, my concern is mostly around needlessly
instantiating new interval trees instead of using the generic
instantiation, because that is clearly against recommended practice. 

But the list could probably be added anyway if needed, and it does
indeed AFAICT reduce the traversal complexity from O(N ln N) to O(N).

/Thomas
 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-05 10:22           ` Thomas Hellström
@ 2024-11-05 16:12             ` Matthew Brost
  2024-11-05 16:28               ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-11-05 16:12 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 05, 2024 at 11:22:12AM +0100, Thomas Hellström wrote:
> On Mon, 2024-11-04 at 15:07 -0800, Matthew Brost wrote:
> > > We
> > > have
> > > https://elixir.bootlin.com/linux/v6.12-rc6/source/include/linux/int
> > > erval_tree_generic.h#L24
> > > 
> > > to relate to. Now GPUVM can't use the generic version since it
> > > needs
> > > u64 intervals. These trees need unsigned long only so it should be
> > > ok.
> > > And safe removal, isn't that possible to implement without the
> > > list?
> > > Then it's really only the linked list as a perf optimization I
> > > guess,
> > > but we have a lot of those pending...
> > > 
> > 
> > See my other comments. Let me just follow on using a maple tree and
> > perhaps a
> > list isn't required if we use that. Will have definite answer in my
> > next rev.
> 
> Note, though, that IIRC maple trees do not allow overlapping ranges,
> and If we need to support multiple svm VMAs with different offsets,
> like Christian suggests, we will likely have overlapping ranges for the
> range tree but not for the notifier tree.
> 

I don't think that is how overlapping ranges would look though. We'd
have multiple GPU VMAs / GPU ptes to pointing the same SVM range. The
SVM ranges speak in the CPU address space - we'd attach multiple GPU
VMAs to the SVM so in the notifier we can find all the GPU pages to
invalidate. At least I think it would look this way - can cross that
bridge if / when we get to it though.

> Thinking a bit more about this, my concern is mostly around needlessly
> instantiating new interval trees instead of using the generic
> instantiation, because that is clearly against recommended practice. 
> 

Ok, so with this statement then I think both the interval trees in GPU
VM / xe_range_fence are going again the recommended practice too?

> But the list could probably be added anyway if needed, and it does
> indeed AFAICT reduce the traversal complexity from O(N ln N) to O(N).
> 

I think this will show in the notifier. We currently walk the notifier's
range tree twice - once to do the invalidate, once to unmap the pages /
add to the garbage collector. I even optimize this so the 2nd walk
doesn't have to lookup first range again making the complexity O(ln N +
2 * N) vs (2 * N * ln N) without a list.

Matt

> /Thomas
>  

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-05 16:12             ` Matthew Brost
@ 2024-11-05 16:28               ` Thomas Hellström
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-11-05 16:28 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, 2024-11-05 at 08:12 -0800, Matthew Brost wrote:
> On Tue, Nov 05, 2024 at 11:22:12AM +0100, Thomas Hellström wrote:
> > On Mon, 2024-11-04 at 15:07 -0800, Matthew Brost wrote:
> > > > We
> > > > have
> > > > https://elixir.bootlin.com/linux/v6.12-rc6/source/include/linux/int
> > > > erval_tree_generic.h#L24
> > > > 
> > > > to relate to. Now GPUVM can't use the generic version since it
> > > > needs
> > > > u64 intervals. These trees need unsigned long only so it should
> > > > be
> > > > ok.
> > > > And safe removal, isn't that possible to implement without the
> > > > list?
> > > > Then it's really only the linked list as a perf optimization I
> > > > guess,
> > > > but we have a lot of those pending...
> > > > 
> > > 
> > > See my other comments. Let me just follow on using a maple tree
> > > and
> > > perhaps a
> > > list isn't required if we use that. Will have definite answer in
> > > my
> > > next rev.
> > 
> > Note, though, that IIRC maple trees do not allow overlapping
> > ranges,
> > and If we need to support multiple svm VMAs with different offsets,
> > like Christian suggests, we will likely have overlapping ranges for
> > the
> > range tree but not for the notifier tree.
> > 
> 
> I don't think that is how overlapping ranges would look though. We'd
> have multiple GPU VMAs / GPU ptes to pointing the same SVM range. The
> SVM ranges speak in the CPU address space - we'd attach multiple GPU
> VMAs to the SVM so in the notifier we can find all the GPU pages to
> invalidate. At least I think it would look this way - can cross that
> bridge if / when we get to it though.
> 
> > Thinking a bit more about this, my concern is mostly around
> > needlessly
> > instantiating new interval trees instead of using the generic
> > instantiation, because that is clearly against recommended
> > practice. 
> > 
> 
> Ok, so with this statement then I think both the interval trees in
> GPU
> VM / xe_range_fence are going again the recommended practice too?

No, they work in gpu virtual address space with u64 integers, whereas
these are cpu virtual address space with unsigned long, which is also
the type used for the generic instantiation. (I think maple trees also
use unsigned long). At least for the range fences that was the
motivation for a separate instantiation. Not sure what the reasoning
was with gpuvm. 

/Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
  2024-10-31 18:58   ` Thomas Hellström
  2024-11-04 15:25   ` Thomas Hellström
@ 2024-11-05 14:48   ` Thomas Hellström
  2024-11-05 16:32     ` Matthew Brost
  2024-11-20  3:00   ` Gwan-gyeong Mun
  2024-11-29  0:00   ` Alistair Popple
  4 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-05 14:48 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:


Continued review:

> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for GPU
> Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in
> @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the
> corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction
> dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev,
> dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_devmem - Migrate GPU SVM range to device
> memory
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @devmem_allocation: Pointer to the device memory allocation. The
> caller
> + *                     should hold a reference to the device memory
> allocation,
> + *                     which should be dropped via ops-
> >devmem_release or upon
> + *                     the failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to device
> memory. It performs the
> + * necessary setup and invokes the driver-specific operations for
> migration to
> + * device memory. Upon successful return, @devmem_allocation can
> safely reference @range
> + * until ops->devmem_release is called which only upon successful
> return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> +				 struct drm_gpusvm_range *range,
> +				 struct drm_gpusvm_devmem
> *devmem_allocation,
> +				 const struct drm_gpusvm_ctx *ctx)
> +{
> +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation-
> >ops;
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_devmem)
> +		return -EINVAL;
> +
> +	if (!ops->populate_devmem_pfn || !ops->copy_to_devmem ||
> !ops->copy_to_ram)
> +		return -EOPNOTSUPP;
> +
> +	if (!mmget_not_zero(mm)) {
> +		err = -EFAULT;
> +		goto err_out;
> +	}
> +	mmap_read_lock(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(gpusvm-
> >device_private_page_owner);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> npages, not
> +	 * always an error. Need to revisit possible cases and how
> to handle. We
> +	 * could prefault on migrate.cpages != npages via
> hmm_range_fault.
> +	 */
> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = ops->populate_devmem_pfn(devmem_allocation, npages,
> migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev,
> dma_addr,
> +					   migrate.src, npages,
> DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_devmem_page(page, zdd);
> +	}
> +
> +	err = ops->copy_to_devmem(pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind devmem allocation to range and zdd */
> +	WRITE_ONCE(zdd->devmem_allocation, devmem_allocation);	/*
> Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev,
> dma_addr, npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_ram_pfn - Populate RAM PFNs for a VM
> area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @mpages: Number of pages to migrate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the RAM migrate page frame numbers (PFNs)
> for the
> + * specified VM area structure. It allocates and locks pages in the
> VM area for
> + * RAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_ram_pfn(struct vm_area_struct
> *vas,
> +					       unsigned long npages,
> +					       unsigned long
> *mpages,
> +					       unsigned long
> *src_mpfn,
> +					       unsigned long *mpfn,
> u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);

Allocating under page-locks seem a bit scary, but OTOH we're
recursively locking page-locks as well. Perhaps a comment on why this
is allowed.

Allocating and then trylocking with asserts as separate steps otherwise
would guarantee that we don't hit a deadlock without noticing but the
way it's currently coded seems to be common practice.

> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +		++*mpages;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_ram - Evict GPU SVM range to RAM
> + * @devmem_allocation: Pointer to the device memory allocation
> + *
> + * Similar to __drm_gpusvm_migrate_to_ram but does not require mmap
> lock and
> + * migration done via migrate_device_* functions.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem
> *devmem_allocation)
> +{
> +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation-
> >ops;
> +	unsigned long npages, mpages = 0;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = devmem_allocation->size >> PAGE_SHIFT;
> +
> +retry:
> +	if (!mmget_not_zero(devmem_allocation->mm))
> +		return -EFAULT;
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> npages;
> +
> +	err = ops->populate_devmem_pfn(devmem_allocation, npages,
> src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_prepopulated_range(src, npages);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_ram_pfn(NULL, npages,
> &mpages, src,
> +						  dst, 0);
> +	if (err || !mpages)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev,
> dma_addr,
> +					   dst, npages,
> DMA_FROM_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = ops->copy_to_ram(pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev,
> dma_addr, npages,
> +				       DMA_FROM_DEVICE);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmput_async(devmem_allocation->mm);
> +	if (!err && !READ_ONCE(devmem_allocation->detached)) {
> +		cond_resched();
> +		goto retry;
> +	}
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> (internal)
> + * @vas: Pointer to the VM area structure
> + * @device_private_page_owner: Device private pages owner
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @fault_addr: Fault address
> + * @size: Size of migration
> + *
> + * This internal function performs the migration of the specified
> GPU SVM range
> + * to RAM. It sets up the migration, populates + dma maps RAM PFNs,
> and
> + * invokes the driver-specific operations for migration to RAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_ram(struct vm_area_struct *vas,
> +				       void
> *device_private_page_owner,
> +				       struct page *page, u64
> fault_addr,
> +				       u64 size)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE
> |
> +			MIGRATE_VMA_SELECT_DEVICE_COHERENT,
> +		.fault_page	= page,
> +	};
> +	struct drm_gpusvm_zdd *zdd;
> +	const struct drm_gpusvm_devmem_ops *ops;
> +	struct device *dev;
> +	unsigned long npages, mpages = 0;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	u64 start, end;
> +	void *buf;
> +	int i, err = 0;
> +
> +	start = ALIGN_DOWN(fault_addr, size);
> +	end = ALIGN(fault_addr + 1, size);
> +
> +	/* Corner where VMA area struct has been partially unmapped
> */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	if (!page) {
> +		for (i = 0; i < npages; ++i) {
> +			if (!(migrate.src[i] & MIGRATE_PFN_MIGRATE))
> +				continue;
> +
> +			page = migrate_pfn_to_page(migrate.src[i]);
> +			break;
> +		}
> +
> +		if (!page)
> +			goto err_finalize;
> +	}
> +	zdd = page->zone_device_data;
> +	ops = zdd->devmem_allocation->ops;
> +	dev = zdd->devmem_allocation->dev;
> +
> +	err = drm_gpusvm_migrate_populate_ram_pfn(vas, npages,
> &mpages,
> +						  migrate.src,
> migrate.dst,
> +						  start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(dev, dma_addr,
> migrate.dst, npages,
> +					   DMA_FROM_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +
> +	err = ops->copy_to_ram(pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(dev, dma_addr, npages,
> +				       DMA_FROM_DEVICE);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_evict - Evict GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function evicts the specified GPU SVM range.
> + */
> +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> +			    struct drm_gpusvm_range *range)

Although we discussed this one before. Ideally I think we'd want to be
able to migrate also other devices' pages. But need to consider device-
coherent pages.


> +{
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +
> +	if (!mmget_not_zero(mm))
> +		return;
> +
> +	mmap_read_lock(mm);
> +	vas = vma_lookup(mm, range->va.start);
> +	if (!vas)
> +		goto unlock;
> +
> +	__drm_gpusvm_migrate_to_ram(vas, gpusvm-
> >device_private_page_owner,
> +				    NULL, range->va.start,
> +				    range->va.end - range-
> >va.start);
> +unlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device
> data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM
> range to RAM.
> + * It retrieves the GPU SVM range information from the faulting page
> and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> +	int err;

... and this one only the pages belonging to the current pagemap.

> +
> +	err = __drm_gpusvm_migrate_to_ram(vmf->vma,
> +					  zdd-
> >device_private_page_owner,
> +					  vmf->page, vmf->address,
> +					  zdd->devmem_allocation-
> >size);
> +
> +	return err ? VM_FAULT_SIGBUS : 0;
> +}
> +
> +/**
> + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> + */
> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> +	.page_free = drm_gpusvm_page_free,
> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> +};
> +
> +/**
> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> operations
> + *
> + * Returns:
> + * Pointer to the GPU SVM device page map operations structure.
> + */
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> +{
> +	return &drm_gpusvm_pagemap_ops;
> +}
> +
> +/**
> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> given address range
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @start: Start address
> + * @end: End address
> + *
> + * Returns:
> + * True if GPU SVM has mapping, False otherwise
> + */
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> u64 end)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> +		struct drm_gpusvm_range *range = NULL;
> +
> +		drm_gpusvm_for_each_range(range, notifier, start,
> end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> b/drivers/gpu/drm/xe/drm_gpusvm.h
> new file mode 100644
> index 000000000000..15ec22d4f9a5
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> @@ -0,0 +1,447 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __DRM_GPUSVM_H__
> +#define __DRM_GPUSVM_H__
> +
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/workqueue.h>
> +
> +struct dev_pagemap_ops;
> +struct drm_device;
> +struct drm_gpusvm;
> +struct drm_gpusvm_notifier;
> +struct drm_gpusvm_ops;
> +struct drm_gpusvm_range;
> +struct drm_gpusvm_devmem;
> +struct drm_pagemap;
> +struct drm_pagemap_dma_addr;
> +
> +/**
> + * struct drm_gpusvm_devmem_ops - Operations structure for GPU SVM
> device memory
> + *
> + * This structure defines the operations for GPU Shared Virtual
> Memory (SVM)
> + * device memory. These operations are provided by the GPU driver to
> manage device memory
> + * allocations and perform operations such as migration between
> device memory and system
> + * RAM.
> + */
> +struct drm_gpusvm_devmem_ops {
> +	/**
> +	 * @devmem_release: Release device memory allocation
> (optional)
> +	 * @devmem_allocation: device memory allocation
> +	 *
> +	 * This function shall release device memory allocation and
> expects to drop a

NIT: Consider "Release device memory..." rather than "This function
shall release..." (general comment).

> +	 * reference to device memory allocation.
> +	 */
> +	void (*devmem_release)(struct drm_gpusvm_devmem
> *devmem_allocation);
> +
> +	/**
> +	 * @populate_devmem_pfn: Populate device memory PFN
> (required for migration)
> +	 * @devmem_allocation: device memory allocation
> +	 * @npages: Number of pages to populate
> +	 * @pfn: Array of page frame numbers to populate
> +	 *
> +	 * This function shall populate device memory page frame
> numbers (PFN).
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*populate_devmem_pfn)(struct drm_gpusvm_devmem
> *devmem_allocation,
> +				 unsigned long npages, unsigned long
> *pfn);
> +
> +	/**
> +	 * @copy_to_devmem: Copy to device memory (required for
> migration)
> +	 * @pages: Pointer to array of device memory pages
> (destination)
> +	 * @dma_addr: Pointer to array of DMA addresses (source)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to device memory.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_devmem)(struct page **pages,
> +			      dma_addr_t *dma_addr,
> +			      unsigned long npages);
> +
> +	/**
> +	 * @copy_to_ram: Copy to system RAM (required for migration)
> +	 * @pages: Pointer to array of device memory pages (source)
> +	 * @dma_addr: Pointer to array of DMA addresses
> (destination)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to system RAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_ram)(struct page **pages,
> +			   dma_addr_t *dma_addr,
> +			   unsigned long npages);
> +};
> +
> +/**
> + * struct drm_gpusvm_devmem - Structure representing a GPU SVM
> device memory allocation
> + *
> + * @dev: Pointer to the device structure which device memory
> allocation belongs to
> + * @mm: Pointer to the mm_struct for the address space
> + * @ops: Pointer to the operations structure for GPU SVM device
> memory
> + * @dpagemap: The struct drm_pagemap of the pages this allocation
> belongs to.
> + * @size: Size of device memory allocation
> + * @detached: device memory allocations is detached from device
> pages
> + */
> +struct drm_gpusvm_devmem {
> +	struct device *dev;
> +	struct mm_struct *mm;
> +	const struct drm_gpusvm_devmem_ops *ops;
> +	struct drm_pagemap *dpagemap;
> +	size_t size;
> +	bool detached;
> +};
> +
> +/**
> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> + *
> + * This structure defines the operations for GPU Shared Virtual
> Memory (SVM).
> + * These operations are provided by the GPU driver to manage SVM
> ranges and
> + * notifiers.
> + */
> +struct drm_gpusvm_ops {
> +	/**
> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> +	 *
> +	 * This function shall allocate a GPU SVM notifier.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM notifier on success,
> NULL on failure.
> +	 */
> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> +
> +	/**
> +	 * @notifier_free: Free a GPU SVM notifier (optional)
> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> +	 *
> +	 * This function shall free a GPU SVM notifier.
> +	 */
> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> +
> +	/**
> +	 * @range_alloc: Allocate a GPU SVM range (optional)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 *
> +	 * This function shall allocate a GPU SVM range.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM range on success, NULL
> on failure.
> +	 */
> +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> *gpusvm);
> +
> +	/**
> +	 * @range_free: Free a GPU SVM range (optional)
> +	 * @range: Pointer to the GPU SVM range to be freed
> +	 *
> +	 * This function shall free a GPU SVM range.
> +	 */
> +	void (*range_free)(struct drm_gpusvm_range *range);
> +
> +	/**
> +	 * @invalidate: Invalidate GPU SVM notifier (required)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @notifier: Pointer to the GPU SVM notifier
> +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> +	 *
> +	 * This function shall invalidate the GPU page tables. It
> can safely
> +	 * walk the notifier range RB tree/list in this function.
> Called while
> +	 * holding the notifier lock.
> +	 */
> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> +			   struct drm_gpusvm_notifier *notifier,
> +			   const struct mmu_notifier_range
> *mmu_range);
> +};
> +
> +/**
> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> notifier
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: MMU interval notifier
> + * @interval: Interval for the notifier
> + * @rb: Red-black tree node for the parent GPU SVM structure
> notifier tree
> + * @root: Cached root node of the RB tree containing ranges
> + * @range_list: List head containing of ranges in the same order
> they appear in
> + *              interval tree. This is useful to keep iterating
> ranges while
> + *              doing modifications to RB tree.
> + * @flags.removed: Flag indicating whether the MMU interval notifier
> has been
> + *                 removed

Please also document the nested fields.

> + *
> + * This structure represents a GPU SVM notifier.
> + */
> +struct drm_gpusvm_notifier {
> +	struct drm_gpusvm *gpusvm;
> +	struct mmu_interval_notifier notifier;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} interval;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct rb_root_cached root;
> +	struct list_head range_list;
> +	struct {
> +		u32 removed : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier
> + * @refcount: Reference count for the range
> + * @rb: Red-black tree node for the parent GPU SVM notifier
> structure range tree
> + * @va: Virtual address range
> + * @notifier_seq: Notifier sequence number of the range's pages
> + * @dma_addr: DMA address array
> + * @dpagemap: The struct drm_pagemap of the device pages we're dma-
> mapping.
> + * Note this is assuming only one drm_pagemap per range is allowed.
> + * @flags.migrate_devmem: Flag indicating whether the range can be
> migrated to device memory
> + * @flags.unmapped: Flag indicating if the range has been unmapped
> + * @flags.partial_unmap: Flag indicating if the range has been
> partially unmapped
> + * @flags.has_devmem_pages: Flag indicating if the range has devmem
> pages
> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> mapping
> + *
> + * This structure represents a GPU SVM range used for tracking
> memory ranges
> + * mapped in a DRM device.
> + */
Also here.

> +struct drm_gpusvm_range {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +	struct kref refcount;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} va;
> +	unsigned long notifier_seq;
> +	struct drm_pagemap_dma_addr *dma_addr;
> +	struct drm_pagemap *dpagemap;
> +	struct {
> +		/* All flags below must be set upon creation */
> +		u16 migrate_devmem : 1;
> +		/* All flags below must be set / cleared under
> notifier lock */
> +		u16 unmapped : 1;
> +		u16 partial_unmap : 1;
> +		u16 has_devmem_pages : 1;
> +		u16 has_dma_mapping : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm - GPU SVM structure
> + *
> + * @name: Name of the GPU SVM
> + * @drm: Pointer to the DRM device structure
> + * @mm: Pointer to the mm_struct for the address space
> + * @device_private_page_owner: Device private pages owner
> + * @mm_start: Start address of GPU SVM
> + * @mm_range: Range of the GPU SVM
> + * @notifier_size: Size of individual notifiers
> + * @ops: Pointer to the operations structure for GPU SVM
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> allocation.
> + *               Entries should be powers of 2 in descending order.
> + * @num_chunks: Number of chunks
> + * @notifier_lock: Read-write semaphore for protecting notifier
> operations
> + * @root: Cached root node of the Red-Black tree containing GPU SVM
> notifiers
> + * @notifier_list: list head containing of notifiers in the same
> order they
> + *                 appear in interval tree. This is useful to keep
> iterating
> + *                 notifiers while doing modifications to RB tree.
> + *
> + * This structure represents a GPU SVM (Shared Virtual Memory) used
> for tracking
> + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> + *
> + * No reference counting is provided, as this is expected to be
> embedded in the
> + * driver VM structure along with the struct drm_gpuvm, which
> handles reference
> + * counting.
> + */
> +struct drm_gpusvm {
> +	const char *name;
> +	struct drm_device *drm;
> +	struct mm_struct *mm;
> +	void *device_private_page_owner;
> +	u64 mm_start;
> +	u64 mm_range;

Possibly consider using unsigned long for cpu virtual addresses, since
that's typically done elsewhere. 
GPU virtual addresses should remain 64-bit, though.

> +	u64 notifier_size;
> +	const struct drm_gpusvm_ops *ops;
> +	const u64 *chunk_sizes;
> +	int num_chunks;
> +	struct rw_semaphore notifier_lock;
> +	struct rb_root_cached root;
> +	struct list_head notifier_list;
> +};
> +
> +/**
> + * struct drm_gpusvm_ctx - DRM GPU SVM context
> + *
> + * @in_notifier: entering from a MMU notifier
> + * @read_only: operating on read-only memory
> + * @devmem_possible: possible to use device memory
> + * @check_pages: check pages and only create range for pages faulted
> in
> + *
> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> + */
> +struct drm_gpusvm_ctx {
> +	u32 in_notifier :1;
> +	u32 read_only :1;
> +	u32 devmem_possible :1;
> +	u32 check_pages :1;
> +};
> +
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void
> *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks);
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> +			    struct drm_gpusvm_range *range);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> +
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range);
> +
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx);

There is some newline inconsistency between declarations. I think the
recommended coding style is to use a newline in-between.

> +
> +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> +				 struct drm_gpusvm_range *range,
> +				 struct drm_gpusvm_devmem
> *devmem_allocation,
> +				 const struct drm_gpusvm_ctx *ctx);
> +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem
> *devmem_allocation);
> +
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> +
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> u64 end);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> start, u64 end);
> +
> +/**
> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, take lock
> + */
> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> +	down_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, drop lock
> + */
> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> +	up_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> + * @range: a pointer to the current GPU SVM range
> + *
> + * Return: A pointer to the next drm_gpusvm_range if available, or
> NULL if the
> + *         current range is the last one or if the input range is
> NULL.
> + */
> +static inline struct drm_gpusvm_range *
> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> +{
> +	if (range && !list_is_last(&range->rb.entry,
> +				   &range->notifier->range_list))
> +		return list_next_entry(range, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> notifier
> + * @range__: Iterator variable for the ranges. If set, it indicates
> the start of
> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> get the range.
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier.
> It is safe
> + * to use while holding the driver SVM lock or the notifier lock.
> + */
> +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> end__)	\
> +	for ((range__) = (range__)
> ?:					\
> +	     drm_gpusvm_range_find((notifier__), (start__),
> (end__));	\
> +	     (range__) && (range__->va.start <
> (end__));		\
> +	     (range__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> + * @range: Pointer to the GPU SVM range structure.
> + * @mmu_range: Pointer to the MMU notifier range structure.
> + *
> + * This function marks a GPU SVM range as unmapped and sets the
> partial_unmap flag
> + * if the range partially falls within the provided MMU notifier
> range.
> + */
> +static inline void
> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> +			      const struct mmu_notifier_range
> *mmu_range)
> +{
> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> +
> +	range->flags.unmapped = true;
> +	if (range->va.start < mmu_range->start ||
> +	    range->va.end > mmu_range->end)
> +		range->flags.partial_unmap = true;
> +}

Inlines are really only useful for performance reasons, and that's not
the case here.

> +
> +/**
> + * drm_gpusvm_devmem_init - Initialize a GPU SVM device memory
> allocation
> + *
> + * @dev: Pointer to the device structure which device memory
> allocation belongs to
> + * @mm: Pointer to the mm_struct for the address space
> + * @ops: Pointer to the operations structure for GPU SVM device
> memory
> + * @dpagemap: The struct drm_pagemap we're allocating from.
> + * @size: Size of device memory allocation
> + */
> +static inline void
> +drm_gpusvm_devmem_init(struct drm_gpusvm_devmem *devmem_allocation,
> +		       struct device *dev, struct mm_struct *mm,
> +		       const struct drm_gpusvm_devmem_ops *ops,
> +		       struct drm_pagemap *dpagemap, size_t size)
> +{
> +	devmem_allocation->dev = dev;
> +	devmem_allocation->mm = mm;
> +	devmem_allocation->ops = ops;
> +	devmem_allocation->dpagemap = dpagemap;
> +	devmem_allocation->size = size;
> +}

Same here?


> +
> +#endif /* __DRM_GPUSVM_H__ */
> --
> 2.34.1


Thanks,
Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-05 14:48   ` Thomas Hellström
@ 2024-11-05 16:32     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-05 16:32 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 05, 2024 at 03:48:24PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> 
> 
> Continued review:
> 
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped
> > for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU
> > Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in
> > @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the
> > corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction
> > dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev,
> > dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_devmem - Migrate GPU SVM range to device
> > memory
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @devmem_allocation: Pointer to the device memory allocation. The
> > caller
> > + *                     should hold a reference to the device memory
> > allocation,
> > + *                     which should be dropped via ops-
> > >devmem_release or upon
> > + *                     the failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to device
> > memory. It performs the
> > + * necessary setup and invokes the driver-specific operations for
> > migration to
> > + * device memory. Upon successful return, @devmem_allocation can
> > safely reference @range
> > + * until ops->devmem_release is called which only upon successful
> > return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> > +				 struct drm_gpusvm_range *range,
> > +				 struct drm_gpusvm_devmem
> > *devmem_allocation,
> > +				 const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation-
> > >ops;
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_devmem)
> > +		return -EINVAL;
> > +
> > +	if (!ops->populate_devmem_pfn || !ops->copy_to_devmem ||
> > !ops->copy_to_ram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!mmget_not_zero(mm)) {
> > +		err = -EFAULT;
> > +		goto err_out;
> > +	}
> > +	mmap_read_lock(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(gpusvm-
> > >device_private_page_owner);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages !=
> > npages, not
> > +	 * always an error. Need to revisit possible cases and how
> > to handle. We
> > +	 * could prefault on migrate.cpages != npages via
> > hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = ops->populate_devmem_pfn(devmem_allocation, npages,
> > migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev,
> > dma_addr,
> > +					   migrate.src, npages,
> > DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_devmem_page(page, zdd);
> > +	}
> > +
> > +	err = ops->copy_to_devmem(pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind devmem allocation to range and zdd */
> > +	WRITE_ONCE(zdd->devmem_allocation, devmem_allocation);	/*
> > Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev,
> > dma_addr, npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_ram_pfn - Populate RAM PFNs for a VM
> > area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @mpages: Number of pages to migrate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the RAM migrate page frame numbers (PFNs)
> > for the
> > + * specified VM area structure. It allocates and locks pages in the
> > VM area for
> > + * RAM usage. If vas is non-NULL use alloc_page_vma for allocation,
> > if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_ram_pfn(struct vm_area_struct
> > *vas,
> > +					       unsigned long npages,
> > +					       unsigned long
> > *mpages,
> > +					       unsigned long
> > *src_mpfn,
> > +					       unsigned long *mpfn,
> > u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas,
> > addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> 
> Allocating under page-locks seem a bit scary, but OTOH we're
> recursively locking page-locks as well. Perhaps a comment on why this
> is allowed.
> 
> Allocating and then trylocking with asserts as separate steps otherwise
> would guarantee that we don't hit a deadlock without noticing but the
> way it's currently coded seems to be common practice.
> 

I think this works as page allocated clearly will be unlocked and no one else
should be locking the page here either, but agree this is a bit scary and does
expose a deadlock risk I suppose. I can split into a two step procces /w try
locks if you like or just add comment. I don't have a strong opinion here. 

Note this code will likely be updated to allocate 2M pages if possible so 2M dma
mappings can used for the copy and eventually migrate at 2M grainularity too in
migrate_vma*. In that case breaking this out into 2 steps will be less costly in
common 2M case.

AMD has similar code this to fwiw.

> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +		++*mpages;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_ram - Evict GPU SVM range to RAM
> > + * @devmem_allocation: Pointer to the device memory allocation
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_ram but does not require mmap
> > lock and
> > + * migration done via migrate_device_* functions.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem
> > *devmem_allocation)
> > +{
> > +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation-
> > >ops;
> > +	unsigned long npages, mpages = 0;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = devmem_allocation->size >> PAGE_SHIFT;
> > +
> > +retry:
> > +	if (!mmget_not_zero(devmem_allocation->mm))
> > +		return -EFAULT;
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr)
> > +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) *
> > npages;
> > +
> > +	err = ops->populate_devmem_pfn(devmem_allocation, npages,
> > src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_prepopulated_range(src, npages);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_ram_pfn(NULL, npages,
> > &mpages, src,
> > +						  dst, 0);
> > +	if (err || !mpages)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev,
> > dma_addr,
> > +					   dst, npages,
> > DMA_FROM_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = ops->copy_to_ram(pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev,
> > dma_addr, npages,
> > +				       DMA_FROM_DEVICE);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmput_async(devmem_allocation->mm);
> > +	if (!err && !READ_ONCE(devmem_allocation->detached)) {
> > +		cond_resched();
> > +		goto retry;
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM
> > (internal)
> > + * @vas: Pointer to the VM area structure
> > + * @device_private_page_owner: Device private pages owner
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @fault_addr: Fault address
> > + * @size: Size of migration
> > + *
> > + * This internal function performs the migration of the specified
> > GPU SVM range
> > + * to RAM. It sets up the migration, populates + dma maps RAM PFNs,
> > and
> > + * invokes the driver-specific operations for migration to RAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_ram(struct vm_area_struct *vas,
> > +				       void
> > *device_private_page_owner,
> > +				       struct page *page, u64
> > fault_addr,
> > +				       u64 size)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE
> > |
> > +			MIGRATE_VMA_SELECT_DEVICE_COHERENT,
> > +		.fault_page	= page,
> > +	};
> > +	struct drm_gpusvm_zdd *zdd;
> > +	const struct drm_gpusvm_devmem_ops *ops;
> > +	struct device *dev;
> > +	unsigned long npages, mpages = 0;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	u64 start, end;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	start = ALIGN_DOWN(fault_addr, size);
> > +	end = ALIGN(fault_addr + 1, size);
> > +
> > +	/* Corner where VMA area struct has been partially unmapped
> > */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) +
> > sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr))
> > * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	if (!page) {
> > +		for (i = 0; i < npages; ++i) {
> > +			if (!(migrate.src[i] & MIGRATE_PFN_MIGRATE))
> > +				continue;
> > +
> > +			page = migrate_pfn_to_page(migrate.src[i]);
> > +			break;
> > +		}
> > +
> > +		if (!page)
> > +			goto err_finalize;
> > +	}
> > +	zdd = page->zone_device_data;
> > +	ops = zdd->devmem_allocation->ops;
> > +	dev = zdd->devmem_allocation->dev;
> > +
> > +	err = drm_gpusvm_migrate_populate_ram_pfn(vas, npages,
> > &mpages,
> > +						  migrate.src,
> > migrate.dst,
> > +						  start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(dev, dma_addr,
> > migrate.dst, npages,
> > +					   DMA_FROM_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = ops->copy_to_ram(pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(dev, dma_addr, npages,
> > +				       DMA_FROM_DEVICE);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_evict - Evict GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function evicts the specified GPU SVM range.
> > + */
> > +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> > +			    struct drm_gpusvm_range *range)
> 
> Although we discussed this one before. Ideally I think we'd want to be
> able to migrate also other devices' pages. But need to consider device-
> coherent pages.
>

I discussed this Alistar a bit here [1]. Purposed hmm_range_fault lookup of device
pages + a new helper. I think we landed on that is not such a great idea and
just use the existing migrate_vma_* functions. 

I think we actually allowed to migrate other devices though in
__drm_gpusvm_migrate_to_ram as we lookup drm gpu devmem struct from the pages.
Maybe a little more work is needed in __drm_gpusvm_migrate_to_ram if multiple
pages found point to different drm gpu devmem struct though.

[1] https://patchwork.freedesktop.org/patch/619814/?series=137870&rev=2#comment_1127790

> 
> > +{
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +
> > +	if (!mmget_not_zero(mm))
> > +		return;
> > +
> > +	mmap_read_lock(mm);
> > +	vas = vma_lookup(mm, range->va.start);
> > +	if (!vas)
> > +		goto unlock;

Not this version is missing a VMA loop here which is required if the VMAs have
changed since the range was created. Have locally in the latest stable branch I
have shared [2].

[2] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-10-15-24/-/tree/svm-10-16.stable?ref_type=heads

> > +
> > +	__drm_gpusvm_migrate_to_ram(vas, gpusvm-
> > >device_private_page_owner,
> > +				    NULL, range->va.start,
> > +				    range->va.end - range-
> > >va.start);
> > +unlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated
> > with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device
> > data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page
> > fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM
> > range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page
> > and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> 
> ... and this one only the pages belonging to the current pagemap.
> 
> > +
> > +	err = __drm_gpusvm_migrate_to_ram(vmf->vma,
> > +					  zdd-
> > >device_private_page_owner,
> > +					  vmf->page, vmf->address,
> > +					  zdd->devmem_allocation-
> > >size);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map
> > operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the
> > given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start,
> > end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h
> > b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..15ec22d4f9a5
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,447 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +struct drm_gpusvm_devmem;
> > +struct drm_pagemap;
> > +struct drm_pagemap_dma_addr;
> > +
> > +/**
> > + * struct drm_gpusvm_devmem_ops - Operations structure for GPU SVM
> > device memory
> > + *
> > + * This structure defines the operations for GPU Shared Virtual
> > Memory (SVM)
> > + * device memory. These operations are provided by the GPU driver to
> > manage device memory
> > + * allocations and perform operations such as migration between
> > device memory and system
> > + * RAM.
> > + */
> > +struct drm_gpusvm_devmem_ops {
> > +	/**
> > +	 * @devmem_release: Release device memory allocation
> > (optional)
> > +	 * @devmem_allocation: device memory allocation
> > +	 *
> > +	 * This function shall release device memory allocation and
> > expects to drop a
> 
> NIT: Consider "Release device memory..." rather than "This function
> shall release..." (general comment).
> 

Will do.

> > +	 * reference to device memory allocation.
> > +	 */
> > +	void (*devmem_release)(struct drm_gpusvm_devmem
> > *devmem_allocation);
> > +
> > +	/**
> > +	 * @populate_devmem_pfn: Populate device memory PFN
> > (required for migration)
> > +	 * @devmem_allocation: device memory allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate device memory page frame
> > numbers (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_devmem_pfn)(struct drm_gpusvm_devmem
> > *devmem_allocation,
> > +				 unsigned long npages, unsigned long
> > *pfn);
> > +
> > +	/**
> > +	 * @copy_to_devmem: Copy to device memory (required for
> > migration)
> > +	 * @pages: Pointer to array of device memory pages
> > (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to device memory.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_devmem)(struct page **pages,
> > +			      dma_addr_t *dma_addr,
> > +			      unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_ram: Copy to system RAM (required for migration)
> > +	 * @pages: Pointer to array of device memory pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses
> > (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_ram)(struct page **pages,
> > +			   dma_addr_t *dma_addr,
> > +			   unsigned long npages);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_devmem - Structure representing a GPU SVM
> > device memory allocation
> > + *
> > + * @dev: Pointer to the device structure which device memory
> > allocation belongs to
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @ops: Pointer to the operations structure for GPU SVM device
> > memory
> > + * @dpagemap: The struct drm_pagemap of the pages this allocation
> > belongs to.
> > + * @size: Size of device memory allocation
> > + * @detached: device memory allocations is detached from device
> > pages
> > + */
> > +struct drm_gpusvm_devmem {
> > +	struct device *dev;
> > +	struct mm_struct *mm;
> > +	const struct drm_gpusvm_devmem_ops *ops;
> > +	struct drm_pagemap *dpagemap;
> > +	size_t size;
> > +	bool detached;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual
> > Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM
> > ranges and
> > + * notifiers.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success,
> > NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL
> > on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm
> > *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It
> > can safely
> > +	 * walk the notifier range RB tree/list in this function.
> > Called while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range
> > *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM
> > notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure
> > notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order
> > they appear in
> > + *              interval tree. This is useful to keep iterating
> > ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval notifier
> > has been
> > + *                 removed
> 
> Please also document the nested fields.
> 

Ah yes, will do.

> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier
> > structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @dma_addr: DMA address array
> > + * @dpagemap: The struct drm_pagemap of the device pages we're dma-
> > mapping.
> > + * Note this is assuming only one drm_pagemap per range is allowed.
> > + * @flags.migrate_devmem: Flag indicating whether the range can be
> > migrated to device memory
> > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been
> > partially unmapped
> > + * @flags.has_devmem_pages: Flag indicating if the range has devmem
> > pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA
> > mapping
> > + *
> > + * This structure represents a GPU SVM range used for tracking
> > memory ranges
> > + * mapped in a DRM device.
> > + */
> Also here.
> 

+1

> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	struct drm_pagemap_dma_addr *dma_addr;
> > +	struct drm_pagemap *dpagemap;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_devmem : 1;
> > +		/* All flags below must be set / cleared under
> > notifier lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_devmem_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range
> > allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier
> > operations
> > + * @root: Cached root node of the Red-Black tree containing GPU SVM
> > notifiers
> > + * @notifier_list: list head containing of notifiers in the same
> > order they
> > + *                 appear in interval tree. This is useful to keep
> > iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory) used
> > for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > + *
> > + * No reference counting is provided, as this is expected to be
> > embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which
> > handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> 
> Possibly consider using unsigned long for cpu virtual addresses, since
> that's typically done elsewhere. 
> GPU virtual addresses should remain 64-bit, though.
> 

Ok, will adjust.

> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @devmem_possible: possible to use device memory
> > + * @check_pages: check pages and only create range for pages faulted
> > in
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 devmem_possible :1;
> > +	u32 check_pages :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void
> > *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64
> > fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> > +			    struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> 
> There is some newline inconsistency between declarations. I think the
> recommended coding style is to use a newline in-between.
> 

Ok, got it.

> > +
> > +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> > +				 struct drm_gpusvm_range *range,
> > +				 struct drm_gpusvm_devmem
> > *devmem_allocation,
> > +				 const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem
> > *devmem_allocation);
> > +
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start,
> > u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64
> > start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or
> > NULL if the
> > + *         current range is the last one or if the input range is
> > NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a
> > notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates
> > the start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to
> > get the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier.
> > It is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__,
> > end__)	\
> > +	for ((range__) = (range__)
> > ?:					\
> > +	     drm_gpusvm_range_find((notifier__), (start__),
> > (end__));	\
> > +	     (range__) && (range__->va.start <
> > (end__));		\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the
> > partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier
> > range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > +			      const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> 
> Inlines are really only useful for performance reasons, and that's not
> the case here.
> 

Will move to drm_gpusvm.c

> > +
> > +/**
> > + * drm_gpusvm_devmem_init - Initialize a GPU SVM device memory
> > allocation
> > + *
> > + * @dev: Pointer to the device structure which device memory
> > allocation belongs to
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @ops: Pointer to the operations structure for GPU SVM device
> > memory
> > + * @dpagemap: The struct drm_pagemap we're allocating from.
> > + * @size: Size of device memory allocation
> > + */
> > +static inline void
> > +drm_gpusvm_devmem_init(struct drm_gpusvm_devmem *devmem_allocation,
> > +		       struct device *dev, struct mm_struct *mm,
> > +		       const struct drm_gpusvm_devmem_ops *ops,
> > +		       struct drm_pagemap *dpagemap, size_t size)
> > +{
> > +	devmem_allocation->dev = dev;
> > +	devmem_allocation->mm = mm;
> > +	devmem_allocation->ops = ops;
> > +	devmem_allocation->dpagemap = dpagemap;
> > +	devmem_allocation->size = size;
> > +}
> 
> Same here?
> 

Ok, will change too.

Matt

> 
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> > --
> > 2.34.1
> 
> 
> Thanks,
> Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                     ` (2 preceding siblings ...)
  2024-11-05 14:48   ` Thomas Hellström
@ 2024-11-20  3:00   ` Gwan-gyeong Mun
  2024-11-29  0:00   ` Alistair Popple
  4 siblings, 0 replies; 129+ messages in thread
From: Gwan-gyeong Mun @ 2024-11-20  3:00 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr



On 10/16/24 6:24 AM, Matthew Brost wrote:

> +
> +/**
> + * drm_gpusvm_get_devmem_page - Get a reference to a device memory page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_devmem_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
To use the zone_device_page_init() function, CONFIG_ZONE_DEVICE=y in config.
To resolve build issues, CONFIG_ZONE_DEVICE dependency is required in 
xe's kconfig.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
                     ` (3 preceding siblings ...)
  2024-11-20  3:00   ` Gwan-gyeong Mun
@ 2024-11-29  0:00   ` Alistair Popple
  2024-12-14  1:16     ` Matthew Brost
  4 siblings, 1 reply; 129+ messages in thread
From: Alistair Popple @ 2024-11-29  0:00 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr


Matthew Brost <matthew.brost@intel.com> writes:

[...]

> + * 3) Invalidation driver vfunc.
> + *
> + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> + *				 struct drm_gpusvm_notifier *notifier,
> + *				 const struct mmu_notifier_range *mmu_range)
> + *	{
> + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> + *		struct drm_gpusvm_range *range = NULL;
> + *
> + *		driver_invalidate_device_tlb(gpusvm, mmu_range->start, mmu_range->end);
> + *
> + *		drm_gpusvm_for_each_range(range, notifier, mmu_range->start,
> + *					  mmu_range->end) {
> + *			drm_gpusvm_range_unmap_pages(gpusvm, range, &ctx);
> + *
> + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)

I've only glanced at this series as an interested onlooker so I
may have overlooked something obvious but why is it ok to skip notifiers
other than MMU_NOTIFY_UNMAP? Wouldn't you also need to clears GPU PTEs
in other cases?

 - Alistair

> + *				continue;
> + *
> + *			drm_gpusvm_range_set_unmapped(range, mmu_range);
> + *			driver_garbage_collector_add(gpusvm, range);
> + *		}
> + *	}
> + */
> +
> +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64, rb.__subtree_last,
> +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> +		     static __maybe_unused, range);
> +
> +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)->interval.start)
> +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)->interval.end - 1)
> +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused, notifier);
> +
> +/**
> + * npages_in_range() - Calculate the number of pages in a given range
> + * @start__: The start address of the range
> + * @end__: The end address of the range
> + *
> + * This macro calculates the number of pages in a given memory range,
> + * specified by the start and end addresses. It divides the difference
> + * between the end and start addresses by the page size (PAGE_SIZE) to
> + * determine the number of pages in the range.
> + *
> + * Return: The number of pages in the specified range.
> + */
> +#define npages_in_range(start__, end__)	\
> +	(((end__) - (start__)) >> PAGE_SHIFT)
> +
> +/**
> + * struct drm_gpusvm_zdd - GPU SVM zone device data
> + *
> + * @refcount: Reference count for the zdd
> + * @destroy_work: Work structure for asynchronous zdd destruction
> + * @devmem_allocation: device memory allocation
> + * @device_private_page_owner: Device private pages owner
> + *
> + * This structure serves as a generic wrapper installed in
> + * page->zone_device_data. It provides infrastructure for looking up a device
> + * memory allocation upon CPU page fault and asynchronously releasing device
> + * memory once the CPU has no page references. Asynchronous release is useful
> + * because CPU page references can be dropped in IRQ contexts, while releasing
> + * device memory likely requires sleeping locks.
> + */
> +struct drm_gpusvm_zdd {
> +	struct kref refcount;
> +	struct work_struct destroy_work;
> +	struct drm_gpusvm_devmem *devmem_allocation;
> +	void *device_private_page_owner;
> +};
> +
> +/**
> + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a zdd
> + * @w: Pointer to the work_struct
> + *
> + * This function releases device memory, puts GPU SVM range, and frees zdd.
> + */
> +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(w, struct drm_gpusvm_zdd, destroy_work);
> +	const struct drm_gpusvm_devmem_ops *ops = zdd->devmem_allocation ?
> +		zdd->devmem_allocation->ops : NULL;
> +
> +	if (zdd->devmem_allocation && ops->devmem_release)
> +		ops->devmem_release(zdd->devmem_allocation);
> +	kfree(zdd);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> + * @device_private_page_owner: Device private pages owner
> + *
> + * This function allocates and initializes a new zdd structure. It sets up the
> + * reference count and initializes the destroy work.
> + *
> + * Returns:
> + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_zdd *
> +drm_gpusvm_zdd_alloc(void *device_private_page_owner)
> +{
> +	struct drm_gpusvm_zdd *zdd;
> +
> +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> +	if (!zdd)
> +		return NULL;
> +
> +	kref_init(&zdd->refcount);
> +	INIT_WORK(&zdd->destroy_work, drm_gpusvm_zdd_destroy_work_func);
> +	zdd->devmem_allocation = NULL;
> +	zdd->device_private_page_owner = device_private_page_owner;
> +
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function increments the reference count of the provided zdd structure.
> + *
> + * Returns: Pointer to the zdd structure.
> + */
> +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_get(&zdd->refcount);
> +	return zdd;
> +}
> +
> +/**
> + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> + * @ref: Pointer to the reference count structure.
> + *
> + * This function queues the destroy_work of the zdd for asynchronous destruction.
> + */
> +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> +{
> +	struct drm_gpusvm_zdd *zdd =
> +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> +
> +	if (zdd->devmem_allocation)
> +		WRITE_ONCE(zdd->devmem_allocation->detached, true);
> +	schedule_work(&zdd->destroy_work);
> +}
> +
> +/**
> + * drm_gpusvm_zdd_put - Put a zdd reference.
> + * @zdd: Pointer to the zdd structure.
> + *
> + * This function decrements the reference count of the provided zdd structure
> + * and schedules its destruction if the count drops to zero.
> + */
> +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> +{
> +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> + * @notifier: Pointer to the GPU SVM notifier structure.
> + * @start: Start address of the range
> + * @end: End address of the range
> + *
> + * Return: A pointer to the drm_gpusvm_range if found or NULL
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end)
> +{
> +	return range_iter_first(&notifier->root, start, end - 1);
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM ranges in a notifier
> + * @range__: Iterator variable for the ranges
> + * @next__: Iterator variable for the ranges temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier while
> + * removing ranges from it.
> + */
> +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
> +	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
> +	     (next__) = __drm_gpusvm_range_next(range__);				\
> +	     (range__) && (range__->va.start < (end__));				\
> +	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in the list
> + * @notifier: a pointer to the current drm_gpusvm_notifier
> + *
> + * Return: A pointer to the next drm_gpusvm_notifier if available, or NULL if
> + *         the current notifier is the last one or if the input notifier is
> + *         NULL.
> + */
> +static struct drm_gpusvm_notifier *
> +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> +{
> +	if (notifier && !list_is_last(&notifier->rb.entry,
> +				      &notifier->gpusvm->notifier_list))
> +		return list_next_entry(notifier, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> + */
> +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
> +	     (notifier__) && (notifier__->interval.start < (end__));			\
> +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM notifiers in a gpusvm
> + * @notifier__: Iterator variable for the notifiers
> + * @next__: Iterator variable for the notifiers temporay storage
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the notifier
> + * @end__: End address of the notifier
> + *
> + * This macro is used to iterate over GPU SVM notifiers in a gpusvm while
> + * removing notifiers from it.
> + */
> +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
> +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
> +	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
> +	     (notifier__) && (notifier__->interval.start < (end__));			\
> +	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))
> +
> +/**
> + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> + * @mni: Pointer to the mmu_interval_notifier structure.
> + * @mmu_range: Pointer to the mmu_notifier_range structure.
> + * @cur_seq: Current sequence number.
> + *
> + * This function serves as a generic MMU notifier for GPU SVM. It sets the MMU
> + * notifier sequence number and calls the driver invalidate vfunc under
> + * gpusvm->notifier_lock.
> + *
> + * Returns:
> + * true if the operation succeeds, false otherwise.
> + */
> +static bool
> +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> +			       const struct mmu_notifier_range *mmu_range,
> +			       unsigned long cur_seq)
> +{
> +	struct drm_gpusvm_notifier *notifier =
> +		container_of(mni, typeof(*notifier), notifier);
> +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> +
> +	if (!mmu_notifier_range_blockable(mmu_range))
> +		return false;
> +
> +	down_write(&gpusvm->notifier_lock);
> +	mmu_interval_set_seq(mni, cur_seq);
> +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> +	up_write(&gpusvm->notifier_lock);
> +
> +	return true;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
> + */
> +static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
> +	.invalidate = drm_gpusvm_notifier_invalidate,
> +};
> +
> +/**
> + * drm_gpusvm_init - Initialize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @name: Name of the GPU SVM.
> + * @drm: Pointer to the DRM device structure.
> + * @mm: Pointer to the mm_struct for the address space.
> + * @device_private_page_owner: Device private pages owner.
> + * @mm_start: Start address of GPU SVM.
> + * @mm_range: Range of the GPU SVM.
> + * @notifier_size: Size of individual notifiers.
> + * @ops: Pointer to the operations structure for GPU SVM.
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> + *               Entries should be powers of 2 in descending order with last
> + *               entry being SZ_4K.
> + * @num_chunks: Number of chunks.
> + *
> + * This function initializes the GPU SVM.
> + *
> + * Returns:
> + * 0 on success, a negative error code on failure.
> + */
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks)
> +{
> +	if (!ops->invalidate || !num_chunks)
> +		return -EINVAL;
> +
> +	gpusvm->name = name;
> +	gpusvm->drm = drm;
> +	gpusvm->mm = mm;
> +	gpusvm->device_private_page_owner = device_private_page_owner;
> +	gpusvm->mm_start = mm_start;
> +	gpusvm->mm_range = mm_range;
> +	gpusvm->notifier_size = notifier_size;
> +	gpusvm->ops = ops;
> +	gpusvm->chunk_sizes = chunk_sizes;
> +	gpusvm->num_chunks = num_chunks;
> +
> +	mmgrab(mm);
> +	gpusvm->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> +
> +	init_rwsem(&gpusvm->notifier_lock);
> +
> +	fs_reclaim_acquire(GFP_KERNEL);
> +	might_lock(&gpusvm->notifier_lock);
> +	fs_reclaim_release(GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure
> + * @fault_addr__: Fault address
> + *
> + * This macro finds the GPU SVM notifier associated with the fault address.
> + *
> + * Returns:
> + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> + */
> +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> +			    (fault_addr__ + 1))
> +
> +/**
> + * to_drm_gpusvm_notifier - retrieve the container struct for a given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_notifier struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_notifier structure.
> + */
> +#define to_drm_gpusvm_notifier(__node)				\
> +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> +
> +/**
> + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function inserts the GPU SVM notifier into the GPU SVM RB tree and list.
> + */
> +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier *notifier)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	notifier_insert(notifier, &gpusvm->root);
> +
> +	node = rb_prev(&notifier->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> +	else
> +		head = &gpusvm->notifier_list;
> +
> +	list_add(&notifier->rb.entry, head);
> +}
> +
> +/**
> + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM tructure
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + *
> + * This macro removes the GPU SVM notifier from the GPU SVM RB tree and list.
> + */
> +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> +	list_del(&(notifier__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_fini - Finalize the GPU SVM.
> + * @gpusvm: Pointer to the GPU SVM structure.
> + *
> + * This function finalizes the GPU SVM by cleaning up any remaining ranges and
> + * notifiers, and dropping a reference to struct MM.
> + */
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> +{
> +	struct drm_gpusvm_notifier *notifier, *next;
> +
> +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0, LONG_MAX) {
> +		struct drm_gpusvm_range *range, *__next;
> +
> +		/*
> +		 * Remove notifier first to avoid racing with any invalidation
> +		 */
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +		notifier->flags.removed = true;
> +
> +		drm_gpusvm_for_each_range_safe(range, __next, notifier, 0,
> +					       LONG_MAX)
> +			drm_gpusvm_range_remove(gpusvm, range);
> +	}
> +
> +	mmdrop(gpusvm->mm);
> +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> +}
> +
> +/**
> + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + *
> + * This function allocates and initializes the GPU SVM notifier structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_notifier *
> +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	if (gpusvm->ops->notifier_alloc)
> +		notifier = gpusvm->ops->notifier_alloc();
> +	else
> +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> +
> +	if (!notifier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	notifier->gpusvm = gpusvm;
> +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
> +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm->notifier_size);
> +	INIT_LIST_HEAD(&notifier->rb.entry);
> +	notifier->root = RB_ROOT_CACHED;
> +	INIT_LIST_HEAD(&notifier->range_list);
> +
> +	return notifier;
> +}
> +
> +/**
> + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + *
> + * This function frees the GPU SVM notifier structure.
> + */
> +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> +				     struct drm_gpusvm_notifier *notifier)
> +{
> +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> +
> +	if (gpusvm->ops->notifier_free)
> +		gpusvm->ops->notifier_free(notifier);
> +	else
> +		kfree(notifier);
> +}
> +
> +/**
> + * to_drm_gpusvm_range - retrieve the container struct for a given rbtree node
> + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_range struct
> + *
> + * Return: A pointer to the containing drm_gpusvm_range structure.
> + */
> +#define to_drm_gpusvm_range(node__)	\
> +	container_of((node__), struct drm_gpusvm_range, rb.node)
> +
> +/**
> + * drm_gpusvm_range_insert - Insert GPU SVM range
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function inserts the GPU SVM range into the notifier RB tree and list.
> + */
> +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier *notifier,
> +				    struct drm_gpusvm_range *range)
> +{
> +	struct rb_node *node;
> +	struct list_head *head;
> +
> +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> +	range_insert(range, &notifier->root);
> +
> +	node = rb_prev(&range->rb.node);
> +	if (node)
> +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> +	else
> +		head = &notifier->range_list;
> +
> +	list_add(&range->rb.entry, head);
> +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> +}
> +
> +/**
> + * __drm_gpusvm_range_remove - Remove GPU SVM range
> + * @notifier__: Pointer to the GPU SVM notifier structure
> + * @range__: Pointer to the GPU SVM range structure
> + *
> + * This macro removes the GPU SVM range from the notifier RB tree and list.
> + */
> +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> +	range_remove((range__), &(notifier__)->root);		\
> +	list_del(&(range__)->rb.entry)
> +
> +/**
> + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @fault_addr: Fault address
> + * @chunk_size: Chunk size
> + * @migrate_devmem: Flag indicating whether to migrate device memory
> + *
> + * This function allocates and initializes the GPU SVM range structure.
> + *
> + * Returns:
> + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on failure.
> + */
> +static struct drm_gpusvm_range *
> +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> +		       struct drm_gpusvm_notifier *notifier,
> +		       u64 fault_addr, u64 chunk_size, bool migrate_devmem)
> +{
> +	struct drm_gpusvm_range *range;
> +
> +	if (gpusvm->ops->range_alloc)
> +		range = gpusvm->ops->range_alloc(gpusvm);
> +	else
> +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&range->refcount);
> +	range->gpusvm = gpusvm;
> +	range->notifier = notifier;
> +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> +	INIT_LIST_HEAD(&range->rb.entry);
> +	range->notifier_seq = LONG_MAX;
> +	range->flags.migrate_devmem = migrate_devmem ? 1 : 0;
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_check_pages - Check pages
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @start: Start address
> + * @end: End address
> + *
> + * Check if pages between start and end have been faulted in on the CPU. Use to
> + * prevent migration of pages without CPU backing store.
> + *
> + * Returns:
> + * True if pages have been faulted into CPU, False otherwise
> + */
> +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> +				   struct drm_gpusvm_notifier *notifier,
> +				   u64 start, u64 end)
> +{
> +	struct hmm_range hmm_range = {
> +		.default_flags = 0,
> +		.notifier = &notifier->notifier,
> +		.start = start,
> +		.end = end,
> +		.dev_private_owner = gpusvm->device_private_page_owner,
> +	};
> +	unsigned long timeout =
> +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns;
> +	unsigned long npages = npages_in_range(start, end);
> +	int err, i;
> +
> +	mmap_assert_locked(gpusvm->mm);
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return false;
> +
> +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> +	hmm_range.hmm_pfns = pfns;
> +
> +	while (true) {
> +		err = hmm_range_fault(&hmm_range);
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	if (err)
> +		goto err_free;
> +
> +	for (i = 0; i < npages;) {
> +		if (!(pfns[i] & HMM_PFN_VALID)) {
> +			err = -EFAULT;
> +			goto err_free;
> +		}
> +		i += 0x1 << hmm_pfn_to_map_order(pfns[i]);
> +	}
> +
> +err_free:
> +	kvfree(pfns);
> +	return err ? false : true;
> +}
> +
> +/**
> + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier structure
> + * @vas: Pointer to the virtual memory area structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @check_pages: Flag indicating whether to check pages
> + *
> + * This function determines the chunk size for the GPU SVM range based on the
> + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and the virtual
> + * memory area boundaries.
> + *
> + * Returns:
> + * Chunk size on success, LONG_MAX on failure.
> + */
> +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> +				       struct drm_gpusvm_notifier *notifier,
> +				       struct vm_area_struct *vas,
> +				       u64 fault_addr, u64 gpuva_start,
> +				       u64 gpuva_end, bool check_pages)
> +{
> +	u64 start, end;
> +	int i = 0;
> +
> +retry:
> +	for (; i < gpusvm->num_chunks; ++i) {
> +		start = ALIGN_DOWN(fault_addr, gpusvm->chunk_sizes[i]);
> +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> +
> +		if (start >= vas->vm_start && end <= vas->vm_end &&
> +		    start >= notifier->interval.start &&
> +		    end <= notifier->interval.end &&
> +		    start >= gpuva_start && end <= gpuva_end)
> +			break;
> +	}
> +
> +	if (i == gpusvm->num_chunks)
> +		return LONG_MAX;
> +
> +	/*
> +	 * If allocation more than page, ensure not to overlap with existing
> +	 * ranges.
> +	 */
> +	if (end - start != SZ_4K) {
> +		struct drm_gpusvm_range *range;
> +
> +		range = drm_gpusvm_range_find(notifier, start, end);
> +		if (range) {
> +			++i;
> +			goto retry;
> +		}
> +
> +		/*
> +		 * XXX: Only create range on pages CPU has faulted in. Without
> +		 * this check, or prefault, on BMG 'xe_exec_system_allocator --r
> +		 * process-many-malloc' fails. In the failure case, each process
> +		 * mallocs 16k but the CPU VMA is ~128k which results in 64k SVM
> +		 * ranges. When migrating the SVM ranges, some processes fail in
> +		 * drm_gpusvm_migrate_to_devmem with 'migrate.cpages != npages'
> +		 * and then upon drm_gpusvm_range_get_pages device pages from
> +		 * other processes are collected + faulted in which creates all
> +		 * sorts of problems. Unsure exactly how this happening, also
> +		 * problem goes away if 'xe_exec_system_allocator --r
> +		 * process-many-malloc' mallocs at least 64k at a time.
> +		 */
> +		if (check_pages &&
> +		    !drm_gpusvm_check_pages(gpusvm, notifier, start, end)) {
> +			++i;
> +			goto retry;
> +		}
> +	}
> +
> +	return end - start;
> +}
> +
> +/**
> + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @fault_addr: Fault address
> + * @gpuva_start: Start address of GPUVA which mirrors CPU
> + * @gpuva_end: End address of GPUVA which mirrors CPU
> + * @ctx: GPU SVM context
> + *
> + * This function finds or inserts a newly allocated a GPU SVM range based on the
> + * fault address. Caller must hold a lock to protect range lookup and insertion.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +	struct drm_gpusvm_range *range;
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +	bool notifier_alloc = false;
> +	u64 chunk_size;
> +	int err;
> +	bool migrate_devmem;
> +
> +	if (fault_addr < gpusvm->mm_start ||
> +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!mmget_not_zero(mm)) {
> +		err = -EFAULT;
> +		goto err_out;
> +	}
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> +	if (!notifier) {
> +		notifier = drm_gpusvm_notifier_alloc(gpusvm, fault_addr);
> +		if (IS_ERR(notifier)) {
> +			err = PTR_ERR(notifier);
> +			goto err_mmunlock;
> +		}
> +		notifier_alloc = true;
> +		err = mmu_interval_notifier_insert(&notifier->notifier,
> +						   mm, notifier->interval.start,
> +						   notifier->interval.end -
> +						   notifier->interval.start,
> +						   &drm_gpusvm_notifier_ops);
> +		if (err)
> +			goto err_notifier;
> +	}
> +
> +	mmap_read_lock(mm);
> +
> +	vas = vma_lookup(mm, fault_addr);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_notifier_remove;
> +	}
> +
> +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> +		err = -EPERM;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_find(notifier, fault_addr, fault_addr + 1);
> +	if (range)
> +		goto out_mmunlock;
> +	/*
> +	 * XXX: Short-circuiting migration based on migrate_vma_* current
> +	 * limitations. If/when migrate_vma_* add more support, this logic will
> +	 * have to change.
> +	 */
> +	migrate_devmem = ctx->devmem_possible &&
> +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> +
> +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier, vas,
> +						 fault_addr, gpuva_start,
> +						 gpuva_end, migrate_devmem &&
> +						 ctx->check_pages);
> +	if (chunk_size == LONG_MAX) {
> +		err = -EINVAL;
> +		goto err_notifier_remove;
> +	}
> +
> +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr, chunk_size,
> +				       migrate_devmem);
> +	if (IS_ERR(range)) {
> +		err = PTR_ERR(range);
> +		goto err_notifier_remove;
> +	}
> +
> +	drm_gpusvm_range_insert(notifier, range);
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> +
> +out_mmunlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	return range;
> +
> +err_notifier_remove:
> +	mmap_read_unlock(mm);
> +	if (notifier_alloc)
> +		mmu_interval_notifier_remove(&notifier->notifier);
> +err_notifier:
> +	if (notifier_alloc)
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +err_mmunlock:
> +	mmput(mm);
> +err_out:
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range (internal)
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @npages: Number of pages to unmap
> + *
> + * This function unmap pages associated with a GPU SVM range. Assumes and
> + * asserts correct locking is in place when called.
> + */
> +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +					   struct drm_gpusvm_range *range,
> +					   unsigned long npages)
> +{
> +	unsigned long i, j;
> +	struct drm_pagemap *dpagemap = range->dpagemap;
> +	struct device *dev = gpusvm->drm->dev;
> +
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->flags.has_dma_mapping) {
> +		for (i = 0, j = 0; i < npages; j++) {
> +			struct drm_pagemap_dma_addr *addr = &range->dma_addr[j];
> +
> +			if (addr->proto == DRM_INTERCONNECT_SYSTEM) {
> +				dma_unmap_page(dev,
> +					       addr->addr,
> +					       PAGE_SIZE << addr->order,
> +					       addr->dir);
> +			} else if (dpagemap && dpagemap->ops->unmap_dma) {
> +				dpagemap->ops->unmap_dma(dpagemap,
> +							 dev,
> +							 *addr);
> +			}
> +			i += 1 << addr->order;
> +		}
> +		range->flags.has_devmem_pages = false;
> +		range->flags.has_dma_mapping = false;
> +		range->dpagemap = NULL;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_free_pages - Free pages associated with a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function free pages associated with a GPU SVM range.
> + */
> +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> +					struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	if (range->dma_addr) {
> +		kvfree(range->dma_addr);
> +		range->dma_addr = NULL;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_remove - Remove GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function removes the specified GPU SVM range and also removes the parent
> + * GPU SVM notifier if no more ranges remain in the notifier. The caller must
> + * hold a lock to protect range and notifier removal.
> + */
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range)
> +{
> +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	notifier = drm_gpusvm_notifier_find(gpusvm, range->va.start);
> +	if (WARN_ON_ONCE(!notifier))
> +		return;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> +	drm_gpusvm_range_free_pages(gpusvm, range);
> +	__drm_gpusvm_range_remove(notifier, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	drm_gpusvm_range_put(range);
> +
> +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> +		if (!notifier->flags.removed)
> +			mmu_interval_notifier_remove(&notifier->notifier);
> +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> +		drm_gpusvm_notifier_free(gpusvm, notifier);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function increments the reference count of the specified GPU SVM range.
> + *
> + * Returns:
> + * Pointer to the GPU SVM range.
> + */
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> +{
> +	kref_get(&range->refcount);
> +
> +	return range;
> +}
> +
> +/**
> + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> + * @refcount: Pointer to the reference counter embedded in the GPU SVM range
> + *
> + * This function destroys the specified GPU SVM range when its reference count
> + * reaches zero. If a custom range-free function is provided, it is invoked to
> + * free the range; otherwise, the range is deallocated using kfree().
> + */
> +static void drm_gpusvm_range_destroy(struct kref *refcount)
> +{
> +	struct drm_gpusvm_range *range =
> +		container_of(refcount, struct drm_gpusvm_range, refcount);
> +	struct drm_gpusvm *gpusvm = range->gpusvm;
> +
> +	if (gpusvm->ops->range_free)
> +		gpusvm->ops->range_free(range);
> +	else
> +		kfree(range);
> +}
> +
> +/**
> + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> + * @range: Pointer to the GPU SVM range
> + *
> + * This function decrements the reference count of the specified GPU SVM range
> + * and frees it when the count reaches zero.
> + */
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> +{
> +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid. Expected be
> + * called holding gpusvm->notifier_lock and as the last step before commiting a
> + * GPU binding.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range)
> +{
> +	lockdep_assert_held(&gpusvm->notifier_lock);
> +
> +	return range->flags.has_devmem_pages || range->flags.has_dma_mapping;
> +}
> +
> +/**
> + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid unlocked
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + *
> + * This function determines if a GPU SVM range pages are valid. Expected be
> + * called without holding gpusvm->notifier_lock.
> + *
> + * Returns:
> + * True if GPU SVM range has valid pages, False otherwise
> + */
> +static bool
> +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> +				      struct drm_gpusvm_range *range)
> +{
> +	bool pages_valid;
> +
> +	if (!range->dma_addr)
> +		return false;
> +
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> +	if (!pages_valid)
> +		drm_gpusvm_range_free_pages(gpusvm, range);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +
> +	return pages_valid;
> +}
> +
> +/**
> + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function gets pages for a GPU SVM range and ensures they are mapped for
> + * DMA access.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct mmu_interval_notifier *notifier = &range->notifier->notifier;
> +	struct hmm_range hmm_range = {
> +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only ? 0 :
> +			HMM_PFN_REQ_WRITE),
> +		.notifier = notifier,
> +		.start = range->va.start,
> +		.end = range->va.end,
> +		.dev_private_owner = gpusvm->device_private_page_owner,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct drm_gpusvm_zdd *zdd;
> +	unsigned long timeout =
> +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long i, j;
> +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> +	unsigned long num_dma_mapped;
> +	unsigned int order = 0;
> +	unsigned long *pfns;
> +	struct page **pages;
> +	int err = 0;
> +	struct dev_pagemap *pagemap;
> +	struct drm_pagemap *dpagemap;
> +
> +retry:
> +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> +		goto set_seqno;
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return -ENOMEM;
> +
> +	if (!mmget_not_zero(mm)) {
> +		err = -EFAULT;
> +		goto err_out;
> +	}
> +
> +	hmm_range.hmm_pfns = pfns;
> +	while (true) {
> +		mmap_read_lock(mm);
> +		err = hmm_range_fault(&hmm_range);
> +		mmap_read_unlock(mm);
> +
> +		if (err == -EBUSY) {
> +			if (time_after(jiffies, timeout))
> +				break;
> +
> +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> +			continue;
> +		}
> +		break;
> +	}
> +	mmput(mm);
> +	if (err)
> +		goto err_free;
> +
> +	pages = (struct page **)pfns;
> +map_pages:
> +	/*
> +	 * Perform all dma mappings under the notifier lock to not
> +	 * access freed pages. A notifier will either block on
> +	 * the notifier lock or unmap dma.
> +	 */
> +	drm_gpusvm_notifier_lock(gpusvm);
> +	if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +		goto retry;
> +	}
> +
> +	if (!range->dma_addr) {
> +		/* Unlock and restart mapping to allocate memory. */
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +		range->dma_addr = kvmalloc_array(npages, sizeof(*range->dma_addr),
> +						 GFP_KERNEL);
> +		if (!range->dma_addr) {
> +			err = -ENOMEM;
> +			goto err_free;
> +		}
> +		goto map_pages;
> +	}
> +
> +	zdd = NULL;
> +	num_dma_mapped = 0;
> +	for (i = 0, j = 0; i < npages; ++j) {
> +		struct page *page = hmm_pfn_to_page(pfns[i]);
> +
> +		order = hmm_pfn_to_map_order(pfns[i]);
> +		if (is_device_private_page(page) || is_device_coherent_page(page)) {
> +			if (zdd != page->zone_device_data && i > 0) {
> +				err = -EOPNOTSUPP;
> +				goto err_unmap;
> +			}
> +			zdd = page->zone_device_data;
> +			if (pagemap != page->pgmap) {
> +				if (i > 0) {
> +					err = -EOPNOTSUPP;
> +					goto err_unmap;
> +				}
> +
> +				pagemap = page->pgmap;
> +				dpagemap = zdd->devmem_allocation->dpagemap;
> +				if (drm_WARN_ON(gpusvm->drm, !dpagemap)) {
> +					/*
> +					 * Raced. This is not supposed to happen
> +					 * since hmm_range_fault() should've migrated
> +					 * this page to system.
> +					 */
> +					err = -EAGAIN;
> +					goto err_unmap;
> +				}
> +			}
> +			range->dma_addr[j] =
> +				dpagemap->ops->map_dma(dpagemap, gpusvm->drm->dev,
> +						       page, order,
> +						       DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev, range->dma_addr[j].addr)) {
> +				err = -EFAULT;
> +				goto err_unmap;
> +			}
> +
> +			pages[i] = page;
> +		} else {
> +			dma_addr_t addr;
> +
> +			if (is_zone_device_page(page) || zdd) {
> +				err = -EOPNOTSUPP;
> +				goto err_unmap;
> +			}
> +
> +			addr = dma_map_page(gpusvm->drm->dev,
> +					    page, 0,
> +					    PAGE_SIZE << order,
> +					    DMA_BIDIRECTIONAL);
> +			if (dma_mapping_error(gpusvm->drm->dev, addr)) {
> +				err = -EFAULT;
> +				goto err_unmap;
> +			}
> +
> +			range->dma_addr[j] = drm_pagemap_dma_addr_encode
> +				(addr, DRM_INTERCONNECT_SYSTEM, order,
> +				 DMA_BIDIRECTIONAL);
> +		}
> +		i += 1 << order;
> +		num_dma_mapped = i;
> +	}
> +
> +	range->flags.has_dma_mapping = true;
> +	if (zdd) {
> +		range->flags.has_devmem_pages = true;
> +		range->dpagemap = dpagemap;
> +	}
> +
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +	kvfree(pfns);
> +set_seqno:
> +	range->notifier_seq = hmm_range.notifier_seq;
> +
> +	return 0;
> +
> +err_unmap:
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range, num_dma_mapped);
> +	drm_gpusvm_notifier_unlock(gpusvm);
> +err_free:
> +	kvfree(pfns);
> +err_out:
> +	if (err == -EAGAIN)
> +		goto retry;
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @ctx: GPU SVM context
> + *
> + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> + * security model.
> + */
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx)
> +{
> +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> +
> +	if (ctx->in_notifier)
> +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> +	else
> +		drm_gpusvm_notifier_lock(gpusvm);
> +
> +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> +
> +	if (!ctx->in_notifier)
> +		drm_gpusvm_notifier_unlock(gpusvm);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_page - Put a migration page
> + * @page: Pointer to the page to put
> + *
> + * This function unlocks and puts a page.
> + */
> +static void drm_gpusvm_migration_put_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * drm_gpusvm_migration_put_pages - Put migration pages
> + * @npages: Number of pages
> + * @migrate_pfn: Array of migrate page frame numbers
> + *
> + * This function puts an array of pages.
> + */
> +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> +					   unsigned long *migrate_pfn)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!migrate_pfn[i])
> +			continue;
> +
> +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> +		migrate_pfn[i] = 0;
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_get_devmem_page - Get a reference to a device memory page
> + * @page: Pointer to the page
> + * @zdd: Pointer to the GPU SVM zone device data
> + *
> + * This function associates the given page with the specified GPU SVM zone
> + * device data and initializes it for zone device usage.
> + */
> +static void drm_gpusvm_get_devmem_page(struct page *page,
> +				     struct drm_gpusvm_zdd *zdd)
> +{
> +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> +	zone_device_page_init(page);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> + * @dev: The device for which the pages are being mapped
> + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> + * @migrate_pfn: Array of migrate page frame numbers to map
> + * @npages: Number of pages to map
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function maps pages of memory for migration usage in GPU SVM. It
> + * iterates over each page frame number provided in @migrate_pfn, maps the
> + * corresponding page, and stores the DMA address in the provided @dma_addr
> + * array.
> + *
> + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> + */
> +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> +					dma_addr_t *dma_addr,
> +					long unsigned int *migrate_pfn,
> +					unsigned long npages,
> +					enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> +			return -EFAULT;
> +
> +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> +		if (dma_mapping_error(dev, dma_addr[i]))
> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> + * @dev: The device for which the pages were mapped
> + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> + * @npages: Number of pages to unmap
> + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> + *
> + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> + * if it's valid and not already unmapped, and unmaps the corresponding page.
> + */
> +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> +					   dma_addr_t *dma_addr,
> +					   unsigned long npages,
> +					   enum dma_data_direction dir)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i) {
> +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> +			continue;
> +
> +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> +	}
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_devmem - Migrate GPU SVM range to device memory
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range structure
> + * @devmem_allocation: Pointer to the device memory allocation. The caller
> + *                     should hold a reference to the device memory allocation,
> + *                     which should be dropped via ops->devmem_release or upon
> + *                     the failure of this function.
> + * @ctx: GPU SVM context
> + *
> + * This function migrates the specified GPU SVM range to device memory. It performs the
> + * necessary setup and invokes the driver-specific operations for migration to
> + * device memory. Upon successful return, @devmem_allocation can safely reference @range
> + * until ops->devmem_release is called which only upon successful return.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> +				 struct drm_gpusvm_range *range,
> +				 struct drm_gpusvm_devmem *devmem_allocation,
> +				 const struct drm_gpusvm_ctx *ctx)
> +{
> +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation->ops;
> +	u64 start = range->va.start, end = range->va.end;
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= gpusvm->device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct mm_struct *mm = gpusvm->mm;
> +	unsigned long i, npages = npages_in_range(start, end);
> +	struct vm_area_struct *vas;
> +	struct drm_gpusvm_zdd *zdd = NULL;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int err;
> +
> +	if (!range->flags.migrate_devmem)
> +		return -EINVAL;
> +
> +	if (!ops->populate_devmem_pfn || !ops->copy_to_devmem || !ops->copy_to_ram)
> +		return -EOPNOTSUPP;
> +
> +	if (!mmget_not_zero(mm)) {
> +		err = -EFAULT;
> +		goto err_out;
> +	}
> +	mmap_read_lock(mm);
> +
> +	vas = vma_lookup(mm, start);
> +	if (!vas) {
> +		err = -ENOENT;
> +		goto err_mmunlock;
> +	}
> +
> +	if (end > vas->vm_end || start < vas->vm_start) {
> +		err = -EINVAL;
> +		goto err_mmunlock;
> +	}
> +
> +	if (!vma_is_anonymous(vas)) {
> +		err = -EBUSY;
> +		goto err_mmunlock;
> +	}
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_mmunlock;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> +
> +	zdd = drm_gpusvm_zdd_alloc(gpusvm->device_private_page_owner);
> +	if (!zdd) {
> +		err = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/*
> +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> +	 * always an error. Need to revisit possible cases and how to handle. We
> +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> +	 */
> +
> +	if (!migrate.cpages) {
> +		err = -EFAULT;
> +		goto err_free;
> +	}
> +
> +	if (migrate.cpages != npages) {
> +		err = -EBUSY;
> +		goto err_finalize;
> +	}
> +
> +	err = ops->populate_devmem_pfn(devmem_allocation, npages, migrate.dst);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev, dma_addr,
> +					   migrate.src, npages, DMA_TO_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *page = pfn_to_page(migrate.dst[i]);
> +
> +		pages[i] = page;
> +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> +		drm_gpusvm_get_devmem_page(page, zdd);
> +	}
> +
> +	err = ops->copy_to_devmem(pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +	/* Upon success bind devmem allocation to range and zdd */
> +	WRITE_ONCE(zdd->devmem_allocation, devmem_allocation);	/* Owns ref */
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev, dma_addr, npages,
> +				       DMA_TO_DEVICE);
> +err_free:
> +	if (zdd)
> +		drm_gpusvm_zdd_put(zdd);
> +	kvfree(buf);
> +err_mmunlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +err_out:
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_migrate_populate_ram_pfn - Populate RAM PFNs for a VM area
> + * @vas: Pointer to the VM area structure, can be NULL
> + * @npages: Number of pages to populate
> + * @mpages: Number of pages to migrate
> + * @src_mpfn: Source array of migrate PFNs
> + * @mpfn: Array of migrate PFNs to populate
> + * @addr: Start address for PFN allocation
> + *
> + * This function populates the RAM migrate page frame numbers (PFNs) for the
> + * specified VM area structure. It allocates and locks pages in the VM area for
> + * RAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> + * alloc_page for allocation.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int drm_gpusvm_migrate_populate_ram_pfn(struct vm_area_struct *vas,
> +					       unsigned long npages,
> +					       unsigned long *mpages,
> +					       unsigned long *src_mpfn,
> +					       unsigned long *mpfn, u64 addr)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> +		struct page *page;
> +
> +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (vas)
> +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> +		else
> +			page = alloc_page(GFP_HIGHUSER);
> +
> +		if (!page)
> +			return -ENOMEM;
> +
> +		lock_page(page);
> +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> +		++*mpages;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * drm_gpusvm_evict_to_ram - Evict GPU SVM range to RAM
> + * @devmem_allocation: Pointer to the device memory allocation
> + *
> + * Similar to __drm_gpusvm_migrate_to_ram but does not require mmap lock and
> + * migration done via migrate_device_* functions.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem *devmem_allocation)
> +{
> +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation->ops;
> +	unsigned long npages, mpages = 0;
> +	struct page **pages;
> +	unsigned long *src, *dst;
> +	dma_addr_t *dma_addr;
> +	void *buf;
> +	int i, err = 0;
> +
> +	npages = devmem_allocation->size >> PAGE_SHIFT;
> +
> +retry:
> +	if (!mmget_not_zero(devmem_allocation->mm))
> +		return -EFAULT;
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	src = buf;
> +	dst = buf + (sizeof(*src) * npages);
> +	dma_addr = buf + (2 * sizeof(*src) * npages);
> +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> +
> +	err = ops->populate_devmem_pfn(devmem_allocation, npages, src);
> +	if (err)
> +		goto err_free;
> +
> +	err = migrate_device_prepopulated_range(src, npages);
> +	if (err)
> +		goto err_free;
> +
> +	err = drm_gpusvm_migrate_populate_ram_pfn(NULL, npages, &mpages, src,
> +						  dst, 0);
> +	if (err || !mpages)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev, dma_addr,
> +					   dst, npages, DMA_FROM_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(src[i]);
> +
> +	err = ops->copy_to_ram(pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, dst);
> +	migrate_device_pages(src, dst, npages);
> +	migrate_device_finalize(src, dst, npages);
> +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev, dma_addr, npages,
> +				       DMA_FROM_DEVICE);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +	mmput_async(devmem_allocation->mm);
> +	if (!err && !READ_ONCE(devmem_allocation->detached)) {
> +		cond_resched();
> +		goto retry;
> +	}
> +
> +	return err;
> +}
> +
> +/**
> + * __drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (internal)
> + * @vas: Pointer to the VM area structure
> + * @device_private_page_owner: Device private pages owner
> + * @page: Pointer to the page for fault handling (can be NULL)
> + * @fault_addr: Fault address
> + * @size: Size of migration
> + *
> + * This internal function performs the migration of the specified GPU SVM range
> + * to RAM. It sets up the migration, populates + dma maps RAM PFNs, and
> + * invokes the driver-specific operations for migration to RAM.
> + *
> + * Returns:
> + * 0 on success, negative error code on failure.
> + */
> +static int __drm_gpusvm_migrate_to_ram(struct vm_area_struct *vas,
> +				       void *device_private_page_owner,
> +				       struct page *page, u64 fault_addr,
> +				       u64 size)
> +{
> +	struct migrate_vma migrate = {
> +		.vma		= vas,
> +		.pgmap_owner	= device_private_page_owner,
> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
> +			MIGRATE_VMA_SELECT_DEVICE_COHERENT,
> +		.fault_page	= page,
> +	};
> +	struct drm_gpusvm_zdd *zdd;
> +	const struct drm_gpusvm_devmem_ops *ops;
> +	struct device *dev;
> +	unsigned long npages, mpages = 0;
> +	struct page **pages;
> +	dma_addr_t *dma_addr;
> +	u64 start, end;
> +	void *buf;
> +	int i, err = 0;
> +
> +	start = ALIGN_DOWN(fault_addr, size);
> +	end = ALIGN(fault_addr + 1, size);
> +
> +	/* Corner where VMA area struct has been partially unmapped */
> +	if (start < vas->vm_start)
> +		start = vas->vm_start;
> +	if (end > vas->vm_end)
> +		end = vas->vm_end;
> +
> +	migrate.start = start;
> +	migrate.end = end;
> +	npages = npages_in_range(start, end);
> +
> +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> +		       sizeof(*pages), GFP_KERNEL);
> +	if (!buf) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> +
> +	migrate.vma = vas;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +
> +	err = migrate_vma_setup(&migrate);
> +	if (err)
> +		goto err_free;
> +
> +	/* Raced with another CPU fault, nothing to do */
> +	if (!migrate.cpages)
> +		goto err_free;
> +
> +	if (!page) {
> +		for (i = 0; i < npages; ++i) {
> +			if (!(migrate.src[i] & MIGRATE_PFN_MIGRATE))
> +				continue;
> +
> +			page = migrate_pfn_to_page(migrate.src[i]);
> +			break;
> +		}
> +
> +		if (!page)
> +			goto err_finalize;
> +	}
> +	zdd = page->zone_device_data;
> +	ops = zdd->devmem_allocation->ops;
> +	dev = zdd->devmem_allocation->dev;
> +
> +	err = drm_gpusvm_migrate_populate_ram_pfn(vas, npages, &mpages,
> +						  migrate.src, migrate.dst,
> +						  start);
> +	if (err)
> +		goto err_finalize;
> +
> +	err = drm_gpusvm_migrate_map_pages(dev, dma_addr, migrate.dst, npages,
> +					   DMA_FROM_DEVICE);
> +	if (err)
> +		goto err_finalize;
> +
> +	for (i = 0; i < npages; ++i)
> +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> +
> +	err = ops->copy_to_ram(pages, dma_addr, npages);
> +	if (err)
> +		goto err_finalize;
> +
> +err_finalize:
> +	if (err)
> +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +	drm_gpusvm_migrate_unmap_pages(dev, dma_addr, npages,
> +				       DMA_FROM_DEVICE);
> +err_free:
> +	kvfree(buf);
> +err_out:
> +
> +	return err;
> +}
> +
> +/**
> + * drm_gpusvm_range_evict - Evict GPU SVM range
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @range: Pointer to the GPU SVM range to be removed
> + *
> + * This function evicts the specified GPU SVM range.
> + */
> +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> +			    struct drm_gpusvm_range *range)
> +{
> +	struct mm_struct *mm = gpusvm->mm;
> +	struct vm_area_struct *vas;
> +
> +	if (!mmget_not_zero(mm))
> +		return;
> +
> +	mmap_read_lock(mm);
> +	vas = vma_lookup(mm, range->va.start);
> +	if (!vas)
> +		goto unlock;
> +
> +	__drm_gpusvm_migrate_to_ram(vas, gpusvm->device_private_page_owner,
> +				    NULL, range->va.start,
> +				    range->va.end - range->va.start);
> +unlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +}
> +
> +/**
> + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> + * @page: Pointer to the page
> + *
> + * This function is a callback used to put the GPU SVM zone device data
> + * associated with a page when it is being released.
> + */
> +static void drm_gpusvm_page_free(struct page *page)
> +{
> +	drm_gpusvm_zdd_put(page->zone_device_data);
> +}
> +
> +/**
> + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> + * @vmf: Pointer to the fault information structure
> + *
> + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> + * It retrieves the GPU SVM range information from the faulting page and invokes
> + * the internal migration function to migrate the range back to RAM.
> + *
> + * Returns:
> + * VM_FAULT_SIGBUS on failure, 0 on success.
> + */
> +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> +	int err;
> +
> +	err = __drm_gpusvm_migrate_to_ram(vmf->vma,
> +					  zdd->device_private_page_owner,
> +					  vmf->page, vmf->address,
> +					  zdd->devmem_allocation->size);
> +
> +	return err ? VM_FAULT_SIGBUS : 0;
> +}
> +
> +/**
> + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> + */
> +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> +	.page_free = drm_gpusvm_page_free,
> +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> +};
> +
> +/**
> + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> + *
> + * Returns:
> + * Pointer to the GPU SVM device page map operations structure.
> + */
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> +{
> +	return &drm_gpusvm_pagemap_ops;
> +}
> +
> +/**
> + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> + * @gpusvm: Pointer to the GPU SVM structure.
> + * @start: Start address
> + * @end: End address
> + *
> + * Returns:
> + * True if GPU SVM has mapping, False otherwise
> + */
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> +{
> +	struct drm_gpusvm_notifier *notifier;
> +
> +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> +		struct drm_gpusvm_range *range = NULL;
> +
> +		drm_gpusvm_for_each_range(range, notifier, start, end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> new file mode 100644
> index 000000000000..15ec22d4f9a5
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> @@ -0,0 +1,447 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __DRM_GPUSVM_H__
> +#define __DRM_GPUSVM_H__
> +
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/workqueue.h>
> +
> +struct dev_pagemap_ops;
> +struct drm_device;
> +struct drm_gpusvm;
> +struct drm_gpusvm_notifier;
> +struct drm_gpusvm_ops;
> +struct drm_gpusvm_range;
> +struct drm_gpusvm_devmem;
> +struct drm_pagemap;
> +struct drm_pagemap_dma_addr;
> +
> +/**
> + * struct drm_gpusvm_devmem_ops - Operations structure for GPU SVM device memory
> + *
> + * This structure defines the operations for GPU Shared Virtual Memory (SVM)
> + * device memory. These operations are provided by the GPU driver to manage device memory
> + * allocations and perform operations such as migration between device memory and system
> + * RAM.
> + */
> +struct drm_gpusvm_devmem_ops {
> +	/**
> +	 * @devmem_release: Release device memory allocation (optional)
> +	 * @devmem_allocation: device memory allocation
> +	 *
> +	 * This function shall release device memory allocation and expects to drop a
> +	 * reference to device memory allocation.
> +	 */
> +	void (*devmem_release)(struct drm_gpusvm_devmem *devmem_allocation);
> +
> +	/**
> +	 * @populate_devmem_pfn: Populate device memory PFN (required for migration)
> +	 * @devmem_allocation: device memory allocation
> +	 * @npages: Number of pages to populate
> +	 * @pfn: Array of page frame numbers to populate
> +	 *
> +	 * This function shall populate device memory page frame numbers (PFN).
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*populate_devmem_pfn)(struct drm_gpusvm_devmem *devmem_allocation,
> +				 unsigned long npages, unsigned long *pfn);
> +
> +	/**
> +	 * @copy_to_devmem: Copy to device memory (required for migration)
> +	 * @pages: Pointer to array of device memory pages (destination)
> +	 * @dma_addr: Pointer to array of DMA addresses (source)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to device memory.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_devmem)(struct page **pages,
> +			      dma_addr_t *dma_addr,
> +			      unsigned long npages);
> +
> +	/**
> +	 * @copy_to_ram: Copy to system RAM (required for migration)
> +	 * @pages: Pointer to array of device memory pages (source)
> +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> +	 * @npages: Number of pages to copy
> +	 *
> +	 * This function shall copy pages to system RAM.
> +	 *
> +	 * Returns:
> +	 * 0 on success, a negative error code on failure.
> +	 */
> +	int (*copy_to_ram)(struct page **pages,
> +			   dma_addr_t *dma_addr,
> +			   unsigned long npages);
> +};
> +
> +/**
> + * struct drm_gpusvm_devmem - Structure representing a GPU SVM device memory allocation
> + *
> + * @dev: Pointer to the device structure which device memory allocation belongs to
> + * @mm: Pointer to the mm_struct for the address space
> + * @ops: Pointer to the operations structure for GPU SVM device memory
> + * @dpagemap: The struct drm_pagemap of the pages this allocation belongs to.
> + * @size: Size of device memory allocation
> + * @detached: device memory allocations is detached from device pages
> + */
> +struct drm_gpusvm_devmem {
> +	struct device *dev;
> +	struct mm_struct *mm;
> +	const struct drm_gpusvm_devmem_ops *ops;
> +	struct drm_pagemap *dpagemap;
> +	size_t size;
> +	bool detached;
> +};
> +
> +/**
> + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> + *
> + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> + * These operations are provided by the GPU driver to manage SVM ranges and
> + * notifiers.
> + */
> +struct drm_gpusvm_ops {
> +	/**
> +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> +	 *
> +	 * This function shall allocate a GPU SVM notifier.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> +	 */
> +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> +
> +	/**
> +	 * @notifier_free: Free a GPU SVM notifier (optional)
> +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> +	 *
> +	 * This function shall free a GPU SVM notifier.
> +	 */
> +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> +
> +	/**
> +	 * @range_alloc: Allocate a GPU SVM range (optional)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 *
> +	 * This function shall allocate a GPU SVM range.
> +	 *
> +	 * Returns:
> +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> +	 */
> +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> +
> +	/**
> +	 * @range_free: Free a GPU SVM range (optional)
> +	 * @range: Pointer to the GPU SVM range to be freed
> +	 *
> +	 * This function shall free a GPU SVM range.
> +	 */
> +	void (*range_free)(struct drm_gpusvm_range *range);
> +
> +	/**
> +	 * @invalidate: Invalidate GPU SVM notifier (required)
> +	 * @gpusvm: Pointer to the GPU SVM
> +	 * @notifier: Pointer to the GPU SVM notifier
> +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> +	 *
> +	 * This function shall invalidate the GPU page tables. It can safely
> +	 * walk the notifier range RB tree/list in this function. Called while
> +	 * holding the notifier lock.
> +	 */
> +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> +			   struct drm_gpusvm_notifier *notifier,
> +			   const struct mmu_notifier_range *mmu_range);
> +};
> +
> +/**
> + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: MMU interval notifier
> + * @interval: Interval for the notifier
> + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> + * @root: Cached root node of the RB tree containing ranges
> + * @range_list: List head containing of ranges in the same order they appear in
> + *              interval tree. This is useful to keep iterating ranges while
> + *              doing modifications to RB tree.
> + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> + *                 removed
> + *
> + * This structure represents a GPU SVM notifier.
> + */
> +struct drm_gpusvm_notifier {
> +	struct drm_gpusvm *gpusvm;
> +	struct mmu_interval_notifier notifier;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} interval;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct rb_root_cached root;
> +	struct list_head range_list;
> +	struct {
> +		u32 removed : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> + *
> + * @gpusvm: Pointer to the GPU SVM structure
> + * @notifier: Pointer to the GPU SVM notifier
> + * @refcount: Reference count for the range
> + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> + * @va: Virtual address range
> + * @notifier_seq: Notifier sequence number of the range's pages
> + * @dma_addr: DMA address array
> + * @dpagemap: The struct drm_pagemap of the device pages we're dma-mapping.
> + * Note this is assuming only one drm_pagemap per range is allowed.
> + * @flags.migrate_devmem: Flag indicating whether the range can be migrated to device memory
> + * @flags.unmapped: Flag indicating if the range has been unmapped
> + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> + * @flags.has_devmem_pages: Flag indicating if the range has devmem pages
> + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> + *
> + * This structure represents a GPU SVM range used for tracking memory ranges
> + * mapped in a DRM device.
> + */
> +struct drm_gpusvm_range {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +	struct kref refcount;
> +	struct {
> +		struct rb_node node;
> +		struct list_head entry;
> +		u64 __subtree_last;
> +	} rb;
> +	struct {
> +		u64 start;
> +		u64 end;
> +	} va;
> +	unsigned long notifier_seq;
> +	struct drm_pagemap_dma_addr *dma_addr;
> +	struct drm_pagemap *dpagemap;
> +	struct {
> +		/* All flags below must be set upon creation */
> +		u16 migrate_devmem : 1;
> +		/* All flags below must be set / cleared under notifier lock */
> +		u16 unmapped : 1;
> +		u16 partial_unmap : 1;
> +		u16 has_devmem_pages : 1;
> +		u16 has_dma_mapping : 1;
> +	} flags;
> +};
> +
> +/**
> + * struct drm_gpusvm - GPU SVM structure
> + *
> + * @name: Name of the GPU SVM
> + * @drm: Pointer to the DRM device structure
> + * @mm: Pointer to the mm_struct for the address space
> + * @device_private_page_owner: Device private pages owner
> + * @mm_start: Start address of GPU SVM
> + * @mm_range: Range of the GPU SVM
> + * @notifier_size: Size of individual notifiers
> + * @ops: Pointer to the operations structure for GPU SVM
> + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> + *               Entries should be powers of 2 in descending order.
> + * @num_chunks: Number of chunks
> + * @notifier_lock: Read-write semaphore for protecting notifier operations
> + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> + * @notifier_list: list head containing of notifiers in the same order they
> + *                 appear in interval tree. This is useful to keep iterating
> + *                 notifiers while doing modifications to RB tree.
> + *
> + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> + *
> + * No reference counting is provided, as this is expected to be embedded in the
> + * driver VM structure along with the struct drm_gpuvm, which handles reference
> + * counting.
> + */
> +struct drm_gpusvm {
> +	const char *name;
> +	struct drm_device *drm;
> +	struct mm_struct *mm;
> +	void *device_private_page_owner;
> +	u64 mm_start;
> +	u64 mm_range;
> +	u64 notifier_size;
> +	const struct drm_gpusvm_ops *ops;
> +	const u64 *chunk_sizes;
> +	int num_chunks;
> +	struct rw_semaphore notifier_lock;
> +	struct rb_root_cached root;
> +	struct list_head notifier_list;
> +};
> +
> +/**
> + * struct drm_gpusvm_ctx - DRM GPU SVM context
> + *
> + * @in_notifier: entering from a MMU notifier
> + * @read_only: operating on read-only memory
> + * @devmem_possible: possible to use device memory
> + * @check_pages: check pages and only create range for pages faulted in
> + *
> + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> + */
> +struct drm_gpusvm_ctx {
> +	u32 in_notifier :1;
> +	u32 read_only :1;
> +	u32 devmem_possible :1;
> +	u32 check_pages :1;
> +};
> +
> +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> +		    const char *name, struct drm_device *drm,
> +		    struct mm_struct *mm, void *device_private_page_owner,
> +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> +		    const struct drm_gpusvm_ops *ops,
> +		    const u64 *chunk_sizes, int num_chunks);
> +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> +				u64 gpuva_start, u64 gpuva_end,
> +				const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> +			     struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> +			    struct drm_gpusvm_range *range);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> +
> +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range);
> +
> +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> +			       struct drm_gpusvm_range *range,
> +			       const struct drm_gpusvm_ctx *ctx);
> +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> +				  struct drm_gpusvm_range *range,
> +				  const struct drm_gpusvm_ctx *ctx);
> +
> +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> +				 struct drm_gpusvm_range *range,
> +				 struct drm_gpusvm_devmem *devmem_allocation,
> +				 const struct drm_gpusvm_ctx *ctx);
> +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem *devmem_allocation);
> +
> +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> +
> +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> +
> +struct drm_gpusvm_range *
> +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> +
> +/**
> + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, take lock
> + */
> +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> +	down_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> + * @gpusvm__: Pointer to the GPU SVM structure.
> + *
> + * Abstract client usage GPU SVM notifier lock, drop lock
> + */
> +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> +	up_read(&(gpusvm__)->notifier_lock)
> +
> +/**
> + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> + * @range: a pointer to the current GPU SVM range
> + *
> + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> + *         current range is the last one or if the input range is NULL.
> + */
> +static inline struct drm_gpusvm_range *
> +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> +{
> +	if (range && !list_is_last(&range->rb.entry,
> +				   &range->notifier->range_list))
> +		return list_next_entry(range, rb.entry);
> +
> +	return NULL;
> +}
> +
> +/**
> + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> + * @notifier__: Pointer to the GPU SVM notifier
> + * @start__: Start address of the range
> + * @end__: End address of the range
> + *
> + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> + * to use while holding the driver SVM lock or the notifier lock.
> + */
> +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> +	for ((range__) = (range__) ?:					\
> +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> +	     (range__) && (range__->va.start < (end__));		\
> +	     (range__) = __drm_gpusvm_range_next(range__))
> +
> +/**
> + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> + * @range: Pointer to the GPU SVM range structure.
> + * @mmu_range: Pointer to the MMU notifier range structure.
> + *
> + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> + * if the range partially falls within the provided MMU notifier range.
> + */
> +static inline void
> +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> +			      const struct mmu_notifier_range *mmu_range)
> +{
> +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> +
> +	range->flags.unmapped = true;
> +	if (range->va.start < mmu_range->start ||
> +	    range->va.end > mmu_range->end)
> +		range->flags.partial_unmap = true;
> +}
> +
> +/**
> + * drm_gpusvm_devmem_init - Initialize a GPU SVM device memory allocation
> + *
> + * @dev: Pointer to the device structure which device memory allocation belongs to
> + * @mm: Pointer to the mm_struct for the address space
> + * @ops: Pointer to the operations structure for GPU SVM device memory
> + * @dpagemap: The struct drm_pagemap we're allocating from.
> + * @size: Size of device memory allocation
> + */
> +static inline void
> +drm_gpusvm_devmem_init(struct drm_gpusvm_devmem *devmem_allocation,
> +		       struct device *dev, struct mm_struct *mm,
> +		       const struct drm_gpusvm_devmem_ops *ops,
> +		       struct drm_pagemap *dpagemap, size_t size)
> +{
> +	devmem_allocation->dev = dev;
> +	devmem_allocation->mm = mm;
> +	devmem_allocation->ops = ops;
> +	devmem_allocation->dpagemap = dpagemap;
> +	devmem_allocation->size = size;
> +}
> +
> +#endif /* __DRM_GPUSVM_H__ */


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory
  2024-11-29  0:00   ` Alistair Popple
@ 2024-12-14  1:16     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-14  1:16 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, dri-devel, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

On Fri, Nov 29, 2024 at 11:00:24AM +1100, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> [...]
> 
> > + * 3) Invalidation driver vfunc.
> > + *
> > + *	void driver_invalidation(struct drm_gpusvm *gpusvm,
> > + *				 struct drm_gpusvm_notifier *notifier,
> > + *				 const struct mmu_notifier_range *mmu_range)
> > + *	{
> > + *		struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > + *		struct drm_gpusvm_range *range = NULL;
> > + *
> > + *		driver_invalidate_device_tlb(gpusvm, mmu_range->start, mmu_range->end);
> > + *
> > + *		drm_gpusvm_for_each_range(range, notifier, mmu_range->start,
> > + *					  mmu_range->end) {
> > + *			drm_gpusvm_range_unmap_pages(gpusvm, range, &ctx);
> > + *
> > + *			if (mmu_range->event != MMU_NOTIFY_UNMAP)
> 
> I've only glanced at this series as an interested onlooker so I
> may have overlooked something obvious but why is it ok to skip notifiers
> other than MMU_NOTIFY_UNMAP? Wouldn't you also need to clears GPU PTEs
> in other cases?
> 

This just skips the addition to the garbage collector, which performs
the final removal of GPU page tables and also removes the range from the
GPU SVM tree. Both of those steps require locks that are in the reclaim
path, so you can't perform them directly here. The garbage collector is
a worker with a list of ranges pending removal.

In all cases, the GPU PTEs need to be cleared, and any DMA mappings need
to be removed. This is handled in the pseudo code in
driver_invalidate_device_tlb and drm_gpusvm_range_unmap_pages,
respectively.

Hope that clears things up,
Matt

>  - Alistair
> 
> > + *				continue;
> > + *
> > + *			drm_gpusvm_range_set_unmapped(range, mmu_range);
> > + *			driver_garbage_collector_add(gpusvm, range);
> > + *		}
> > + *	}
> > + */
> > +
> > +#define DRM_GPUSVM_RANGE_START(_range)	((_range)->va.start)
> > +#define DRM_GPUSVM_RANGE_END(_range)	((_range)->va.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_range, rb.node, u64, rb.__subtree_last,
> > +		     DRM_GPUSVM_RANGE_START, DRM_GPUSVM_RANGE_END,
> > +		     static __maybe_unused, range);
> > +
> > +#define DRM_GPUSVM_NOTIFIER_START(_notifier)	((_notifier)->interval.start)
> > +#define DRM_GPUSVM_NOTIFIER_END(_notifier)	((_notifier)->interval.end - 1)
> > +INTERVAL_TREE_DEFINE(struct drm_gpusvm_notifier, rb.node, u64,
> > +		     rb.__subtree_last, DRM_GPUSVM_NOTIFIER_START,
> > +		     DRM_GPUSVM_NOTIFIER_END, static __maybe_unused, notifier);
> > +
> > +/**
> > + * npages_in_range() - Calculate the number of pages in a given range
> > + * @start__: The start address of the range
> > + * @end__: The end address of the range
> > + *
> > + * This macro calculates the number of pages in a given memory range,
> > + * specified by the start and end addresses. It divides the difference
> > + * between the end and start addresses by the page size (PAGE_SIZE) to
> > + * determine the number of pages in the range.
> > + *
> > + * Return: The number of pages in the specified range.
> > + */
> > +#define npages_in_range(start__, end__)	\
> > +	(((end__) - (start__)) >> PAGE_SHIFT)
> > +
> > +/**
> > + * struct drm_gpusvm_zdd - GPU SVM zone device data
> > + *
> > + * @refcount: Reference count for the zdd
> > + * @destroy_work: Work structure for asynchronous zdd destruction
> > + * @devmem_allocation: device memory allocation
> > + * @device_private_page_owner: Device private pages owner
> > + *
> > + * This structure serves as a generic wrapper installed in
> > + * page->zone_device_data. It provides infrastructure for looking up a device
> > + * memory allocation upon CPU page fault and asynchronously releasing device
> > + * memory once the CPU has no page references. Asynchronous release is useful
> > + * because CPU page references can be dropped in IRQ contexts, while releasing
> > + * device memory likely requires sleeping locks.
> > + */
> > +struct drm_gpusvm_zdd {
> > +	struct kref refcount;
> > +	struct work_struct destroy_work;
> > +	struct drm_gpusvm_devmem *devmem_allocation;
> > +	void *device_private_page_owner;
> > +};
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy_work_func - Work function for destroying a zdd
> > + * @w: Pointer to the work_struct
> > + *
> > + * This function releases device memory, puts GPU SVM range, and frees zdd.
> > + */
> > +static void drm_gpusvm_zdd_destroy_work_func(struct work_struct *w)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(w, struct drm_gpusvm_zdd, destroy_work);
> > +	const struct drm_gpusvm_devmem_ops *ops = zdd->devmem_allocation ?
> > +		zdd->devmem_allocation->ops : NULL;
> > +
> > +	if (zdd->devmem_allocation && ops->devmem_release)
> > +		ops->devmem_release(zdd->devmem_allocation);
> > +	kfree(zdd);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_alloc - Allocate a zdd structure.
> > + * @device_private_page_owner: Device private pages owner
> > + *
> > + * This function allocates and initializes a new zdd structure. It sets up the
> > + * reference count and initializes the destroy work.
> > + *
> > + * Returns:
> > + * Pointer to the allocated zdd on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_zdd *
> > +drm_gpusvm_zdd_alloc(void *device_private_page_owner)
> > +{
> > +	struct drm_gpusvm_zdd *zdd;
> > +
> > +	zdd = kmalloc(sizeof(*zdd), GFP_KERNEL);
> > +	if (!zdd)
> > +		return NULL;
> > +
> > +	kref_init(&zdd->refcount);
> > +	INIT_WORK(&zdd->destroy_work, drm_gpusvm_zdd_destroy_work_func);
> > +	zdd->devmem_allocation = NULL;
> > +	zdd->device_private_page_owner = device_private_page_owner;
> > +
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_get - Get a reference to a zdd structure.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function increments the reference count of the provided zdd structure.
> > + *
> > + * Returns: Pointer to the zdd structure.
> > + */
> > +static struct drm_gpusvm_zdd *drm_gpusvm_zdd_get(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_get(&zdd->refcount);
> > +	return zdd;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_destroy - Destroy a zdd structure.
> > + * @ref: Pointer to the reference count structure.
> > + *
> > + * This function queues the destroy_work of the zdd for asynchronous destruction.
> > + */
> > +static void drm_gpusvm_zdd_destroy(struct kref *ref)
> > +{
> > +	struct drm_gpusvm_zdd *zdd =
> > +		container_of(ref, struct drm_gpusvm_zdd, refcount);
> > +
> > +	if (zdd->devmem_allocation)
> > +		WRITE_ONCE(zdd->devmem_allocation->detached, true);
> > +	schedule_work(&zdd->destroy_work);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_zdd_put - Put a zdd reference.
> > + * @zdd: Pointer to the zdd structure.
> > + *
> > + * This function decrements the reference count of the provided zdd structure
> > + * and schedules its destruction if the count drops to zero.
> > + */
> > +static void drm_gpusvm_zdd_put(struct drm_gpusvm_zdd *zdd)
> > +{
> > +	kref_put(&zdd->refcount, drm_gpusvm_zdd_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find - Find GPU SVM range from GPU SVM notifier
> > + * @notifier: Pointer to the GPU SVM notifier structure.
> > + * @start: Start address of the range
> > + * @end: End address of the range
> > + *
> > + * Return: A pointer to the drm_gpusvm_range if found or NULL
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end)
> > +{
> > +	return range_iter_first(&notifier->root, start, end - 1);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range_safe - Safely iterate over GPU SVM ranges in a notifier
> > + * @range__: Iterator variable for the ranges
> > + * @next__: Iterator variable for the ranges temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier while
> > + * removing ranges from it.
> > + */
> > +#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
> > +	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
> > +	     (next__) = __drm_gpusvm_range_next(range__);				\
> > +	     (range__) && (range__->va.start < (end__));				\
> > +	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * __drm_gpusvm_notifier_next - get the next drm_gpusvm_notifier in the list
> > + * @notifier: a pointer to the current drm_gpusvm_notifier
> > + *
> > + * Return: A pointer to the next drm_gpusvm_notifier if available, or NULL if
> > + *         the current notifier is the last one or if the input notifier is
> > + *         NULL.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +__drm_gpusvm_notifier_next(struct drm_gpusvm_notifier *notifier)
> > +{
> > +	if (notifier && !list_is_last(&notifier->rb.entry,
> > +				      &notifier->gpusvm->notifier_list))
> > +		return list_next_entry(notifier, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier - Iterate over GPU SVM notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm.
> > + */
> > +#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
> > +	     (notifier__) && (notifier__->interval.start < (end__));			\
> > +	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_for_each_notifier_safe - Safely iterate over GPU SVM notifiers in a gpusvm
> > + * @notifier__: Iterator variable for the notifiers
> > + * @next__: Iterator variable for the notifiers temporay storage
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the notifier
> > + * @end__: End address of the notifier
> > + *
> > + * This macro is used to iterate over GPU SVM notifiers in a gpusvm while
> > + * removing notifiers from it.
> > + */
> > +#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
> > +	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
> > +	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
> > +	     (notifier__) && (notifier__->interval.start < (end__));			\
> > +	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))
> > +
> > +/**
> > + * drm_gpusvm_notifier_invalidate - Invalidate a GPU SVM notifier.
> > + * @mni: Pointer to the mmu_interval_notifier structure.
> > + * @mmu_range: Pointer to the mmu_notifier_range structure.
> > + * @cur_seq: Current sequence number.
> > + *
> > + * This function serves as a generic MMU notifier for GPU SVM. It sets the MMU
> > + * notifier sequence number and calls the driver invalidate vfunc under
> > + * gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * true if the operation succeeds, false otherwise.
> > + */
> > +static bool
> > +drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
> > +			       const struct mmu_notifier_range *mmu_range,
> > +			       unsigned long cur_seq)
> > +{
> > +	struct drm_gpusvm_notifier *notifier =
> > +		container_of(mni, typeof(*notifier), notifier);
> > +	struct drm_gpusvm *gpusvm = notifier->gpusvm;
> > +
> > +	if (!mmu_notifier_range_blockable(mmu_range))
> > +		return false;
> > +
> > +	down_write(&gpusvm->notifier_lock);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
> > +	up_write(&gpusvm->notifier_lock);
> > +
> > +	return true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
> > + */
> > +static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
> > +	.invalidate = drm_gpusvm_notifier_invalidate,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_init - Initialize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @name: Name of the GPU SVM.
> > + * @drm: Pointer to the DRM device structure.
> > + * @mm: Pointer to the mm_struct for the address space.
> > + * @device_private_page_owner: Device private pages owner.
> > + * @mm_start: Start address of GPU SVM.
> > + * @mm_range: Range of the GPU SVM.
> > + * @notifier_size: Size of individual notifiers.
> > + * @ops: Pointer to the operations structure for GPU SVM.
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > + *               Entries should be powers of 2 in descending order with last
> > + *               entry being SZ_4K.
> > + * @num_chunks: Number of chunks.
> > + *
> > + * This function initializes the GPU SVM.
> > + *
> > + * Returns:
> > + * 0 on success, a negative error code on failure.
> > + */
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks)
> > +{
> > +	if (!ops->invalidate || !num_chunks)
> > +		return -EINVAL;
> > +
> > +	gpusvm->name = name;
> > +	gpusvm->drm = drm;
> > +	gpusvm->mm = mm;
> > +	gpusvm->device_private_page_owner = device_private_page_owner;
> > +	gpusvm->mm_start = mm_start;
> > +	gpusvm->mm_range = mm_range;
> > +	gpusvm->notifier_size = notifier_size;
> > +	gpusvm->ops = ops;
> > +	gpusvm->chunk_sizes = chunk_sizes;
> > +	gpusvm->num_chunks = num_chunks;
> > +
> > +	mmgrab(mm);
> > +	gpusvm->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&gpusvm->notifier_list);
> > +
> > +	init_rwsem(&gpusvm->notifier_lock);
> > +
> > +	fs_reclaim_acquire(GFP_KERNEL);
> > +	might_lock(&gpusvm->notifier_lock);
> > +	fs_reclaim_release(GFP_KERNEL);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_find - Find GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure
> > + * @fault_addr__: Fault address
> > + *
> > + * This macro finds the GPU SVM notifier associated with the fault address.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM notifier on success, NULL otherwise.
> > + */
> > +#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
> > +	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
> > +			    (fault_addr__ + 1))
> > +
> > +/**
> > + * to_drm_gpusvm_notifier - retrieve the container struct for a given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_notifier struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_notifier structure.
> > + */
> > +#define to_drm_gpusvm_notifier(__node)				\
> > +	container_of((__node), struct drm_gpusvm_notifier, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_notifier_insert - Insert GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function inserts the GPU SVM notifier into the GPU SVM RB tree and list.
> > + */
> > +static void drm_gpusvm_notifier_insert(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier *notifier)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	notifier_insert(notifier, &gpusvm->root);
> > +
> > +	node = rb_prev(&notifier->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_notifier(node))->rb.entry;
> > +	else
> > +		head = &gpusvm->notifier_list;
> > +
> > +	list_add(&notifier->rb.entry, head);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_remove - Remove GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM tructure
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + *
> > + * This macro removes the GPU SVM notifier from the GPU SVM RB tree and list.
> > + */
> > +#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
> > +	notifier_remove((notifier__), &(gpusvm__)->root);	\
> > +	list_del(&(notifier__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_fini - Finalize the GPU SVM.
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + *
> > + * This function finalizes the GPU SVM by cleaning up any remaining ranges and
> > + * notifiers, and dropping a reference to struct MM.
> > + */
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct drm_gpusvm_notifier *notifier, *next;
> > +
> > +	drm_gpusvm_for_each_notifier_safe(notifier, next, gpusvm, 0, LONG_MAX) {
> > +		struct drm_gpusvm_range *range, *__next;
> > +
> > +		/*
> > +		 * Remove notifier first to avoid racing with any invalidation
> > +		 */
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +		notifier->flags.removed = true;
> > +
> > +		drm_gpusvm_for_each_range_safe(range, __next, notifier, 0,
> > +					       LONG_MAX)
> > +			drm_gpusvm_range_remove(gpusvm, range);
> > +	}
> > +
> > +	mmdrop(gpusvm->mm);
> > +	WARN_ON(!RB_EMPTY_ROOT(&gpusvm->root.rb_root));
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_alloc - Allocate GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + *
> > + * This function allocates and initializes the GPU SVM notifier structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM notifier on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_notifier *
> > +drm_gpusvm_notifier_alloc(struct drm_gpusvm *gpusvm, u64 fault_addr)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	if (gpusvm->ops->notifier_alloc)
> > +		notifier = gpusvm->ops->notifier_alloc();
> > +	else
> > +		notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
> > +
> > +	if (!notifier)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	notifier->gpusvm = gpusvm;
> > +	notifier->interval.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
> > +	notifier->interval.end = ALIGN(fault_addr + 1, gpusvm->notifier_size);
> > +	INIT_LIST_HEAD(&notifier->rb.entry);
> > +	notifier->root = RB_ROOT_CACHED;
> > +	INIT_LIST_HEAD(&notifier->range_list);
> > +
> > +	return notifier;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_notifier_free - Free GPU SVM notifier
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + *
> > + * This function frees the GPU SVM notifier structure.
> > + */
> > +static void drm_gpusvm_notifier_free(struct drm_gpusvm *gpusvm,
> > +				     struct drm_gpusvm_notifier *notifier)
> > +{
> > +	WARN_ON(!RB_EMPTY_ROOT(&notifier->root.rb_root));
> > +
> > +	if (gpusvm->ops->notifier_free)
> > +		gpusvm->ops->notifier_free(notifier);
> > +	else
> > +		kfree(notifier);
> > +}
> > +
> > +/**
> > + * to_drm_gpusvm_range - retrieve the container struct for a given rbtree node
> > + * @node__: a pointer to the rbtree node embedded within a drm_gpusvm_range struct
> > + *
> > + * Return: A pointer to the containing drm_gpusvm_range structure.
> > + */
> > +#define to_drm_gpusvm_range(node__)	\
> > +	container_of((node__), struct drm_gpusvm_range, rb.node)
> > +
> > +/**
> > + * drm_gpusvm_range_insert - Insert GPU SVM range
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function inserts the GPU SVM range into the notifier RB tree and list.
> > + */
> > +static void drm_gpusvm_range_insert(struct drm_gpusvm_notifier *notifier,
> > +				    struct drm_gpusvm_range *range)
> > +{
> > +	struct rb_node *node;
> > +	struct list_head *head;
> > +
> > +	drm_gpusvm_notifier_lock(notifier->gpusvm);
> > +	range_insert(range, &notifier->root);
> > +
> > +	node = rb_prev(&range->rb.node);
> > +	if (node)
> > +		head = &(to_drm_gpusvm_range(node))->rb.entry;
> > +	else
> > +		head = &notifier->range_list;
> > +
> > +	list_add(&range->rb.entry, head);
> > +	drm_gpusvm_notifier_unlock(notifier->gpusvm);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @notifier__: Pointer to the GPU SVM notifier structure
> > + * @range__: Pointer to the GPU SVM range structure
> > + *
> > + * This macro removes the GPU SVM range from the notifier RB tree and list.
> > + */
> > +#define __drm_gpusvm_range_remove(notifier__, range__)		\
> > +	range_remove((range__), &(notifier__)->root);		\
> > +	list_del(&(range__)->rb.entry)
> > +
> > +/**
> > + * drm_gpusvm_range_alloc - Allocate GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @fault_addr: Fault address
> > + * @chunk_size: Chunk size
> > + * @migrate_devmem: Flag indicating whether to migrate device memory
> > + *
> > + * This function allocates and initializes the GPU SVM range structure.
> > + *
> > + * Returns:
> > + * Pointer to the allocated GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +static struct drm_gpusvm_range *
> > +drm_gpusvm_range_alloc(struct drm_gpusvm *gpusvm,
> > +		       struct drm_gpusvm_notifier *notifier,
> > +		       u64 fault_addr, u64 chunk_size, bool migrate_devmem)
> > +{
> > +	struct drm_gpusvm_range *range;
> > +
> > +	if (gpusvm->ops->range_alloc)
> > +		range = gpusvm->ops->range_alloc(gpusvm);
> > +	else
> > +		range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	kref_init(&range->refcount);
> > +	range->gpusvm = gpusvm;
> > +	range->notifier = notifier;
> > +	range->va.start = ALIGN_DOWN(fault_addr, chunk_size);
> > +	range->va.end = ALIGN(fault_addr + 1, chunk_size);
> > +	INIT_LIST_HEAD(&range->rb.entry);
> > +	range->notifier_seq = LONG_MAX;
> > +	range->flags.migrate_devmem = migrate_devmem ? 1 : 0;
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_check_pages - Check pages
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Check if pages between start and end have been faulted in on the CPU. Use to
> > + * prevent migration of pages without CPU backing store.
> > + *
> > + * Returns:
> > + * True if pages have been faulted into CPU, False otherwise
> > + */
> > +static bool drm_gpusvm_check_pages(struct drm_gpusvm *gpusvm,
> > +				   struct drm_gpusvm_notifier *notifier,
> > +				   u64 start, u64 end)
> > +{
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = 0,
> > +		.notifier = &notifier->notifier,
> > +		.start = start,
> > +		.end = end,
> > +		.dev_private_owner = gpusvm->device_private_page_owner,
> > +	};
> > +	unsigned long timeout =
> > +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns;
> > +	unsigned long npages = npages_in_range(start, end);
> > +	int err, i;
> > +
> > +	mmap_assert_locked(gpusvm->mm);
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return false;
> > +
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> > +	hmm_range.hmm_pfns = pfns;
> > +
> > +	while (true) {
> > +		err = hmm_range_fault(&hmm_range);
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	if (err)
> > +		goto err_free;
> > +
> > +	for (i = 0; i < npages;) {
> > +		if (!(pfns[i] & HMM_PFN_VALID)) {
> > +			err = -EFAULT;
> > +			goto err_free;
> > +		}
> > +		i += 0x1 << hmm_pfn_to_map_order(pfns[i]);
> > +	}
> > +
> > +err_free:
> > +	kvfree(pfns);
> > +	return err ? false : true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_chunk_size - Determine chunk size for GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier structure
> > + * @vas: Pointer to the virtual memory area structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @check_pages: Flag indicating whether to check pages
> > + *
> > + * This function determines the chunk size for the GPU SVM range based on the
> > + * fault address, GPU SVM chunk sizes, existing GPU SVM ranges, and the virtual
> > + * memory area boundaries.
> > + *
> > + * Returns:
> > + * Chunk size on success, LONG_MAX on failure.
> > + */
> > +static u64 drm_gpusvm_range_chunk_size(struct drm_gpusvm *gpusvm,
> > +				       struct drm_gpusvm_notifier *notifier,
> > +				       struct vm_area_struct *vas,
> > +				       u64 fault_addr, u64 gpuva_start,
> > +				       u64 gpuva_end, bool check_pages)
> > +{
> > +	u64 start, end;
> > +	int i = 0;
> > +
> > +retry:
> > +	for (; i < gpusvm->num_chunks; ++i) {
> > +		start = ALIGN_DOWN(fault_addr, gpusvm->chunk_sizes[i]);
> > +		end = ALIGN(fault_addr + 1, gpusvm->chunk_sizes[i]);
> > +
> > +		if (start >= vas->vm_start && end <= vas->vm_end &&
> > +		    start >= notifier->interval.start &&
> > +		    end <= notifier->interval.end &&
> > +		    start >= gpuva_start && end <= gpuva_end)
> > +			break;
> > +	}
> > +
> > +	if (i == gpusvm->num_chunks)
> > +		return LONG_MAX;
> > +
> > +	/*
> > +	 * If allocation more than page, ensure not to overlap with existing
> > +	 * ranges.
> > +	 */
> > +	if (end - start != SZ_4K) {
> > +		struct drm_gpusvm_range *range;
> > +
> > +		range = drm_gpusvm_range_find(notifier, start, end);
> > +		if (range) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +
> > +		/*
> > +		 * XXX: Only create range on pages CPU has faulted in. Without
> > +		 * this check, or prefault, on BMG 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' fails. In the failure case, each process
> > +		 * mallocs 16k but the CPU VMA is ~128k which results in 64k SVM
> > +		 * ranges. When migrating the SVM ranges, some processes fail in
> > +		 * drm_gpusvm_migrate_to_devmem with 'migrate.cpages != npages'
> > +		 * and then upon drm_gpusvm_range_get_pages device pages from
> > +		 * other processes are collected + faulted in which creates all
> > +		 * sorts of problems. Unsure exactly how this happening, also
> > +		 * problem goes away if 'xe_exec_system_allocator --r
> > +		 * process-many-malloc' mallocs at least 64k at a time.
> > +		 */
> > +		if (check_pages &&
> > +		    !drm_gpusvm_check_pages(gpusvm, notifier, start, end)) {
> > +			++i;
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	return end - start;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_find_or_insert - Find or insert GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @fault_addr: Fault address
> > + * @gpuva_start: Start address of GPUVA which mirrors CPU
> > + * @gpuva_end: End address of GPUVA which mirrors CPU
> > + * @ctx: GPU SVM context
> > + *
> > + * This function finds or inserts a newly allocated a GPU SVM range based on the
> > + * fault address. Caller must hold a lock to protect range lookup and insertion.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range on success, ERR_PTR() on failure.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct drm_gpusvm_range *range;
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +	bool notifier_alloc = false;
> > +	u64 chunk_size;
> > +	int err;
> > +	bool migrate_devmem;
> > +
> > +	if (fault_addr < gpusvm->mm_start ||
> > +	    fault_addr > gpusvm->mm_start + gpusvm->mm_range) {
> > +		err = -EINVAL;
> > +		goto err_out;
> > +	}
> > +
> > +	if (!mmget_not_zero(mm)) {
> > +		err = -EFAULT;
> > +		goto err_out;
> > +	}
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, fault_addr);
> > +	if (!notifier) {
> > +		notifier = drm_gpusvm_notifier_alloc(gpusvm, fault_addr);
> > +		if (IS_ERR(notifier)) {
> > +			err = PTR_ERR(notifier);
> > +			goto err_mmunlock;
> > +		}
> > +		notifier_alloc = true;
> > +		err = mmu_interval_notifier_insert(&notifier->notifier,
> > +						   mm, notifier->interval.start,
> > +						   notifier->interval.end -
> > +						   notifier->interval.start,
> > +						   &drm_gpusvm_notifier_ops);
> > +		if (err)
> > +			goto err_notifier;
> > +	}
> > +
> > +	mmap_read_lock(mm);
> > +
> > +	vas = vma_lookup(mm, fault_addr);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	if (!ctx->read_only && !(vas->vm_flags & VM_WRITE)) {
> > +		err = -EPERM;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_find(notifier, fault_addr, fault_addr + 1);
> > +	if (range)
> > +		goto out_mmunlock;
> > +	/*
> > +	 * XXX: Short-circuiting migration based on migrate_vma_* current
> > +	 * limitations. If/when migrate_vma_* add more support, this logic will
> > +	 * have to change.
> > +	 */
> > +	migrate_devmem = ctx->devmem_possible &&
> > +		vma_is_anonymous(vas) && !is_vm_hugetlb_page(vas);
> > +
> > +	chunk_size = drm_gpusvm_range_chunk_size(gpusvm, notifier, vas,
> > +						 fault_addr, gpuva_start,
> > +						 gpuva_end, migrate_devmem &&
> > +						 ctx->check_pages);
> > +	if (chunk_size == LONG_MAX) {
> > +		err = -EINVAL;
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	range = drm_gpusvm_range_alloc(gpusvm, notifier, fault_addr, chunk_size,
> > +				       migrate_devmem);
> > +	if (IS_ERR(range)) {
> > +		err = PTR_ERR(range);
> > +		goto err_notifier_remove;
> > +	}
> > +
> > +	drm_gpusvm_range_insert(notifier, range);
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_insert(gpusvm, notifier);
> > +
> > +out_mmunlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +
> > +	return range;
> > +
> > +err_notifier_remove:
> > +	mmap_read_unlock(mm);
> > +	if (notifier_alloc)
> > +		mmu_interval_notifier_remove(&notifier->notifier);
> > +err_notifier:
> > +	if (notifier_alloc)
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +err_mmunlock:
> > +	mmput(mm);
> > +err_out:
> > +	return ERR_PTR(err);
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range (internal)
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @npages: Number of pages to unmap
> > + *
> > + * This function unmap pages associated with a GPU SVM range. Assumes and
> > + * asserts correct locking is in place when called.
> > + */
> > +static void __drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +					   struct drm_gpusvm_range *range,
> > +					   unsigned long npages)
> > +{
> > +	unsigned long i, j;
> > +	struct drm_pagemap *dpagemap = range->dpagemap;
> > +	struct device *dev = gpusvm->drm->dev;
> > +
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->flags.has_dma_mapping) {
> > +		for (i = 0, j = 0; i < npages; j++) {
> > +			struct drm_pagemap_dma_addr *addr = &range->dma_addr[j];
> > +
> > +			if (addr->proto == DRM_INTERCONNECT_SYSTEM) {
> > +				dma_unmap_page(dev,
> > +					       addr->addr,
> > +					       PAGE_SIZE << addr->order,
> > +					       addr->dir);
> > +			} else if (dpagemap && dpagemap->ops->unmap_dma) {
> > +				dpagemap->ops->unmap_dma(dpagemap,
> > +							 dev,
> > +							 *addr);
> > +			}
> > +			i += 1 << addr->order;
> > +		}
> > +		range->flags.has_devmem_pages = false;
> > +		range->flags.has_dma_mapping = false;
> > +		range->dpagemap = NULL;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_free_pages - Free pages associated with a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function free pages associated with a GPU SVM range.
> > + */
> > +static void drm_gpusvm_range_free_pages(struct drm_gpusvm *gpusvm,
> > +					struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	if (range->dma_addr) {
> > +		kvfree(range->dma_addr);
> > +		range->dma_addr = NULL;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_remove - Remove GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function removes the specified GPU SVM range and also removes the parent
> > + * GPU SVM notifier if no more ranges remain in the notifier. The caller must
> > + * hold a lock to protect range and notifier removal.
> > + */
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range)
> > +{
> > +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	notifier = drm_gpusvm_notifier_find(gpusvm, range->va.start);
> > +	if (WARN_ON_ONCE(!notifier))
> > +		return;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > +	drm_gpusvm_range_free_pages(gpusvm, range);
> > +	__drm_gpusvm_range_remove(notifier, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	drm_gpusvm_range_put(range);
> > +
> > +	if (RB_EMPTY_ROOT(&notifier->root.rb_root)) {
> > +		if (!notifier->flags.removed)
> > +			mmu_interval_notifier_remove(&notifier->notifier);
> > +		drm_gpusvm_notifier_remove(gpusvm, notifier);
> > +		drm_gpusvm_notifier_free(gpusvm, notifier);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get - Get a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function increments the reference count of the specified GPU SVM range.
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM range.
> > + */
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range)
> > +{
> > +	kref_get(&range->refcount);
> > +
> > +	return range;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_destroy - Destroy GPU SVM range
> > + * @refcount: Pointer to the reference counter embedded in the GPU SVM range
> > + *
> > + * This function destroys the specified GPU SVM range when its reference count
> > + * reaches zero. If a custom range-free function is provided, it is invoked to
> > + * free the range; otherwise, the range is deallocated using kfree().
> > + */
> > +static void drm_gpusvm_range_destroy(struct kref *refcount)
> > +{
> > +	struct drm_gpusvm_range *range =
> > +		container_of(refcount, struct drm_gpusvm_range, refcount);
> > +	struct drm_gpusvm *gpusvm = range->gpusvm;
> > +
> > +	if (gpusvm->ops->range_free)
> > +		gpusvm->ops->range_free(range);
> > +	else
> > +		kfree(range);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_put - Put a reference to GPU SVM range
> > + * @range: Pointer to the GPU SVM range
> > + *
> > + * This function decrements the reference count of the specified GPU SVM range
> > + * and frees it when the count reaches zero.
> > + */
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range)
> > +{
> > +	kref_put(&range->refcount, drm_gpusvm_range_destroy);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid - GPU SVM range pages valid
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid. Expected be
> > + * called holding gpusvm->notifier_lock and as the last step before commiting a
> > + * GPU binding.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range)
> > +{
> > +	lockdep_assert_held(&gpusvm->notifier_lock);
> > +
> > +	return range->flags.has_devmem_pages || range->flags.has_dma_mapping;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_pages_valid_unlocked - GPU SVM range pages valid unlocked
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + *
> > + * This function determines if a GPU SVM range pages are valid. Expected be
> > + * called without holding gpusvm->notifier_lock.
> > + *
> > + * Returns:
> > + * True if GPU SVM range has valid pages, False otherwise
> > + */
> > +static bool
> > +drm_gpusvm_range_pages_valid_unlocked(struct drm_gpusvm *gpusvm,
> > +				      struct drm_gpusvm_range *range)
> > +{
> > +	bool pages_valid;
> > +
> > +	if (!range->dma_addr)
> > +		return false;
> > +
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	pages_valid = drm_gpusvm_range_pages_valid(gpusvm, range);
> > +	if (!pages_valid)
> > +		drm_gpusvm_range_free_pages(gpusvm, range);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +
> > +	return pages_valid;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_get_pages - Get pages for a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function gets pages for a GPU SVM range and ensures they are mapped for
> > + * DMA access.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mmu_interval_notifier *notifier = &range->notifier->notifier;
> > +	struct hmm_range hmm_range = {
> > +		.default_flags = HMM_PFN_REQ_FAULT | (ctx->read_only ? 0 :
> > +			HMM_PFN_REQ_WRITE),
> > +		.notifier = notifier,
> > +		.start = range->va.start,
> > +		.end = range->va.end,
> > +		.dev_private_owner = gpusvm->device_private_page_owner,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct drm_gpusvm_zdd *zdd;
> > +	unsigned long timeout =
> > +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long i, j;
> > +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> > +	unsigned long num_dma_mapped;
> > +	unsigned int order = 0;
> > +	unsigned long *pfns;
> > +	struct page **pages;
> > +	int err = 0;
> > +	struct dev_pagemap *pagemap;
> > +	struct drm_pagemap *dpagemap;
> > +
> > +retry:
> > +	hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +	if (drm_gpusvm_range_pages_valid_unlocked(gpusvm, range))
> > +		goto set_seqno;
> > +
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return -ENOMEM;
> > +
> > +	if (!mmget_not_zero(mm)) {
> > +		err = -EFAULT;
> > +		goto err_out;
> > +	}
> > +
> > +	hmm_range.hmm_pfns = pfns;
> > +	while (true) {
> > +		mmap_read_lock(mm);
> > +		err = hmm_range_fault(&hmm_range);
> > +		mmap_read_unlock(mm);
> > +
> > +		if (err == -EBUSY) {
> > +			if (time_after(jiffies, timeout))
> > +				break;
> > +
> > +			hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> > +			continue;
> > +		}
> > +		break;
> > +	}
> > +	mmput(mm);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	pages = (struct page **)pfns;
> > +map_pages:
> > +	/*
> > +	 * Perform all dma mappings under the notifier lock to not
> > +	 * access freed pages. A notifier will either block on
> > +	 * the notifier lock or unmap dma.
> > +	 */
> > +	drm_gpusvm_notifier_lock(gpusvm);
> > +	if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +		goto retry;
> > +	}
> > +
> > +	if (!range->dma_addr) {
> > +		/* Unlock and restart mapping to allocate memory. */
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +		range->dma_addr = kvmalloc_array(npages, sizeof(*range->dma_addr),
> > +						 GFP_KERNEL);
> > +		if (!range->dma_addr) {
> > +			err = -ENOMEM;
> > +			goto err_free;
> > +		}
> > +		goto map_pages;
> > +	}
> > +
> > +	zdd = NULL;
> > +	num_dma_mapped = 0;
> > +	for (i = 0, j = 0; i < npages; ++j) {
> > +		struct page *page = hmm_pfn_to_page(pfns[i]);
> > +
> > +		order = hmm_pfn_to_map_order(pfns[i]);
> > +		if (is_device_private_page(page) || is_device_coherent_page(page)) {
> > +			if (zdd != page->zone_device_data && i > 0) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_unmap;
> > +			}
> > +			zdd = page->zone_device_data;
> > +			if (pagemap != page->pgmap) {
> > +				if (i > 0) {
> > +					err = -EOPNOTSUPP;
> > +					goto err_unmap;
> > +				}
> > +
> > +				pagemap = page->pgmap;
> > +				dpagemap = zdd->devmem_allocation->dpagemap;
> > +				if (drm_WARN_ON(gpusvm->drm, !dpagemap)) {
> > +					/*
> > +					 * Raced. This is not supposed to happen
> > +					 * since hmm_range_fault() should've migrated
> > +					 * this page to system.
> > +					 */
> > +					err = -EAGAIN;
> > +					goto err_unmap;
> > +				}
> > +			}
> > +			range->dma_addr[j] =
> > +				dpagemap->ops->map_dma(dpagemap, gpusvm->drm->dev,
> > +						       page, order,
> > +						       DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev, range->dma_addr[j].addr)) {
> > +				err = -EFAULT;
> > +				goto err_unmap;
> > +			}
> > +
> > +			pages[i] = page;
> > +		} else {
> > +			dma_addr_t addr;
> > +
> > +			if (is_zone_device_page(page) || zdd) {
> > +				err = -EOPNOTSUPP;
> > +				goto err_unmap;
> > +			}
> > +
> > +			addr = dma_map_page(gpusvm->drm->dev,
> > +					    page, 0,
> > +					    PAGE_SIZE << order,
> > +					    DMA_BIDIRECTIONAL);
> > +			if (dma_mapping_error(gpusvm->drm->dev, addr)) {
> > +				err = -EFAULT;
> > +				goto err_unmap;
> > +			}
> > +
> > +			range->dma_addr[j] = drm_pagemap_dma_addr_encode
> > +				(addr, DRM_INTERCONNECT_SYSTEM, order,
> > +				 DMA_BIDIRECTIONAL);
> > +		}
> > +		i += 1 << order;
> > +		num_dma_mapped = i;
> > +	}
> > +
> > +	range->flags.has_dma_mapping = true;
> > +	if (zdd) {
> > +		range->flags.has_devmem_pages = true;
> > +		range->dpagemap = dpagemap;
> > +	}
> > +
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +	kvfree(pfns);
> > +set_seqno:
> > +	range->notifier_seq = hmm_range.notifier_seq;
> > +
> > +	return 0;
> > +
> > +err_unmap:
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, num_dma_mapped);
> > +	drm_gpusvm_notifier_unlock(gpusvm);
> > +err_free:
> > +	kvfree(pfns);
> > +err_out:
> > +	if (err == -EAGAIN)
> > +		goto retry;
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_unmap_pages - Unmap pages associated with a GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @ctx: GPU SVM context
> > + *
> > + * This function unmaps pages associated with a GPU SVM range. If @in_notifier
> > + * is set, it is assumed that gpusvm->notifier_lock is held in write mode; if it
> > + * is clear, it acquires gpusvm->notifier_lock in read mode. Must be called on
> > + * each GPU SVM range attached to notifier in gpusvm->ops->invalidate for IOMMU
> > + * security model.
> > + */
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	unsigned long npages = npages_in_range(range->va.start, range->va.end);
> > +
> > +	if (ctx->in_notifier)
> > +		lockdep_assert_held_write(&gpusvm->notifier_lock);
> > +	else
> > +		drm_gpusvm_notifier_lock(gpusvm);
> > +
> > +	__drm_gpusvm_range_unmap_pages(gpusvm, range, npages);
> > +
> > +	if (!ctx->in_notifier)
> > +		drm_gpusvm_notifier_unlock(gpusvm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_page - Put a migration page
> > + * @page: Pointer to the page to put
> > + *
> > + * This function unlocks and puts a page.
> > + */
> > +static void drm_gpusvm_migration_put_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migration_put_pages - Put migration pages
> > + * @npages: Number of pages
> > + * @migrate_pfn: Array of migrate page frame numbers
> > + *
> > + * This function puts an array of pages.
> > + */
> > +static void drm_gpusvm_migration_put_pages(unsigned long npages,
> > +					   unsigned long *migrate_pfn)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!migrate_pfn[i])
> > +			continue;
> > +
> > +		drm_gpusvm_migration_put_page(migrate_pfn_to_page(migrate_pfn[i]));
> > +		migrate_pfn[i] = 0;
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_get_devmem_page - Get a reference to a device memory page
> > + * @page: Pointer to the page
> > + * @zdd: Pointer to the GPU SVM zone device data
> > + *
> > + * This function associates the given page with the specified GPU SVM zone
> > + * device data and initializes it for zone device usage.
> > + */
> > +static void drm_gpusvm_get_devmem_page(struct page *page,
> > +				     struct drm_gpusvm_zdd *zdd)
> > +{
> > +	page->zone_device_data = drm_gpusvm_zdd_get(zdd);
> > +	zone_device_page_init(page);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_map_pages() - Map migration pages for GPU SVM migration
> > + * @dev: The device for which the pages are being mapped
> > + * @dma_addr: Array to store DMA addresses corresponding to mapped pages
> > + * @migrate_pfn: Array of migrate page frame numbers to map
> > + * @npages: Number of pages to map
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function maps pages of memory for migration usage in GPU SVM. It
> > + * iterates over each page frame number provided in @migrate_pfn, maps the
> > + * corresponding page, and stores the DMA address in the provided @dma_addr
> > + * array.
> > + *
> > + * Return: 0 on success, -EFAULT if an error occurs during mapping.
> > + */
> > +static int drm_gpusvm_migrate_map_pages(struct device *dev,
> > +					dma_addr_t *dma_addr,
> > +					long unsigned int *migrate_pfn,
> > +					unsigned long npages,
> > +					enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = migrate_pfn_to_page(migrate_pfn[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (WARN_ON_ONCE(is_zone_device_page(page)))
> > +			return -EFAULT;
> > +
> > +		dma_addr[i] = dma_map_page(dev, page, 0, PAGE_SIZE, dir);
> > +		if (dma_mapping_error(dev, dma_addr[i]))
> > +			return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_unmap_pages() - Unmap pages previously mapped for GPU SVM migration
> > + * @dev: The device for which the pages were mapped
> > + * @dma_addr: Array of DMA addresses corresponding to mapped pages
> > + * @npages: Number of pages to unmap
> > + * @dir: Direction of data transfer (e.g., DMA_BIDIRECTIONAL)
> > + *
> > + * This function unmaps previously mapped pages of memory for GPU Shared Virtual
> > + * Memory (SVM). It iterates over each DMA address provided in @dma_addr, checks
> > + * if it's valid and not already unmapped, and unmaps the corresponding page.
> > + */
> > +static void drm_gpusvm_migrate_unmap_pages(struct device *dev,
> > +					   dma_addr_t *dma_addr,
> > +					   unsigned long npages,
> > +					   enum dma_data_direction dir)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		if (!dma_addr[i] || dma_mapping_error(dev, dma_addr[i]))
> > +			continue;
> > +
> > +		dma_unmap_page(dev, dma_addr[i], PAGE_SIZE, dir);
> > +	}
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_devmem - Migrate GPU SVM range to device memory
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range structure
> > + * @devmem_allocation: Pointer to the device memory allocation. The caller
> > + *                     should hold a reference to the device memory allocation,
> > + *                     which should be dropped via ops->devmem_release or upon
> > + *                     the failure of this function.
> > + * @ctx: GPU SVM context
> > + *
> > + * This function migrates the specified GPU SVM range to device memory. It performs the
> > + * necessary setup and invokes the driver-specific operations for migration to
> > + * device memory. Upon successful return, @devmem_allocation can safely reference @range
> > + * until ops->devmem_release is called which only upon successful return.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> > +				 struct drm_gpusvm_range *range,
> > +				 struct drm_gpusvm_devmem *devmem_allocation,
> > +				 const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation->ops;
> > +	u64 start = range->va.start, end = range->va.end;
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	unsigned long i, npages = npages_in_range(start, end);
> > +	struct vm_area_struct *vas;
> > +	struct drm_gpusvm_zdd *zdd = NULL;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int err;
> > +
> > +	if (!range->flags.migrate_devmem)
> > +		return -EINVAL;
> > +
> > +	if (!ops->populate_devmem_pfn || !ops->copy_to_devmem || !ops->copy_to_ram)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!mmget_not_zero(mm)) {
> > +		err = -EFAULT;
> > +		goto err_out;
> > +	}
> > +	mmap_read_lock(mm);
> > +
> > +	vas = vma_lookup(mm, start);
> > +	if (!vas) {
> > +		err = -ENOENT;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (end > vas->vm_end || start < vas->vm_start) {
> > +		err = -EINVAL;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	if (!vma_is_anonymous(vas)) {
> > +		err = -EBUSY;
> > +		goto err_mmunlock;
> > +	}
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_mmunlock;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > +
> > +	zdd = drm_gpusvm_zdd_alloc(gpusvm->device_private_page_owner);
> > +	if (!zdd) {
> > +		err = -ENOMEM;
> > +		goto err_free;
> > +	}
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/*
> > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > +	 * always an error. Need to revisit possible cases and how to handle. We
> > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> > +	 */
> > +
> > +	if (!migrate.cpages) {
> > +		err = -EFAULT;
> > +		goto err_free;
> > +	}
> > +
> > +	if (migrate.cpages != npages) {
> > +		err = -EBUSY;
> > +		goto err_finalize;
> > +	}
> > +
> > +	err = ops->populate_devmem_pfn(devmem_allocation, npages, migrate.dst);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev, dma_addr,
> > +					   migrate.src, npages, DMA_TO_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *page = pfn_to_page(migrate.dst[i]);
> > +
> > +		pages[i] = page;
> > +		migrate.dst[i] = migrate_pfn(migrate.dst[i]);
> > +		drm_gpusvm_get_devmem_page(page, zdd);
> > +	}
> > +
> > +	err = ops->copy_to_devmem(pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	/* Upon success bind devmem allocation to range and zdd */
> > +	WRITE_ONCE(zdd->devmem_allocation, devmem_allocation);	/* Owns ref */
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev, dma_addr, npages,
> > +				       DMA_TO_DEVICE);
> > +err_free:
> > +	if (zdd)
> > +		drm_gpusvm_zdd_put(zdd);
> > +	kvfree(buf);
> > +err_mmunlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +err_out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_populate_ram_pfn - Populate RAM PFNs for a VM area
> > + * @vas: Pointer to the VM area structure, can be NULL
> > + * @npages: Number of pages to populate
> > + * @mpages: Number of pages to migrate
> > + * @src_mpfn: Source array of migrate PFNs
> > + * @mpfn: Array of migrate PFNs to populate
> > + * @addr: Start address for PFN allocation
> > + *
> > + * This function populates the RAM migrate page frame numbers (PFNs) for the
> > + * specified VM area structure. It allocates and locks pages in the VM area for
> > + * RAM usage. If vas is non-NULL use alloc_page_vma for allocation, if NULL use
> > + * alloc_page for allocation.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int drm_gpusvm_migrate_populate_ram_pfn(struct vm_area_struct *vas,
> > +					       unsigned long npages,
> > +					       unsigned long *mpages,
> > +					       unsigned long *src_mpfn,
> > +					       unsigned long *mpfn, u64 addr)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
> > +		struct page *page;
> > +
> > +		if (!(src_mpfn[i] & MIGRATE_PFN_MIGRATE))
> > +			continue;
> > +
> > +		if (vas)
> > +			page = alloc_page_vma(GFP_HIGHUSER, vas, addr);
> > +		else
> > +			page = alloc_page(GFP_HIGHUSER);
> > +
> > +		if (!page)
> > +			return -ENOMEM;
> > +
> > +		lock_page(page);
> > +		mpfn[i] = migrate_pfn(page_to_pfn(page));
> > +		++*mpages;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_evict_to_ram - Evict GPU SVM range to RAM
> > + * @devmem_allocation: Pointer to the device memory allocation
> > + *
> > + * Similar to __drm_gpusvm_migrate_to_ram but does not require mmap lock and
> > + * migration done via migrate_device_* functions.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem *devmem_allocation)
> > +{
> > +	const struct drm_gpusvm_devmem_ops *ops = devmem_allocation->ops;
> > +	unsigned long npages, mpages = 0;
> > +	struct page **pages;
> > +	unsigned long *src, *dst;
> > +	dma_addr_t *dma_addr;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	npages = devmem_allocation->size >> PAGE_SHIFT;
> > +
> > +retry:
> > +	if (!mmget_not_zero(devmem_allocation->mm))
> > +		return -EFAULT;
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	src = buf;
> > +	dst = buf + (sizeof(*src) * npages);
> > +	dma_addr = buf + (2 * sizeof(*src) * npages);
> > +	pages = buf + (2 * sizeof(*src) + sizeof(*dma_addr)) * npages;
> > +
> > +	err = ops->populate_devmem_pfn(devmem_allocation, npages, src);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = migrate_device_prepopulated_range(src, npages);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	err = drm_gpusvm_migrate_populate_ram_pfn(NULL, npages, &mpages, src,
> > +						  dst, 0);
> > +	if (err || !mpages)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(devmem_allocation->dev, dma_addr,
> > +					   dst, npages, DMA_FROM_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(src[i]);
> > +
> > +	err = ops->copy_to_ram(pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, dst);
> > +	migrate_device_pages(src, dst, npages);
> > +	migrate_device_finalize(src, dst, npages);
> > +	drm_gpusvm_migrate_unmap_pages(devmem_allocation->dev, dma_addr, npages,
> > +				       DMA_FROM_DEVICE);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +	mmput_async(devmem_allocation->mm);
> > +	if (!err && !READ_ONCE(devmem_allocation->detached)) {
> > +		cond_resched();
> > +		goto retry;
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * __drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (internal)
> > + * @vas: Pointer to the VM area structure
> > + * @device_private_page_owner: Device private pages owner
> > + * @page: Pointer to the page for fault handling (can be NULL)
> > + * @fault_addr: Fault address
> > + * @size: Size of migration
> > + *
> > + * This internal function performs the migration of the specified GPU SVM range
> > + * to RAM. It sets up the migration, populates + dma maps RAM PFNs, and
> > + * invokes the driver-specific operations for migration to RAM.
> > + *
> > + * Returns:
> > + * 0 on success, negative error code on failure.
> > + */
> > +static int __drm_gpusvm_migrate_to_ram(struct vm_area_struct *vas,
> > +				       void *device_private_page_owner,
> > +				       struct page *page, u64 fault_addr,
> > +				       u64 size)
> > +{
> > +	struct migrate_vma migrate = {
> > +		.vma		= vas,
> > +		.pgmap_owner	= device_private_page_owner,
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
> > +			MIGRATE_VMA_SELECT_DEVICE_COHERENT,
> > +		.fault_page	= page,
> > +	};
> > +	struct drm_gpusvm_zdd *zdd;
> > +	const struct drm_gpusvm_devmem_ops *ops;
> > +	struct device *dev;
> > +	unsigned long npages, mpages = 0;
> > +	struct page **pages;
> > +	dma_addr_t *dma_addr;
> > +	u64 start, end;
> > +	void *buf;
> > +	int i, err = 0;
> > +
> > +	start = ALIGN_DOWN(fault_addr, size);
> > +	end = ALIGN(fault_addr + 1, size);
> > +
> > +	/* Corner where VMA area struct has been partially unmapped */
> > +	if (start < vas->vm_start)
> > +		start = vas->vm_start;
> > +	if (end > vas->vm_end)
> > +		end = vas->vm_end;
> > +
> > +	migrate.start = start;
> > +	migrate.end = end;
> > +	npages = npages_in_range(start, end);
> > +
> > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > +		       sizeof(*pages), GFP_KERNEL);
> > +	if (!buf) {
> > +		err = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > +
> > +	migrate.vma = vas;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +
> > +	err = migrate_vma_setup(&migrate);
> > +	if (err)
> > +		goto err_free;
> > +
> > +	/* Raced with another CPU fault, nothing to do */
> > +	if (!migrate.cpages)
> > +		goto err_free;
> > +
> > +	if (!page) {
> > +		for (i = 0; i < npages; ++i) {
> > +			if (!(migrate.src[i] & MIGRATE_PFN_MIGRATE))
> > +				continue;
> > +
> > +			page = migrate_pfn_to_page(migrate.src[i]);
> > +			break;
> > +		}
> > +
> > +		if (!page)
> > +			goto err_finalize;
> > +	}
> > +	zdd = page->zone_device_data;
> > +	ops = zdd->devmem_allocation->ops;
> > +	dev = zdd->devmem_allocation->dev;
> > +
> > +	err = drm_gpusvm_migrate_populate_ram_pfn(vas, npages, &mpages,
> > +						  migrate.src, migrate.dst,
> > +						  start);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	err = drm_gpusvm_migrate_map_pages(dev, dma_addr, migrate.dst, npages,
> > +					   DMA_FROM_DEVICE);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +	for (i = 0; i < npages; ++i)
> > +		pages[i] = migrate_pfn_to_page(migrate.src[i]);
> > +
> > +	err = ops->copy_to_ram(pages, dma_addr, npages);
> > +	if (err)
> > +		goto err_finalize;
> > +
> > +err_finalize:
> > +	if (err)
> > +		drm_gpusvm_migration_put_pages(npages, migrate.dst);
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +	drm_gpusvm_migrate_unmap_pages(dev, dma_addr, npages,
> > +				       DMA_FROM_DEVICE);
> > +err_free:
> > +	kvfree(buf);
> > +err_out:
> > +
> > +	return err;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_range_evict - Evict GPU SVM range
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @range: Pointer to the GPU SVM range to be removed
> > + *
> > + * This function evicts the specified GPU SVM range.
> > + */
> > +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> > +			    struct drm_gpusvm_range *range)
> > +{
> > +	struct mm_struct *mm = gpusvm->mm;
> > +	struct vm_area_struct *vas;
> > +
> > +	if (!mmget_not_zero(mm))
> > +		return;
> > +
> > +	mmap_read_lock(mm);
> > +	vas = vma_lookup(mm, range->va.start);
> > +	if (!vas)
> > +		goto unlock;
> > +
> > +	__drm_gpusvm_migrate_to_ram(vas, gpusvm->device_private_page_owner,
> > +				    NULL, range->va.start,
> > +				    range->va.end - range->va.start);
> > +unlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_page_free - Put GPU SVM zone device data associated with a page
> > + * @page: Pointer to the page
> > + *
> > + * This function is a callback used to put the GPU SVM zone device data
> > + * associated with a page when it is being released.
> > + */
> > +static void drm_gpusvm_page_free(struct page *page)
> > +{
> > +	drm_gpusvm_zdd_put(page->zone_device_data);
> > +}
> > +
> > +/**
> > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > + * @vmf: Pointer to the fault information structure
> > + *
> > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > + * the internal migration function to migrate the range back to RAM.
> > + *
> > + * Returns:
> > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > + */
> > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > +	int err;
> > +
> > +	err = __drm_gpusvm_migrate_to_ram(vmf->vma,
> > +					  zdd->device_private_page_owner,
> > +					  vmf->page, vmf->address,
> > +					  zdd->devmem_allocation->size);
> > +
> > +	return err ? VM_FAULT_SIGBUS : 0;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops - Device page map operations for GPU SVM
> > + */
> > +static const struct dev_pagemap_ops drm_gpusvm_pagemap_ops = {
> > +	.page_free = drm_gpusvm_page_free,
> > +	.migrate_to_ram = drm_gpusvm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * drm_gpusvm_pagemap_ops_get - Retrieve GPU SVM device page map operations
> > + *
> > + * Returns:
> > + * Pointer to the GPU SVM device page map operations structure.
> > + */
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void)
> > +{
> > +	return &drm_gpusvm_pagemap_ops;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_has_mapping - Check if GPU SVM has mapping for the given address range
> > + * @gpusvm: Pointer to the GPU SVM structure.
> > + * @start: Start address
> > + * @end: End address
> > + *
> > + * Returns:
> > + * True if GPU SVM has mapping, False otherwise
> > + */
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end)
> > +{
> > +	struct drm_gpusvm_notifier *notifier;
> > +
> > +	drm_gpusvm_for_each_notifier(notifier, gpusvm, start, end) {
> > +		struct drm_gpusvm_range *range = NULL;
> > +
> > +		drm_gpusvm_for_each_range(range, notifier, start, end)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > diff --git a/drivers/gpu/drm/xe/drm_gpusvm.h b/drivers/gpu/drm/xe/drm_gpusvm.h
> > new file mode 100644
> > index 000000000000..15ec22d4f9a5
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/drm_gpusvm.h
> > @@ -0,0 +1,447 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __DRM_GPUSVM_H__
> > +#define __DRM_GPUSVM_H__
> > +
> > +#include <linux/kref.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dev_pagemap_ops;
> > +struct drm_device;
> > +struct drm_gpusvm;
> > +struct drm_gpusvm_notifier;
> > +struct drm_gpusvm_ops;
> > +struct drm_gpusvm_range;
> > +struct drm_gpusvm_devmem;
> > +struct drm_pagemap;
> > +struct drm_pagemap_dma_addr;
> > +
> > +/**
> > + * struct drm_gpusvm_devmem_ops - Operations structure for GPU SVM device memory
> > + *
> > + * This structure defines the operations for GPU Shared Virtual Memory (SVM)
> > + * device memory. These operations are provided by the GPU driver to manage device memory
> > + * allocations and perform operations such as migration between device memory and system
> > + * RAM.
> > + */
> > +struct drm_gpusvm_devmem_ops {
> > +	/**
> > +	 * @devmem_release: Release device memory allocation (optional)
> > +	 * @devmem_allocation: device memory allocation
> > +	 *
> > +	 * This function shall release device memory allocation and expects to drop a
> > +	 * reference to device memory allocation.
> > +	 */
> > +	void (*devmem_release)(struct drm_gpusvm_devmem *devmem_allocation);
> > +
> > +	/**
> > +	 * @populate_devmem_pfn: Populate device memory PFN (required for migration)
> > +	 * @devmem_allocation: device memory allocation
> > +	 * @npages: Number of pages to populate
> > +	 * @pfn: Array of page frame numbers to populate
> > +	 *
> > +	 * This function shall populate device memory page frame numbers (PFN).
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*populate_devmem_pfn)(struct drm_gpusvm_devmem *devmem_allocation,
> > +				 unsigned long npages, unsigned long *pfn);
> > +
> > +	/**
> > +	 * @copy_to_devmem: Copy to device memory (required for migration)
> > +	 * @pages: Pointer to array of device memory pages (destination)
> > +	 * @dma_addr: Pointer to array of DMA addresses (source)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to device memory.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_devmem)(struct page **pages,
> > +			      dma_addr_t *dma_addr,
> > +			      unsigned long npages);
> > +
> > +	/**
> > +	 * @copy_to_ram: Copy to system RAM (required for migration)
> > +	 * @pages: Pointer to array of device memory pages (source)
> > +	 * @dma_addr: Pointer to array of DMA addresses (destination)
> > +	 * @npages: Number of pages to copy
> > +	 *
> > +	 * This function shall copy pages to system RAM.
> > +	 *
> > +	 * Returns:
> > +	 * 0 on success, a negative error code on failure.
> > +	 */
> > +	int (*copy_to_ram)(struct page **pages,
> > +			   dma_addr_t *dma_addr,
> > +			   unsigned long npages);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_devmem - Structure representing a GPU SVM device memory allocation
> > + *
> > + * @dev: Pointer to the device structure which device memory allocation belongs to
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @ops: Pointer to the operations structure for GPU SVM device memory
> > + * @dpagemap: The struct drm_pagemap of the pages this allocation belongs to.
> > + * @size: Size of device memory allocation
> > + * @detached: device memory allocations is detached from device pages
> > + */
> > +struct drm_gpusvm_devmem {
> > +	struct device *dev;
> > +	struct mm_struct *mm;
> > +	const struct drm_gpusvm_devmem_ops *ops;
> > +	struct drm_pagemap *dpagemap;
> > +	size_t size;
> > +	bool detached;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ops - Operations structure for GPU SVM
> > + *
> > + * This structure defines the operations for GPU Shared Virtual Memory (SVM).
> > + * These operations are provided by the GPU driver to manage SVM ranges and
> > + * notifiers.
> > + */
> > +struct drm_gpusvm_ops {
> > +	/**
> > +	 * @notifier_alloc: Allocate a GPU SVM notifier (optional)
> > +	 *
> > +	 * This function shall allocate a GPU SVM notifier.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM notifier on success, NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_notifier *(*notifier_alloc)(void);
> > +
> > +	/**
> > +	 * @notifier_free: Free a GPU SVM notifier (optional)
> > +	 * @notifier: Pointer to the GPU SVM notifier to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM notifier.
> > +	 */
> > +	void (*notifier_free)(struct drm_gpusvm_notifier *notifier);
> > +
> > +	/**
> > +	 * @range_alloc: Allocate a GPU SVM range (optional)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 *
> > +	 * This function shall allocate a GPU SVM range.
> > +	 *
> > +	 * Returns:
> > +	 * Pointer to the allocated GPU SVM range on success, NULL on failure.
> > +	 */
> > +	struct drm_gpusvm_range *(*range_alloc)(struct drm_gpusvm *gpusvm);
> > +
> > +	/**
> > +	 * @range_free: Free a GPU SVM range (optional)
> > +	 * @range: Pointer to the GPU SVM range to be freed
> > +	 *
> > +	 * This function shall free a GPU SVM range.
> > +	 */
> > +	void (*range_free)(struct drm_gpusvm_range *range);
> > +
> > +	/**
> > +	 * @invalidate: Invalidate GPU SVM notifier (required)
> > +	 * @gpusvm: Pointer to the GPU SVM
> > +	 * @notifier: Pointer to the GPU SVM notifier
> > +	 * @mmu_range: Pointer to the mmu_notifier_range structure
> > +	 *
> > +	 * This function shall invalidate the GPU page tables. It can safely
> > +	 * walk the notifier range RB tree/list in this function. Called while
> > +	 * holding the notifier lock.
> > +	 */
> > +	void (*invalidate)(struct drm_gpusvm *gpusvm,
> > +			   struct drm_gpusvm_notifier *notifier,
> > +			   const struct mmu_notifier_range *mmu_range);
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_notifier - Structure representing a GPU SVM notifier
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: MMU interval notifier
> > + * @interval: Interval for the notifier
> > + * @rb: Red-black tree node for the parent GPU SVM structure notifier tree
> > + * @root: Cached root node of the RB tree containing ranges
> > + * @range_list: List head containing of ranges in the same order they appear in
> > + *              interval tree. This is useful to keep iterating ranges while
> > + *              doing modifications to RB tree.
> > + * @flags.removed: Flag indicating whether the MMU interval notifier has been
> > + *                 removed
> > + *
> > + * This structure represents a GPU SVM notifier.
> > + */
> > +struct drm_gpusvm_notifier {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct mmu_interval_notifier notifier;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} interval;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct rb_root_cached root;
> > +	struct list_head range_list;
> > +	struct {
> > +		u32 removed : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_range - Structure representing a GPU SVM range
> > + *
> > + * @gpusvm: Pointer to the GPU SVM structure
> > + * @notifier: Pointer to the GPU SVM notifier
> > + * @refcount: Reference count for the range
> > + * @rb: Red-black tree node for the parent GPU SVM notifier structure range tree
> > + * @va: Virtual address range
> > + * @notifier_seq: Notifier sequence number of the range's pages
> > + * @dma_addr: DMA address array
> > + * @dpagemap: The struct drm_pagemap of the device pages we're dma-mapping.
> > + * Note this is assuming only one drm_pagemap per range is allowed.
> > + * @flags.migrate_devmem: Flag indicating whether the range can be migrated to device memory
> > + * @flags.unmapped: Flag indicating if the range has been unmapped
> > + * @flags.partial_unmap: Flag indicating if the range has been partially unmapped
> > + * @flags.has_devmem_pages: Flag indicating if the range has devmem pages
> > + * @flags.has_dma_mapping: Flag indicating if the range has a DMA mapping
> > + *
> > + * This structure represents a GPU SVM range used for tracking memory ranges
> > + * mapped in a DRM device.
> > + */
> > +struct drm_gpusvm_range {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +	struct kref refcount;
> > +	struct {
> > +		struct rb_node node;
> > +		struct list_head entry;
> > +		u64 __subtree_last;
> > +	} rb;
> > +	struct {
> > +		u64 start;
> > +		u64 end;
> > +	} va;
> > +	unsigned long notifier_seq;
> > +	struct drm_pagemap_dma_addr *dma_addr;
> > +	struct drm_pagemap *dpagemap;
> > +	struct {
> > +		/* All flags below must be set upon creation */
> > +		u16 migrate_devmem : 1;
> > +		/* All flags below must be set / cleared under notifier lock */
> > +		u16 unmapped : 1;
> > +		u16 partial_unmap : 1;
> > +		u16 has_devmem_pages : 1;
> > +		u16 has_dma_mapping : 1;
> > +	} flags;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm - GPU SVM structure
> > + *
> > + * @name: Name of the GPU SVM
> > + * @drm: Pointer to the DRM device structure
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @device_private_page_owner: Device private pages owner
> > + * @mm_start: Start address of GPU SVM
> > + * @mm_range: Range of the GPU SVM
> > + * @notifier_size: Size of individual notifiers
> > + * @ops: Pointer to the operations structure for GPU SVM
> > + * @chunk_sizes: Pointer to the array of chunk sizes used in range allocation.
> > + *               Entries should be powers of 2 in descending order.
> > + * @num_chunks: Number of chunks
> > + * @notifier_lock: Read-write semaphore for protecting notifier operations
> > + * @root: Cached root node of the Red-Black tree containing GPU SVM notifiers
> > + * @notifier_list: list head containing of notifiers in the same order they
> > + *                 appear in interval tree. This is useful to keep iterating
> > + *                 notifiers while doing modifications to RB tree.
> > + *
> > + * This structure represents a GPU SVM (Shared Virtual Memory) used for tracking
> > + * memory ranges mapped in a DRM (Direct Rendering Manager) device.
> > + *
> > + * No reference counting is provided, as this is expected to be embedded in the
> > + * driver VM structure along with the struct drm_gpuvm, which handles reference
> > + * counting.
> > + */
> > +struct drm_gpusvm {
> > +	const char *name;
> > +	struct drm_device *drm;
> > +	struct mm_struct *mm;
> > +	void *device_private_page_owner;
> > +	u64 mm_start;
> > +	u64 mm_range;
> > +	u64 notifier_size;
> > +	const struct drm_gpusvm_ops *ops;
> > +	const u64 *chunk_sizes;
> > +	int num_chunks;
> > +	struct rw_semaphore notifier_lock;
> > +	struct rb_root_cached root;
> > +	struct list_head notifier_list;
> > +};
> > +
> > +/**
> > + * struct drm_gpusvm_ctx - DRM GPU SVM context
> > + *
> > + * @in_notifier: entering from a MMU notifier
> > + * @read_only: operating on read-only memory
> > + * @devmem_possible: possible to use device memory
> > + * @check_pages: check pages and only create range for pages faulted in
> > + *
> > + * Context that is DRM GPUSVM is operating in (i.e. user arguments).
> > + */
> > +struct drm_gpusvm_ctx {
> > +	u32 in_notifier :1;
> > +	u32 read_only :1;
> > +	u32 devmem_possible :1;
> > +	u32 check_pages :1;
> > +};
> > +
> > +int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
> > +		    const char *name, struct drm_device *drm,
> > +		    struct mm_struct *mm, void *device_private_page_owner,
> > +		    u64 mm_start, u64 mm_range, u64 notifier_size,
> > +		    const struct drm_gpusvm_ops *ops,
> > +		    const u64 *chunk_sizes, int num_chunks);
> > +void drm_gpusvm_fini(struct drm_gpusvm *gpusvm);
> > +void drm_gpusvm_free(struct drm_gpusvm *gpusvm);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find_or_insert(struct drm_gpusvm *gpusvm, u64 fault_addr,
> > +				u64 gpuva_start, u64 gpuva_end,
> > +				const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_remove(struct drm_gpusvm *gpusvm,
> > +			     struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_evict(struct drm_gpusvm *gpusvm,
> > +			    struct drm_gpusvm_range *range);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_get(struct drm_gpusvm_range *range);
> > +void drm_gpusvm_range_put(struct drm_gpusvm_range *range);
> > +
> > +bool drm_gpusvm_range_pages_valid(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range);
> > +
> > +int drm_gpusvm_range_get_pages(struct drm_gpusvm *gpusvm,
> > +			       struct drm_gpusvm_range *range,
> > +			       const struct drm_gpusvm_ctx *ctx);
> > +void drm_gpusvm_range_unmap_pages(struct drm_gpusvm *gpusvm,
> > +				  struct drm_gpusvm_range *range,
> > +				  const struct drm_gpusvm_ctx *ctx);
> > +
> > +int drm_gpusvm_migrate_to_devmem(struct drm_gpusvm *gpusvm,
> > +				 struct drm_gpusvm_range *range,
> > +				 struct drm_gpusvm_devmem *devmem_allocation,
> > +				 const struct drm_gpusvm_ctx *ctx);
> > +int drm_gpusvm_evict_to_ram(struct drm_gpusvm_devmem *devmem_allocation);
> > +
> > +const struct dev_pagemap_ops *drm_gpusvm_pagemap_ops_get(void);
> > +
> > +bool drm_gpusvm_has_mapping(struct drm_gpusvm *gpusvm, u64 start, u64 end);
> > +
> > +struct drm_gpusvm_range *
> > +drm_gpusvm_range_find(struct drm_gpusvm_notifier *notifier, u64 start, u64 end);
> > +
> > +/**
> > + * drm_gpusvm_notifier_lock - Lock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, take lock
> > + */
> > +#define drm_gpusvm_notifier_lock(gpusvm__)	\
> > +	down_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * drm_gpusvm_notifier_unlock - Unlock GPU SVM notifier
> > + * @gpusvm__: Pointer to the GPU SVM structure.
> > + *
> > + * Abstract client usage GPU SVM notifier lock, drop lock
> > + */
> > +#define drm_gpusvm_notifier_unlock(gpusvm__)	\
> > +	up_read(&(gpusvm__)->notifier_lock)
> > +
> > +/**
> > + * __drm_gpusvm_range_next - Get the next GPU SVM range in the list
> > + * @range: a pointer to the current GPU SVM range
> > + *
> > + * Return: A pointer to the next drm_gpusvm_range if available, or NULL if the
> > + *         current range is the last one or if the input range is NULL.
> > + */
> > +static inline struct drm_gpusvm_range *
> > +__drm_gpusvm_range_next(struct drm_gpusvm_range *range)
> > +{
> > +	if (range && !list_is_last(&range->rb.entry,
> > +				   &range->notifier->range_list))
> > +		return list_next_entry(range, rb.entry);
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_for_each_range - Iterate over GPU SVM ranges in a notifier
> > + * @range__: Iterator variable for the ranges. If set, it indicates the start of
> > + *	     the iterator. If NULL, call drm_gpusvm_range_find() to get the range.
> > + * @notifier__: Pointer to the GPU SVM notifier
> > + * @start__: Start address of the range
> > + * @end__: End address of the range
> > + *
> > + * This macro is used to iterate over GPU SVM ranges in a notifier. It is safe
> > + * to use while holding the driver SVM lock or the notifier lock.
> > + */
> > +#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
> > +	for ((range__) = (range__) ?:					\
> > +	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
> > +	     (range__) && (range__->va.start < (end__));		\
> > +	     (range__) = __drm_gpusvm_range_next(range__))
> > +
> > +/**
> > + * drm_gpusvm_range_set_unmapped - Mark a GPU SVM range as unmapped
> > + * @range: Pointer to the GPU SVM range structure.
> > + * @mmu_range: Pointer to the MMU notifier range structure.
> > + *
> > + * This function marks a GPU SVM range as unmapped and sets the partial_unmap flag
> > + * if the range partially falls within the provided MMU notifier range.
> > + */
> > +static inline void
> > +drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
> > +			      const struct mmu_notifier_range *mmu_range)
> > +{
> > +	lockdep_assert_held_write(&range->gpusvm->notifier_lock);
> > +
> > +	range->flags.unmapped = true;
> > +	if (range->va.start < mmu_range->start ||
> > +	    range->va.end > mmu_range->end)
> > +		range->flags.partial_unmap = true;
> > +}
> > +
> > +/**
> > + * drm_gpusvm_devmem_init - Initialize a GPU SVM device memory allocation
> > + *
> > + * @dev: Pointer to the device structure which device memory allocation belongs to
> > + * @mm: Pointer to the mm_struct for the address space
> > + * @ops: Pointer to the operations structure for GPU SVM device memory
> > + * @dpagemap: The struct drm_pagemap we're allocating from.
> > + * @size: Size of device memory allocation
> > + */
> > +static inline void
> > +drm_gpusvm_devmem_init(struct drm_gpusvm_devmem *devmem_allocation,
> > +		       struct device *dev, struct mm_struct *mm,
> > +		       const struct drm_gpusvm_devmem_ops *ops,
> > +		       struct drm_pagemap *dpagemap, size_t size)
> > +{
> > +	devmem_allocation->dev = dev;
> > +	devmem_allocation->mm = mm;
> > +	devmem_allocation->ops = ops;
> > +	devmem_allocation->dpagemap = dpagemap;
> > +	devmem_allocation->size = size;
> > +}
> > +
> > +#endif /* __DRM_GPUSVM_H__ */
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 06/29] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (4 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-11-18 13:44   ` Thomas Hellström
  2024-10-16  3:24 ` [PATCH v2 07/29] drm/xe: Add SVM init / close / fini to faulting VMs Matthew Brost
                   ` (25 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag, which is used to
create unpopulated virtual memory areas (VMAs) without memory backing or
GPU page tables. These VMAs are referred to as system allocator VMAs.
The idea is that upon a page fault or prefetch, the memory backing and
GPU page tables will be populated.

System allocator VMAs only update GPUVM state; they do not have an
internal page table (PT) state, nor do they have GPU mappings.

It is expected that system allocator VMAs will be mixed with buffer
object (BO) VMAs within a single VM. In other words, system allocations
and runtime allocations can be mixed within a single user-mode driver
(UMD) program.

Expected usage:

- Bind the entire virtual address (VA) space upon program load using the
  DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
- If a buffer object (BO) requires GPU mapping, allocate an address
  using malloc, and bind the BO to the malloc'd address using existing
  bind IOCTLs (runtime allocation).
- If a BO no longer requires GPU mapping, bind the mapping address with
  the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
- Any malloc'd address accessed by the GPU will be faulted in via the
  SVM implementation (system allocation).
- Upon freeing any malloc'd data, the SVM implementation will remove GPU
  mappings.

Only supporting 1 to 1 mapping between user address space and GPU
address space at the moment as that is the expected use case. uAPI
defines interface for non 1 to 1 but enforces 1 to 1, this restriction
can be lifted if use cases arrise for non 1 to 1 mappings.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       |  76 +++++++++++++++++-----
 drivers/gpu/drm/xe/xe_vm.c       | 107 ++++++++++++++++++++-----------
 drivers/gpu/drm/xe/xe_vm.h       |   8 ++-
 drivers/gpu/drm/xe/xe_vm_types.h |   3 +
 include/uapi/drm/xe_drm.h        |  19 +++++-
 5 files changed, 157 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index f27f579f4d85..39357e829b6d 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -1068,6 +1068,11 @@ static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
 {
 	int err = 0;
 
+	/*
+	 * No need to check for is_system_allocator here as vma_add_deps is a
+	 * NOP if VMA is_system_allocator
+	 */
+
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
 		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
@@ -1646,6 +1651,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 	int err;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1713,6 +1719,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	if (!((vma->tile_present | vma->tile_staged) & BIT(tile->id)))
 		return 0;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1759,15 +1766,21 @@ static int op_prepare(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.remap.unmap->va));
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, old);
 
 		if (!err && op->remap.prev) {
 			err = bind_op_prepare(vm, tile, pt_update_ops,
@@ -1780,15 +1793,28 @@ static int op_prepare(struct xe_vm *vm,
 			pt_update_ops->wait_vm_bookkeep = true;
 		}
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.unmap.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, vma);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		err = bind_op_prepare(vm, tile, pt_update_ops,
-				      gpuva_to_vma(op->base.prefetch.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -1857,6 +1883,8 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vma *vma, struct dma_fence *fence,
 			   struct dma_fence *fence2)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1890,6 +1918,8 @@ static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			     struct xe_vma *vma, struct dma_fence *fence,
 			     struct dma_fence *fence2)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1924,16 +1954,21 @@ static void op_commit(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence,
 			       fence2);
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.remap.unmap->va), fence,
-				 fence2);
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		unbind_op_commit(vm, tile, pt_update_ops, old, fence, fence2);
 
 		if (op->remap.prev)
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
@@ -1942,14 +1977,25 @@ static void op_commit(struct xe_vm *vm,
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
 				       fence, fence2);
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.unmap.va), fence, fence2);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			unbind_op_commit(vm, tile, pt_update_ops, vma, fence,
+					 fence2);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		bind_op_commit(vm, tile, pt_update_ops,
-			       gpuva_to_vma(op->base.prefetch.va), fence, fence2);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			bind_op_commit(vm, tile, pt_update_ops, vma, fence,
+				       fence2);
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index c99380271de6..0d887fb9de59 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -901,9 +901,10 @@ static void xe_vma_free(struct xe_vma *vma)
 		kfree(vma);
 }
 
-#define VMA_CREATE_FLAG_READ_ONLY	BIT(0)
-#define VMA_CREATE_FLAG_IS_NULL		BIT(1)
-#define VMA_CREATE_FLAG_DUMPABLE	BIT(2)
+#define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
+#define VMA_CREATE_FLAG_IS_NULL			BIT(1)
+#define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
+#define VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR	BIT(3)
 
 static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 				    struct xe_bo *bo,
@@ -917,6 +918,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	bool read_only = (flags & VMA_CREATE_FLAG_READ_ONLY);
 	bool is_null = (flags & VMA_CREATE_FLAG_IS_NULL);
 	bool dumpable = (flags & VMA_CREATE_FLAG_DUMPABLE);
+	bool is_system_allocator =
+		(flags & VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR);
 
 	xe_assert(vm->xe, start < end);
 	xe_assert(vm->xe, end < vm->size);
@@ -925,7 +928,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	 * Allocate and ensure that the xe_vma_is_userptr() return
 	 * matches what was allocated.
 	 */
-	if (!bo && !is_null) {
+	if (!bo && !is_null && !is_system_allocator) {
 		struct xe_userptr_vma *uvma = kzalloc(sizeof(*uvma), GFP_KERNEL);
 
 		if (!uvma)
@@ -937,6 +940,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		if (!vma)
 			return ERR_PTR(-ENOMEM);
 
+		if (is_system_allocator)
+			vma->gpuva.flags |= XE_VMA_SYSTEM_ALLOCATOR;
 		if (is_null)
 			vma->gpuva.flags |= DRM_GPUVA_SPARSE;
 		if (bo)
@@ -979,7 +984,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		drm_gpuva_link(&vma->gpuva, vm_bo);
 		drm_gpuvm_bo_put(vm_bo);
 	} else /* userptr or null */ {
-		if (!is_null) {
+		if (!is_null && !is_system_allocator) {
 			struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
 			u64 size = end - start + 1;
 			int err;
@@ -1029,7 +1034,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		 */
 		mmu_interval_notifier_remove(&userptr->notifier);
 		xe_vm_put(vm);
-	} else if (xe_vma_is_null(vma)) {
+	} else if (xe_vma_is_null(vma) || xe_vma_is_system_allocator(vma)) {
 		xe_vm_put(vm);
 	} else {
 		xe_bo_put(xe_vma_bo(vma));
@@ -1068,7 +1073,7 @@ static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
 		spin_lock(&vm->userptr.invalidated_lock);
 		list_del(&to_userptr_vma(vma)->userptr.invalidate_link);
 		spin_unlock(&vm->userptr.invalidated_lock);
-	} else if (!xe_vma_is_null(vma)) {
+	} else if (!xe_vma_is_null(vma) && !xe_vma_is_system_allocator(vma)) {
 		xe_bo_assert_held(xe_vma_bo(vma));
 
 		drm_gpuva_unlink(&vma->gpuva);
@@ -1967,6 +1972,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 			op->map.read_only =
 				flags & DRM_XE_VM_BIND_FLAG_READONLY;
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+			op->map.is_system_allocator = flags &
+				DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
 		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
@@ -2158,6 +2165,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				VMA_CREATE_FLAG_IS_NULL : 0;
 			flags |= op->map.dumpable ?
 				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->map.is_system_allocator ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
 			vma = new_vma(vm, &op->base.map, op->map.pat_index,
 				      flags);
@@ -2165,7 +2174,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
-			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+			if ((op->map.immediate || !xe_vm_in_fault_mode(vm)) &&
+			    !op->map.is_system_allocator)
 				xe_vma_ops_incr_pt_update_ops(vops,
 							      op->tile_mask);
 			break;
@@ -2174,21 +2184,24 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 		{
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
+			bool skip = xe_vma_is_system_allocator(old);
 
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
 
-			if (op->base.remap.prev) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_READ_ONLY ?
+				VMA_CREATE_FLAG_READ_ONLY : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				DRM_GPUVA_SPARSE ?
+				VMA_CREATE_FLAG_IS_NULL : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_DUMPABLE ?
+				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= xe_vma_is_system_allocator(old) ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
+			if (op->base.remap.prev) {
 				vma = new_vma(vm, op->base.remap.prev,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2200,9 +2213,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_prev = !xe_vma_is_userptr(old) &&
+				op->remap.skip_prev = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_end(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_prev) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2218,16 +2232,6 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			}
 
 			if (op->base.remap.next) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
-
 				vma = new_vma(vm, op->base.remap.next,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2239,9 +2243,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_next = !xe_vma_is_userptr(old) &&
+				op->remap.skip_next = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_start(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_next) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2254,14 +2259,27 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!skip)
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			break;
+		}
 		case DRM_GPUVA_OP_PREFETCH:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
 			/* FIXME: Need to skip some prefetch ops */
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
+		}
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
@@ -2702,7 +2720,8 @@ static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
 	(DRM_XE_VM_BIND_FLAG_READONLY | \
 	 DRM_XE_VM_BIND_FLAG_IMMEDIATE | \
 	 DRM_XE_VM_BIND_FLAG_NULL | \
-	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+	 DRM_XE_VM_BIND_FLAG_DUMPABLE | \
+	 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR)
 
 #ifdef TEST_VM_OPS_ERROR
 #define SUPPORTED_FLAGS	(SUPPORTED_FLAGS_STUB | FORCE_OP_ERROR)
@@ -2757,9 +2776,17 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u64 obj_offset = (*bind_ops)[i].obj_offset;
 		u32 prefetch_region = (*bind_ops)[i].prefetch_mem_region_instance;
 		bool is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+		bool is_system_allocator = flags &
+			DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
+		/* FIXME: Disabling system allocator for now */
+		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
+			err = -EOPNOTSUPP;
+			goto free_bind_ops;
+		}
+
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
@@ -2780,13 +2807,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 
 		if (XE_IOCTL_DBG(xe, op > DRM_XE_VM_BIND_OP_PREFETCH) ||
 		    XE_IOCTL_DBG(xe, flags & ~SUPPORTED_FLAGS) ||
-		    XE_IOCTL_DBG(xe, obj && is_null) ||
-		    XE_IOCTL_DBG(xe, obj_offset && is_null) ||
+		    XE_IOCTL_DBG(xe, obj && (is_null || is_system_allocator)) ||
+		    XE_IOCTL_DBG(xe, obj_offset && (is_null ||
+				 is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
-				 is_null) ||
+				 (is_null || is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_MAP &&
-				 !is_null) ||
+				 !is_null && !is_system_allocator) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_UNMAP_ALL) ||
 		    XE_IOCTL_DBG(xe, addr &&
@@ -3170,6 +3198,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	int ret = 0;
 
 	xe_assert(xe, !xe_vma_is_null(vma));
+	xe_assert(xe, !xe_vma_is_system_allocator(vma));
 	trace_xe_vma_invalidate(vma);
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index c864dba35e1d..1a5aed678214 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -151,6 +151,11 @@ static inline bool xe_vma_is_null(struct xe_vma *vma)
 	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
 }
 
+static inline bool xe_vma_is_system_allocator(struct xe_vma *vma)
+{
+	return vma->gpuva.flags & XE_VMA_SYSTEM_ALLOCATOR;
+}
+
 static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 {
 	return !xe_vma_bo(vma);
@@ -158,7 +163,8 @@ static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 
 static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 {
-	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma);
+	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma) &&
+		!xe_vma_is_system_allocator(vma);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 7f9a303e51d8..1764781c376b 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -42,6 +42,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 6)
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 7)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 8)
+#define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 9)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -294,6 +295,8 @@ struct xe_vma_op_map {
 	bool read_only;
 	/** @is_null: is NULL binding */
 	bool is_null;
+	/** @is_system_allocator: is system allocator binding */
+	bool is_system_allocator;
 	/** @dumpable: whether BO is dumped on GPU hang */
 	bool dumpable;
 	/** @pat_index: The pat index to use for this operation. */
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index c4182e95a619..1e92fd498967 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -906,6 +906,12 @@ struct drm_xe_vm_destroy {
  *    will only be valid for DRM_XE_VM_BIND_OP_MAP operations, the BO
  *    handle MBZ, and the BO offset MBZ. This flag is intended to
  *    implement VK sparse bindings.
+ *  - %DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR - When the system allocator flag is
+ *    set, no mappings are created rather the range is reserved for system
+ *    allocations which will be populated on GPU page faults. Only valid on VMs
+ *    with DRM_XE_VM_CREATE_FLAG_FAULT_MODE set. The system allocator flag are
+ *    only valid for DRM_XE_VM_BIND_OP_MAP operations, the BO handle MBZ, and
+ *    the BO offset MBZ.
  */
 struct drm_xe_vm_bind_op {
 	/** @extensions: Pointer to the first extension struct, if any */
@@ -958,7 +964,9 @@ struct drm_xe_vm_bind_op {
 	 * on the @pat_index. For such mappings there is no actual memory being
 	 * mapped (the address in the PTE is invalid), so the various PAT memory
 	 * attributes likely do not apply.  Simply leaving as zero is one
-	 * option (still a valid pat_index).
+	 * option (still a valid pat_index). Same applies to
+	 * DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR bindings as for such mapping
+	 * there is no actual memory being mapped.
 	 */
 	__u16 pat_index;
 
@@ -974,6 +982,14 @@ struct drm_xe_vm_bind_op {
 
 		/** @userptr: user pointer to bind on */
 		__u64 userptr;
+
+		/**
+		 * @system_allocator_offset: Offset from GPU @addr to create
+		 * system allocator mappings. MBZ with current level of support
+		 * (e.g. 1 to 1 mapping between GPU and CPU mappings only
+		 * supported).
+		 */
+		__s64 system_allocator_offset;
 	};
 
 	/**
@@ -996,6 +1012,7 @@ struct drm_xe_vm_bind_op {
 #define DRM_XE_VM_BIND_FLAG_IMMEDIATE	(1 << 1)
 #define DRM_XE_VM_BIND_FLAG_NULL	(1 << 2)
 #define DRM_XE_VM_BIND_FLAG_DUMPABLE	(1 << 3)
+#define DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR	(1 << 4)
 	/** @flags: Bind flags */
 	__u32 flags;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 06/29] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
  2024-10-16  3:24 ` [PATCH v2 06/29] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
@ 2024-11-18 13:44   ` Thomas Hellström
  2024-11-19 16:01     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-18 13:44 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> Add the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag, which is used to
> create unpopulated virtual memory areas (VMAs) without memory backing
> or
> GPU page tables. These VMAs are referred to as system allocator VMAs.
> The idea is that upon a page fault or prefetch, the memory backing
> and
> GPU page tables will be populated.

It would be good if the commit message could describe the state of the
code after this patch. It seems we do a lot more than just adding a
flag, but no real implementation. Perhaps just adjust the current code
to avoid code-paths that are not taken when the flag is set?

> 
> System allocator VMAs only update GPUVM state; they do not have an
> internal page table (PT) state, nor do they have GPU mappings.
> 
> It is expected that system allocator VMAs will be mixed with buffer
> object (BO) VMAs within a single VM. In other words, system
> allocations
> and runtime allocations can be mixed within a single user-mode driver
> (UMD) program.

This sounds like compute API-level terminology describing where the app
gets its buffer objects: System allocator - malloc, Runtime allocator -
the compute runtime (allocating buffer objects under the hood). 

Not sure what would be the best terminology, though, but something
along DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR? (And when setteled change
inside code as well).

Otherwise it gets weird if someone asks why is it called "System
Allocator", and the reply is "a compute runtime would typically use
this functionality when an app has allocated the memory using malloc()
which can be called a system allocator".

IOW we name the functionality based on what KMD does and not how the
app uses it through UMD.

> 
> Expected usage:
> 
> - Bind the entire virtual address (VA) space upon program load using
> the
>   DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
> - If a buffer object (BO) requires GPU mapping, allocate an address
>   using malloc, and bind the BO to the malloc'd address using
> existing
>   bind IOCTLs (runtime allocation).

allocate a cpu address using mmap(PROT_NONE), bind the BO to the
malloced address using existing bind IOCTLS. If a cpu map of the bo is
needed, mmap it again to the same cpu address using mmap(MAP_FIXED)

> - If a BO no longer requires GPU mapping, bind the mapping address
> with
>   the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.

unmap it from cpu space  and then...

> - Any malloc'd address accessed by the GPU will be faulted in via the
>   SVM implementation (system allocation).
> - Upon freeing any malloc'd data, the SVM implementation will remove
> GPU
>   mappings.
> 
> Only supporting 1 to 1 mapping between user address space and GPU
> address space at the moment as that is the expected use case. uAPI
> defines interface for non 1 to 1 but enforces 1 to 1, this
> restriction
> can be lifted if use cases arrise for non 1 to 1 mappings.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c       |  76 +++++++++++++++++-----
>  drivers/gpu/drm/xe/xe_vm.c       | 107 ++++++++++++++++++++---------
> --
>  drivers/gpu/drm/xe/xe_vm.h       |   8 ++-
>  drivers/gpu/drm/xe/xe_vm_types.h |   3 +
>  include/uapi/drm/xe_drm.h        |  19 +++++-
>  5 files changed, 157 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index f27f579f4d85..39357e829b6d 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -1068,6 +1068,11 @@ static int op_add_deps(struct xe_vm *vm,
> struct xe_vma_op *op,
>  {
>  	int err = 0;
>  
> +	/*
> +	 * No need to check for is_system_allocator here as
> vma_add_deps is a
> +	 * NOP if VMA is_system_allocator
> +	 */
> +
>  	switch (op->base.op) {
>  	case DRM_GPUVA_OP_MAP:
>  		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> @@ -1646,6 +1651,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
>  	int err;
>  
> +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
>  	xe_bo_assert_held(xe_vma_bo(vma));
>  
>  	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> @@ -1713,6 +1719,7 @@ static int unbind_op_prepare(struct xe_tile
> *tile,
>  	if (!((vma->tile_present | vma->tile_staged) & BIT(tile-
> >id)))
>  		return 0;
>  
> +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
>  	xe_bo_assert_held(xe_vma_bo(vma));
>  
>  	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> @@ -1759,15 +1766,21 @@ static int op_prepare(struct xe_vm *vm,
>  
>  	switch (op->base.op) {
>  	case DRM_GPUVA_OP_MAP:
> -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> +		if ((!op->map.immediate && xe_vm_in_fault_mode(vm))
> ||
> +		    op->map.is_system_allocator)
>  			break;
>  
>  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> >map.vma);
>  		pt_update_ops->wait_vm_kernel = true;
>  		break;
>  	case DRM_GPUVA_OP_REMAP:
> -		err = unbind_op_prepare(tile, pt_update_ops,
> -					gpuva_to_vma(op-
> >base.remap.unmap->va));
> +	{
> +		struct xe_vma *old = gpuva_to_vma(op-
> >base.remap.unmap->va);
> +
> +		if (xe_vma_is_system_allocator(old))
> +			break;
> +
> +		err = unbind_op_prepare(tile, pt_update_ops, old);
>  
>  		if (!err && op->remap.prev) {
>  			err = bind_op_prepare(vm, tile,
> pt_update_ops,
> @@ -1780,15 +1793,28 @@ static int op_prepare(struct xe_vm *vm,
>  			pt_update_ops->wait_vm_bookkeep = true;
>  		}
>  		break;
> +	}
>  	case DRM_GPUVA_OP_UNMAP:
> -		err = unbind_op_prepare(tile, pt_update_ops,
> -					gpuva_to_vma(op-
> >base.unmap.va));
> +	{
> +		struct xe_vma *vma = gpuva_to_vma(op-
> >base.unmap.va);
> +
> +		if (xe_vma_is_system_allocator(vma))
> +			break;
> +
> +		err = unbind_op_prepare(tile, pt_update_ops, vma);
>  		break;
> +	}
>  	case DRM_GPUVA_OP_PREFETCH:
> -		err = bind_op_prepare(vm, tile, pt_update_ops,
> -				      gpuva_to_vma(op-
> >base.prefetch.va));
> +	{
> +		struct xe_vma *vma = gpuva_to_vma(op-
> >base.prefetch.va);
> +
> +		if (xe_vma_is_system_allocator(vma))
> +			break;
> +
> +		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
>  		pt_update_ops->wait_vm_kernel = true;
>  		break;
> +	}
>  	default:
>  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
>  	}
> @@ -1857,6 +1883,8 @@ static void bind_op_commit(struct xe_vm *vm,
> struct xe_tile *tile,
>  			   struct xe_vma *vma, struct dma_fence
> *fence,
>  			   struct dma_fence *fence2)
>  {
> +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
> +
>  	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
>  		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv,
> fence,
>  				   pt_update_ops->wait_vm_bookkeep ?
> @@ -1890,6 +1918,8 @@ static void unbind_op_commit(struct xe_vm *vm,
> struct xe_tile *tile,
>  			     struct xe_vma *vma, struct dma_fence
> *fence,
>  			     struct dma_fence *fence2)
>  {
> +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
> +
>  	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
>  		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv,
> fence,
>  				   pt_update_ops->wait_vm_bookkeep ?
> @@ -1924,16 +1954,21 @@ static void op_commit(struct xe_vm *vm,
>  
>  	switch (op->base.op) {
>  	case DRM_GPUVA_OP_MAP:
> -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> +		if ((!op->map.immediate && xe_vm_in_fault_mode(vm))
> ||
> +		    op->map.is_system_allocator)
>  			break;
>  
>  		bind_op_commit(vm, tile, pt_update_ops, op->map.vma,
> fence,
>  			       fence2);
>  		break;
>  	case DRM_GPUVA_OP_REMAP:
> -		unbind_op_commit(vm, tile, pt_update_ops,
> -				 gpuva_to_vma(op->base.remap.unmap-
> >va), fence,
> -				 fence2);
> +	{
> +		struct xe_vma *old = gpuva_to_vma(op-
> >base.remap.unmap->va);
> +
> +		if (xe_vma_is_system_allocator(old))
> +			break;
> +
> +		unbind_op_commit(vm, tile, pt_update_ops, old,
> fence, fence2);
>  
>  		if (op->remap.prev)
>  			bind_op_commit(vm, tile, pt_update_ops, op-
> >remap.prev,
> @@ -1942,14 +1977,25 @@ static void op_commit(struct xe_vm *vm,
>  			bind_op_commit(vm, tile, pt_update_ops, op-
> >remap.next,
>  				       fence, fence2);
>  		break;
> +	}
>  	case DRM_GPUVA_OP_UNMAP:
> -		unbind_op_commit(vm, tile, pt_update_ops,
> -				 gpuva_to_vma(op->base.unmap.va),
> fence, fence2);
> +	{
> +		struct xe_vma *vma = gpuva_to_vma(op-
> >base.unmap.va);
> +
> +		if (!xe_vma_is_system_allocator(vma))
> +			unbind_op_commit(vm, tile, pt_update_ops,
> vma, fence,
> +					 fence2);
>  		break;
> +	}
>  	case DRM_GPUVA_OP_PREFETCH:
> -		bind_op_commit(vm, tile, pt_update_ops,
> -			       gpuva_to_vma(op->base.prefetch.va),
> fence, fence2);
> +	{
> +		struct xe_vma *vma = gpuva_to_vma(op-
> >base.prefetch.va);
> +
> +		if (!xe_vma_is_system_allocator(vma))
> +			bind_op_commit(vm, tile, pt_update_ops, vma,
> fence,
> +				       fence2);

Wouldn't we want to support prefetch? Or perhaps the implementation is
deferred? 


>  		break;
> +	}
>  	default:
>  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
>  	}
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index c99380271de6..0d887fb9de59 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -901,9 +901,10 @@ static void xe_vma_free(struct xe_vma *vma)
>  		kfree(vma);
>  }
>  
> -#define VMA_CREATE_FLAG_READ_ONLY	BIT(0)
> -#define VMA_CREATE_FLAG_IS_NULL		BIT(1)
> -#define VMA_CREATE_FLAG_DUMPABLE	BIT(2)
> +#define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
> +#define VMA_CREATE_FLAG_IS_NULL			BIT(1)
> +#define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
> +#define VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR	BIT(3)
>  
>  static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>  				    struct xe_bo *bo,
> @@ -917,6 +918,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
>  	bool read_only = (flags & VMA_CREATE_FLAG_READ_ONLY);
>  	bool is_null = (flags & VMA_CREATE_FLAG_IS_NULL);
>  	bool dumpable = (flags & VMA_CREATE_FLAG_DUMPABLE);
> +	bool is_system_allocator =
> +		(flags & VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR);
>  
>  	xe_assert(vm->xe, start < end);
>  	xe_assert(vm->xe, end < vm->size);
> @@ -925,7 +928,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
>  	 * Allocate and ensure that the xe_vma_is_userptr() return
>  	 * matches what was allocated.
>  	 */
> -	if (!bo && !is_null) {
> +	if (!bo && !is_null && !is_system_allocator) {
>  		struct xe_userptr_vma *uvma = kzalloc(sizeof(*uvma),
> GFP_KERNEL);
>  
>  		if (!uvma)
> @@ -937,6 +940,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
>  		if (!vma)
>  			return ERR_PTR(-ENOMEM);
>  
> +		if (is_system_allocator)
> +			vma->gpuva.flags |= XE_VMA_SYSTEM_ALLOCATOR;
>  		if (is_null)
>  			vma->gpuva.flags |= DRM_GPUVA_SPARSE;
>  		if (bo)
> @@ -979,7 +984,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
>  		drm_gpuva_link(&vma->gpuva, vm_bo);
>  		drm_gpuvm_bo_put(vm_bo);
>  	} else /* userptr or null */ {
> -		if (!is_null) {
> +		if (!is_null && !is_system_allocator) {
>  			struct xe_userptr *userptr =
> &to_userptr_vma(vma)->userptr;
>  			u64 size = end - start + 1;
>  			int err;
> @@ -1029,7 +1034,7 @@ static void xe_vma_destroy_late(struct xe_vma
> *vma)
>  		 */
>  		mmu_interval_notifier_remove(&userptr->notifier);
>  		xe_vm_put(vm);
> -	} else if (xe_vma_is_null(vma)) {
> +	} else if (xe_vma_is_null(vma) ||
> xe_vma_is_system_allocator(vma)) {
>  		xe_vm_put(vm);
>  	} else {
>  		xe_bo_put(xe_vma_bo(vma));
> @@ -1068,7 +1073,7 @@ static void xe_vma_destroy(struct xe_vma *vma,
> struct dma_fence *fence)
>  		spin_lock(&vm->userptr.invalidated_lock);
>  		list_del(&to_userptr_vma(vma)-
> >userptr.invalidate_link);
>  		spin_unlock(&vm->userptr.invalidated_lock);
> -	} else if (!xe_vma_is_null(vma)) {
> +	} else if (!xe_vma_is_null(vma) &&
> !xe_vma_is_system_allocator(vma)) {
>  		xe_bo_assert_held(xe_vma_bo(vma));
>  
>  		drm_gpuva_unlink(&vma->gpuva);
> @@ -1967,6 +1972,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
>  			op->map.read_only =
>  				flags &
> DRM_XE_VM_BIND_FLAG_READONLY;
>  			op->map.is_null = flags &
> DRM_XE_VM_BIND_FLAG_NULL;
> +			op->map.is_system_allocator = flags &
> +				DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR
> ;
>  			op->map.dumpable = flags &
> DRM_XE_VM_BIND_FLAG_DUMPABLE;
>  			op->map.pat_index = pat_index;
>  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> @@ -2158,6 +2165,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct drm_gpuva_ops *ops,
>  				VMA_CREATE_FLAG_IS_NULL : 0;
>  			flags |= op->map.dumpable ?
>  				VMA_CREATE_FLAG_DUMPABLE : 0;
> +			flags |= op->map.is_system_allocator ?
> +				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR
> : 0;
>  
>  			vma = new_vma(vm, &op->base.map, op-
> >map.pat_index,
>  				      flags);
> @@ -2165,7 +2174,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct drm_gpuva_ops *ops,
>  				return PTR_ERR(vma);
>  
>  			op->map.vma = vma;
> -			if (op->map.immediate ||
> !xe_vm_in_fault_mode(vm))
> +			if ((op->map.immediate ||
> !xe_vm_in_fault_mode(vm)) &&
> +			    !op->map.is_system_allocator)
>  				xe_vma_ops_incr_pt_update_ops(vops,
>  							      op-
> >tile_mask);
>  			break;
> @@ -2174,21 +2184,24 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
>  		{
>  			struct xe_vma *old =
>  				gpuva_to_vma(op->base.remap.unmap-
> >va);
> +			bool skip = xe_vma_is_system_allocator(old);
>  
>  			op->remap.start = xe_vma_start(old);
>  			op->remap.range = xe_vma_size(old);
>  
> -			if (op->base.remap.prev) {
> -				flags |= op->base.remap.unmap->va-
> >flags &
> -					XE_VMA_READ_ONLY ?
> -					VMA_CREATE_FLAG_READ_ONLY :
> 0;
> -				flags |= op->base.remap.unmap->va-
> >flags &
> -					DRM_GPUVA_SPARSE ?
> -					VMA_CREATE_FLAG_IS_NULL : 0;
> -				flags |= op->base.remap.unmap->va-
> >flags &
> -					XE_VMA_DUMPABLE ?
> -					VMA_CREATE_FLAG_DUMPABLE :
> 0;
> +			flags |= op->base.remap.unmap->va->flags &
> +				XE_VMA_READ_ONLY ?
> +				VMA_CREATE_FLAG_READ_ONLY : 0;
> +			flags |= op->base.remap.unmap->va->flags &
> +				DRM_GPUVA_SPARSE ?
> +				VMA_CREATE_FLAG_IS_NULL : 0;
> +			flags |= op->base.remap.unmap->va->flags &
> +				XE_VMA_DUMPABLE ?
> +				VMA_CREATE_FLAG_DUMPABLE : 0;
> +			flags |= xe_vma_is_system_allocator(old) ?
> +				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR
> : 0;
>  
> +			if (op->base.remap.prev) {
>  				vma = new_vma(vm, op-
> >base.remap.prev,
>  					      old->pat_index,
> flags);
>  				if (IS_ERR(vma))
> @@ -2200,9 +2213,10 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
>  				 * Userptr creates a new SG mapping
> so
>  				 * we must also rebind.
>  				 */
> -				op->remap.skip_prev =
> !xe_vma_is_userptr(old) &&
> +				op->remap.skip_prev = skip ||
> +					(!xe_vma_is_userptr(old) &&
>  					IS_ALIGNED(xe_vma_end(vma),
> -						  
> xe_vma_max_pte_size(old));
> +						  
> xe_vma_max_pte_size(old)));
>  				if (op->remap.skip_prev) {
>  					xe_vma_set_pte_size(vma,
> xe_vma_max_pte_size(old));
>  					op->remap.range -=
> @@ -2218,16 +2232,6 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
>  			}
>  
>  			if (op->base.remap.next) {
> -				flags |= op->base.remap.unmap->va-
> >flags &
> -					XE_VMA_READ_ONLY ?
> -					VMA_CREATE_FLAG_READ_ONLY :
> 0;
> -				flags |= op->base.remap.unmap->va-
> >flags &
> -					DRM_GPUVA_SPARSE ?
> -					VMA_CREATE_FLAG_IS_NULL : 0;
> -				flags |= op->base.remap.unmap->va-
> >flags &
> -					XE_VMA_DUMPABLE ?
> -					VMA_CREATE_FLAG_DUMPABLE :
> 0;
> -
>  				vma = new_vma(vm, op-
> >base.remap.next,
>  					      old->pat_index,
> flags);
>  				if (IS_ERR(vma))
> @@ -2239,9 +2243,10 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
>  				 * Userptr creates a new SG mapping
> so
>  				 * we must also rebind.
>  				 */
> -				op->remap.skip_next =
> !xe_vma_is_userptr(old) &&
> +				op->remap.skip_next = skip ||
> +					(!xe_vma_is_userptr(old) &&
>  					IS_ALIGNED(xe_vma_start(vma)
> ,
> -						  
> xe_vma_max_pte_size(old));
> +						  
> xe_vma_max_pte_size(old)));
>  				if (op->remap.skip_next) {
>  					xe_vma_set_pte_size(vma,
> xe_vma_max_pte_size(old));
>  					op->remap.range -=
> @@ -2254,14 +2259,27 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
>  					xe_vma_ops_incr_pt_update_op
> s(vops, op->tile_mask);
>  				}
>  			}
> -			xe_vma_ops_incr_pt_update_ops(vops, op-
> >tile_mask);
> +			if (!skip)
> +				xe_vma_ops_incr_pt_update_ops(vops,
> op->tile_mask);
>  			break;
>  		}
>  		case DRM_GPUVA_OP_UNMAP:
> +		{
> +			struct xe_vma *vma = gpuva_to_vma(op-
> >base.unmap.va);
> +
> +			if (!xe_vma_is_system_allocator(vma))
> +				xe_vma_ops_incr_pt_update_ops(vops,
> op->tile_mask);
> +			break;
> +		}
>  		case DRM_GPUVA_OP_PREFETCH:
> +		{
> +			struct xe_vma *vma = gpuva_to_vma(op-
> >base.prefetch.va);
> +
>  			/* FIXME: Need to skip some prefetch ops */
> -			xe_vma_ops_incr_pt_update_ops(vops, op-
> >tile_mask);
> +			if (!xe_vma_is_system_allocator(vma))
> +				xe_vma_ops_incr_pt_update_ops(vops,
> op->tile_mask);
>  			break;
> +		}
>  		default:
>  			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
>  		}
> @@ -2702,7 +2720,8 @@ static int vm_bind_ioctl_ops_execute(struct
> xe_vm *vm,
>  	(DRM_XE_VM_BIND_FLAG_READONLY | \
>  	 DRM_XE_VM_BIND_FLAG_IMMEDIATE | \
>  	 DRM_XE_VM_BIND_FLAG_NULL | \
> -	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
> +	 DRM_XE_VM_BIND_FLAG_DUMPABLE | \
> +	 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR)
>  
>  #ifdef TEST_VM_OPS_ERROR
>  #define SUPPORTED_FLAGS	(SUPPORTED_FLAGS_STUB |
> FORCE_OP_ERROR)
> @@ -2757,9 +2776,17 @@ static int vm_bind_ioctl_check_args(struct
> xe_device *xe,
>  		u64 obj_offset = (*bind_ops)[i].obj_offset;
>  		u32 prefetch_region =
> (*bind_ops)[i].prefetch_mem_region_instance;
>  		bool is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
> +		bool is_system_allocator = flags &
> +			DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
>  		u16 pat_index = (*bind_ops)[i].pat_index;
>  		u16 coh_mode;
>  
> +		/* FIXME: Disabling system allocator for now */
> +		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
> +			err = -EOPNOTSUPP;
> +			goto free_bind_ops;
> +		}
> +
>  		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >pat.n_entries)) {
>  			err = -EINVAL;
>  			goto free_bind_ops;
> @@ -2780,13 +2807,14 @@ static int vm_bind_ioctl_check_args(struct
> xe_device *xe,
>  
>  		if (XE_IOCTL_DBG(xe, op >
> DRM_XE_VM_BIND_OP_PREFETCH) ||
>  		    XE_IOCTL_DBG(xe, flags & ~SUPPORTED_FLAGS) ||
> -		    XE_IOCTL_DBG(xe, obj && is_null) ||
> -		    XE_IOCTL_DBG(xe, obj_offset && is_null) ||
> +		    XE_IOCTL_DBG(xe, obj && (is_null ||
> is_system_allocator)) ||
> +		    XE_IOCTL_DBG(xe, obj_offset && (is_null ||
> +				 is_system_allocator)) ||
>  		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
> -				 is_null) ||
> +				 (is_null || is_system_allocator))
> ||
>  		    XE_IOCTL_DBG(xe, !obj &&
>  				 op == DRM_XE_VM_BIND_OP_MAP &&
> -				 !is_null) ||
> +				 !is_null && !is_system_allocator)
> ||
>  		    XE_IOCTL_DBG(xe, !obj &&
>  				 op == DRM_XE_VM_BIND_OP_UNMAP_ALL)
> ||
>  		    XE_IOCTL_DBG(xe, addr &&
> @@ -3170,6 +3198,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
>  	int ret = 0;
>  
>  	xe_assert(xe, !xe_vma_is_null(vma));
> +	xe_assert(xe, !xe_vma_is_system_allocator(vma));
>  	trace_xe_vma_invalidate(vma);
>  
>  	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index c864dba35e1d..1a5aed678214 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -151,6 +151,11 @@ static inline bool xe_vma_is_null(struct xe_vma
> *vma)
>  	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
>  }
>  
> +static inline bool xe_vma_is_system_allocator(struct xe_vma *vma)
> +{
> +	return vma->gpuva.flags & XE_VMA_SYSTEM_ALLOCATOR;
> +}
> +
>  static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
>  {
>  	return !xe_vma_bo(vma);
> @@ -158,7 +163,8 @@ static inline bool xe_vma_has_no_bo(struct xe_vma
> *vma)
>  
>  static inline bool xe_vma_is_userptr(struct xe_vma *vma)
>  {
> -	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma);
> +	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma) &&
> +		!xe_vma_is_system_allocator(vma);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> index 7f9a303e51d8..1764781c376b 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -42,6 +42,7 @@ struct xe_vm_pgtable_update_op;
>  #define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 6)
>  #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 7)
>  #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 8)
> +#define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 9)
>  
>  /** struct xe_userptr - User pointer */
>  struct xe_userptr {
> @@ -294,6 +295,8 @@ struct xe_vma_op_map {
>  	bool read_only;
>  	/** @is_null: is NULL binding */
>  	bool is_null;
> +	/** @is_system_allocator: is system allocator binding */
> +	bool is_system_allocator;
>  	/** @dumpable: whether BO is dumped on GPU hang */
>  	bool dumpable;
>  	/** @pat_index: The pat index to use for this operation. */
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index c4182e95a619..1e92fd498967 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -906,6 +906,12 @@ struct drm_xe_vm_destroy {
>   *    will only be valid for DRM_XE_VM_BIND_OP_MAP operations, the
> BO
>   *    handle MBZ, and the BO offset MBZ. This flag is intended to
>   *    implement VK sparse bindings.
> + *  - %DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR - When the system
> allocator flag is
> + *    set, no mappings are created rather the range is reserved for
> system
> + *    allocations which will be populated on GPU page faults. Only
> valid on VMs
> + *    with DRM_XE_VM_CREATE_FLAG_FAULT_MODE set. The system
> allocator flag are
> + *    only valid for DRM_XE_VM_BIND_OP_MAP operations, the BO handle
> MBZ, and
> + *    the BO offset MBZ.
>   */
>  struct drm_xe_vm_bind_op {
>  	/** @extensions: Pointer to the first extension struct, if
> any */
> @@ -958,7 +964,9 @@ struct drm_xe_vm_bind_op {
>  	 * on the @pat_index. For such mappings there is no actual
> memory being
>  	 * mapped (the address in the PTE is invalid), so the
> various PAT memory
>  	 * attributes likely do not apply.  Simply leaving as zero
> is one
> -	 * option (still a valid pat_index).
> +	 * option (still a valid pat_index). Same applies to
> +	 * DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR bindings as for such
> mapping
> +	 * there is no actual memory being mapped.
>  	 */
>  	__u16 pat_index;
>  
> @@ -974,6 +982,14 @@ struct drm_xe_vm_bind_op {
>  
>  		/** @userptr: user pointer to bind on */
>  		__u64 userptr;
> +
> +		/**
> +		 * @system_allocator_offset: Offset from GPU @addr
> to create
> +		 * system allocator mappings. MBZ with current level
> of support
> +		 * (e.g. 1 to 1 mapping between GPU and CPU mappings
> only
> +		 * supported).
> +		 */
> +		__s64 system_allocator_offset;
>  	};
>  
>  	/**
> @@ -996,6 +1012,7 @@ struct drm_xe_vm_bind_op {
>  #define DRM_XE_VM_BIND_FLAG_IMMEDIATE	(1 << 1)
>  #define DRM_XE_VM_BIND_FLAG_NULL	(1 << 2)
>  #define DRM_XE_VM_BIND_FLAG_DUMPABLE	(1 << 3)
> +#define DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR	(1 << 4)
>  	/** @flags: Bind flags */
>  	__u32 flags;
>  


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 06/29] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
  2024-11-18 13:44   ` Thomas Hellström
@ 2024-11-19 16:01     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-19 16:01 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Nov 18, 2024 at 02:44:58PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > Add the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag, which is used to
> > create unpopulated virtual memory areas (VMAs) without memory backing
> > or
> > GPU page tables. These VMAs are referred to as system allocator VMAs.
> > The idea is that upon a page fault or prefetch, the memory backing
> > and
> > GPU page tables will be populated.
> 
> It would be good if the commit message could describe the state of the
> code after this patch. It seems we do a lot more than just adding a
> flag, but no real implementation. Perhaps just adjust the current code
> to avoid code-paths that are not taken when the flag is set?
> 

Sure can add a description of what the patch does which aligns with your
assessment - it updates the bind code to create VMA without creating
page tables when this flag is set.

> > 
> > System allocator VMAs only update GPUVM state; they do not have an
> > internal page table (PT) state, nor do they have GPU mappings.
> > 
> > It is expected that system allocator VMAs will be mixed with buffer
> > object (BO) VMAs within a single VM. In other words, system
> > allocations
> > and runtime allocations can be mixed within a single user-mode driver
> > (UMD) program.
> 
> This sounds like compute API-level terminology describing where the app
> gets its buffer objects: System allocator - malloc, Runtime allocator -
> the compute runtime (allocating buffer objects under the hood). 
> 
> Not sure what would be the best terminology, though, but something
> along DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR? (And when setteled change
> inside code as well).
>

DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR seems reasonible. Then also
s/xe_vma_is_system_allocator/s/xe_vma_is_cpu_addr_mirror too?
 
> Otherwise it gets weird if someone asks why is it called "System
> Allocator", and the reply is "a compute runtime would typically use
> this functionality when an app has allocated the memory using malloc()
> which can be called a system allocator".
> 
> IOW we name the functionality based on what KMD does and not how the
> app uses it through UMD.
> 
> > 
> > Expected usage:
> > 
> > - Bind the entire virtual address (VA) space upon program load using
> > the
> >   DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
> > - If a buffer object (BO) requires GPU mapping, allocate an address
> >   using malloc, and bind the BO to the malloc'd address using
> > existing
> >   bind IOCTLs (runtime allocation).
> 
> allocate a cpu address using mmap(PROT_NONE), bind the BO to the
> malloced address using existing bind IOCTLS. If a cpu map of the bo is
> needed, mmap it again to the same cpu address using mmap(MAP_FIXED)
>

Will adjust.
 
> > - If a BO no longer requires GPU mapping, bind the mapping address
> > with
> >   the DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag.
> 
> unmap it from cpu space  and then...

Yes. Will add.

> 
> > - Any malloc'd address accessed by the GPU will be faulted in via the
> >   SVM implementation (system allocation).
> > - Upon freeing any malloc'd data, the SVM implementation will remove
> > GPU
> >   mappings.
> > 
> > Only supporting 1 to 1 mapping between user address space and GPU
> > address space at the moment as that is the expected use case. uAPI
> > defines interface for non 1 to 1 but enforces 1 to 1, this
> > restriction
> > can be lifted if use cases arrise for non 1 to 1 mappings.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c       |  76 +++++++++++++++++-----
> >  drivers/gpu/drm/xe/xe_vm.c       | 107 ++++++++++++++++++++---------
> > --
> >  drivers/gpu/drm/xe/xe_vm.h       |   8 ++-
> >  drivers/gpu/drm/xe/xe_vm_types.h |   3 +
> >  include/uapi/drm/xe_drm.h        |  19 +++++-
> >  5 files changed, 157 insertions(+), 56 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index f27f579f4d85..39357e829b6d 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -1068,6 +1068,11 @@ static int op_add_deps(struct xe_vm *vm,
> > struct xe_vma_op *op,
> >  {
> >  	int err = 0;
> >  
> > +	/*
> > +	 * No need to check for is_system_allocator here as
> > vma_add_deps is a
> > +	 * NOP if VMA is_system_allocator
> > +	 */
> > +
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> >  		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> > @@ -1646,6 +1651,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> >  	int err;
> >  
> > +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
> >  	xe_bo_assert_held(xe_vma_bo(vma));
> >  
> >  	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > @@ -1713,6 +1719,7 @@ static int unbind_op_prepare(struct xe_tile
> > *tile,
> >  	if (!((vma->tile_present | vma->tile_staged) & BIT(tile-
> > >id)))
> >  		return 0;
> >  
> > +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
> >  	xe_bo_assert_held(xe_vma_bo(vma));
> >  
> >  	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > @@ -1759,15 +1766,21 @@ static int op_prepare(struct xe_vm *vm,
> >  
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> > +		if ((!op->map.immediate && xe_vm_in_fault_mode(vm))
> > ||
> > +		    op->map.is_system_allocator)
> >  			break;
> >  
> >  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> > >map.vma);
> >  		pt_update_ops->wait_vm_kernel = true;
> >  		break;
> >  	case DRM_GPUVA_OP_REMAP:
> > -		err = unbind_op_prepare(tile, pt_update_ops,
> > -					gpuva_to_vma(op-
> > >base.remap.unmap->va));
> > +	{
> > +		struct xe_vma *old = gpuva_to_vma(op-
> > >base.remap.unmap->va);
> > +
> > +		if (xe_vma_is_system_allocator(old))
> > +			break;
> > +
> > +		err = unbind_op_prepare(tile, pt_update_ops, old);
> >  
> >  		if (!err && op->remap.prev) {
> >  			err = bind_op_prepare(vm, tile,
> > pt_update_ops,
> > @@ -1780,15 +1793,28 @@ static int op_prepare(struct xe_vm *vm,
> >  			pt_update_ops->wait_vm_bookkeep = true;
> >  		}
> >  		break;
> > +	}
> >  	case DRM_GPUVA_OP_UNMAP:
> > -		err = unbind_op_prepare(tile, pt_update_ops,
> > -					gpuva_to_vma(op-
> > >base.unmap.va));
> > +	{
> > +		struct xe_vma *vma = gpuva_to_vma(op-
> > >base.unmap.va);
> > +
> > +		if (xe_vma_is_system_allocator(vma))
> > +			break;
> > +
> > +		err = unbind_op_prepare(tile, pt_update_ops, vma);
> >  		break;
> > +	}
> >  	case DRM_GPUVA_OP_PREFETCH:
> > -		err = bind_op_prepare(vm, tile, pt_update_ops,
> > -				      gpuva_to_vma(op-
> > >base.prefetch.va));
> > +	{
> > +		struct xe_vma *vma = gpuva_to_vma(op-
> > >base.prefetch.va);
> > +
> > +		if (xe_vma_is_system_allocator(vma))
> > +			break;
> > +
> > +		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
> >  		pt_update_ops->wait_vm_kernel = true;
> >  		break;
> > +	}
> >  	default:
> >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> >  	}
> > @@ -1857,6 +1883,8 @@ static void bind_op_commit(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  			   struct xe_vma *vma, struct dma_fence
> > *fence,
> >  			   struct dma_fence *fence2)
> >  {
> > +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
> > +
> >  	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
> >  		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv,
> > fence,
> >  				   pt_update_ops->wait_vm_bookkeep ?
> > @@ -1890,6 +1918,8 @@ static void unbind_op_commit(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  			     struct xe_vma *vma, struct dma_fence
> > *fence,
> >  			     struct dma_fence *fence2)
> >  {
> > +	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
> > +
> >  	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm) {
> >  		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv,
> > fence,
> >  				   pt_update_ops->wait_vm_bookkeep ?
> > @@ -1924,16 +1954,21 @@ static void op_commit(struct xe_vm *vm,
> >  
> >  	switch (op->base.op) {
> >  	case DRM_GPUVA_OP_MAP:
> > -		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
> > +		if ((!op->map.immediate && xe_vm_in_fault_mode(vm))
> > ||
> > +		    op->map.is_system_allocator)
> >  			break;
> >  
> >  		bind_op_commit(vm, tile, pt_update_ops, op->map.vma,
> > fence,
> >  			       fence2);
> >  		break;
> >  	case DRM_GPUVA_OP_REMAP:
> > -		unbind_op_commit(vm, tile, pt_update_ops,
> > -				 gpuva_to_vma(op->base.remap.unmap-
> > >va), fence,
> > -				 fence2);
> > +	{
> > +		struct xe_vma *old = gpuva_to_vma(op-
> > >base.remap.unmap->va);
> > +
> > +		if (xe_vma_is_system_allocator(old))
> > +			break;
> > +
> > +		unbind_op_commit(vm, tile, pt_update_ops, old,
> > fence, fence2);
> >  
> >  		if (op->remap.prev)
> >  			bind_op_commit(vm, tile, pt_update_ops, op-
> > >remap.prev,
> > @@ -1942,14 +1977,25 @@ static void op_commit(struct xe_vm *vm,
> >  			bind_op_commit(vm, tile, pt_update_ops, op-
> > >remap.next,
> >  				       fence, fence2);
> >  		break;
> > +	}
> >  	case DRM_GPUVA_OP_UNMAP:
> > -		unbind_op_commit(vm, tile, pt_update_ops,
> > -				 gpuva_to_vma(op->base.unmap.va),
> > fence, fence2);
> > +	{
> > +		struct xe_vma *vma = gpuva_to_vma(op-
> > >base.unmap.va);
> > +
> > +		if (!xe_vma_is_system_allocator(vma))
> > +			unbind_op_commit(vm, tile, pt_update_ops,
> > vma, fence,
> > +					 fence2);
> >  		break;
> > +	}
> >  	case DRM_GPUVA_OP_PREFETCH:
> > -		bind_op_commit(vm, tile, pt_update_ops,
> > -			       gpuva_to_vma(op->base.prefetch.va),
> > fence, fence2);
> > +	{
> > +		struct xe_vma *vma = gpuva_to_vma(op-
> > >base.prefetch.va);
> > +
> > +		if (!xe_vma_is_system_allocator(vma))
> > +			bind_op_commit(vm, tile, pt_update_ops, vma,
> > fence,
> > +				       fence2);
> 
> Wouldn't we want to support prefetch? Or perhaps the implementation is
> deferred? 
>

Yes, this will be deferred. Himal is looking at this piece.

Matt

> 
> >  		break;
> > +	}
> >  	default:
> >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> >  	}
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index c99380271de6..0d887fb9de59 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -901,9 +901,10 @@ static void xe_vma_free(struct xe_vma *vma)
> >  		kfree(vma);
> >  }
> >  
> > -#define VMA_CREATE_FLAG_READ_ONLY	BIT(0)
> > -#define VMA_CREATE_FLAG_IS_NULL		BIT(1)
> > -#define VMA_CREATE_FLAG_DUMPABLE	BIT(2)
> > +#define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
> > +#define VMA_CREATE_FLAG_IS_NULL			BIT(1)
> > +#define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
> > +#define VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR	BIT(3)
> >  
> >  static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> >  				    struct xe_bo *bo,
> > @@ -917,6 +918,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> > *vm,
> >  	bool read_only = (flags & VMA_CREATE_FLAG_READ_ONLY);
> >  	bool is_null = (flags & VMA_CREATE_FLAG_IS_NULL);
> >  	bool dumpable = (flags & VMA_CREATE_FLAG_DUMPABLE);
> > +	bool is_system_allocator =
> > +		(flags & VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR);
> >  
> >  	xe_assert(vm->xe, start < end);
> >  	xe_assert(vm->xe, end < vm->size);
> > @@ -925,7 +928,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> > *vm,
> >  	 * Allocate and ensure that the xe_vma_is_userptr() return
> >  	 * matches what was allocated.
> >  	 */
> > -	if (!bo && !is_null) {
> > +	if (!bo && !is_null && !is_system_allocator) {
> >  		struct xe_userptr_vma *uvma = kzalloc(sizeof(*uvma),
> > GFP_KERNEL);
> >  
> >  		if (!uvma)
> > @@ -937,6 +940,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> > *vm,
> >  		if (!vma)
> >  			return ERR_PTR(-ENOMEM);
> >  
> > +		if (is_system_allocator)
> > +			vma->gpuva.flags |= XE_VMA_SYSTEM_ALLOCATOR;
> >  		if (is_null)
> >  			vma->gpuva.flags |= DRM_GPUVA_SPARSE;
> >  		if (bo)
> > @@ -979,7 +984,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> > *vm,
> >  		drm_gpuva_link(&vma->gpuva, vm_bo);
> >  		drm_gpuvm_bo_put(vm_bo);
> >  	} else /* userptr or null */ {
> > -		if (!is_null) {
> > +		if (!is_null && !is_system_allocator) {
> >  			struct xe_userptr *userptr =
> > &to_userptr_vma(vma)->userptr;
> >  			u64 size = end - start + 1;
> >  			int err;
> > @@ -1029,7 +1034,7 @@ static void xe_vma_destroy_late(struct xe_vma
> > *vma)
> >  		 */
> >  		mmu_interval_notifier_remove(&userptr->notifier);
> >  		xe_vm_put(vm);
> > -	} else if (xe_vma_is_null(vma)) {
> > +	} else if (xe_vma_is_null(vma) ||
> > xe_vma_is_system_allocator(vma)) {
> >  		xe_vm_put(vm);
> >  	} else {
> >  		xe_bo_put(xe_vma_bo(vma));
> > @@ -1068,7 +1073,7 @@ static void xe_vma_destroy(struct xe_vma *vma,
> > struct dma_fence *fence)
> >  		spin_lock(&vm->userptr.invalidated_lock);
> >  		list_del(&to_userptr_vma(vma)-
> > >userptr.invalidate_link);
> >  		spin_unlock(&vm->userptr.invalidated_lock);
> > -	} else if (!xe_vma_is_null(vma)) {
> > +	} else if (!xe_vma_is_null(vma) &&
> > !xe_vma_is_system_allocator(vma)) {
> >  		xe_bo_assert_held(xe_vma_bo(vma));
> >  
> >  		drm_gpuva_unlink(&vma->gpuva);
> > @@ -1967,6 +1972,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> > struct xe_bo *bo,
> >  			op->map.read_only =
> >  				flags &
> > DRM_XE_VM_BIND_FLAG_READONLY;
> >  			op->map.is_null = flags &
> > DRM_XE_VM_BIND_FLAG_NULL;
> > +			op->map.is_system_allocator = flags &
> > +				DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR
> > ;
> >  			op->map.dumpable = flags &
> > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> >  			op->map.pat_index = pat_index;
> >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> > @@ -2158,6 +2165,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> > *vm, struct drm_gpuva_ops *ops,
> >  				VMA_CREATE_FLAG_IS_NULL : 0;
> >  			flags |= op->map.dumpable ?
> >  				VMA_CREATE_FLAG_DUMPABLE : 0;
> > +			flags |= op->map.is_system_allocator ?
> > +				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR
> > : 0;
> >  
> >  			vma = new_vma(vm, &op->base.map, op-
> > >map.pat_index,
> >  				      flags);
> > @@ -2165,7 +2174,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> > *vm, struct drm_gpuva_ops *ops,
> >  				return PTR_ERR(vma);
> >  
> >  			op->map.vma = vma;
> > -			if (op->map.immediate ||
> > !xe_vm_in_fault_mode(vm))
> > +			if ((op->map.immediate ||
> > !xe_vm_in_fault_mode(vm)) &&
> > +			    !op->map.is_system_allocator)
> >  				xe_vma_ops_incr_pt_update_ops(vops,
> >  							      op-
> > >tile_mask);
> >  			break;
> > @@ -2174,21 +2184,24 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  		{
> >  			struct xe_vma *old =
> >  				gpuva_to_vma(op->base.remap.unmap-
> > >va);
> > +			bool skip = xe_vma_is_system_allocator(old);
> >  
> >  			op->remap.start = xe_vma_start(old);
> >  			op->remap.range = xe_vma_size(old);
> >  
> > -			if (op->base.remap.prev) {
> > -				flags |= op->base.remap.unmap->va-
> > >flags &
> > -					XE_VMA_READ_ONLY ?
> > -					VMA_CREATE_FLAG_READ_ONLY :
> > 0;
> > -				flags |= op->base.remap.unmap->va-
> > >flags &
> > -					DRM_GPUVA_SPARSE ?
> > -					VMA_CREATE_FLAG_IS_NULL : 0;
> > -				flags |= op->base.remap.unmap->va-
> > >flags &
> > -					XE_VMA_DUMPABLE ?
> > -					VMA_CREATE_FLAG_DUMPABLE :
> > 0;
> > +			flags |= op->base.remap.unmap->va->flags &
> > +				XE_VMA_READ_ONLY ?
> > +				VMA_CREATE_FLAG_READ_ONLY : 0;
> > +			flags |= op->base.remap.unmap->va->flags &
> > +				DRM_GPUVA_SPARSE ?
> > +				VMA_CREATE_FLAG_IS_NULL : 0;
> > +			flags |= op->base.remap.unmap->va->flags &
> > +				XE_VMA_DUMPABLE ?
> > +				VMA_CREATE_FLAG_DUMPABLE : 0;
> > +			flags |= xe_vma_is_system_allocator(old) ?
> > +				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR
> > : 0;
> >  
> > +			if (op->base.remap.prev) {
> >  				vma = new_vma(vm, op-
> > >base.remap.prev,
> >  					      old->pat_index,
> > flags);
> >  				if (IS_ERR(vma))
> > @@ -2200,9 +2213,10 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  				 * Userptr creates a new SG mapping
> > so
> >  				 * we must also rebind.
> >  				 */
> > -				op->remap.skip_prev =
> > !xe_vma_is_userptr(old) &&
> > +				op->remap.skip_prev = skip ||
> > +					(!xe_vma_is_userptr(old) &&
> >  					IS_ALIGNED(xe_vma_end(vma),
> > -						  
> > xe_vma_max_pte_size(old));
> > +						  
> > xe_vma_max_pte_size(old)));
> >  				if (op->remap.skip_prev) {
> >  					xe_vma_set_pte_size(vma,
> > xe_vma_max_pte_size(old));
> >  					op->remap.range -=
> > @@ -2218,16 +2232,6 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  			}
> >  
> >  			if (op->base.remap.next) {
> > -				flags |= op->base.remap.unmap->va-
> > >flags &
> > -					XE_VMA_READ_ONLY ?
> > -					VMA_CREATE_FLAG_READ_ONLY :
> > 0;
> > -				flags |= op->base.remap.unmap->va-
> > >flags &
> > -					DRM_GPUVA_SPARSE ?
> > -					VMA_CREATE_FLAG_IS_NULL : 0;
> > -				flags |= op->base.remap.unmap->va-
> > >flags &
> > -					XE_VMA_DUMPABLE ?
> > -					VMA_CREATE_FLAG_DUMPABLE :
> > 0;
> > -
> >  				vma = new_vma(vm, op-
> > >base.remap.next,
> >  					      old->pat_index,
> > flags);
> >  				if (IS_ERR(vma))
> > @@ -2239,9 +2243,10 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  				 * Userptr creates a new SG mapping
> > so
> >  				 * we must also rebind.
> >  				 */
> > -				op->remap.skip_next =
> > !xe_vma_is_userptr(old) &&
> > +				op->remap.skip_next = skip ||
> > +					(!xe_vma_is_userptr(old) &&
> >  					IS_ALIGNED(xe_vma_start(vma)
> > ,
> > -						  
> > xe_vma_max_pte_size(old));
> > +						  
> > xe_vma_max_pte_size(old)));
> >  				if (op->remap.skip_next) {
> >  					xe_vma_set_pte_size(vma,
> > xe_vma_max_pte_size(old));
> >  					op->remap.range -=
> > @@ -2254,14 +2259,27 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  					xe_vma_ops_incr_pt_update_op
> > s(vops, op->tile_mask);
> >  				}
> >  			}
> > -			xe_vma_ops_incr_pt_update_ops(vops, op-
> > >tile_mask);
> > +			if (!skip)
> > +				xe_vma_ops_incr_pt_update_ops(vops,
> > op->tile_mask);
> >  			break;
> >  		}
> >  		case DRM_GPUVA_OP_UNMAP:
> > +		{
> > +			struct xe_vma *vma = gpuva_to_vma(op-
> > >base.unmap.va);
> > +
> > +			if (!xe_vma_is_system_allocator(vma))
> > +				xe_vma_ops_incr_pt_update_ops(vops,
> > op->tile_mask);
> > +			break;
> > +		}
> >  		case DRM_GPUVA_OP_PREFETCH:
> > +		{
> > +			struct xe_vma *vma = gpuva_to_vma(op-
> > >base.prefetch.va);
> > +
> >  			/* FIXME: Need to skip some prefetch ops */
> > -			xe_vma_ops_incr_pt_update_ops(vops, op-
> > >tile_mask);
> > +			if (!xe_vma_is_system_allocator(vma))
> > +				xe_vma_ops_incr_pt_update_ops(vops,
> > op->tile_mask);
> >  			break;
> > +		}
> >  		default:
> >  			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> >  		}
> > @@ -2702,7 +2720,8 @@ static int vm_bind_ioctl_ops_execute(struct
> > xe_vm *vm,
> >  	(DRM_XE_VM_BIND_FLAG_READONLY | \
> >  	 DRM_XE_VM_BIND_FLAG_IMMEDIATE | \
> >  	 DRM_XE_VM_BIND_FLAG_NULL | \
> > -	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
> > +	 DRM_XE_VM_BIND_FLAG_DUMPABLE | \
> > +	 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR)
> >  
> >  #ifdef TEST_VM_OPS_ERROR
> >  #define SUPPORTED_FLAGS	(SUPPORTED_FLAGS_STUB |
> > FORCE_OP_ERROR)
> > @@ -2757,9 +2776,17 @@ static int vm_bind_ioctl_check_args(struct
> > xe_device *xe,
> >  		u64 obj_offset = (*bind_ops)[i].obj_offset;
> >  		u32 prefetch_region =
> > (*bind_ops)[i].prefetch_mem_region_instance;
> >  		bool is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
> > +		bool is_system_allocator = flags &
> > +			DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
> >  		u16 pat_index = (*bind_ops)[i].pat_index;
> >  		u16 coh_mode;
> >  
> > +		/* FIXME: Disabling system allocator for now */
> > +		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
> > +			err = -EOPNOTSUPP;
> > +			goto free_bind_ops;
> > +		}
> > +
> >  		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> > >pat.n_entries)) {
> >  			err = -EINVAL;
> >  			goto free_bind_ops;
> > @@ -2780,13 +2807,14 @@ static int vm_bind_ioctl_check_args(struct
> > xe_device *xe,
> >  
> >  		if (XE_IOCTL_DBG(xe, op >
> > DRM_XE_VM_BIND_OP_PREFETCH) ||
> >  		    XE_IOCTL_DBG(xe, flags & ~SUPPORTED_FLAGS) ||
> > -		    XE_IOCTL_DBG(xe, obj && is_null) ||
> > -		    XE_IOCTL_DBG(xe, obj_offset && is_null) ||
> > +		    XE_IOCTL_DBG(xe, obj && (is_null ||
> > is_system_allocator)) ||
> > +		    XE_IOCTL_DBG(xe, obj_offset && (is_null ||
> > +				 is_system_allocator)) ||
> >  		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
> > -				 is_null) ||
> > +				 (is_null || is_system_allocator))
> > ||
> >  		    XE_IOCTL_DBG(xe, !obj &&
> >  				 op == DRM_XE_VM_BIND_OP_MAP &&
> > -				 !is_null) ||
> > +				 !is_null && !is_system_allocator)
> > ||
> >  		    XE_IOCTL_DBG(xe, !obj &&
> >  				 op == DRM_XE_VM_BIND_OP_UNMAP_ALL)
> > ||
> >  		    XE_IOCTL_DBG(xe, addr &&
> > @@ -3170,6 +3198,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
> >  	int ret = 0;
> >  
> >  	xe_assert(xe, !xe_vma_is_null(vma));
> > +	xe_assert(xe, !xe_vma_is_system_allocator(vma));
> >  	trace_xe_vma_invalidate(vma);
> >  
> >  	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> > index c864dba35e1d..1a5aed678214 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.h
> > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > @@ -151,6 +151,11 @@ static inline bool xe_vma_is_null(struct xe_vma
> > *vma)
> >  	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
> >  }
> >  
> > +static inline bool xe_vma_is_system_allocator(struct xe_vma *vma)
> > +{
> > +	return vma->gpuva.flags & XE_VMA_SYSTEM_ALLOCATOR;
> > +}
> > +
> >  static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
> >  {
> >  	return !xe_vma_bo(vma);
> > @@ -158,7 +163,8 @@ static inline bool xe_vma_has_no_bo(struct xe_vma
> > *vma)
> >  
> >  static inline bool xe_vma_is_userptr(struct xe_vma *vma)
> >  {
> > -	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma);
> > +	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma) &&
> > +		!xe_vma_is_system_allocator(vma);
> >  }
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index 7f9a303e51d8..1764781c376b 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -42,6 +42,7 @@ struct xe_vm_pgtable_update_op;
> >  #define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 6)
> >  #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 7)
> >  #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 8)
> > +#define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 9)
> >  
> >  /** struct xe_userptr - User pointer */
> >  struct xe_userptr {
> > @@ -294,6 +295,8 @@ struct xe_vma_op_map {
> >  	bool read_only;
> >  	/** @is_null: is NULL binding */
> >  	bool is_null;
> > +	/** @is_system_allocator: is system allocator binding */
> > +	bool is_system_allocator;
> >  	/** @dumpable: whether BO is dumped on GPU hang */
> >  	bool dumpable;
> >  	/** @pat_index: The pat index to use for this operation. */
> > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > index c4182e95a619..1e92fd498967 100644
> > --- a/include/uapi/drm/xe_drm.h
> > +++ b/include/uapi/drm/xe_drm.h
> > @@ -906,6 +906,12 @@ struct drm_xe_vm_destroy {
> >   *    will only be valid for DRM_XE_VM_BIND_OP_MAP operations, the
> > BO
> >   *    handle MBZ, and the BO offset MBZ. This flag is intended to
> >   *    implement VK sparse bindings.
> > + *  - %DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR - When the system
> > allocator flag is
> > + *    set, no mappings are created rather the range is reserved for
> > system
> > + *    allocations which will be populated on GPU page faults. Only
> > valid on VMs
> > + *    with DRM_XE_VM_CREATE_FLAG_FAULT_MODE set. The system
> > allocator flag are
> > + *    only valid for DRM_XE_VM_BIND_OP_MAP operations, the BO handle
> > MBZ, and
> > + *    the BO offset MBZ.
> >   */
> >  struct drm_xe_vm_bind_op {
> >  	/** @extensions: Pointer to the first extension struct, if
> > any */
> > @@ -958,7 +964,9 @@ struct drm_xe_vm_bind_op {
> >  	 * on the @pat_index. For such mappings there is no actual
> > memory being
> >  	 * mapped (the address in the PTE is invalid), so the
> > various PAT memory
> >  	 * attributes likely do not apply.  Simply leaving as zero
> > is one
> > -	 * option (still a valid pat_index).
> > +	 * option (still a valid pat_index). Same applies to
> > +	 * DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR bindings as for such
> > mapping
> > +	 * there is no actual memory being mapped.
> >  	 */
> >  	__u16 pat_index;
> >  
> > @@ -974,6 +982,14 @@ struct drm_xe_vm_bind_op {
> >  
> >  		/** @userptr: user pointer to bind on */
> >  		__u64 userptr;
> > +
> > +		/**
> > +		 * @system_allocator_offset: Offset from GPU @addr
> > to create
> > +		 * system allocator mappings. MBZ with current level
> > of support
> > +		 * (e.g. 1 to 1 mapping between GPU and CPU mappings
> > only
> > +		 * supported).
> > +		 */
> > +		__s64 system_allocator_offset;
> >  	};
> >  
> >  	/**
> > @@ -996,6 +1012,7 @@ struct drm_xe_vm_bind_op {
> >  #define DRM_XE_VM_BIND_FLAG_IMMEDIATE	(1 << 1)
> >  #define DRM_XE_VM_BIND_FLAG_NULL	(1 << 2)
> >  #define DRM_XE_VM_BIND_FLAG_DUMPABLE	(1 << 3)
> > +#define DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR	(1 << 4)
> >  	/** @flags: Bind flags */
> >  	__u32 flags;
> >  
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 07/29] drm/xe: Add SVM init / close / fini to faulting VMs
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (5 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 06/29] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-11-19 12:13   ` Thomas Hellström
  2024-10-16  3:24 ` [PATCH v2 08/29] drm/xe: Add dma_addr res cursor Matthew Brost
                   ` (24 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add SVM init / close / fini to faulting VMs. Minimual implementation.

v2:
 - Add close function

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/Makefile      |  1 +
 drivers/gpu/drm/xe/xe_svm.c      | 46 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h      | 15 +++++++++++
 drivers/gpu/drm/xe/xe_vm.c       | 12 +++++++++
 drivers/gpu/drm/xe/xe_vm_types.h |  7 +++++
 5 files changed, 81 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 8d991d4a92a5..c3e85bcfd4d1 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -96,6 +96,7 @@ xe-y += drm_gpusvm.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
new file mode 100644
index 000000000000..57b740367843
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include "drm_gpusvm.h"
+
+#include "xe_svm.h"
+#include "xe_vm.h"
+#include "xe_vm_types.h"
+
+static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
+			      struct drm_gpusvm_notifier *notifier,
+			      const struct mmu_notifier_range *mmu_range)
+{
+	/* TODO: Implement */
+}
+
+static const struct drm_gpusvm_ops gpusvm_ops = {
+	.invalidate = xe_svm_invalidate,
+};
+
+static const u64 fault_chunk_sizes[] = {
+	SZ_2M,
+	SZ_64K,
+	SZ_4K,
+};
+
+int xe_svm_init(struct xe_vm *vm)
+{
+	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
+			       current->mm, NULL, 0, vm->size,
+			       SZ_512M, &gpusvm_ops, fault_chunk_sizes,
+			       ARRAY_SIZE(fault_chunk_sizes));
+}
+
+void xe_svm_close(struct xe_vm *vm)
+{
+}
+
+void xe_svm_fini(struct xe_vm *vm)
+{
+	xe_assert(vm->xe, xe_vm_is_closed(vm));
+
+	drm_gpusvm_fini(&vm->svm.gpusvm);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..979f2322eeba
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _XE_SVM_H_
+#define _XE_SVM_H_
+
+struct xe_vm;
+
+int xe_svm_init(struct xe_vm *vm);
+void xe_svm_fini(struct xe_vm *vm);
+void xe_svm_close(struct xe_vm *vm);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 0d887fb9de59..b11fb0ac9520 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -35,6 +35,7 @@
 #include "xe_preempt_fence.h"
 #include "xe_pt.h"
 #include "xe_res_cursor.h"
+#include "xe_svm.h"
 #include "xe_sync.h"
 #include "xe_trace_bo.h"
 #include "xe_wa.h"
@@ -1503,6 +1504,12 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		}
 	}
 
+	if (flags & XE_VM_FLAG_FAULT_MODE) {
+		err = xe_svm_init(vm);
+		if (err)
+			goto err_close;
+	}
+
 	if (number_tiles > 1)
 		vm->composite_fence_ctx = dma_fence_context_alloc(1);
 
@@ -1548,6 +1555,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	xe_vm_close(vm);
 	if (xe_vm_in_preempt_fence_mode(vm))
 		flush_work(&vm->preempt.rebind_work);
+	if (xe_vm_in_fault_mode(vm))
+		xe_svm_close(vm);
 
 	down_write(&vm->lock);
 	for_each_tile(tile, xe, id) {
@@ -1616,6 +1625,9 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 		xe_vma_destroy_unlocked(vma);
 	}
 
+	if (xe_vm_in_fault_mode(vm))
+		xe_svm_fini(vm);
+
 	up_write(&vm->lock);
 
 	down_write(&xe->usm.lock);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 1764781c376b..bd1c0e368238 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -6,6 +6,7 @@
 #ifndef _XE_VM_TYPES_H_
 #define _XE_VM_TYPES_H_
 
+#include "drm_gpusvm.h"
 #include <drm/drm_gpuvm.h>
 
 #include <linux/dma-resv.h>
@@ -140,6 +141,12 @@ struct xe_vm {
 	/** @gpuvm: base GPUVM used to track VMAs */
 	struct drm_gpuvm gpuvm;
 
+	/** @svm: Shared virtual memory state */
+	struct {
+		/** @svm.gpusvm: base GPUSVM used to track fault allocations */
+		struct drm_gpusvm gpusvm;
+	} svm;
+
 	struct xe_device *xe;
 
 	/* exec queue used for (un)binding vma's */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 07/29] drm/xe: Add SVM init / close / fini to faulting VMs
  2024-10-16  3:24 ` [PATCH v2 07/29] drm/xe: Add SVM init / close / fini to faulting VMs Matthew Brost
@ 2024-11-19 12:13   ` Thomas Hellström
  2024-11-19 16:22     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 12:13 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> Add SVM init / close / fini to faulting VMs. Minimual implementation.
> 
> v2:
>  - Add close function
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile      |  1 +
>  drivers/gpu/drm/xe/xe_svm.c      | 46
> ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_svm.h      | 15 +++++++++++
>  drivers/gpu/drm/xe/xe_vm.c       | 12 +++++++++
>  drivers/gpu/drm/xe/xe_vm_types.h |  7 +++++
>  5 files changed, 81 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index 8d991d4a92a5..c3e85bcfd4d1 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -96,6 +96,7 @@ xe-y += drm_gpusvm.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
>  	xe_step.o \
> +	xe_svm.o \
>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> new file mode 100644
> index 000000000000..57b740367843
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include "drm_gpusvm.h"
> +
> +#include "xe_svm.h"
> +#include "xe_vm.h"
> +#include "xe_vm_types.h"
> +
> +static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> +			      struct drm_gpusvm_notifier *notifier,
> +			      const struct mmu_notifier_range
> *mmu_range)
> +{
> +	/* TODO: Implement */
> +}
> +
> +static const struct drm_gpusvm_ops gpusvm_ops = {
> +	.invalidate = xe_svm_invalidate,
> +};
> +
> +static const u64 fault_chunk_sizes[] = {
> +	SZ_2M,
> +	SZ_64K,
> +	SZ_4K,
> +};
> +
> +int xe_svm_init(struct xe_vm *vm)
Kerneldoc + other undocumented extern funcs

> +{
> +	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe-
> >drm,
> +			       current->mm, NULL, 0, vm->size,
> +			       SZ_512M, &gpusvm_ops,
> fault_chunk_sizes,
> +			       ARRAY_SIZE(fault_chunk_sizes));
> +}
> +
> +void xe_svm_close(struct xe_vm *vm)

> +{
> +}
> +
> +void xe_svm_fini(struct xe_vm *vm)
> +{
> +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> +
> +	drm_gpusvm_fini(&vm->svm.gpusvm);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> new file mode 100644
> index 000000000000..979f2322eeba
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef _XE_SVM_H_
> +#define _XE_SVM_H_
> +
> +struct xe_vm;
> +
> +int xe_svm_init(struct xe_vm *vm);
> +void xe_svm_fini(struct xe_vm *vm);
> +void xe_svm_close(struct xe_vm *vm);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 0d887fb9de59..b11fb0ac9520 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -35,6 +35,7 @@
>  #include "xe_preempt_fence.h"
>  #include "xe_pt.h"
>  #include "xe_res_cursor.h"
> +#include "xe_svm.h"
>  #include "xe_sync.h"
>  #include "xe_trace_bo.h"
>  #include "xe_wa.h"
> @@ -1503,6 +1504,12 @@ struct xe_vm *xe_vm_create(struct xe_device
> *xe, u32 flags)
>  		}
>  	}
>  
> +	if (flags & XE_VM_FLAG_FAULT_MODE) {
> +		err = xe_svm_init(vm);
> +		if (err)
> +			goto err_close;
> +	}
> +
>  	if (number_tiles > 1)
>  		vm->composite_fence_ctx =
> dma_fence_context_alloc(1);
>  
> @@ -1548,6 +1555,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
>  	xe_vm_close(vm);
>  	if (xe_vm_in_preempt_fence_mode(vm))
>  		flush_work(&vm->preempt.rebind_work);
> +	if (xe_vm_in_fault_mode(vm))
> +		xe_svm_close(vm);
>  
>  	down_write(&vm->lock);
>  	for_each_tile(tile, xe, id) {
> @@ -1616,6 +1625,9 @@ void xe_vm_close_and_put(struct xe_vm *vm)
>  		xe_vma_destroy_unlocked(vma);
>  	}
>  
> +	if (xe_vm_in_fault_mode(vm))
> +		xe_svm_fini(vm);
> +
>  	up_write(&vm->lock);
>  
>  	down_write(&xe->usm.lock);
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> index 1764781c376b..bd1c0e368238 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -6,6 +6,7 @@
>  #ifndef _XE_VM_TYPES_H_
>  #define _XE_VM_TYPES_H_
>  
> +#include "drm_gpusvm.h"
>  #include <drm/drm_gpuvm.h>
>  
>  #include <linux/dma-resv.h>
> @@ -140,6 +141,12 @@ struct xe_vm {
>  	/** @gpuvm: base GPUVM used to track VMAs */
>  	struct drm_gpuvm gpuvm;
>  
> +	/** @svm: Shared virtual memory state */
> +	struct {
> +		/** @svm.gpusvm: base GPUSVM used to track fault
> allocations */
> +		struct drm_gpusvm gpusvm;
> +	} svm;
> +
>  	struct xe_device *xe;
>  
>  	/* exec queue used for (un)binding vma's */

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 07/29] drm/xe: Add SVM init / close / fini to faulting VMs
  2024-11-19 12:13   ` Thomas Hellström
@ 2024-11-19 16:22     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-19 16:22 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 01:13:26PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > Add SVM init / close / fini to faulting VMs. Minimual implementation.
> > 
> > v2:
> >  - Add close function
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile      |  1 +
> >  drivers/gpu/drm/xe/xe_svm.c      | 46
> > ++++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_svm.h      | 15 +++++++++++
> >  drivers/gpu/drm/xe/xe_vm.c       | 12 +++++++++
> >  drivers/gpu/drm/xe/xe_vm_types.h |  7 +++++
> >  5 files changed, 81 insertions(+)
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index 8d991d4a92a5..c3e85bcfd4d1 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -96,6 +96,7 @@ xe-y += drm_gpusvm.o \
> >  	xe_sa.o \
> >  	xe_sched_job.o \
> >  	xe_step.o \
> > +	xe_svm.o \
> >  	xe_sync.o \
> >  	xe_tile.o \
> >  	xe_tile_sysfs.o \
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > new file mode 100644
> > index 000000000000..57b740367843
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -0,0 +1,46 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include "drm_gpusvm.h"
> > +
> > +#include "xe_svm.h"
> > +#include "xe_vm.h"
> > +#include "xe_vm_types.h"
> > +
> > +static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > +			      struct drm_gpusvm_notifier *notifier,
> > +			      const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	/* TODO: Implement */
> > +}
> > +
> > +static const struct drm_gpusvm_ops gpusvm_ops = {
> > +	.invalidate = xe_svm_invalidate,
> > +};
> > +
> > +static const u64 fault_chunk_sizes[] = {
> > +	SZ_2M,
> > +	SZ_64K,
> > +	SZ_4K,
> > +};
> > +
> > +int xe_svm_init(struct xe_vm *vm)
> Kerneldoc + other undocumented extern funcs
> 

This whole series going to be missing kernel doc aside from GPUSVM.
Aware of the issue, will fixup all kernel doc in next rev.

Matt

> > +{
> > +	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe-
> > >drm,
> > +			       current->mm, NULL, 0, vm->size,
> > +			       SZ_512M, &gpusvm_ops,
> > fault_chunk_sizes,
> > +			       ARRAY_SIZE(fault_chunk_sizes));
> > +}
> > +
> > +void xe_svm_close(struct xe_vm *vm)
> 
> > +{
> > +}
> > +
> > +void xe_svm_fini(struct xe_vm *vm)
> > +{
> > +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> > +
> > +	drm_gpusvm_fini(&vm->svm.gpusvm);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > new file mode 100644
> > index 000000000000..979f2322eeba
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_SVM_H_
> > +#define _XE_SVM_H_
> > +
> > +struct xe_vm;
> > +
> > +int xe_svm_init(struct xe_vm *vm);
> > +void xe_svm_fini(struct xe_vm *vm);
> > +void xe_svm_close(struct xe_vm *vm);
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 0d887fb9de59..b11fb0ac9520 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -35,6 +35,7 @@
> >  #include "xe_preempt_fence.h"
> >  #include "xe_pt.h"
> >  #include "xe_res_cursor.h"
> > +#include "xe_svm.h"
> >  #include "xe_sync.h"
> >  #include "xe_trace_bo.h"
> >  #include "xe_wa.h"
> > @@ -1503,6 +1504,12 @@ struct xe_vm *xe_vm_create(struct xe_device
> > *xe, u32 flags)
> >  		}
> >  	}
> >  
> > +	if (flags & XE_VM_FLAG_FAULT_MODE) {
> > +		err = xe_svm_init(vm);
> > +		if (err)
> > +			goto err_close;
> > +	}
> > +
> >  	if (number_tiles > 1)
> >  		vm->composite_fence_ctx =
> > dma_fence_context_alloc(1);
> >  
> > @@ -1548,6 +1555,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
> >  	xe_vm_close(vm);
> >  	if (xe_vm_in_preempt_fence_mode(vm))
> >  		flush_work(&vm->preempt.rebind_work);
> > +	if (xe_vm_in_fault_mode(vm))
> > +		xe_svm_close(vm);
> >  
> >  	down_write(&vm->lock);
> >  	for_each_tile(tile, xe, id) {
> > @@ -1616,6 +1625,9 @@ void xe_vm_close_and_put(struct xe_vm *vm)
> >  		xe_vma_destroy_unlocked(vma);
> >  	}
> >  
> > +	if (xe_vm_in_fault_mode(vm))
> > +		xe_svm_fini(vm);
> > +
> >  	up_write(&vm->lock);
> >  
> >  	down_write(&xe->usm.lock);
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index 1764781c376b..bd1c0e368238 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -6,6 +6,7 @@
> >  #ifndef _XE_VM_TYPES_H_
> >  #define _XE_VM_TYPES_H_
> >  
> > +#include "drm_gpusvm.h"
> >  #include <drm/drm_gpuvm.h>
> >  
> >  #include <linux/dma-resv.h>
> > @@ -140,6 +141,12 @@ struct xe_vm {
> >  	/** @gpuvm: base GPUVM used to track VMAs */
> >  	struct drm_gpuvm gpuvm;
> >  
> > +	/** @svm: Shared virtual memory state */
> > +	struct {
> > +		/** @svm.gpusvm: base GPUSVM used to track fault
> > allocations */
> > +		struct drm_gpusvm gpusvm;
> > +	} svm;
> > +
> >  	struct xe_device *xe;
> >  
> >  	/* exec queue used for (un)binding vma's */
> 
> Thanks,
> Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 08/29] drm/xe: Add dma_addr res cursor
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (6 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 07/29] drm/xe: Add SVM init / close / fini to faulting VMs Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-11-19 12:15   ` Thomas Hellström
  2024-10-16  3:24 ` [PATCH v2 09/29] drm/xe: Add SVM range invalidation Matthew Brost
                   ` (23 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Useful for SVM ranges in SRAM and programing page tables.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_res_cursor.h | 116 ++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h        |   4 +
 2 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h b/drivers/gpu/drm/xe/xe_res_cursor.h
index dca374b6521c..3faa3d9adb82 100644
--- a/drivers/gpu/drm/xe/xe_res_cursor.h
+++ b/drivers/gpu/drm/xe/xe_res_cursor.h
@@ -30,13 +30,18 @@
 #include <drm/ttm/ttm_range_manager.h>
 #include <drm/ttm/ttm_resource.h>
 #include <drm/ttm/ttm_tt.h>
+#include "drm_pagemap.h"
 
 #include "xe_bo.h"
 #include "xe_device.h"
 #include "xe_macros.h"
+#include "xe_svm.h"
 #include "xe_ttm_vram_mgr.h"
 
-/* state back for walking over vram_mgr, stolen_mgr, and gtt_mgr allocations */
+/**
+ * struct xe_res_cursor - state for walking over vram_mgr, stolen_mgr,
+ * and gtt_mgr allocations
+ */
 struct xe_res_cursor {
 	u64 start;
 	u64 size;
@@ -44,7 +49,17 @@ struct xe_res_cursor {
 	void *node;
 	u32 mem_type;
 	struct scatterlist *sgl;
+	/** @dma_addr: Current element in a struct drm_pagemap_dma_addr array */
+	const struct drm_pagemap_dma_addr *dma_addr;
 	struct drm_buddy *mm;
+	/**
+	 * @dma_start: DMA start address for the current segment.
+	 * This may be different to @dma_addr.addr since elements in
+	 * the array may be coalesced to a single segment.
+	 */
+	u64 dma_start;
+	/** @dma_seg_size: Size of the current segment. */
+	u64 dma_seg_size;
 };
 
 static struct drm_buddy *xe_res_get_buddy(struct ttm_resource *res)
@@ -70,6 +85,7 @@ static inline void xe_res_first(struct ttm_resource *res,
 				struct xe_res_cursor *cur)
 {
 	cur->sgl = NULL;
+	cur->dma_addr = NULL;
 	if (!res)
 		goto fallback;
 
@@ -141,6 +157,36 @@ static inline void __xe_res_sg_next(struct xe_res_cursor *cur)
 	cur->sgl = sgl;
 }
 
+/**
+ * __xe_res_dma_next() - Advance the cursor when end-of-segment is reached
+ * @cur: The cursor
+ */
+static inline void __xe_res_dma_next(struct xe_res_cursor *cur)
+{
+	const struct drm_pagemap_dma_addr *addr = cur->dma_addr;
+	u64 start = cur->start;
+
+	while (start >= cur->dma_seg_size) {
+		start -= cur->dma_seg_size;
+		addr++;
+		cur->dma_seg_size = PAGE_SIZE << addr->order;
+	}
+	cur->dma_start = addr->addr;
+
+	/* Coalesce array_elements */
+	while (cur->dma_seg_size - start < cur->remaining) {
+		if (cur->dma_start + cur->dma_seg_size != addr[1].addr ||
+		    addr->proto != addr[1].proto)
+			break;
+		addr++;
+		cur->dma_seg_size += PAGE_SIZE << addr->order;
+	}
+
+	cur->dma_addr = addr;
+	cur->start = start;
+	cur->size = cur->dma_seg_size - start;
+}
+
 /**
  * xe_res_first_sg - initialize a xe_res_cursor with a scatter gather table
  *
@@ -160,11 +206,42 @@ static inline void xe_res_first_sg(const struct sg_table *sg,
 	cur->start = start;
 	cur->remaining = size;
 	cur->size = 0;
+	cur->dma_addr = NULL;
 	cur->sgl = sg->sgl;
 	cur->mem_type = XE_PL_TT;
 	__xe_res_sg_next(cur);
 }
 
+/**
+ * xe_res_first_dma - initialize a xe_res_cursor with dma_addr array
+ *
+ * @dma_addr: struct drm_pagemap_dma_addr array to walk
+ * @start: Start of the range
+ * @size: Size of the range
+ * @cur: cursor object to initialize
+ *
+ * Start walking over the range of allocations between @start and @size.
+ */
+static inline void xe_res_first_dma(const struct drm_pagemap_dma_addr *dma_addr,
+				    u64 start, u64 size,
+				    struct xe_res_cursor *cur)
+{
+	XE_WARN_ON(!dma_addr);
+	XE_WARN_ON(!IS_ALIGNED(start, PAGE_SIZE) ||
+		   !IS_ALIGNED(size, PAGE_SIZE));
+
+	cur->node = NULL;
+	cur->start = start;
+	cur->remaining = size;
+	cur->dma_seg_size = PAGE_SIZE << dma_addr->order;
+	cur->dma_start = 0;
+	cur->size = 0;
+	cur->dma_addr = dma_addr;
+	__xe_res_dma_next(cur);
+	cur->sgl = NULL;
+	cur->mem_type = XE_PL_TT;
+}
+
 /**
  * xe_res_next - advance the cursor
  *
@@ -191,6 +268,12 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
 		return;
 	}
 
+	if (cur->dma_addr) {
+		cur->start += size;
+		__xe_res_dma_next(cur);
+		return;
+	}
+
 	if (cur->sgl) {
 		cur->start += size;
 		__xe_res_sg_next(cur);
@@ -232,6 +315,35 @@ static inline void xe_res_next(struct xe_res_cursor *cur, u64 size)
  */
 static inline u64 xe_res_dma(const struct xe_res_cursor *cur)
 {
-	return cur->sgl ? sg_dma_address(cur->sgl) + cur->start : cur->start;
+	if (cur->dma_addr)
+		return cur->dma_start + cur->start;
+	else if (cur->sgl)
+		return sg_dma_address(cur->sgl) + cur->start;
+	else
+		return cur->start;
+}
+
+/**
+ * xe_res_is_vram() - Whether the cursor current dma address points to
+ * same-device VRAM
+ * @cur: The cursor.
+ *
+ * Return: true iff the address returned by xe_res_dma() points to internal vram.
+ */
+static inline bool xe_res_is_vram(const struct xe_res_cursor *cur)
+{
+	if (cur->dma_addr)
+		return cur->dma_addr->proto == XE_INTERCONNECT_VRAM;
+
+	switch (cur->mem_type) {
+	case XE_PL_STOLEN:
+	case XE_PL_VRAM0:
+	case XE_PL_VRAM1:
+		return true;
+	default:
+		break;
+	}
+
+	return false;
 }
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 979f2322eeba..376e86876a11 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -6,6 +6,10 @@
 #ifndef _XE_SVM_H_
 #define _XE_SVM_H_
 
+#include "drm_pagemap.h"
+
+#define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
+
 struct xe_vm;
 
 int xe_svm_init(struct xe_vm *vm);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 08/29] drm/xe: Add dma_addr res cursor
  2024-10-16  3:24 ` [PATCH v2 08/29] drm/xe: Add dma_addr res cursor Matthew Brost
@ 2024-11-19 12:15   ` Thomas Hellström
  2024-11-19 16:24     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 12:15 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> 
> Useful for SVM ranges in SRAM and programing page tables.

We should look at providing a better commit message.

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/xe/xe_res_cursor.h | 116
> ++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h        |   4 +
>  2 files changed, 118 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h
> b/drivers/gpu/drm/xe/xe_res_cursor.h
> index dca374b6521c..3faa3d9adb82 100644
> --- a/drivers/gpu/drm/xe/xe_res_cursor.h
> +++ b/drivers/gpu/drm/xe/xe_res_cursor.h
> @@ -30,13 +30,18 @@
>  #include <drm/ttm/ttm_range_manager.h>
>  #include <drm/ttm/ttm_resource.h>
>  #include <drm/ttm/ttm_tt.h>
> +#include "drm_pagemap.h"
>  
>  #include "xe_bo.h"
>  #include "xe_device.h"
>  #include "xe_macros.h"
> +#include "xe_svm.h"
>  #include "xe_ttm_vram_mgr.h"
>  
> -/* state back for walking over vram_mgr, stolen_mgr, and gtt_mgr
> allocations */
> +/**
> + * struct xe_res_cursor - state for walking over vram_mgr,
> stolen_mgr,
> + * and gtt_mgr allocations
> + */
>  struct xe_res_cursor {
>  	u64 start;
>  	u64 size;
> @@ -44,7 +49,17 @@ struct xe_res_cursor {
>  	void *node;
>  	u32 mem_type;
>  	struct scatterlist *sgl;
> +	/** @dma_addr: Current element in a struct
> drm_pagemap_dma_addr array */
> +	const struct drm_pagemap_dma_addr *dma_addr;
>  	struct drm_buddy *mm;
> +	/**
> +	 * @dma_start: DMA start address for the current segment.
> +	 * This may be different to @dma_addr.addr since elements in
> +	 * the array may be coalesced to a single segment.
> +	 */
> +	u64 dma_start;
> +	/** @dma_seg_size: Size of the current segment. */
> +	u64 dma_seg_size;
>  };
>  
>  static struct drm_buddy *xe_res_get_buddy(struct ttm_resource *res)
> @@ -70,6 +85,7 @@ static inline void xe_res_first(struct ttm_resource
> *res,
>  				struct xe_res_cursor *cur)
>  {
>  	cur->sgl = NULL;
> +	cur->dma_addr = NULL;
>  	if (!res)
>  		goto fallback;
>  
> @@ -141,6 +157,36 @@ static inline void __xe_res_sg_next(struct
> xe_res_cursor *cur)
>  	cur->sgl = sgl;
>  }
>  
> +/**
> + * __xe_res_dma_next() - Advance the cursor when end-of-segment is
> reached
> + * @cur: The cursor
> + */
> +static inline void __xe_res_dma_next(struct xe_res_cursor *cur)
> +{
> +	const struct drm_pagemap_dma_addr *addr = cur->dma_addr;
> +	u64 start = cur->start;
> +
> +	while (start >= cur->dma_seg_size) {
> +		start -= cur->dma_seg_size;
> +		addr++;
> +		cur->dma_seg_size = PAGE_SIZE << addr->order;
> +	}
> +	cur->dma_start = addr->addr;
> +
> +	/* Coalesce array_elements */
> +	while (cur->dma_seg_size - start < cur->remaining) {
> +		if (cur->dma_start + cur->dma_seg_size !=
> addr[1].addr ||
> +		    addr->proto != addr[1].proto)
> +			break;
> +		addr++;
> +		cur->dma_seg_size += PAGE_SIZE << addr->order;
> +	}
> +
> +	cur->dma_addr = addr;
> +	cur->start = start;
> +	cur->size = cur->dma_seg_size - start;
> +}
> +
>  /**
>   * xe_res_first_sg - initialize a xe_res_cursor with a scatter
> gather table
>   *
> @@ -160,11 +206,42 @@ static inline void xe_res_first_sg(const struct
> sg_table *sg,
>  	cur->start = start;
>  	cur->remaining = size;
>  	cur->size = 0;
> +	cur->dma_addr = NULL;
>  	cur->sgl = sg->sgl;
>  	cur->mem_type = XE_PL_TT;
>  	__xe_res_sg_next(cur);
>  }
>  
> +/**
> + * xe_res_first_dma - initialize a xe_res_cursor with dma_addr array
> + *
> + * @dma_addr: struct drm_pagemap_dma_addr array to walk
> + * @start: Start of the range
> + * @size: Size of the range
> + * @cur: cursor object to initialize
> + *
> + * Start walking over the range of allocations between @start and
> @size.
> + */
> +static inline void xe_res_first_dma(const struct
> drm_pagemap_dma_addr *dma_addr,
> +				    u64 start, u64 size,
> +				    struct xe_res_cursor *cur)
> +{
> +	XE_WARN_ON(!dma_addr);
> +	XE_WARN_ON(!IS_ALIGNED(start, PAGE_SIZE) ||
> +		   !IS_ALIGNED(size, PAGE_SIZE));
> +
> +	cur->node = NULL;
> +	cur->start = start;
> +	cur->remaining = size;
> +	cur->dma_seg_size = PAGE_SIZE << dma_addr->order;
> +	cur->dma_start = 0;
> +	cur->size = 0;
> +	cur->dma_addr = dma_addr;
> +	__xe_res_dma_next(cur);
> +	cur->sgl = NULL;
> +	cur->mem_type = XE_PL_TT;
> +}
> +
>  /**
>   * xe_res_next - advance the cursor
>   *
> @@ -191,6 +268,12 @@ static inline void xe_res_next(struct
> xe_res_cursor *cur, u64 size)
>  		return;
>  	}
>  
> +	if (cur->dma_addr) {
> +		cur->start += size;
> +		__xe_res_dma_next(cur);
> +		return;
> +	}
> +
>  	if (cur->sgl) {
>  		cur->start += size;
>  		__xe_res_sg_next(cur);
> @@ -232,6 +315,35 @@ static inline void xe_res_next(struct
> xe_res_cursor *cur, u64 size)
>   */
>  static inline u64 xe_res_dma(const struct xe_res_cursor *cur)
>  {
> -	return cur->sgl ? sg_dma_address(cur->sgl) + cur->start :
> cur->start;
> +	if (cur->dma_addr)
> +		return cur->dma_start + cur->start;
> +	else if (cur->sgl)
> +		return sg_dma_address(cur->sgl) + cur->start;
> +	else
> +		return cur->start;
> +}
> +
> +/**
> + * xe_res_is_vram() - Whether the cursor current dma address points
> to
> + * same-device VRAM
> + * @cur: The cursor.
> + *
> + * Return: true iff the address returned by xe_res_dma() points to
> internal vram.
> + */
> +static inline bool xe_res_is_vram(const struct xe_res_cursor *cur)
> +{
> +	if (cur->dma_addr)
> +		return cur->dma_addr->proto == XE_INTERCONNECT_VRAM;
> +
> +	switch (cur->mem_type) {
> +	case XE_PL_STOLEN:
> +	case XE_PL_VRAM0:
> +	case XE_PL_VRAM1:
> +		return true;
> +	default:
> +		break;
> +	}
> +
> +	return false;
>  }
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 979f2322eeba..376e86876a11 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -6,6 +6,10 @@
>  #ifndef _XE_SVM_H_
>  #define _XE_SVM_H_
>  
> +#include "drm_pagemap.h"
> +
> +#define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> +
>  struct xe_vm;
>  
>  int xe_svm_init(struct xe_vm *vm);


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 08/29] drm/xe: Add dma_addr res cursor
  2024-11-19 12:15   ` Thomas Hellström
@ 2024-11-19 16:24     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-19 16:24 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 01:15:12PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > 
> > Useful for SVM ranges in SRAM and programing page tables.
> 
> We should look at providing a better commit message.
> 

Yes, this is pretty poor. Will scrub all commit messages in next and
ensure higher quality.

Matt

> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_res_cursor.h | 116
> > ++++++++++++++++++++++++++++-
> >  drivers/gpu/drm/xe/xe_svm.h        |   4 +
> >  2 files changed, 118 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h
> > b/drivers/gpu/drm/xe/xe_res_cursor.h
> > index dca374b6521c..3faa3d9adb82 100644
> > --- a/drivers/gpu/drm/xe/xe_res_cursor.h
> > +++ b/drivers/gpu/drm/xe/xe_res_cursor.h
> > @@ -30,13 +30,18 @@
> >  #include <drm/ttm/ttm_range_manager.h>
> >  #include <drm/ttm/ttm_resource.h>
> >  #include <drm/ttm/ttm_tt.h>
> > +#include "drm_pagemap.h"
> >  
> >  #include "xe_bo.h"
> >  #include "xe_device.h"
> >  #include "xe_macros.h"
> > +#include "xe_svm.h"
> >  #include "xe_ttm_vram_mgr.h"
> >  
> > -/* state back for walking over vram_mgr, stolen_mgr, and gtt_mgr
> > allocations */
> > +/**
> > + * struct xe_res_cursor - state for walking over vram_mgr,
> > stolen_mgr,
> > + * and gtt_mgr allocations
> > + */
> >  struct xe_res_cursor {
> >  	u64 start;
> >  	u64 size;
> > @@ -44,7 +49,17 @@ struct xe_res_cursor {
> >  	void *node;
> >  	u32 mem_type;
> >  	struct scatterlist *sgl;
> > +	/** @dma_addr: Current element in a struct
> > drm_pagemap_dma_addr array */
> > +	const struct drm_pagemap_dma_addr *dma_addr;
> >  	struct drm_buddy *mm;
> > +	/**
> > +	 * @dma_start: DMA start address for the current segment.
> > +	 * This may be different to @dma_addr.addr since elements in
> > +	 * the array may be coalesced to a single segment.
> > +	 */
> > +	u64 dma_start;
> > +	/** @dma_seg_size: Size of the current segment. */
> > +	u64 dma_seg_size;
> >  };
> >  
> >  static struct drm_buddy *xe_res_get_buddy(struct ttm_resource *res)
> > @@ -70,6 +85,7 @@ static inline void xe_res_first(struct ttm_resource
> > *res,
> >  				struct xe_res_cursor *cur)
> >  {
> >  	cur->sgl = NULL;
> > +	cur->dma_addr = NULL;
> >  	if (!res)
> >  		goto fallback;
> >  
> > @@ -141,6 +157,36 @@ static inline void __xe_res_sg_next(struct
> > xe_res_cursor *cur)
> >  	cur->sgl = sgl;
> >  }
> >  
> > +/**
> > + * __xe_res_dma_next() - Advance the cursor when end-of-segment is
> > reached
> > + * @cur: The cursor
> > + */
> > +static inline void __xe_res_dma_next(struct xe_res_cursor *cur)
> > +{
> > +	const struct drm_pagemap_dma_addr *addr = cur->dma_addr;
> > +	u64 start = cur->start;
> > +
> > +	while (start >= cur->dma_seg_size) {
> > +		start -= cur->dma_seg_size;
> > +		addr++;
> > +		cur->dma_seg_size = PAGE_SIZE << addr->order;
> > +	}
> > +	cur->dma_start = addr->addr;
> > +
> > +	/* Coalesce array_elements */
> > +	while (cur->dma_seg_size - start < cur->remaining) {
> > +		if (cur->dma_start + cur->dma_seg_size !=
> > addr[1].addr ||
> > +		    addr->proto != addr[1].proto)
> > +			break;
> > +		addr++;
> > +		cur->dma_seg_size += PAGE_SIZE << addr->order;
> > +	}
> > +
> > +	cur->dma_addr = addr;
> > +	cur->start = start;
> > +	cur->size = cur->dma_seg_size - start;
> > +}
> > +
> >  /**
> >   * xe_res_first_sg - initialize a xe_res_cursor with a scatter
> > gather table
> >   *
> > @@ -160,11 +206,42 @@ static inline void xe_res_first_sg(const struct
> > sg_table *sg,
> >  	cur->start = start;
> >  	cur->remaining = size;
> >  	cur->size = 0;
> > +	cur->dma_addr = NULL;
> >  	cur->sgl = sg->sgl;
> >  	cur->mem_type = XE_PL_TT;
> >  	__xe_res_sg_next(cur);
> >  }
> >  
> > +/**
> > + * xe_res_first_dma - initialize a xe_res_cursor with dma_addr array
> > + *
> > + * @dma_addr: struct drm_pagemap_dma_addr array to walk
> > + * @start: Start of the range
> > + * @size: Size of the range
> > + * @cur: cursor object to initialize
> > + *
> > + * Start walking over the range of allocations between @start and
> > @size.
> > + */
> > +static inline void xe_res_first_dma(const struct
> > drm_pagemap_dma_addr *dma_addr,
> > +				    u64 start, u64 size,
> > +				    struct xe_res_cursor *cur)
> > +{
> > +	XE_WARN_ON(!dma_addr);
> > +	XE_WARN_ON(!IS_ALIGNED(start, PAGE_SIZE) ||
> > +		   !IS_ALIGNED(size, PAGE_SIZE));
> > +
> > +	cur->node = NULL;
> > +	cur->start = start;
> > +	cur->remaining = size;
> > +	cur->dma_seg_size = PAGE_SIZE << dma_addr->order;
> > +	cur->dma_start = 0;
> > +	cur->size = 0;
> > +	cur->dma_addr = dma_addr;
> > +	__xe_res_dma_next(cur);
> > +	cur->sgl = NULL;
> > +	cur->mem_type = XE_PL_TT;
> > +}
> > +
> >  /**
> >   * xe_res_next - advance the cursor
> >   *
> > @@ -191,6 +268,12 @@ static inline void xe_res_next(struct
> > xe_res_cursor *cur, u64 size)
> >  		return;
> >  	}
> >  
> > +	if (cur->dma_addr) {
> > +		cur->start += size;
> > +		__xe_res_dma_next(cur);
> > +		return;
> > +	}
> > +
> >  	if (cur->sgl) {
> >  		cur->start += size;
> >  		__xe_res_sg_next(cur);
> > @@ -232,6 +315,35 @@ static inline void xe_res_next(struct
> > xe_res_cursor *cur, u64 size)
> >   */
> >  static inline u64 xe_res_dma(const struct xe_res_cursor *cur)
> >  {
> > -	return cur->sgl ? sg_dma_address(cur->sgl) + cur->start :
> > cur->start;
> > +	if (cur->dma_addr)
> > +		return cur->dma_start + cur->start;
> > +	else if (cur->sgl)
> > +		return sg_dma_address(cur->sgl) + cur->start;
> > +	else
> > +		return cur->start;
> > +}
> > +
> > +/**
> > + * xe_res_is_vram() - Whether the cursor current dma address points
> > to
> > + * same-device VRAM
> > + * @cur: The cursor.
> > + *
> > + * Return: true iff the address returned by xe_res_dma() points to
> > internal vram.
> > + */
> > +static inline bool xe_res_is_vram(const struct xe_res_cursor *cur)
> > +{
> > +	if (cur->dma_addr)
> > +		return cur->dma_addr->proto == XE_INTERCONNECT_VRAM;
> > +
> > +	switch (cur->mem_type) {
> > +	case XE_PL_STOLEN:
> > +	case XE_PL_VRAM0:
> > +	case XE_PL_VRAM1:
> > +		return true;
> > +	default:
> > +		break;
> > +	}
> > +
> > +	return false;
> >  }
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index 979f2322eeba..376e86876a11 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -6,6 +6,10 @@
> >  #ifndef _XE_SVM_H_
> >  #define _XE_SVM_H_
> >  
> > +#include "drm_pagemap.h"
> > +
> > +#define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> > +
> >  struct xe_vm;
> >  
> >  int xe_svm_init(struct xe_vm *vm);
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 09/29] drm/xe: Add SVM range invalidation
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (7 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 08/29] drm/xe: Add dma_addr res cursor Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-11-19 13:56   ` Thomas Hellström
  2024-10-16  3:24 ` [PATCH v2 10/29] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
                   ` (22 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add SVM range invalidation vfunc.

v2:
 - Don't run invalidation if VM is closed
 - Cycle notifier lock in xe_svm_close
 - Drop xe_gt_tlb_invalidation_fence_fini

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
 drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
 drivers/gpu/drm/xe/xe_pt.h           |   3 +
 drivers/gpu/drm/xe/xe_svm.c          | 205 ++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h          |  13 ++
 5 files changed, 256 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 79c426dc2505..92923947a12c 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -19,6 +19,7 @@
 #include "xe_guc.h"
 #include "xe_guc_ct.h"
 #include "xe_migrate.h"
+#include "xe_svm.h"
 #include "xe_trace_bo.h"
 #include "xe_vm.h"
 
@@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec *exec, struct xe_vma *vma,
 	return 0;
 }
 
-static int handle_vma_pagefault(struct xe_tile *tile, struct pagefault *pf,
-				struct xe_vma *vma)
+static int handle_vma_pagefault(struct xe_tile *tile, struct xe_vma *vma,
+				bool atomic)
 {
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct drm_exec exec;
 	struct dma_fence *fence;
 	ktime_t end = 0;
 	int err;
-	bool atomic;
 
+	lockdep_assert_held_write(&vm->lock);
 	trace_xe_vma_pagefault(vma);
-	atomic = access_is_atomic(pf->access_type);
 
 	/* Check if VMA is valid */
 	if (vma_is_valid(tile, vma) && !atomic)
@@ -207,6 +207,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 	struct xe_vm *vm;
 	struct xe_vma *vma = NULL;
 	int err;
+	bool atomic;
 
 	/* SW isn't expected to handle TRTT faults */
 	if (pf->trva_fault)
@@ -232,7 +233,13 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		goto unlock_vm;
 	}
 
-	err = handle_vma_pagefault(tile, pf, vma);
+	atomic = access_is_atomic(pf->access_type);
+
+	if (xe_vma_is_system_allocator(vma))
+		err = xe_svm_handle_pagefault(vm, vma, tile,
+					      pf->page_addr, atomic);
+	else
+		err = handle_vma_pagefault(tile, vma, atomic);
 
 unlock_vm:
 	if (!err)
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 39357e829b6d..282476c4edbd 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -20,6 +20,7 @@
 #include "xe_res_cursor.h"
 #include "xe_sched_job.h"
 #include "xe_sync.h"
+#include "xe_svm.h"
 #include "xe_trace.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma)
 	return xe_walk.needs_invalidate;
 }
 
+bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
+			  struct xe_svm_range *range)
+{
+	struct xe_pt_zap_ptes_walk xe_walk = {
+		.base = {
+			.ops = &xe_pt_zap_ptes_ops,
+			.shifts = xe_normal_pt_shifts,
+			.max_level = XE_PT_HIGHEST_LEVEL,
+		},
+		.tile = tile,
+	};
+	struct xe_pt *pt = vm->pt_root[tile->id];
+	u8 pt_mask = (range->tile_present & ~range->tile_invalidated);
+
+	if (!(pt_mask & BIT(tile->id)))
+		return false;
+
+	(void)xe_pt_walk_shared(&pt->base, pt->level, range->base.va.start,
+				range->base.va.end, &xe_walk.base);
+
+	return xe_walk.needs_invalidate;
+}
+
 static void
 xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *tile,
 		       struct iosys_map *map, void *data,
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 9ab386431cad..5f333eeedf5c 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -13,6 +13,7 @@ struct dma_fence;
 struct xe_bo;
 struct xe_device;
 struct xe_exec_queue;
+struct xe_svm_range;
 struct xe_sync_entry;
 struct xe_tile;
 struct xe_vm;
@@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops);
 void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
+bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
+			  struct xe_svm_range *range);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 57b740367843..b2bc259978c4 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -5,18 +5,188 @@
 
 #include "drm_gpusvm.h"
 
+#include "xe_gt_tlb_invalidation.h"
+#include "xe_pt.h"
 #include "xe_svm.h"
 #include "xe_vm.h"
 #include "xe_vm_types.h"
 
+static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
+{
+	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
+}
+
+static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
+{
+	return gpusvm_to_vm(r->gpusvm);
+}
+
+static struct drm_gpusvm_range *
+xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
+{
+	struct xe_svm_range *range;
+
+	range = kzalloc(sizeof(*range), GFP_KERNEL);
+	if (!range)
+		return ERR_PTR(-ENOMEM);
+
+	xe_vm_get(gpusvm_to_vm(gpusvm));
+
+	return &range->base;
+}
+
+static void xe_svm_range_free(struct drm_gpusvm_range *range)
+{
+	xe_vm_put(range_to_vm(range));
+	kfree(range);
+}
+
+static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range *r)
+{
+	return container_of(r, struct xe_svm_range, base);
+}
+
+static u8
+xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
+				  const struct mmu_notifier_range *mmu_range,
+				  u64 *adj_start, u64 *adj_end)
+{
+	struct xe_svm_range *range = to_xe_range(r);
+	struct xe_device *xe = vm->xe;
+	struct xe_tile *tile;
+	u8 tile_mask = 0;
+	u8 id;
+
+	/* Skip if already unmapped or if no binding exist */
+	if (range->base.flags.unmapped || !range->tile_present)
+		return 0;
+
+	/* Adjust invalidation to range boundaries */
+	if (range->base.va.start < mmu_range->start)
+		*adj_start = range->base.va.start;
+	if (range->base.va.end > mmu_range->end)
+		*adj_end = range->base.va.end;
+
+	/*
+	 * XXX: Ideally would zap PTEs in one shot in xe_svm_invalidate but the
+	 * invalidation code can't correctly cope with sparse ranges or
+	 * invalidations spanning multiple ranges.
+	 */
+	for_each_tile(tile, xe, id)
+		if (xe_pt_zap_ptes_range(tile, vm, range)) {
+			tile_mask |= BIT(id);
+			range->tile_invalidated |= BIT(id);
+		}
+
+	return tile_mask;
+}
+
+static void
+xe_svm_range_notifier_event_end(struct xe_vm *vm, struct drm_gpusvm_range *r,
+				const struct mmu_notifier_range *mmu_range)
+{
+	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
+
+	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
+	/* TODO: Add range to garbage collector */
+}
+
 static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 			      struct drm_gpusvm_notifier *notifier,
 			      const struct mmu_notifier_range *mmu_range)
 {
-	/* TODO: Implement */
+	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
+	struct xe_device *xe = vm->xe;
+	struct xe_tile *tile;
+	struct drm_gpusvm_range *r, *first;
+	struct xe_gt_tlb_invalidation_fence
+		fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
+	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
+	u8 tile_mask = 0;
+	u8 id;
+	u32 fence_id = 0;
+	long err;
+
+	if (xe_vm_is_closed(vm))
+		return;
+
+	/* Adjust invalidation to notifier boundaries */
+	if (adj_start < notifier->interval.start)
+		adj_start = notifier->interval.start;
+	if (adj_end > notifier->interval.end)
+		adj_end = notifier->interval.end;
+
+	first = drm_gpusvm_range_find(notifier, adj_start, adj_end);
+	if (!first)
+		return;
+
+	/*
+	 * XXX: Less than ideal to always wait on VM's resv slots if an
+	 * invalidation is not required. Could walk range list twice to figure
+	 * out if an invalidations is need, but also not ideal. Maybe a counter
+	 * within the notifier, seems like that could work.
+	 */
+	err = dma_resv_wait_timeout(xe_vm_resv(vm),
+				    DMA_RESV_USAGE_BOOKKEEP,
+				    false, MAX_SCHEDULE_TIMEOUT);
+	XE_WARN_ON(err <= 0);
+
+	r = first;
+	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
+		tile_mask |= xe_svm_range_notifier_event_begin(vm, r, mmu_range,
+							       &adj_start,
+							       &adj_end);
+	if (!tile_mask)
+		goto range_notifier_event_end;
+
+	xe_device_wmb(xe);
+
+	for_each_tile(tile, xe, id) {
+		if (tile_mask & BIT(id)) {
+			int err;
+
+			xe_gt_tlb_invalidation_fence_init(tile->primary_gt,
+							  &fence[fence_id], true);
+
+			err = xe_gt_tlb_invalidation_range(tile->primary_gt,
+							   &fence[fence_id],
+							   adj_start,
+							   adj_end,
+							   vm->usm.asid);
+			if (WARN_ON_ONCE(err < 0))
+				goto wait;
+			++fence_id;
+
+			if (!tile->media_gt)
+				continue;
+
+			xe_gt_tlb_invalidation_fence_init(tile->media_gt,
+							  &fence[fence_id], true);
+
+			err = xe_gt_tlb_invalidation_range(tile->media_gt,
+							   &fence[fence_id],
+							   adj_start,
+							   adj_end,
+							   vm->usm.asid);
+			if (WARN_ON_ONCE(err < 0))
+				goto wait;
+			++fence_id;
+		}
+	}
+
+wait:
+	for (id = 0; id < fence_id; ++id)
+		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
+
+range_notifier_event_end:
+	r = first;
+	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
+		xe_svm_range_notifier_event_end(vm, r, mmu_range);
 }
 
 static const struct drm_gpusvm_ops gpusvm_ops = {
+	.range_alloc = xe_svm_range_alloc,
+	.range_free = xe_svm_range_free,
 	.invalidate = xe_svm_invalidate,
 };
 
@@ -36,6 +206,11 @@ int xe_svm_init(struct xe_vm *vm)
 
 void xe_svm_close(struct xe_vm *vm)
 {
+	xe_assert(vm->xe, xe_vm_is_closed(vm));
+
+	/* Flush running notifiers making xe_vm_close() visable */
+	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
+	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
 }
 
 void xe_svm_fini(struct xe_vm *vm)
@@ -44,3 +219,31 @@ void xe_svm_fini(struct xe_vm *vm)
 
 	drm_gpusvm_fini(&vm->svm.gpusvm);
 }
+
+int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
+			    struct xe_tile *tile, u64 fault_addr,
+			    bool atomic)
+{
+	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
+	struct drm_gpusvm_range *r;
+	int err;
+
+	lockdep_assert_held_write(&vm->lock);
+
+retry:
+	/* TODO: Run garbage collector */
+
+	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm, fault_addr,
+					    xe_vma_start(vma), xe_vma_end(vma),
+					    &ctx);
+	if (IS_ERR(r))
+		return PTR_ERR(r);
+
+	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
+	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
+	       goto retry;
+
+	/* TODO: Issue bind */
+
+	return err;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 376e86876a11..c91c5f538024 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -6,14 +6,27 @@
 #ifndef _XE_SVM_H_
 #define _XE_SVM_H_
 
+#include "drm_gpusvm.h"
 #include "drm_pagemap.h"
 
 #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
 
+struct xe_tile;
 struct xe_vm;
+struct xe_vma;
+
+struct xe_svm_range {
+	struct drm_gpusvm_range base;
+	u8 tile_present;
+	u8 tile_invalidated;
+};
 
 int xe_svm_init(struct xe_vm *vm);
 void xe_svm_fini(struct xe_vm *vm);
 void xe_svm_close(struct xe_vm *vm);
 
+int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
+			    struct xe_tile *tile, u64 fault_addr,
+			    bool atomic);
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 09/29] drm/xe: Add SVM range invalidation
  2024-10-16  3:24 ` [PATCH v2 09/29] drm/xe: Add SVM range invalidation Matthew Brost
@ 2024-11-19 13:56   ` Thomas Hellström
  2024-12-11 19:01     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 13:56 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> Add SVM range invalidation vfunc.
> 
> v2:
>  - Don't run invalidation if VM is closed
>  - Cycle notifier lock in xe_svm_close
>  - Drop xe_gt_tlb_invalidation_fence_fini
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
>  drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
>  drivers/gpu/drm/xe/xe_pt.h           |   3 +
>  drivers/gpu/drm/xe/xe_svm.c          | 205
> ++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h          |  13 ++
>  5 files changed, 256 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> index 79c426dc2505..92923947a12c 100644
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> @@ -19,6 +19,7 @@
>  #include "xe_guc.h"
>  #include "xe_guc_ct.h"
>  #include "xe_migrate.h"
> +#include "xe_svm.h"
>  #include "xe_trace_bo.h"
>  #include "xe_vm.h"
>  
> @@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec *exec,
> struct xe_vma *vma,
>  	return 0;
>  }
>  
> -static int handle_vma_pagefault(struct xe_tile *tile, struct
> pagefault *pf,
> -				struct xe_vma *vma)
> +static int handle_vma_pagefault(struct xe_tile *tile, struct xe_vma
> *vma,
> +				bool atomic)
>  {
>  	struct xe_vm *vm = xe_vma_vm(vma);
>  	struct drm_exec exec;
>  	struct dma_fence *fence;
>  	ktime_t end = 0;
>  	int err;
> -	bool atomic;
>  
> +	lockdep_assert_held_write(&vm->lock);
>  	trace_xe_vma_pagefault(vma);
> -	atomic = access_is_atomic(pf->access_type);
>  
>  	/* Check if VMA is valid */
>  	if (vma_is_valid(tile, vma) && !atomic)
> @@ -207,6 +207,7 @@ static int handle_pagefault(struct xe_gt *gt,
> struct pagefault *pf)
>  	struct xe_vm *vm;
>  	struct xe_vma *vma = NULL;
>  	int err;
> +	bool atomic;
>  
>  	/* SW isn't expected to handle TRTT faults */
>  	if (pf->trva_fault)
> @@ -232,7 +233,13 @@ static int handle_pagefault(struct xe_gt *gt,
> struct pagefault *pf)
>  		goto unlock_vm;
>  	}
>  
> -	err = handle_vma_pagefault(tile, pf, vma);
> +	atomic = access_is_atomic(pf->access_type);
> +
> +	if (xe_vma_is_system_allocator(vma))
> +		err = xe_svm_handle_pagefault(vm, vma, tile,
> +					      pf->page_addr,
> atomic);
> +	else
> +		err = handle_vma_pagefault(tile, vma, atomic);
>  
>  unlock_vm:
>  	if (!err)
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 39357e829b6d..282476c4edbd 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -20,6 +20,7 @@
>  #include "xe_res_cursor.h"
>  #include "xe_sched_job.h"
>  #include "xe_sync.h"
> +#include "xe_svm.h"
>  #include "xe_trace.h"
>  #include "xe_ttm_stolen_mgr.h"
>  #include "xe_vm.h"
> @@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct
> xe_vma *vma)
>  	return xe_walk.needs_invalidate;
>  }
>  
> +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
> +			  struct xe_svm_range *range)

Kerneldoc.

Here, (and I saw Oak also commented around this some time ago) ideally
we should make xe_pt.c unaware of vmas and svm ranges, and in this
case, use the same xe_pt function for both.



> +{
> +	struct xe_pt_zap_ptes_walk xe_walk = {
> +		.base = {
> +			.ops = &xe_pt_zap_ptes_ops,
> +			.shifts = xe_normal_pt_shifts,
> +			.max_level = XE_PT_HIGHEST_LEVEL,
> +		},
> +		.tile = tile,
> +	};
> +	struct xe_pt *pt = vm->pt_root[tile->id];
> +	u8 pt_mask = (range->tile_present & ~range-
> >tile_invalidated);
> +
> +	if (!(pt_mask & BIT(tile->id)))
> +		return false;
> +
> +	(void)xe_pt_walk_shared(&pt->base, pt->level, range-
> >base.va.start,
> +				range->base.va.end, &xe_walk.base);
> +
> +	return xe_walk.needs_invalidate;
> +}
> +
>  static void
>  xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update,
> struct xe_tile *tile,
>  		       struct iosys_map *map, void *data,
> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> index 9ab386431cad..5f333eeedf5c 100644
> --- a/drivers/gpu/drm/xe/xe_pt.h
> +++ b/drivers/gpu/drm/xe/xe_pt.h
> @@ -13,6 +13,7 @@ struct dma_fence;
>  struct xe_bo;
>  struct xe_device;
>  struct xe_exec_queue;
> +struct xe_svm_range;
>  struct xe_sync_entry;
>  struct xe_tile;
>  struct xe_vm;
> @@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile *tile,
> struct xe_vma_ops *vops);
>  void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops
> *vops);
>  
>  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
> +			  struct xe_svm_range *range);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 57b740367843..b2bc259978c4 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -5,18 +5,188 @@
>  
>  #include "drm_gpusvm.h"
>  
> +#include "xe_gt_tlb_invalidation.h"
> +#include "xe_pt.h"
>  #include "xe_svm.h"
>  #include "xe_vm.h"
>  #include "xe_vm_types.h"
>  
> +static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
> +{
> +	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
> +}
> +
> +static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
> +{
> +	return gpusvm_to_vm(r->gpusvm);
> +}
> +
> +static struct drm_gpusvm_range *
> +xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> +{
> +	struct xe_svm_range *range;
> +
> +	range = kzalloc(sizeof(*range), GFP_KERNEL);
> +	if (!range)
> +		return ERR_PTR(-ENOMEM);
> +
> +	xe_vm_get(gpusvm_to_vm(gpusvm));
> +
> +	return &range->base;
> +}
> +
> +static void xe_svm_range_free(struct drm_gpusvm_range *range)
> +{
> +	xe_vm_put(range_to_vm(range));
> +	kfree(range);
> +}
> +
> +static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range *r)
> +{
> +	return container_of(r, struct xe_svm_range, base);
> +}
> +
> +static u8
> +xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> drm_gpusvm_range *r,
> +				  const struct mmu_notifier_range
> *mmu_range,
> +				  u64 *adj_start, u64 *adj_end)
> +{
> +	struct xe_svm_range *range = to_xe_range(r);
> +	struct xe_device *xe = vm->xe;
> +	struct xe_tile *tile;
> +	u8 tile_mask = 0;
> +	u8 id;
> +

lockdep assert?

> +	/* Skip if already unmapped or if no binding exist */
> +	if (range->base.flags.unmapped || !range->tile_present)
> +		return 0;
> +
> +	/* Adjust invalidation to range boundaries */
> +	if (range->base.va.start < mmu_range->start)
> +		*adj_start = range->base.va.start;
> +	if (range->base.va.end > mmu_range->end)
> +		*adj_end = range->base.va.end;
> +
> +	/*
> +	 * XXX: Ideally would zap PTEs in one shot in
> xe_svm_invalidate but the
> +	 * invalidation code can't correctly cope with sparse ranges
> or
> +	 * invalidations spanning multiple ranges.
> +	 */
> +	for_each_tile(tile, xe, id)
> +		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> +			tile_mask |= BIT(id);
> +			range->tile_invalidated |= BIT(id);
> +		}
> +
> +	return tile_mask;
> +}
> +
> +static void
> +xe_svm_range_notifier_event_end(struct xe_vm *vm, struct
> drm_gpusvm_range *r,
> +				const struct mmu_notifier_range
> *mmu_range)
> +{
> +	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> +
> +	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> +	/* TODO: Add range to garbage collector */
> +}
> +
>  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>  			      struct drm_gpusvm_notifier *notifier,
>  			      const struct mmu_notifier_range
> *mmu_range)
>  {
> -	/* TODO: Implement */
> +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> +	struct xe_device *xe = vm->xe;
> +	struct xe_tile *tile;
> +	struct drm_gpusvm_range *r, *first;
> +	struct xe_gt_tlb_invalidation_fence
> +		fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
> +	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
> +	u8 tile_mask = 0;
> +	u8 id;
> +	u32 fence_id = 0;
> +	long err;
> +
> +	if (xe_vm_is_closed(vm))
> +		return;

How do we ensure we don't race here? Are we sure that all dma mappings
and all PTEs pointing to the range is gone at this point? Becase "They
will soon be gone anyway" isn't enough.

> +
> +	/* Adjust invalidation to notifier boundaries */
> +	if (adj_start < notifier->interval.start)
> +		adj_start = notifier->interval.start;
> +	if (adj_end > notifier->interval.end)
> +		adj_end = notifier->interval.end;
> +
> +	first = drm_gpusvm_range_find(notifier, adj_start, adj_end);
> +	if (!first)
> +		return;
> +
> +	/*
> +	 * XXX: Less than ideal to always wait on VM's resv slots if
> an
> +	 * invalidation is not required. Could walk range list twice
> to figure
> +	 * out if an invalidations is need, but also not ideal.
> Maybe a counter
> +	 * within the notifier, seems like that could work.
> +	 */
> +	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> +				    DMA_RESV_USAGE_BOOKKEEP,
> +				    false, MAX_SCHEDULE_TIMEOUT);
> +	XE_WARN_ON(err <= 0);
> +
> +	r = first;
> +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> +		tile_mask |= xe_svm_range_notifier_event_begin(vm,
> r, mmu_range,
> +							      
> &adj_start,
> +							      
> &adj_end);
> +	if (!tile_mask)
> +		goto range_notifier_event_end;
> +
> +	xe_device_wmb(xe);
> +
> +	for_each_tile(tile, xe, id) {
> +		if (tile_mask & BIT(id)) {
> +			int err;
> +
> +			xe_gt_tlb_invalidation_fence_init(tile-
> >primary_gt,
> +							 
> &fence[fence_id], true);
> +
> +			err = xe_gt_tlb_invalidation_range(tile-
> >primary_gt,
> +							  
> &fence[fence_id],
> +							  
> adj_start,
> +							   adj_end,
> +							   vm-
> >usm.asid);
> +			if (WARN_ON_ONCE(err < 0))
> +				goto wait;
> +			++fence_id;
> +
> +			if (!tile->media_gt)
> +				continue;
> +
> +			xe_gt_tlb_invalidation_fence_init(tile-
> >media_gt,
> +							 
> &fence[fence_id], true);
> +
> +			err = xe_gt_tlb_invalidation_range(tile-
> >media_gt,
> +							  
> &fence[fence_id],
> +							  
> adj_start,
> +							   adj_end,
> +							   vm-
> >usm.asid);
> +			if (WARN_ON_ONCE(err < 0))
> +				goto wait;
> +			++fence_id;
> +		}
> +	}
> +
> +wait:
> +	for (id = 0; id < fence_id; ++id)
> +		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
> +
> +range_notifier_event_end:
> +	r = first;
> +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> +		xe_svm_range_notifier_event_end(vm, r, mmu_range);
>  }
>  
>  static const struct drm_gpusvm_ops gpusvm_ops = {
> +	.range_alloc = xe_svm_range_alloc,
> +	.range_free = xe_svm_range_free,
>  	.invalidate = xe_svm_invalidate,
>  };
>  
> @@ -36,6 +206,11 @@ int xe_svm_init(struct xe_vm *vm)
>  
>  void xe_svm_close(struct xe_vm *vm)
>  {
> +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> +
> +	/* Flush running notifiers making xe_vm_close() visable */
> +	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> +	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);

Calling mmu_notifier_read_begin() ensures that nothing is invalidating
on the range. Probably a better choice.

>  }
>  
>  void xe_svm_fini(struct xe_vm *vm)
> @@ -44,3 +219,31 @@ void xe_svm_fini(struct xe_vm *vm)
>  
>  	drm_gpusvm_fini(&vm->svm.gpusvm);
>  }
> +
> +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> +			    struct xe_tile *tile, u64 fault_addr,
> +			    bool atomic)
> +{
> +	struct drm_gpusvm_ctx ctx = { .read_only =
> xe_vma_read_only(vma), };
> +	struct drm_gpusvm_range *r;
> +	int err;
> +
> +	lockdep_assert_held_write(&vm->lock);
> +
> +retry:
> +	/* TODO: Run garbage collector */
> +
> +	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> fault_addr,
> +					    xe_vma_start(vma),
> xe_vma_end(vma),
> +					    &ctx);
> +	if (IS_ERR(r))
> +		return PTR_ERR(r);
> +
> +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
> +	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> mappings have change */

s/change/changed/

> +	       goto retry;
> +
> +	/* TODO: Issue bind */
> +
> +	return err;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 376e86876a11..c91c5f538024 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -6,14 +6,27 @@
>  #ifndef _XE_SVM_H_
>  #define _XE_SVM_H_
>  
> +#include "drm_gpusvm.h"
>  #include "drm_pagemap.h"
>  
>  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER

Not used yet

>  
> +struct xe_tile;
>  struct xe_vm;
> +struct xe_vma;
> +
> +struct xe_svm_range {
> +	struct drm_gpusvm_range base;
> +	u8 tile_present;
> +	u8 tile_invalidated;
> +};

Kerneldoc


>  
>  int xe_svm_init(struct xe_vm *vm);
>  void xe_svm_fini(struct xe_vm *vm);
>  void xe_svm_close(struct xe_vm *vm);
>  
> +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> +			    struct xe_tile *tile, u64 fault_addr,
> +			    bool atomic);
> +
>  #endif

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 09/29] drm/xe: Add SVM range invalidation
  2024-11-19 13:56   ` Thomas Hellström
@ 2024-12-11 19:01     ` Matthew Brost
  2024-12-14 23:11       ` Matthew Brost
  2024-12-16 10:01       ` Thomas Hellström
  0 siblings, 2 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:01 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 02:56:12PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > Add SVM range invalidation vfunc.
> > 
> > v2:
> >  - Don't run invalidation if VM is closed
> >  - Cycle notifier lock in xe_svm_close
> >  - Drop xe_gt_tlb_invalidation_fence_fini
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
> >  drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
> >  drivers/gpu/drm/xe/xe_pt.h           |   3 +
> >  drivers/gpu/drm/xe/xe_svm.c          | 205
> > ++++++++++++++++++++++++++-
> >  drivers/gpu/drm/xe/xe_svm.h          |  13 ++
> >  5 files changed, 256 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > index 79c426dc2505..92923947a12c 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > @@ -19,6 +19,7 @@
> >  #include "xe_guc.h"
> >  #include "xe_guc_ct.h"
> >  #include "xe_migrate.h"
> > +#include "xe_svm.h"
> >  #include "xe_trace_bo.h"
> >  #include "xe_vm.h"
> >  
> > @@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec *exec,
> > struct xe_vma *vma,
> >  	return 0;
> >  }
> >  
> > -static int handle_vma_pagefault(struct xe_tile *tile, struct
> > pagefault *pf,
> > -				struct xe_vma *vma)
> > +static int handle_vma_pagefault(struct xe_tile *tile, struct xe_vma
> > *vma,
> > +				bool atomic)
> >  {
> >  	struct xe_vm *vm = xe_vma_vm(vma);
> >  	struct drm_exec exec;
> >  	struct dma_fence *fence;
> >  	ktime_t end = 0;
> >  	int err;
> > -	bool atomic;
> >  
> > +	lockdep_assert_held_write(&vm->lock);
> >  	trace_xe_vma_pagefault(vma);
> > -	atomic = access_is_atomic(pf->access_type);
> >  
> >  	/* Check if VMA is valid */
> >  	if (vma_is_valid(tile, vma) && !atomic)
> > @@ -207,6 +207,7 @@ static int handle_pagefault(struct xe_gt *gt,
> > struct pagefault *pf)
> >  	struct xe_vm *vm;
> >  	struct xe_vma *vma = NULL;
> >  	int err;
> > +	bool atomic;
> >  
> >  	/* SW isn't expected to handle TRTT faults */
> >  	if (pf->trva_fault)
> > @@ -232,7 +233,13 @@ static int handle_pagefault(struct xe_gt *gt,
> > struct pagefault *pf)
> >  		goto unlock_vm;
> >  	}
> >  
> > -	err = handle_vma_pagefault(tile, pf, vma);
> > +	atomic = access_is_atomic(pf->access_type);
> > +
> > +	if (xe_vma_is_system_allocator(vma))
> > +		err = xe_svm_handle_pagefault(vm, vma, tile,
> > +					      pf->page_addr,
> > atomic);
> > +	else
> > +		err = handle_vma_pagefault(tile, vma, atomic);
> >  
> >  unlock_vm:
> >  	if (!err)
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index 39357e829b6d..282476c4edbd 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -20,6 +20,7 @@
> >  #include "xe_res_cursor.h"
> >  #include "xe_sched_job.h"
> >  #include "xe_sync.h"
> > +#include "xe_svm.h"
> >  #include "xe_trace.h"
> >  #include "xe_ttm_stolen_mgr.h"
> >  #include "xe_vm.h"
> > @@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct
> > xe_vma *vma)
> >  	return xe_walk.needs_invalidate;
> >  }
> >  
> > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
> > +			  struct xe_svm_range *range)
> 
> Kerneldoc.
> 

Will add.

> Here, (and I saw Oak also commented around this some time ago) ideally
> we should make xe_pt.c unaware of vmas and svm ranges, and in this
> case, use the same xe_pt function for both.
> 

See some of other comments, agree we should do in a follow up.

> 
> 
> > +{
> > +	struct xe_pt_zap_ptes_walk xe_walk = {
> > +		.base = {
> > +			.ops = &xe_pt_zap_ptes_ops,
> > +			.shifts = xe_normal_pt_shifts,
> > +			.max_level = XE_PT_HIGHEST_LEVEL,
> > +		},
> > +		.tile = tile,
> > +	};
> > +	struct xe_pt *pt = vm->pt_root[tile->id];
> > +	u8 pt_mask = (range->tile_present & ~range-
> > >tile_invalidated);
> > +
> > +	if (!(pt_mask & BIT(tile->id)))
> > +		return false;
> > +
> > +	(void)xe_pt_walk_shared(&pt->base, pt->level, range-
> > >base.va.start,
> > +				range->base.va.end, &xe_walk.base);
> > +
> > +	return xe_walk.needs_invalidate;
> > +}
> > +
> >  static void
> >  xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update,
> > struct xe_tile *tile,
> >  		       struct iosys_map *map, void *data,
> > diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> > index 9ab386431cad..5f333eeedf5c 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.h
> > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > @@ -13,6 +13,7 @@ struct dma_fence;
> >  struct xe_bo;
> >  struct xe_device;
> >  struct xe_exec_queue;
> > +struct xe_svm_range;
> >  struct xe_sync_entry;
> >  struct xe_tile;
> >  struct xe_vm;
> > @@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile *tile,
> > struct xe_vma_ops *vops);
> >  void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops
> > *vops);
> >  
> >  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
> > +			  struct xe_svm_range *range);
> >  
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 57b740367843..b2bc259978c4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -5,18 +5,188 @@
> >  
> >  #include "drm_gpusvm.h"
> >  
> > +#include "xe_gt_tlb_invalidation.h"
> > +#include "xe_pt.h"
> >  #include "xe_svm.h"
> >  #include "xe_vm.h"
> >  #include "xe_vm_types.h"
> >  
> > +static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
> > +{
> > +	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
> > +}
> > +
> > +static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
> > +{
> > +	return gpusvm_to_vm(r->gpusvm);
> > +}
> > +
> > +static struct drm_gpusvm_range *
> > +xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> > +{
> > +	struct xe_svm_range *range;
> > +
> > +	range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +	if (!range)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	xe_vm_get(gpusvm_to_vm(gpusvm));
> > +
> > +	return &range->base;
> > +}
> > +
> > +static void xe_svm_range_free(struct drm_gpusvm_range *range)
> > +{
> > +	xe_vm_put(range_to_vm(range));
> > +	kfree(range);
> > +}
> > +
> > +static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range *r)
> > +{
> > +	return container_of(r, struct xe_svm_range, base);
> > +}
> > +
> > +static u8
> > +xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > drm_gpusvm_range *r,
> > +				  const struct mmu_notifier_range
> > *mmu_range,
> > +				  u64 *adj_start, u64 *adj_end)
> > +{
> > +	struct xe_svm_range *range = to_xe_range(r);
> > +	struct xe_device *xe = vm->xe;
> > +	struct xe_tile *tile;
> > +	u8 tile_mask = 0;
> > +	u8 id;
> > +
> 
> lockdep assert?
>

Sure.
 
> > +	/* Skip if already unmapped or if no binding exist */
> > +	if (range->base.flags.unmapped || !range->tile_present)
> > +		return 0;
> > +
> > +	/* Adjust invalidation to range boundaries */
> > +	if (range->base.va.start < mmu_range->start)
> > +		*adj_start = range->base.va.start;
> > +	if (range->base.va.end > mmu_range->end)
> > +		*adj_end = range->base.va.end;
> > +
> > +	/*
> > +	 * XXX: Ideally would zap PTEs in one shot in
> > xe_svm_invalidate but the
> > +	 * invalidation code can't correctly cope with sparse ranges
> > or
> > +	 * invalidations spanning multiple ranges.
> > +	 */
> > +	for_each_tile(tile, xe, id)
> > +		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> > +			tile_mask |= BIT(id);
> > +			range->tile_invalidated |= BIT(id);
> > +		}
> > +
> > +	return tile_mask;
> > +}
> > +
> > +static void
> > +xe_svm_range_notifier_event_end(struct xe_vm *vm, struct
> > drm_gpusvm_range *r,
> > +				const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > +
> > +	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> > +	/* TODO: Add range to garbage collector */
> > +}
> > +
> >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> >  			      struct drm_gpusvm_notifier *notifier,
> >  			      const struct mmu_notifier_range
> > *mmu_range)
> >  {
> > -	/* TODO: Implement */
> > +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > +	struct xe_device *xe = vm->xe;
> > +	struct xe_tile *tile;
> > +	struct drm_gpusvm_range *r, *first;
> > +	struct xe_gt_tlb_invalidation_fence
> > +		fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
> > +	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
> > +	u8 tile_mask = 0;
> > +	u8 id;
> > +	u32 fence_id = 0;
> > +	long err;
> > +
> > +	if (xe_vm_is_closed(vm))
> > +		return;
> 
> How do we ensure we don't race here? Are we sure that all dma mappings
> and all PTEs pointing to the range is gone at this point? Becase "They
> will soon be gone anyway" isn't enough.
>

I think this is to prevent touching PTs which are being destroyed in
parallel which resulted in kernel explosion, so I think we need this.

How to prevent a race? How about on VM close we invalidate the PT root?
I had patch at one point which did this. We'd still have dma mappings
too but I think if need to we can safely dma-unmap the pages if the VM
is closed too. Thoughts?

> > +
> > +	/* Adjust invalidation to notifier boundaries */
> > +	if (adj_start < notifier->interval.start)
> > +		adj_start = notifier->interval.start;
> > +	if (adj_end > notifier->interval.end)
> > +		adj_end = notifier->interval.end;
> > +
> > +	first = drm_gpusvm_range_find(notifier, adj_start, adj_end);
> > +	if (!first)
> > +		return;
> > +
> > +	/*
> > +	 * XXX: Less than ideal to always wait on VM's resv slots if
> > an
> > +	 * invalidation is not required. Could walk range list twice
> > to figure
> > +	 * out if an invalidations is need, but also not ideal.
> > Maybe a counter
> > +	 * within the notifier, seems like that could work.
> > +	 */
> > +	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > +				    DMA_RESV_USAGE_BOOKKEEP,
> > +				    false, MAX_SCHEDULE_TIMEOUT);
> > +	XE_WARN_ON(err <= 0);
> > +
> > +	r = first;
> > +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> > +		tile_mask |= xe_svm_range_notifier_event_begin(vm,
> > r, mmu_range,
> > +							      
> > &adj_start,
> > +							      
> > &adj_end);
> > +	if (!tile_mask)
> > +		goto range_notifier_event_end;
> > +
> > +	xe_device_wmb(xe);
> > +
> > +	for_each_tile(tile, xe, id) {
> > +		if (tile_mask & BIT(id)) {
> > +			int err;
> > +
> > +			xe_gt_tlb_invalidation_fence_init(tile-
> > >primary_gt,
> > +							 
> > &fence[fence_id], true);
> > +
> > +			err = xe_gt_tlb_invalidation_range(tile-
> > >primary_gt,
> > +							  
> > &fence[fence_id],
> > +							  
> > adj_start,
> > +							   adj_end,
> > +							   vm-
> > >usm.asid);
> > +			if (WARN_ON_ONCE(err < 0))
> > +				goto wait;
> > +			++fence_id;
> > +
> > +			if (!tile->media_gt)
> > +				continue;
> > +
> > +			xe_gt_tlb_invalidation_fence_init(tile-
> > >media_gt,
> > +							 
> > &fence[fence_id], true);
> > +
> > +			err = xe_gt_tlb_invalidation_range(tile-
> > >media_gt,
> > +							  
> > &fence[fence_id],
> > +							  
> > adj_start,
> > +							   adj_end,
> > +							   vm-
> > >usm.asid);
> > +			if (WARN_ON_ONCE(err < 0))
> > +				goto wait;
> > +			++fence_id;
> > +		}
> > +	}
> > +
> > +wait:
> > +	for (id = 0; id < fence_id; ++id)
> > +		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
> > +
> > +range_notifier_event_end:
> > +	r = first;
> > +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> > +		xe_svm_range_notifier_event_end(vm, r, mmu_range);
> >  }
> >  
> >  static const struct drm_gpusvm_ops gpusvm_ops = {
> > +	.range_alloc = xe_svm_range_alloc,
> > +	.range_free = xe_svm_range_free,
> >  	.invalidate = xe_svm_invalidate,
> >  };
> >  
> > @@ -36,6 +206,11 @@ int xe_svm_init(struct xe_vm *vm)
> >  
> >  void xe_svm_close(struct xe_vm *vm)
> >  {
> > +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> > +
> > +	/* Flush running notifiers making xe_vm_close() visable */
> > +	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> > +	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> 
> Calling mmu_notifier_read_begin() ensures that nothing is invalidating
> on the range. Probably a better choice.
>

We'd have to call that on every notifier rather than just cycle the
lock, so with that I'd prefer to leave it as is.
 
> >  }
> >  
> >  void xe_svm_fini(struct xe_vm *vm)
> > @@ -44,3 +219,31 @@ void xe_svm_fini(struct xe_vm *vm)
> >  
> >  	drm_gpusvm_fini(&vm->svm.gpusvm);
> >  }
> > +
> > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > +			    struct xe_tile *tile, u64 fault_addr,
> > +			    bool atomic)
> > +{
> > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > xe_vma_read_only(vma), };
> > +	struct drm_gpusvm_range *r;
> > +	int err;
> > +
> > +	lockdep_assert_held_write(&vm->lock);
> > +
> > +retry:
> > +	/* TODO: Run garbage collector */
> > +
> > +	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > fault_addr,
> > +					    xe_vma_start(vma),
> > xe_vma_end(vma),
> > +					    &ctx);
> > +	if (IS_ERR(r))
> > +		return PTR_ERR(r);
> > +
> > +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
> > +	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> > mappings have change */
> 
> s/change/changed/
> 

Yep.

> > +	       goto retry;
> > +
> > +	/* TODO: Issue bind */
> > +
> > +	return err;
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index 376e86876a11..c91c5f538024 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -6,14 +6,27 @@
> >  #ifndef _XE_SVM_H_
> >  #define _XE_SVM_H_
> >  
> > +#include "drm_gpusvm.h"
> >  #include "drm_pagemap.h"
> >  
> >  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> 
> Not used yet
>

Will remove.
 
> >  
> > +struct xe_tile;
> >  struct xe_vm;
> > +struct xe_vma;
> > +
> > +struct xe_svm_range {
> > +	struct drm_gpusvm_range base;
> > +	u8 tile_present;
> > +	u8 tile_invalidated;
> > +};
> 
> Kerneldoc
> 

Will add.

> 
> >  
> >  int xe_svm_init(struct xe_vm *vm);
> >  void xe_svm_fini(struct xe_vm *vm);
> >  void xe_svm_close(struct xe_vm *vm);
> >  
> > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > +			    struct xe_tile *tile, u64 fault_addr,
> > +			    bool atomic);
> > +
> >  #endif
> 
> Thanks,

Thanks,
Matt

> Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 09/29] drm/xe: Add SVM range invalidation
  2024-12-11 19:01     ` Matthew Brost
@ 2024-12-14 23:11       ` Matthew Brost
  2024-12-16 10:01       ` Thomas Hellström
  1 sibling, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-14 23:11 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Wed, Dec 11, 2024 at 11:01:14AM -0800, Matthew Brost wrote:
> On Tue, Nov 19, 2024 at 02:56:12PM +0100, Thomas Hellström wrote:
> > On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > > Add SVM range invalidation vfunc.
> > > 
> > > v2:
> > >  - Don't run invalidation if VM is closed
> > >  - Cycle notifier lock in xe_svm_close
> > >  - Drop xe_gt_tlb_invalidation_fence_fini
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
> > >  drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
> > >  drivers/gpu/drm/xe/xe_pt.h           |   3 +
> > >  drivers/gpu/drm/xe/xe_svm.c          | 205
> > > ++++++++++++++++++++++++++-
> > >  drivers/gpu/drm/xe/xe_svm.h          |  13 ++
> > >  5 files changed, 256 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > index 79c426dc2505..92923947a12c 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > @@ -19,6 +19,7 @@
> > >  #include "xe_guc.h"
> > >  #include "xe_guc_ct.h"
> > >  #include "xe_migrate.h"
> > > +#include "xe_svm.h"
> > >  #include "xe_trace_bo.h"
> > >  #include "xe_vm.h"
> > >  
> > > @@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec *exec,
> > > struct xe_vma *vma,
> > >  	return 0;
> > >  }
> > >  
> > > -static int handle_vma_pagefault(struct xe_tile *tile, struct
> > > pagefault *pf,
> > > -				struct xe_vma *vma)
> > > +static int handle_vma_pagefault(struct xe_tile *tile, struct xe_vma
> > > *vma,
> > > +				bool atomic)
> > >  {
> > >  	struct xe_vm *vm = xe_vma_vm(vma);
> > >  	struct drm_exec exec;
> > >  	struct dma_fence *fence;
> > >  	ktime_t end = 0;
> > >  	int err;
> > > -	bool atomic;
> > >  
> > > +	lockdep_assert_held_write(&vm->lock);
> > >  	trace_xe_vma_pagefault(vma);
> > > -	atomic = access_is_atomic(pf->access_type);
> > >  
> > >  	/* Check if VMA is valid */
> > >  	if (vma_is_valid(tile, vma) && !atomic)
> > > @@ -207,6 +207,7 @@ static int handle_pagefault(struct xe_gt *gt,
> > > struct pagefault *pf)
> > >  	struct xe_vm *vm;
> > >  	struct xe_vma *vma = NULL;
> > >  	int err;
> > > +	bool atomic;
> > >  
> > >  	/* SW isn't expected to handle TRTT faults */
> > >  	if (pf->trva_fault)
> > > @@ -232,7 +233,13 @@ static int handle_pagefault(struct xe_gt *gt,
> > > struct pagefault *pf)
> > >  		goto unlock_vm;
> > >  	}
> > >  
> > > -	err = handle_vma_pagefault(tile, pf, vma);
> > > +	atomic = access_is_atomic(pf->access_type);
> > > +
> > > +	if (xe_vma_is_system_allocator(vma))
> > > +		err = xe_svm_handle_pagefault(vm, vma, tile,
> > > +					      pf->page_addr,
> > > atomic);
> > > +	else
> > > +		err = handle_vma_pagefault(tile, vma, atomic);
> > >  
> > >  unlock_vm:
> > >  	if (!err)
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > > index 39357e829b6d..282476c4edbd 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -20,6 +20,7 @@
> > >  #include "xe_res_cursor.h"
> > >  #include "xe_sched_job.h"
> > >  #include "xe_sync.h"
> > > +#include "xe_svm.h"
> > >  #include "xe_trace.h"
> > >  #include "xe_ttm_stolen_mgr.h"
> > >  #include "xe_vm.h"
> > > @@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct
> > > xe_vma *vma)
> > >  	return xe_walk.needs_invalidate;
> > >  }
> > >  
> > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
> > > +			  struct xe_svm_range *range)
> > 
> > Kerneldoc.
> > 
> 
> Will add.
> 
> > Here, (and I saw Oak also commented around this some time ago) ideally
> > we should make xe_pt.c unaware of vmas and svm ranges, and in this
> > case, use the same xe_pt function for both.
> > 
> 
> See some of other comments, agree we should do in a follow up.
> 
> > 
> > 
> > > +{
> > > +	struct xe_pt_zap_ptes_walk xe_walk = {
> > > +		.base = {
> > > +			.ops = &xe_pt_zap_ptes_ops,
> > > +			.shifts = xe_normal_pt_shifts,
> > > +			.max_level = XE_PT_HIGHEST_LEVEL,
> > > +		},
> > > +		.tile = tile,
> > > +	};
> > > +	struct xe_pt *pt = vm->pt_root[tile->id];
> > > +	u8 pt_mask = (range->tile_present & ~range-
> > > >tile_invalidated);
> > > +
> > > +	if (!(pt_mask & BIT(tile->id)))
> > > +		return false;
> > > +
> > > +	(void)xe_pt_walk_shared(&pt->base, pt->level, range-
> > > >base.va.start,
> > > +				range->base.va.end, &xe_walk.base);
> > > +
> > > +	return xe_walk.needs_invalidate;
> > > +}
> > > +
> > >  static void
> > >  xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update,
> > > struct xe_tile *tile,
> > >  		       struct iosys_map *map, void *data,
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> > > index 9ab386431cad..5f333eeedf5c 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > @@ -13,6 +13,7 @@ struct dma_fence;
> > >  struct xe_bo;
> > >  struct xe_device;
> > >  struct xe_exec_queue;
> > > +struct xe_svm_range;
> > >  struct xe_sync_entry;
> > >  struct xe_tile;
> > >  struct xe_vm;
> > > @@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile *tile,
> > > struct xe_vma_ops *vops);
> > >  void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops
> > > *vops);
> > >  
> > >  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm *vm,
> > > +			  struct xe_svm_range *range);
> > >  
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index 57b740367843..b2bc259978c4 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -5,18 +5,188 @@
> > >  
> > >  #include "drm_gpusvm.h"
> > >  
> > > +#include "xe_gt_tlb_invalidation.h"
> > > +#include "xe_pt.h"
> > >  #include "xe_svm.h"
> > >  #include "xe_vm.h"
> > >  #include "xe_vm_types.h"
> > >  
> > > +static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
> > > +}
> > > +
> > > +static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
> > > +{
> > > +	return gpusvm_to_vm(r->gpusvm);
> > > +}
> > > +
> > > +static struct drm_gpusvm_range *
> > > +xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct xe_svm_range *range;
> > > +
> > > +	range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	xe_vm_get(gpusvm_to_vm(gpusvm));
> > > +
> > > +	return &range->base;
> > > +}
> > > +
> > > +static void xe_svm_range_free(struct drm_gpusvm_range *range)
> > > +{
> > > +	xe_vm_put(range_to_vm(range));
> > > +	kfree(range);
> > > +}
> > > +
> > > +static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range *r)
> > > +{
> > > +	return container_of(r, struct xe_svm_range, base);
> > > +}
> > > +
> > > +static u8
> > > +xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > > drm_gpusvm_range *r,
> > > +				  const struct mmu_notifier_range
> > > *mmu_range,
> > > +				  u64 *adj_start, u64 *adj_end)
> > > +{
> > > +	struct xe_svm_range *range = to_xe_range(r);
> > > +	struct xe_device *xe = vm->xe;
> > > +	struct xe_tile *tile;
> > > +	u8 tile_mask = 0;
> > > +	u8 id;
> > > +
> > 
> > lockdep assert?
> >
> 
> Sure.
>  
> > > +	/* Skip if already unmapped or if no binding exist */
> > > +	if (range->base.flags.unmapped || !range->tile_present)
> > > +		return 0;
> > > +
> > > +	/* Adjust invalidation to range boundaries */
> > > +	if (range->base.va.start < mmu_range->start)
> > > +		*adj_start = range->base.va.start;
> > > +	if (range->base.va.end > mmu_range->end)
> > > +		*adj_end = range->base.va.end;
> > > +
> > > +	/*
> > > +	 * XXX: Ideally would zap PTEs in one shot in
> > > xe_svm_invalidate but the
> > > +	 * invalidation code can't correctly cope with sparse ranges
> > > or
> > > +	 * invalidations spanning multiple ranges.
> > > +	 */
> > > +	for_each_tile(tile, xe, id)
> > > +		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> > > +			tile_mask |= BIT(id);
> > > +			range->tile_invalidated |= BIT(id);
> > > +		}
> > > +
> > > +	return tile_mask;
> > > +}
> > > +
> > > +static void
> > > +xe_svm_range_notifier_event_end(struct xe_vm *vm, struct
> > > drm_gpusvm_range *r,
> > > +				const struct mmu_notifier_range
> > > *mmu_range)
> > > +{
> > > +	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > > +
> > > +	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> > > +	/* TODO: Add range to garbage collector */
> > > +}
> > > +
> > >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > >  			      struct drm_gpusvm_notifier *notifier,
> > >  			      const struct mmu_notifier_range
> > > *mmu_range)
> > >  {
> > > -	/* TODO: Implement */
> > > +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > > +	struct xe_device *xe = vm->xe;
> > > +	struct xe_tile *tile;
> > > +	struct drm_gpusvm_range *r, *first;
> > > +	struct xe_gt_tlb_invalidation_fence
> > > +		fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
> > > +	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
> > > +	u8 tile_mask = 0;
> > > +	u8 id;
> > > +	u32 fence_id = 0;
> > > +	long err;
> > > +
> > > +	if (xe_vm_is_closed(vm))
> > > +		return;
> > 
> > How do we ensure we don't race here? Are we sure that all dma mappings
> > and all PTEs pointing to the range is gone at this point? Becase "They
> > will soon be gone anyway" isn't enough.
> >
> 
> I think this is to prevent touching PTs which are being destroyed in
> parallel which resulted in kernel explosion, so I think we need this.
> 
> How to prevent a race? How about on VM close we invalidate the PT root?
> I had patch at one point which did this. We'd still have dma mappings
> too but I think if need to we can safely dma-unmap the pages if the VM
> is closed too. Thoughts?
> 
> > > +
> > > +	/* Adjust invalidation to notifier boundaries */
> > > +	if (adj_start < notifier->interval.start)
> > > +		adj_start = notifier->interval.start;
> > > +	if (adj_end > notifier->interval.end)
> > > +		adj_end = notifier->interval.end;
> > > +
> > > +	first = drm_gpusvm_range_find(notifier, adj_start, adj_end);
> > > +	if (!first)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * XXX: Less than ideal to always wait on VM's resv slots if
> > > an
> > > +	 * invalidation is not required. Could walk range list twice
> > > to figure
> > > +	 * out if an invalidations is need, but also not ideal.
> > > Maybe a counter
> > > +	 * within the notifier, seems like that could work.
> > > +	 */
> > > +	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > > +				    DMA_RESV_USAGE_BOOKKEEP,
> > > +				    false, MAX_SCHEDULE_TIMEOUT);
> > > +	XE_WARN_ON(err <= 0);
> > > +
> > > +	r = first;
> > > +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> > > +		tile_mask |= xe_svm_range_notifier_event_begin(vm,
> > > r, mmu_range,
> > > +							      
> > > &adj_start,
> > > +							      
> > > &adj_end);
> > > +	if (!tile_mask)
> > > +		goto range_notifier_event_end;
> > > +
> > > +	xe_device_wmb(xe);
> > > +
> > > +	for_each_tile(tile, xe, id) {
> > > +		if (tile_mask & BIT(id)) {
> > > +			int err;
> > > +
> > > +			xe_gt_tlb_invalidation_fence_init(tile-
> > > >primary_gt,
> > > +							 
> > > &fence[fence_id], true);
> > > +
> > > +			err = xe_gt_tlb_invalidation_range(tile-
> > > >primary_gt,
> > > +							  
> > > &fence[fence_id],
> > > +							  
> > > adj_start,
> > > +							   adj_end,
> > > +							   vm-
> > > >usm.asid);
> > > +			if (WARN_ON_ONCE(err < 0))
> > > +				goto wait;
> > > +			++fence_id;
> > > +
> > > +			if (!tile->media_gt)
> > > +				continue;
> > > +
> > > +			xe_gt_tlb_invalidation_fence_init(tile-
> > > >media_gt,
> > > +							 
> > > &fence[fence_id], true);
> > > +
> > > +			err = xe_gt_tlb_invalidation_range(tile-
> > > >media_gt,
> > > +							  
> > > &fence[fence_id],
> > > +							  
> > > adj_start,
> > > +							   adj_end,
> > > +							   vm-
> > > >usm.asid);
> > > +			if (WARN_ON_ONCE(err < 0))
> > > +				goto wait;
> > > +			++fence_id;
> > > +		}
> > > +	}
> > > +
> > > +wait:
> > > +	for (id = 0; id < fence_id; ++id)
> > > +		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
> > > +
> > > +range_notifier_event_end:
> > > +	r = first;
> > > +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> > > +		xe_svm_range_notifier_event_end(vm, r, mmu_range);
> > >  }
> > >  
> > >  static const struct drm_gpusvm_ops gpusvm_ops = {
> > > +	.range_alloc = xe_svm_range_alloc,
> > > +	.range_free = xe_svm_range_free,
> > >  	.invalidate = xe_svm_invalidate,
> > >  };
> > >  
> > > @@ -36,6 +206,11 @@ int xe_svm_init(struct xe_vm *vm)
> > >  
> > >  void xe_svm_close(struct xe_vm *vm)
> > >  {
> > > +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> > > +
> > > +	/* Flush running notifiers making xe_vm_close() visable */
> > > +	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> > > +	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> > 
> > Calling mmu_notifier_read_begin() ensures that nothing is invalidating
> > on the range. Probably a better choice.
> >
> 
> We'd have to call that on every notifier rather than just cycle the
> lock, so with that I'd prefer to leave it as is.
>  
> > >  }
> > >  
> > >  void xe_svm_fini(struct xe_vm *vm)
> > > @@ -44,3 +219,31 @@ void xe_svm_fini(struct xe_vm *vm)
> > >  
> > >  	drm_gpusvm_fini(&vm->svm.gpusvm);
> > >  }
> > > +
> > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > > +			    struct xe_tile *tile, u64 fault_addr,
> > > +			    bool atomic)
> > > +{
> > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma), };
> > > +	struct drm_gpusvm_range *r;
> > > +	int err;
> > > +
> > > +	lockdep_assert_held_write(&vm->lock);
> > > +
> > > +retry:
> > > +	/* TODO: Run garbage collector */
> > > +
> > > +	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > > fault_addr,
> > > +					    xe_vma_start(vma),
> > > xe_vma_end(vma),
> > > +					    &ctx);
> > > +	if (IS_ERR(r))
> > > +		return PTR_ERR(r);
> > > +
> > > +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
> > > +	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> > > mappings have change */
> > 
> > s/change/changed/
> > 
> 
> Yep.
> 
> > > +	       goto retry;
> > > +
> > > +	/* TODO: Issue bind */
> > > +
> > > +	return err;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > index 376e86876a11..c91c5f538024 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -6,14 +6,27 @@
> > >  #ifndef _XE_SVM_H_
> > >  #define _XE_SVM_H_
> > >  
> > > +#include "drm_gpusvm.h"
> > >  #include "drm_pagemap.h"
> > >  
> > >  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> > 
> > Not used yet
> >
> 
> Will remove.
>  

xe_res_cursor.h uses this in a prior patch in the series.

Matt

> > >  
> > > +struct xe_tile;
> > >  struct xe_vm;
> > > +struct xe_vma;
> > > +
> > > +struct xe_svm_range {
> > > +	struct drm_gpusvm_range base;
> > > +	u8 tile_present;
> > > +	u8 tile_invalidated;
> > > +};
> > 
> > Kerneldoc
> > 
> 
> Will add.
> 
> > 
> > >  
> > >  int xe_svm_init(struct xe_vm *vm);
> > >  void xe_svm_fini(struct xe_vm *vm);
> > >  void xe_svm_close(struct xe_vm *vm);
> > >  
> > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > > +			    struct xe_tile *tile, u64 fault_addr,
> > > +			    bool atomic);
> > > +
> > >  #endif
> > 
> > Thanks,
> 
> Thanks,
> Matt
> 
> > Thomas
> > 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 09/29] drm/xe: Add SVM range invalidation
  2024-12-11 19:01     ` Matthew Brost
  2024-12-14 23:11       ` Matthew Brost
@ 2024-12-16 10:01       ` Thomas Hellström
  2024-12-16 16:09         ` Matthew Brost
  1 sibling, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-16 10:01 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Wed, 2024-12-11 at 11:01 -0800, Matthew Brost wrote:
> On Tue, Nov 19, 2024 at 02:56:12PM +0100, Thomas Hellström wrote:
> > On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > > Add SVM range invalidation vfunc.
> > > 
> > > v2:
> > >  - Don't run invalidation if VM is closed
> > >  - Cycle notifier lock in xe_svm_close
> > >  - Drop xe_gt_tlb_invalidation_fence_fini
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
> > >  drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
> > >  drivers/gpu/drm/xe/xe_pt.h           |   3 +
> > >  drivers/gpu/drm/xe/xe_svm.c          | 205
> > > ++++++++++++++++++++++++++-
> > >  drivers/gpu/drm/xe/xe_svm.h          |  13 ++
> > >  5 files changed, 256 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > index 79c426dc2505..92923947a12c 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > @@ -19,6 +19,7 @@
> > >  #include "xe_guc.h"
> > >  #include "xe_guc_ct.h"
> > >  #include "xe_migrate.h"
> > > +#include "xe_svm.h"
> > >  #include "xe_trace_bo.h"
> > >  #include "xe_vm.h"
> > >  
> > > @@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec
> > > *exec,
> > > struct xe_vma *vma,
> > >  	return 0;
> > >  }
> > >  
> > > -static int handle_vma_pagefault(struct xe_tile *tile, struct
> > > pagefault *pf,
> > > -				struct xe_vma *vma)
> > > +static int handle_vma_pagefault(struct xe_tile *tile, struct
> > > xe_vma
> > > *vma,
> > > +				bool atomic)
> > >  {
> > >  	struct xe_vm *vm = xe_vma_vm(vma);
> > >  	struct drm_exec exec;
> > >  	struct dma_fence *fence;
> > >  	ktime_t end = 0;
> > >  	int err;
> > > -	bool atomic;
> > >  
> > > +	lockdep_assert_held_write(&vm->lock);
> > >  	trace_xe_vma_pagefault(vma);
> > > -	atomic = access_is_atomic(pf->access_type);
> > >  
> > >  	/* Check if VMA is valid */
> > >  	if (vma_is_valid(tile, vma) && !atomic)
> > > @@ -207,6 +207,7 @@ static int handle_pagefault(struct xe_gt *gt,
> > > struct pagefault *pf)
> > >  	struct xe_vm *vm;
> > >  	struct xe_vma *vma = NULL;
> > >  	int err;
> > > +	bool atomic;
> > >  
> > >  	/* SW isn't expected to handle TRTT faults */
> > >  	if (pf->trva_fault)
> > > @@ -232,7 +233,13 @@ static int handle_pagefault(struct xe_gt
> > > *gt,
> > > struct pagefault *pf)
> > >  		goto unlock_vm;
> > >  	}
> > >  
> > > -	err = handle_vma_pagefault(tile, pf, vma);
> > > +	atomic = access_is_atomic(pf->access_type);
> > > +
> > > +	if (xe_vma_is_system_allocator(vma))
> > > +		err = xe_svm_handle_pagefault(vm, vma, tile,
> > > +					      pf->page_addr,
> > > atomic);
> > > +	else
> > > +		err = handle_vma_pagefault(tile, vma, atomic);
> > >  
> > >  unlock_vm:
> > >  	if (!err)
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > b/drivers/gpu/drm/xe/xe_pt.c
> > > index 39357e829b6d..282476c4edbd 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -20,6 +20,7 @@
> > >  #include "xe_res_cursor.h"
> > >  #include "xe_sched_job.h"
> > >  #include "xe_sync.h"
> > > +#include "xe_svm.h"
> > >  #include "xe_trace.h"
> > >  #include "xe_ttm_stolen_mgr.h"
> > >  #include "xe_vm.h"
> > > @@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile *tile,
> > > struct
> > > xe_vma *vma)
> > >  	return xe_walk.needs_invalidate;
> > >  }
> > >  
> > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm
> > > *vm,
> > > +			  struct xe_svm_range *range)
> > 
> > Kerneldoc.
> > 
> 
> Will add.
> 
> > Here, (and I saw Oak also commented around this some time ago)
> > ideally
> > we should make xe_pt.c unaware of vmas and svm ranges, and in this
> > case, use the same xe_pt function for both.
> > 
> 
> See some of other comments, agree we should do in a follow up.
> 
> > 
> > 
> > > +{
> > > +	struct xe_pt_zap_ptes_walk xe_walk = {
> > > +		.base = {
> > > +			.ops = &xe_pt_zap_ptes_ops,
> > > +			.shifts = xe_normal_pt_shifts,
> > > +			.max_level = XE_PT_HIGHEST_LEVEL,
> > > +		},
> > > +		.tile = tile,
> > > +	};
> > > +	struct xe_pt *pt = vm->pt_root[tile->id];
> > > +	u8 pt_mask = (range->tile_present & ~range-
> > > > tile_invalidated);
> > > +
> > > +	if (!(pt_mask & BIT(tile->id)))
> > > +		return false;
> > > +
> > > +	(void)xe_pt_walk_shared(&pt->base, pt->level, range-
> > > > base.va.start,
> > > +				range->base.va.end,
> > > &xe_walk.base);
> > > +
> > > +	return xe_walk.needs_invalidate;
> > > +}
> > > +
> > >  static void
> > >  xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update,
> > > struct xe_tile *tile,
> > >  		       struct iosys_map *map, void *data,
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.h
> > > b/drivers/gpu/drm/xe/xe_pt.h
> > > index 9ab386431cad..5f333eeedf5c 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > @@ -13,6 +13,7 @@ struct dma_fence;
> > >  struct xe_bo;
> > >  struct xe_device;
> > >  struct xe_exec_queue;
> > > +struct xe_svm_range;
> > >  struct xe_sync_entry;
> > >  struct xe_tile;
> > >  struct xe_vm;
> > > @@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile
> > > *tile,
> > > struct xe_vma_ops *vops);
> > >  void xe_pt_update_ops_abort(struct xe_tile *tile, struct
> > > xe_vma_ops
> > > *vops);
> > >  
> > >  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm
> > > *vm,
> > > +			  struct xe_svm_range *range);
> > >  
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index 57b740367843..b2bc259978c4 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -5,18 +5,188 @@
> > >  
> > >  #include "drm_gpusvm.h"
> > >  
> > > +#include "xe_gt_tlb_invalidation.h"
> > > +#include "xe_pt.h"
> > >  #include "xe_svm.h"
> > >  #include "xe_vm.h"
> > >  #include "xe_vm_types.h"
> > >  
> > > +static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
> > > +}
> > > +
> > > +static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
> > > +{
> > > +	return gpusvm_to_vm(r->gpusvm);
> > > +}
> > > +
> > > +static struct drm_gpusvm_range *
> > > +xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> > > +{
> > > +	struct xe_svm_range *range;
> > > +
> > > +	range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > +	if (!range)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	xe_vm_get(gpusvm_to_vm(gpusvm));
> > > +
> > > +	return &range->base;
> > > +}
> > > +
> > > +static void xe_svm_range_free(struct drm_gpusvm_range *range)
> > > +{
> > > +	xe_vm_put(range_to_vm(range));
> > > +	kfree(range);
> > > +}
> > > +
> > > +static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range
> > > *r)
> > > +{
> > > +	return container_of(r, struct xe_svm_range, base);
> > > +}
> > > +
> > > +static u8
> > > +xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > > drm_gpusvm_range *r,
> > > +				  const struct
> > > mmu_notifier_range
> > > *mmu_range,
> > > +				  u64 *adj_start, u64 *adj_end)
> > > +{
> > > +	struct xe_svm_range *range = to_xe_range(r);
> > > +	struct xe_device *xe = vm->xe;
> > > +	struct xe_tile *tile;
> > > +	u8 tile_mask = 0;
> > > +	u8 id;
> > > +
> > 
> > lockdep assert?
> > 
> 
> Sure.
>  
> > > +	/* Skip if already unmapped or if no binding exist */
> > > +	if (range->base.flags.unmapped || !range->tile_present)
> > > +		return 0;
> > > +
> > > +	/* Adjust invalidation to range boundaries */
> > > +	if (range->base.va.start < mmu_range->start)
> > > +		*adj_start = range->base.va.start;
> > > +	if (range->base.va.end > mmu_range->end)
> > > +		*adj_end = range->base.va.end;
> > > +
> > > +	/*
> > > +	 * XXX: Ideally would zap PTEs in one shot in
> > > xe_svm_invalidate but the
> > > +	 * invalidation code can't correctly cope with sparse
> > > ranges
> > > or
> > > +	 * invalidations spanning multiple ranges.
> > > +	 */
> > > +	for_each_tile(tile, xe, id)
> > > +		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> > > +			tile_mask |= BIT(id);
> > > +			range->tile_invalidated |= BIT(id);
> > > +		}
> > > +
> > > +	return tile_mask;
> > > +}
> > > +
> > > +static void
> > > +xe_svm_range_notifier_event_end(struct xe_vm *vm, struct
> > > drm_gpusvm_range *r,
> > > +				const struct mmu_notifier_range
> > > *mmu_range)
> > > +{
> > > +	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > > +
> > > +	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> > > +	/* TODO: Add range to garbage collector */
> > > +}
> > > +
> > >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > >  			      struct drm_gpusvm_notifier
> > > *notifier,
> > >  			      const struct mmu_notifier_range
> > > *mmu_range)
> > >  {
> > > -	/* TODO: Implement */
> > > +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > > +	struct xe_device *xe = vm->xe;
> > > +	struct xe_tile *tile;
> > > +	struct drm_gpusvm_range *r, *first;
> > > +	struct xe_gt_tlb_invalidation_fence
> > > +		fence[XE_MAX_TILES_PER_DEVICE *
> > > XE_MAX_GT_PER_TILE];
> > > +	u64 adj_start = mmu_range->start, adj_end = mmu_range-
> > > >end;
> > > +	u8 tile_mask = 0;
> > > +	u8 id;
> > > +	u32 fence_id = 0;
> > > +	long err;
> > > +
> > > +	if (xe_vm_is_closed(vm))
> > > +		return;
> > 
> > How do we ensure we don't race here? Are we sure that all dma
> > mappings
> > and all PTEs pointing to the range is gone at this point? Becase
> > "They
> > will soon be gone anyway" isn't enough.
> > 
> 
> I think this is to prevent touching PTs which are being destroyed in
> parallel which resulted in kernel explosion, so I think we need this.

IIRC, the pt structure change is committed under the notifier lock when
unbinding, which means that a racing pt zap shouldn't do any harm and
just have the commit phase re-run?

So if we follow the 
1.) take vm->lock : waits for existing binding to complete
2.) mark vm closed : inhibits future binding
3.) unbind page-table
4.) Remove notifiers

We should be ok?

/Thomas


> 
> How to prevent a race? How about on VM close we invalidate the PT
> root?
> I had patch at one point which did this. We'd still have dma mappings
> too but I think if need to we can safely dma-unmap the pages if the
> VM
> is closed too. Thoughts?
> 
> > > +
> > > +	/* Adjust invalidation to notifier boundaries */
> > > +	if (adj_start < notifier->interval.start)
> > > +		adj_start = notifier->interval.start;
> > > +	if (adj_end > notifier->interval.end)
> > > +		adj_end = notifier->interval.end;
> > > +
> > > +	first = drm_gpusvm_range_find(notifier, adj_start,
> > > adj_end);
> > > +	if (!first)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * XXX: Less than ideal to always wait on VM's resv
> > > slots if
> > > an
> > > +	 * invalidation is not required. Could walk range list
> > > twice
> > > to figure
> > > +	 * out if an invalidations is need, but also not ideal.
> > > Maybe a counter
> > > +	 * within the notifier, seems like that could work.
> > > +	 */
> > > +	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > > +				    DMA_RESV_USAGE_BOOKKEEP,
> > > +				    false,
> > > MAX_SCHEDULE_TIMEOUT);
> > > +	XE_WARN_ON(err <= 0);
> > > +
> > > +	r = first;
> > > +	drm_gpusvm_for_each_range(r, notifier, adj_start,
> > > adj_end)
> > > +		tile_mask |=
> > > xe_svm_range_notifier_event_begin(vm,
> > > r, mmu_range,
> > > +							      
> > > &adj_start,
> > > +							      
> > > &adj_end);
> > > +	if (!tile_mask)
> > > +		goto range_notifier_event_end;
> > > +
> > > +	xe_device_wmb(xe);
> > > +
> > > +	for_each_tile(tile, xe, id) {
> > > +		if (tile_mask & BIT(id)) {
> > > +			int err;
> > > +
> > > +			xe_gt_tlb_invalidation_fence_init(tile-
> > > > primary_gt,
> > > +							 
> > > &fence[fence_id], true);
> > > +
> > > +			err = xe_gt_tlb_invalidation_range(tile-
> > > > primary_gt,
> > > +							  
> > > &fence[fence_id],
> > > +							  
> > > adj_start,
> > > +							  
> > > adj_end,
> > > +							   vm-
> > > > usm.asid);
> > > +			if (WARN_ON_ONCE(err < 0))
> > > +				goto wait;
> > > +			++fence_id;
> > > +
> > > +			if (!tile->media_gt)
> > > +				continue;
> > > +
> > > +			xe_gt_tlb_invalidation_fence_init(tile-
> > > > media_gt,
> > > +							 
> > > &fence[fence_id], true);
> > > +
> > > +			err = xe_gt_tlb_invalidation_range(tile-
> > > > media_gt,
> > > +							  
> > > &fence[fence_id],
> > > +							  
> > > adj_start,
> > > +							  
> > > adj_end,
> > > +							   vm-
> > > > usm.asid);
> > > +			if (WARN_ON_ONCE(err < 0))
> > > +				goto wait;
> > > +			++fence_id;
> > > +		}
> > > +	}
> > > +
> > > +wait:
> > > +	for (id = 0; id < fence_id; ++id)
> > > +		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
> > > +
> > > +range_notifier_event_end:
> > > +	r = first;
> > > +	drm_gpusvm_for_each_range(r, notifier, adj_start,
> > > adj_end)
> > > +		xe_svm_range_notifier_event_end(vm, r,
> > > mmu_range);
> > >  }
> > >  
> > >  static const struct drm_gpusvm_ops gpusvm_ops = {
> > > +	.range_alloc = xe_svm_range_alloc,
> > > +	.range_free = xe_svm_range_free,
> > >  	.invalidate = xe_svm_invalidate,
> > >  };
> > >  
> > > @@ -36,6 +206,11 @@ int xe_svm_init(struct xe_vm *vm)
> > >  
> > >  void xe_svm_close(struct xe_vm *vm)
> > >  {
> > > +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> > > +
> > > +	/* Flush running notifiers making xe_vm_close() visable
> > > */
> > > +	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> > > +	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> > 
> > Calling mmu_notifier_read_begin() ensures that nothing is
> > invalidating
> > on the range. Probably a better choice.
> > 
> 
> We'd have to call that on every notifier rather than just cycle the
> lock, so with that I'd prefer to leave it as is.
>  
> > >  }
> > >  
> > >  void xe_svm_fini(struct xe_vm *vm)
> > > @@ -44,3 +219,31 @@ void xe_svm_fini(struct xe_vm *vm)
> > >  
> > >  	drm_gpusvm_fini(&vm->svm.gpusvm);
> > >  }
> > > +
> > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > > +			    struct xe_tile *tile, u64
> > > fault_addr,
> > > +			    bool atomic)
> > > +{
> > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma), };
> > > +	struct drm_gpusvm_range *r;
> > > +	int err;
> > > +
> > > +	lockdep_assert_held_write(&vm->lock);
> > > +
> > > +retry:
> > > +	/* TODO: Run garbage collector */
> > > +
> > > +	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > > fault_addr,
> > > +					    xe_vma_start(vma),
> > > xe_vma_end(vma),
> > > +					    &ctx);
> > > +	if (IS_ERR(r))
> > > +		return PTR_ERR(r);
> > > +
> > > +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > false);
> > > +	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > CPU
> > > mappings have change */
> > 
> > s/change/changed/
> > 
> 
> Yep.
> 
> > > +	       goto retry;
> > > +
> > > +	/* TODO: Issue bind */
> > > +
> > > +	return err;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > index 376e86876a11..c91c5f538024 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -6,14 +6,27 @@
> > >  #ifndef _XE_SVM_H_
> > >  #define _XE_SVM_H_
> > >  
> > > +#include "drm_gpusvm.h"
> > >  #include "drm_pagemap.h"
> > >  
> > >  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> > 
> > Not used yet
> > 
> 
> Will remove.
>  
> > >  
> > > +struct xe_tile;
> > >  struct xe_vm;
> > > +struct xe_vma;
> > > +
> > > +struct xe_svm_range {
> > > +	struct drm_gpusvm_range base;
> > > +	u8 tile_present;
> > > +	u8 tile_invalidated;
> > > +};
> > 
> > Kerneldoc
> > 
> 
> Will add.
> 
> > 
> > >  
> > >  int xe_svm_init(struct xe_vm *vm);
> > >  void xe_svm_fini(struct xe_vm *vm);
> > >  void xe_svm_close(struct xe_vm *vm);
> > >  
> > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > > +			    struct xe_tile *tile, u64
> > > fault_addr,
> > > +			    bool atomic);
> > > +
> > >  #endif
> > 
> > Thanks,
> 
> Thanks,
> Matt
> 
> > Thomas
> > 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 09/29] drm/xe: Add SVM range invalidation
  2024-12-16 10:01       ` Thomas Hellström
@ 2024-12-16 16:09         ` Matthew Brost
  2024-12-16 17:35           ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-12-16 16:09 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 16, 2024 at 11:01:23AM +0100, Thomas Hellström wrote:
> On Wed, 2024-12-11 at 11:01 -0800, Matthew Brost wrote:
> > On Tue, Nov 19, 2024 at 02:56:12PM +0100, Thomas Hellström wrote:
> > > On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > > > Add SVM range invalidation vfunc.
> > > > 
> > > > v2:
> > > >  - Don't run invalidation if VM is closed
> > > >  - Cycle notifier lock in xe_svm_close
> > > >  - Drop xe_gt_tlb_invalidation_fence_fini
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
> > > >  drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
> > > >  drivers/gpu/drm/xe/xe_pt.h           |   3 +
> > > >  drivers/gpu/drm/xe/xe_svm.c          | 205
> > > > ++++++++++++++++++++++++++-
> > > >  drivers/gpu/drm/xe/xe_svm.h          |  13 ++
> > > >  5 files changed, 256 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > index 79c426dc2505..92923947a12c 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > @@ -19,6 +19,7 @@
> > > >  #include "xe_guc.h"
> > > >  #include "xe_guc_ct.h"
> > > >  #include "xe_migrate.h"
> > > > +#include "xe_svm.h"
> > > >  #include "xe_trace_bo.h"
> > > >  #include "xe_vm.h"
> > > >  
> > > > @@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec
> > > > *exec,
> > > > struct xe_vma *vma,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -static int handle_vma_pagefault(struct xe_tile *tile, struct
> > > > pagefault *pf,
> > > > -				struct xe_vma *vma)
> > > > +static int handle_vma_pagefault(struct xe_tile *tile, struct
> > > > xe_vma
> > > > *vma,
> > > > +				bool atomic)
> > > >  {
> > > >  	struct xe_vm *vm = xe_vma_vm(vma);
> > > >  	struct drm_exec exec;
> > > >  	struct dma_fence *fence;
> > > >  	ktime_t end = 0;
> > > >  	int err;
> > > > -	bool atomic;
> > > >  
> > > > +	lockdep_assert_held_write(&vm->lock);
> > > >  	trace_xe_vma_pagefault(vma);
> > > > -	atomic = access_is_atomic(pf->access_type);
> > > >  
> > > >  	/* Check if VMA is valid */
> > > >  	if (vma_is_valid(tile, vma) && !atomic)
> > > > @@ -207,6 +207,7 @@ static int handle_pagefault(struct xe_gt *gt,
> > > > struct pagefault *pf)
> > > >  	struct xe_vm *vm;
> > > >  	struct xe_vma *vma = NULL;
> > > >  	int err;
> > > > +	bool atomic;
> > > >  
> > > >  	/* SW isn't expected to handle TRTT faults */
> > > >  	if (pf->trva_fault)
> > > > @@ -232,7 +233,13 @@ static int handle_pagefault(struct xe_gt
> > > > *gt,
> > > > struct pagefault *pf)
> > > >  		goto unlock_vm;
> > > >  	}
> > > >  
> > > > -	err = handle_vma_pagefault(tile, pf, vma);
> > > > +	atomic = access_is_atomic(pf->access_type);
> > > > +
> > > > +	if (xe_vma_is_system_allocator(vma))
> > > > +		err = xe_svm_handle_pagefault(vm, vma, tile,
> > > > +					      pf->page_addr,
> > > > atomic);
> > > > +	else
> > > > +		err = handle_vma_pagefault(tile, vma, atomic);
> > > >  
> > > >  unlock_vm:
> > > >  	if (!err)
> > > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > > b/drivers/gpu/drm/xe/xe_pt.c
> > > > index 39357e829b6d..282476c4edbd 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > > @@ -20,6 +20,7 @@
> > > >  #include "xe_res_cursor.h"
> > > >  #include "xe_sched_job.h"
> > > >  #include "xe_sync.h"
> > > > +#include "xe_svm.h"
> > > >  #include "xe_trace.h"
> > > >  #include "xe_ttm_stolen_mgr.h"
> > > >  #include "xe_vm.h"
> > > > @@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile *tile,
> > > > struct
> > > > xe_vma *vma)
> > > >  	return xe_walk.needs_invalidate;
> > > >  }
> > > >  
> > > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm
> > > > *vm,
> > > > +			  struct xe_svm_range *range)
> > > 
> > > Kerneldoc.
> > > 
> > 
> > Will add.
> > 
> > > Here, (and I saw Oak also commented around this some time ago)
> > > ideally
> > > we should make xe_pt.c unaware of vmas and svm ranges, and in this
> > > case, use the same xe_pt function for both.
> > > 
> > 
> > See some of other comments, agree we should do in a follow up.
> > 
> > > 
> > > 
> > > > +{
> > > > +	struct xe_pt_zap_ptes_walk xe_walk = {
> > > > +		.base = {
> > > > +			.ops = &xe_pt_zap_ptes_ops,
> > > > +			.shifts = xe_normal_pt_shifts,
> > > > +			.max_level = XE_PT_HIGHEST_LEVEL,
> > > > +		},
> > > > +		.tile = tile,
> > > > +	};
> > > > +	struct xe_pt *pt = vm->pt_root[tile->id];
> > > > +	u8 pt_mask = (range->tile_present & ~range-
> > > > > tile_invalidated);
> > > > +
> > > > +	if (!(pt_mask & BIT(tile->id)))
> > > > +		return false;
> > > > +
> > > > +	(void)xe_pt_walk_shared(&pt->base, pt->level, range-
> > > > > base.va.start,
> > > > +				range->base.va.end,
> > > > &xe_walk.base);
> > > > +
> > > > +	return xe_walk.needs_invalidate;
> > > > +}
> > > > +
> > > >  static void
> > > >  xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update,
> > > > struct xe_tile *tile,
> > > >  		       struct iosys_map *map, void *data,
> > > > diff --git a/drivers/gpu/drm/xe/xe_pt.h
> > > > b/drivers/gpu/drm/xe/xe_pt.h
> > > > index 9ab386431cad..5f333eeedf5c 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > > @@ -13,6 +13,7 @@ struct dma_fence;
> > > >  struct xe_bo;
> > > >  struct xe_device;
> > > >  struct xe_exec_queue;
> > > > +struct xe_svm_range;
> > > >  struct xe_sync_entry;
> > > >  struct xe_tile;
> > > >  struct xe_vm;
> > > > @@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile
> > > > *tile,
> > > > struct xe_vma_ops *vops);
> > > >  void xe_pt_update_ops_abort(struct xe_tile *tile, struct
> > > > xe_vma_ops
> > > > *vops);
> > > >  
> > > >  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> > > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm
> > > > *vm,
> > > > +			  struct xe_svm_range *range);
> > > >  
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > > b/drivers/gpu/drm/xe/xe_svm.c
> > > > index 57b740367843..b2bc259978c4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > > @@ -5,18 +5,188 @@
> > > >  
> > > >  #include "drm_gpusvm.h"
> > > >  
> > > > +#include "xe_gt_tlb_invalidation.h"
> > > > +#include "xe_pt.h"
> > > >  #include "xe_svm.h"
> > > >  #include "xe_vm.h"
> > > >  #include "xe_vm_types.h"
> > > >  
> > > > +static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
> > > > +{
> > > > +	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
> > > > +}
> > > > +
> > > > +static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
> > > > +{
> > > > +	return gpusvm_to_vm(r->gpusvm);
> > > > +}
> > > > +
> > > > +static struct drm_gpusvm_range *
> > > > +xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> > > > +{
> > > > +	struct xe_svm_range *range;
> > > > +
> > > > +	range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > +	if (!range)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	xe_vm_get(gpusvm_to_vm(gpusvm));
> > > > +
> > > > +	return &range->base;
> > > > +}
> > > > +
> > > > +static void xe_svm_range_free(struct drm_gpusvm_range *range)
> > > > +{
> > > > +	xe_vm_put(range_to_vm(range));
> > > > +	kfree(range);
> > > > +}
> > > > +
> > > > +static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range
> > > > *r)
> > > > +{
> > > > +	return container_of(r, struct xe_svm_range, base);
> > > > +}
> > > > +
> > > > +static u8
> > > > +xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > > > drm_gpusvm_range *r,
> > > > +				  const struct
> > > > mmu_notifier_range
> > > > *mmu_range,
> > > > +				  u64 *adj_start, u64 *adj_end)
> > > > +{
> > > > +	struct xe_svm_range *range = to_xe_range(r);
> > > > +	struct xe_device *xe = vm->xe;
> > > > +	struct xe_tile *tile;
> > > > +	u8 tile_mask = 0;
> > > > +	u8 id;
> > > > +
> > > 
> > > lockdep assert?
> > > 
> > 
> > Sure.
> >  
> > > > +	/* Skip if already unmapped or if no binding exist */
> > > > +	if (range->base.flags.unmapped || !range->tile_present)
> > > > +		return 0;
> > > > +
> > > > +	/* Adjust invalidation to range boundaries */
> > > > +	if (range->base.va.start < mmu_range->start)
> > > > +		*adj_start = range->base.va.start;
> > > > +	if (range->base.va.end > mmu_range->end)
> > > > +		*adj_end = range->base.va.end;
> > > > +
> > > > +	/*
> > > > +	 * XXX: Ideally would zap PTEs in one shot in
> > > > xe_svm_invalidate but the
> > > > +	 * invalidation code can't correctly cope with sparse
> > > > ranges
> > > > or
> > > > +	 * invalidations spanning multiple ranges.
> > > > +	 */
> > > > +	for_each_tile(tile, xe, id)
> > > > +		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> > > > +			tile_mask |= BIT(id);
> > > > +			range->tile_invalidated |= BIT(id);
> > > > +		}
> > > > +
> > > > +	return tile_mask;
> > > > +}
> > > > +
> > > > +static void
> > > > +xe_svm_range_notifier_event_end(struct xe_vm *vm, struct
> > > > drm_gpusvm_range *r,
> > > > +				const struct mmu_notifier_range
> > > > *mmu_range)
> > > > +{
> > > > +	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > > > +
> > > > +	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> > > > +	/* TODO: Add range to garbage collector */
> > > > +}
> > > > +
> > > >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > > >  			      struct drm_gpusvm_notifier
> > > > *notifier,
> > > >  			      const struct mmu_notifier_range
> > > > *mmu_range)
> > > >  {
> > > > -	/* TODO: Implement */
> > > > +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > > > +	struct xe_device *xe = vm->xe;
> > > > +	struct xe_tile *tile;
> > > > +	struct drm_gpusvm_range *r, *first;
> > > > +	struct xe_gt_tlb_invalidation_fence
> > > > +		fence[XE_MAX_TILES_PER_DEVICE *
> > > > XE_MAX_GT_PER_TILE];
> > > > +	u64 adj_start = mmu_range->start, adj_end = mmu_range-
> > > > >end;
> > > > +	u8 tile_mask = 0;
> > > > +	u8 id;
> > > > +	u32 fence_id = 0;
> > > > +	long err;
> > > > +
> > > > +	if (xe_vm_is_closed(vm))
> > > > +		return;
> > > 
> > > How do we ensure we don't race here? Are we sure that all dma
> > > mappings
> > > and all PTEs pointing to the range is gone at this point? Becase
> > > "They
> > > will soon be gone anyway" isn't enough.
> > > 
> > 
> > I think this is to prevent touching PTs which are being destroyed in
> > parallel which resulted in kernel explosion, so I think we need this.
> 
> IIRC, the pt structure change is committed under the notifier lock when
> unbinding, which means that a racing pt zap shouldn't do any harm and
> just have the commit phase re-run?
> 

It is destroying the PTs in xe_vm_close_and_put which can race with the
notifier.

> So if we follow the 
> 1.) take vm->lock : waits for existing binding to complete
> 2.) mark vm closed : inhibits future binding
> 3.) unbind page-table
> 4.) Remove notifiers
> 
> We should be ok?
> 

Yea, I think you missing a few locks but yea probably something like you
describe works. I have already coded like I describe below in my
previous reply though - next rev is likely to posted in the next day or
so. I think either works but what you suggest is likely a bit of
larger rework.

I'm thinking really we should completely rework xe_vm_close_and_put as
it is a bit of a mess and we have just continually bolted on things over
time without a ton of deep thought on what we want the flow to look
rather just get stuff working and move on. Would it be ok for me to open
a Jira for this rework, I can take ownership, and do this rework in a
follow on series?

Matt

> /Thomas
> 
> 
> > 
> > How to prevent a race? How about on VM close we invalidate the PT
> > root?
> > I had patch at one point which did this. We'd still have dma mappings
> > too but I think if need to we can safely dma-unmap the pages if the
> > VM
> > is closed too. Thoughts?
> > 
> > > > +
> > > > +	/* Adjust invalidation to notifier boundaries */
> > > > +	if (adj_start < notifier->interval.start)
> > > > +		adj_start = notifier->interval.start;
> > > > +	if (adj_end > notifier->interval.end)
> > > > +		adj_end = notifier->interval.end;
> > > > +
> > > > +	first = drm_gpusvm_range_find(notifier, adj_start,
> > > > adj_end);
> > > > +	if (!first)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * XXX: Less than ideal to always wait on VM's resv
> > > > slots if
> > > > an
> > > > +	 * invalidation is not required. Could walk range list
> > > > twice
> > > > to figure
> > > > +	 * out if an invalidations is need, but also not ideal.
> > > > Maybe a counter
> > > > +	 * within the notifier, seems like that could work.
> > > > +	 */
> > > > +	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > > > +				    DMA_RESV_USAGE_BOOKKEEP,
> > > > +				    false,
> > > > MAX_SCHEDULE_TIMEOUT);
> > > > +	XE_WARN_ON(err <= 0);
> > > > +
> > > > +	r = first;
> > > > +	drm_gpusvm_for_each_range(r, notifier, adj_start,
> > > > adj_end)
> > > > +		tile_mask |=
> > > > xe_svm_range_notifier_event_begin(vm,
> > > > r, mmu_range,
> > > > +							      
> > > > &adj_start,
> > > > +							      
> > > > &adj_end);
> > > > +	if (!tile_mask)
> > > > +		goto range_notifier_event_end;
> > > > +
> > > > +	xe_device_wmb(xe);
> > > > +
> > > > +	for_each_tile(tile, xe, id) {
> > > > +		if (tile_mask & BIT(id)) {
> > > > +			int err;
> > > > +
> > > > +			xe_gt_tlb_invalidation_fence_init(tile-
> > > > > primary_gt,
> > > > +							 
> > > > &fence[fence_id], true);
> > > > +
> > > > +			err = xe_gt_tlb_invalidation_range(tile-
> > > > > primary_gt,
> > > > +							  
> > > > &fence[fence_id],
> > > > +							  
> > > > adj_start,
> > > > +							  
> > > > adj_end,
> > > > +							   vm-
> > > > > usm.asid);
> > > > +			if (WARN_ON_ONCE(err < 0))
> > > > +				goto wait;
> > > > +			++fence_id;
> > > > +
> > > > +			if (!tile->media_gt)
> > > > +				continue;
> > > > +
> > > > +			xe_gt_tlb_invalidation_fence_init(tile-
> > > > > media_gt,
> > > > +							 
> > > > &fence[fence_id], true);
> > > > +
> > > > +			err = xe_gt_tlb_invalidation_range(tile-
> > > > > media_gt,
> > > > +							  
> > > > &fence[fence_id],
> > > > +							  
> > > > adj_start,
> > > > +							  
> > > > adj_end,
> > > > +							   vm-
> > > > > usm.asid);
> > > > +			if (WARN_ON_ONCE(err < 0))
> > > > +				goto wait;
> > > > +			++fence_id;
> > > > +		}
> > > > +	}
> > > > +
> > > > +wait:
> > > > +	for (id = 0; id < fence_id; ++id)
> > > > +		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
> > > > +
> > > > +range_notifier_event_end:
> > > > +	r = first;
> > > > +	drm_gpusvm_for_each_range(r, notifier, adj_start,
> > > > adj_end)
> > > > +		xe_svm_range_notifier_event_end(vm, r,
> > > > mmu_range);
> > > >  }
> > > >  
> > > >  static const struct drm_gpusvm_ops gpusvm_ops = {
> > > > +	.range_alloc = xe_svm_range_alloc,
> > > > +	.range_free = xe_svm_range_free,
> > > >  	.invalidate = xe_svm_invalidate,
> > > >  };
> > > >  
> > > > @@ -36,6 +206,11 @@ int xe_svm_init(struct xe_vm *vm)
> > > >  
> > > >  void xe_svm_close(struct xe_vm *vm)
> > > >  {
> > > > +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> > > > +
> > > > +	/* Flush running notifiers making xe_vm_close() visable
> > > > */
> > > > +	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> > > > +	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> > > 
> > > Calling mmu_notifier_read_begin() ensures that nothing is
> > > invalidating
> > > on the range. Probably a better choice.
> > > 
> > 
> > We'd have to call that on every notifier rather than just cycle the
> > lock, so with that I'd prefer to leave it as is.
> >  
> > > >  }
> > > >  
> > > >  void xe_svm_fini(struct xe_vm *vm)
> > > > @@ -44,3 +219,31 @@ void xe_svm_fini(struct xe_vm *vm)
> > > >  
> > > >  	drm_gpusvm_fini(&vm->svm.gpusvm);
> > > >  }
> > > > +
> > > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > > *vma,
> > > > +			    struct xe_tile *tile, u64
> > > > fault_addr,
> > > > +			    bool atomic)
> > > > +{
> > > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > > xe_vma_read_only(vma), };
> > > > +	struct drm_gpusvm_range *r;
> > > > +	int err;
> > > > +
> > > > +	lockdep_assert_held_write(&vm->lock);
> > > > +
> > > > +retry:
> > > > +	/* TODO: Run garbage collector */
> > > > +
> > > > +	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > > > fault_addr,
> > > > +					    xe_vma_start(vma),
> > > > xe_vma_end(vma),
> > > > +					    &ctx);
> > > > +	if (IS_ERR(r))
> > > > +		return PTR_ERR(r);
> > > > +
> > > > +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > > false);
> > > > +	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > > CPU
> > > > mappings have change */
> > > 
> > > s/change/changed/
> > > 
> > 
> > Yep.
> > 
> > > > +	       goto retry;
> > > > +
> > > > +	/* TODO: Issue bind */
> > > > +
> > > > +	return err;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > index 376e86876a11..c91c5f538024 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -6,14 +6,27 @@
> > > >  #ifndef _XE_SVM_H_
> > > >  #define _XE_SVM_H_
> > > >  
> > > > +#include "drm_gpusvm.h"
> > > >  #include "drm_pagemap.h"
> > > >  
> > > >  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> > > 
> > > Not used yet
> > > 
> > 
> > Will remove.
> >  
> > > >  
> > > > +struct xe_tile;
> > > >  struct xe_vm;
> > > > +struct xe_vma;
> > > > +
> > > > +struct xe_svm_range {
> > > > +	struct drm_gpusvm_range base;
> > > > +	u8 tile_present;
> > > > +	u8 tile_invalidated;
> > > > +};
> > > 
> > > Kerneldoc
> > > 
> > 
> > Will add.
> > 
> > > 
> > > >  
> > > >  int xe_svm_init(struct xe_vm *vm);
> > > >  void xe_svm_fini(struct xe_vm *vm);
> > > >  void xe_svm_close(struct xe_vm *vm);
> > > >  
> > > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > > *vma,
> > > > +			    struct xe_tile *tile, u64
> > > > fault_addr,
> > > > +			    bool atomic);
> > > > +
> > > >  #endif
> > > 
> > > Thanks,
> > 
> > Thanks,
> > Matt
> > 
> > > Thomas
> > > 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 09/29] drm/xe: Add SVM range invalidation
  2024-12-16 16:09         ` Matthew Brost
@ 2024-12-16 17:35           ` Thomas Hellström
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-12-16 17:35 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, 2024-12-16 at 08:09 -0800, Matthew Brost wrote:
> On Mon, Dec 16, 2024 at 11:01:23AM +0100, Thomas Hellström wrote:
> > On Wed, 2024-12-11 at 11:01 -0800, Matthew Brost wrote:
> > > On Tue, Nov 19, 2024 at 02:56:12PM +0100, Thomas Hellström wrote:
> > > > On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > > > > Add SVM range invalidation vfunc.
> > > > > 
> > > > > v2:
> > > > >  - Don't run invalidation if VM is closed
> > > > >  - Cycle notifier lock in xe_svm_close
> > > > >  - Drop xe_gt_tlb_invalidation_fence_fini
> > > > > 
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_gt_pagefault.c |  17 ++-
> > > > >  drivers/gpu/drm/xe/xe_pt.c           |  24 ++++
> > > > >  drivers/gpu/drm/xe/xe_pt.h           |   3 +
> > > > >  drivers/gpu/drm/xe/xe_svm.c          | 205
> > > > > ++++++++++++++++++++++++++-
> > > > >  drivers/gpu/drm/xe/xe_svm.h          |  13 ++
> > > > >  5 files changed, 256 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > > index 79c426dc2505..92923947a12c 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > > @@ -19,6 +19,7 @@
> > > > >  #include "xe_guc.h"
> > > > >  #include "xe_guc_ct.h"
> > > > >  #include "xe_migrate.h"
> > > > > +#include "xe_svm.h"
> > > > >  #include "xe_trace_bo.h"
> > > > >  #include "xe_vm.h"
> > > > >  
> > > > > @@ -125,18 +126,17 @@ static int xe_pf_begin(struct drm_exec
> > > > > *exec,
> > > > > struct xe_vma *vma,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > -static int handle_vma_pagefault(struct xe_tile *tile, struct
> > > > > pagefault *pf,
> > > > > -				struct xe_vma *vma)
> > > > > +static int handle_vma_pagefault(struct xe_tile *tile, struct
> > > > > xe_vma
> > > > > *vma,
> > > > > +				bool atomic)
> > > > >  {
> > > > >  	struct xe_vm *vm = xe_vma_vm(vma);
> > > > >  	struct drm_exec exec;
> > > > >  	struct dma_fence *fence;
> > > > >  	ktime_t end = 0;
> > > > >  	int err;
> > > > > -	bool atomic;
> > > > >  
> > > > > +	lockdep_assert_held_write(&vm->lock);
> > > > >  	trace_xe_vma_pagefault(vma);
> > > > > -	atomic = access_is_atomic(pf->access_type);
> > > > >  
> > > > >  	/* Check if VMA is valid */
> > > > >  	if (vma_is_valid(tile, vma) && !atomic)
> > > > > @@ -207,6 +207,7 @@ static int handle_pagefault(struct xe_gt
> > > > > *gt,
> > > > > struct pagefault *pf)
> > > > >  	struct xe_vm *vm;
> > > > >  	struct xe_vma *vma = NULL;
> > > > >  	int err;
> > > > > +	bool atomic;
> > > > >  
> > > > >  	/* SW isn't expected to handle TRTT faults */
> > > > >  	if (pf->trva_fault)
> > > > > @@ -232,7 +233,13 @@ static int handle_pagefault(struct xe_gt
> > > > > *gt,
> > > > > struct pagefault *pf)
> > > > >  		goto unlock_vm;
> > > > >  	}
> > > > >  
> > > > > -	err = handle_vma_pagefault(tile, pf, vma);
> > > > > +	atomic = access_is_atomic(pf->access_type);
> > > > > +
> > > > > +	if (xe_vma_is_system_allocator(vma))
> > > > > +		err = xe_svm_handle_pagefault(vm, vma, tile,
> > > > > +					      pf->page_addr,
> > > > > atomic);
> > > > > +	else
> > > > > +		err = handle_vma_pagefault(tile, vma,
> > > > > atomic);
> > > > >  
> > > > >  unlock_vm:
> > > > >  	if (!err)
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > > > b/drivers/gpu/drm/xe/xe_pt.c
> > > > > index 39357e829b6d..282476c4edbd 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > > > @@ -20,6 +20,7 @@
> > > > >  #include "xe_res_cursor.h"
> > > > >  #include "xe_sched_job.h"
> > > > >  #include "xe_sync.h"
> > > > > +#include "xe_svm.h"
> > > > >  #include "xe_trace.h"
> > > > >  #include "xe_ttm_stolen_mgr.h"
> > > > >  #include "xe_vm.h"
> > > > > @@ -829,6 +830,29 @@ bool xe_pt_zap_ptes(struct xe_tile
> > > > > *tile,
> > > > > struct
> > > > > xe_vma *vma)
> > > > >  	return xe_walk.needs_invalidate;
> > > > >  }
> > > > >  
> > > > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm
> > > > > *vm,
> > > > > +			  struct xe_svm_range *range)
> > > > 
> > > > Kerneldoc.
> > > > 
> > > 
> > > Will add.
> > > 
> > > > Here, (and I saw Oak also commented around this some time ago)
> > > > ideally
> > > > we should make xe_pt.c unaware of vmas and svm ranges, and in
> > > > this
> > > > case, use the same xe_pt function for both.
> > > > 
> > > 
> > > See some of other comments, agree we should do in a follow up.
> > > 
> > > > 
> > > > 
> > > > > +{
> > > > > +	struct xe_pt_zap_ptes_walk xe_walk = {
> > > > > +		.base = {
> > > > > +			.ops = &xe_pt_zap_ptes_ops,
> > > > > +			.shifts = xe_normal_pt_shifts,
> > > > > +			.max_level = XE_PT_HIGHEST_LEVEL,
> > > > > +		},
> > > > > +		.tile = tile,
> > > > > +	};
> > > > > +	struct xe_pt *pt = vm->pt_root[tile->id];
> > > > > +	u8 pt_mask = (range->tile_present & ~range-
> > > > > > tile_invalidated);
> > > > > +
> > > > > +	if (!(pt_mask & BIT(tile->id)))
> > > > > +		return false;
> > > > > +
> > > > > +	(void)xe_pt_walk_shared(&pt->base, pt->level, range-
> > > > > > base.va.start,
> > > > > +				range->base.va.end,
> > > > > &xe_walk.base);
> > > > > +
> > > > > +	return xe_walk.needs_invalidate;
> > > > > +}
> > > > > +
> > > > >  static void
> > > > >  xe_vm_populate_pgtable(struct xe_migrate_pt_update
> > > > > *pt_update,
> > > > > struct xe_tile *tile,
> > > > >  		       struct iosys_map *map, void *data,
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pt.h
> > > > > b/drivers/gpu/drm/xe/xe_pt.h
> > > > > index 9ab386431cad..5f333eeedf5c 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_pt.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_pt.h
> > > > > @@ -13,6 +13,7 @@ struct dma_fence;
> > > > >  struct xe_bo;
> > > > >  struct xe_device;
> > > > >  struct xe_exec_queue;
> > > > > +struct xe_svm_range;
> > > > >  struct xe_sync_entry;
> > > > >  struct xe_tile;
> > > > >  struct xe_vm;
> > > > > @@ -42,5 +43,7 @@ void xe_pt_update_ops_fini(struct xe_tile
> > > > > *tile,
> > > > > struct xe_vma_ops *vops);
> > > > >  void xe_pt_update_ops_abort(struct xe_tile *tile, struct
> > > > > xe_vma_ops
> > > > > *vops);
> > > > >  
> > > > >  bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma
> > > > > *vma);
> > > > > +bool xe_pt_zap_ptes_range(struct xe_tile *tile, struct xe_vm
> > > > > *vm,
> > > > > +			  struct xe_svm_range *range);
> > > > >  
> > > > >  #endif
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > > > b/drivers/gpu/drm/xe/xe_svm.c
> > > > > index 57b740367843..b2bc259978c4 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > > > @@ -5,18 +5,188 @@
> > > > >  
> > > > >  #include "drm_gpusvm.h"
> > > > >  
> > > > > +#include "xe_gt_tlb_invalidation.h"
> > > > > +#include "xe_pt.h"
> > > > >  #include "xe_svm.h"
> > > > >  #include "xe_vm.h"
> > > > >  #include "xe_vm_types.h"
> > > > >  
> > > > > +static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
> > > > > +{
> > > > > +	return container_of(gpusvm, struct xe_vm,
> > > > > svm.gpusvm);
> > > > > +}
> > > > > +
> > > > > +static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
> > > > > +{
> > > > > +	return gpusvm_to_vm(r->gpusvm);
> > > > > +}
> > > > > +
> > > > > +static struct drm_gpusvm_range *
> > > > > +xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> > > > > +{
> > > > > +	struct xe_svm_range *range;
> > > > > +
> > > > > +	range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > > +	if (!range)
> > > > > +		return ERR_PTR(-ENOMEM);
> > > > > +
> > > > > +	xe_vm_get(gpusvm_to_vm(gpusvm));
> > > > > +
> > > > > +	return &range->base;
> > > > > +}
> > > > > +
> > > > > +static void xe_svm_range_free(struct drm_gpusvm_range
> > > > > *range)
> > > > > +{
> > > > > +	xe_vm_put(range_to_vm(range));
> > > > > +	kfree(range);
> > > > > +}
> > > > > +
> > > > > +static struct xe_svm_range *to_xe_range(struct
> > > > > drm_gpusvm_range
> > > > > *r)
> > > > > +{
> > > > > +	return container_of(r, struct xe_svm_range, base);
> > > > > +}
> > > > > +
> > > > > +static u8
> > > > > +xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > > > > drm_gpusvm_range *r,
> > > > > +				  const struct
> > > > > mmu_notifier_range
> > > > > *mmu_range,
> > > > > +				  u64 *adj_start, u64
> > > > > *adj_end)
> > > > > +{
> > > > > +	struct xe_svm_range *range = to_xe_range(r);
> > > > > +	struct xe_device *xe = vm->xe;
> > > > > +	struct xe_tile *tile;
> > > > > +	u8 tile_mask = 0;
> > > > > +	u8 id;
> > > > > +
> > > > 
> > > > lockdep assert?
> > > > 
> > > 
> > > Sure.
> > >  
> > > > > +	/* Skip if already unmapped or if no binding exist
> > > > > */
> > > > > +	if (range->base.flags.unmapped || !range-
> > > > > >tile_present)
> > > > > +		return 0;
> > > > > +
> > > > > +	/* Adjust invalidation to range boundaries */
> > > > > +	if (range->base.va.start < mmu_range->start)
> > > > > +		*adj_start = range->base.va.start;
> > > > > +	if (range->base.va.end > mmu_range->end)
> > > > > +		*adj_end = range->base.va.end;
> > > > > +
> > > > > +	/*
> > > > > +	 * XXX: Ideally would zap PTEs in one shot in
> > > > > xe_svm_invalidate but the
> > > > > +	 * invalidation code can't correctly cope with
> > > > > sparse
> > > > > ranges
> > > > > or
> > > > > +	 * invalidations spanning multiple ranges.
> > > > > +	 */
> > > > > +	for_each_tile(tile, xe, id)
> > > > > +		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> > > > > +			tile_mask |= BIT(id);
> > > > > +			range->tile_invalidated |= BIT(id);
> > > > > +		}
> > > > > +
> > > > > +	return tile_mask;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +xe_svm_range_notifier_event_end(struct xe_vm *vm, struct
> > > > > drm_gpusvm_range *r,
> > > > > +				const struct
> > > > > mmu_notifier_range
> > > > > *mmu_range)
> > > > > +{
> > > > > +	struct drm_gpusvm_ctx ctx = { .in_notifier = true,
> > > > > };
> > > > > +
> > > > > +	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r,
> > > > > &ctx);
> > > > > +	/* TODO: Add range to garbage collector */
> > > > > +}
> > > > > +
> > > > >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > > > >  			      struct drm_gpusvm_notifier
> > > > > *notifier,
> > > > >  			      const struct
> > > > > mmu_notifier_range
> > > > > *mmu_range)
> > > > >  {
> > > > > -	/* TODO: Implement */
> > > > > +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > > > > +	struct xe_device *xe = vm->xe;
> > > > > +	struct xe_tile *tile;
> > > > > +	struct drm_gpusvm_range *r, *first;
> > > > > +	struct xe_gt_tlb_invalidation_fence
> > > > > +		fence[XE_MAX_TILES_PER_DEVICE *
> > > > > XE_MAX_GT_PER_TILE];
> > > > > +	u64 adj_start = mmu_range->start, adj_end =
> > > > > mmu_range-
> > > > > > end;
> > > > > +	u8 tile_mask = 0;
> > > > > +	u8 id;
> > > > > +	u32 fence_id = 0;
> > > > > +	long err;
> > > > > +
> > > > > +	if (xe_vm_is_closed(vm))
> > > > > +		return;
> > > > 
> > > > How do we ensure we don't race here? Are we sure that all dma
> > > > mappings
> > > > and all PTEs pointing to the range is gone at this point?
> > > > Becase
> > > > "They
> > > > will soon be gone anyway" isn't enough.
> > > > 
> > > 
> > > I think this is to prevent touching PTs which are being destroyed
> > > in
> > > parallel which resulted in kernel explosion, so I think we need
> > > this.
> > 
> > IIRC, the pt structure change is committed under the notifier lock
> > when
> > unbinding, which means that a racing pt zap shouldn't do any harm
> > and
> > just have the commit phase re-run?
> > 
> 
> It is destroying the PTs in xe_vm_close_and_put which can race with
> the
> notifier.
> 
> > So if we follow the 
> > 1.) take vm->lock : waits for existing binding to complete
> > 2.) mark vm closed : inhibits future binding
> > 3.) unbind page-table
> > 4.) Remove notifiers
> > 
> > We should be ok?
> > 
> 
> Yea, I think you missing a few locks but yea probably something like
> you
> describe works. I have already coded like I describe below in my
> previous reply though - next rev is likely to posted in the next day
> or
> so. I think either works but what you suggest is likely a bit of
> larger rework.
> 
> I'm thinking really we should completely rework xe_vm_close_and_put
> as
> it is a bit of a mess and we have just continually bolted on things
> over
> time without a ton of deep thought on what we want the flow to look
> rather just get stuff working and move on. Would it be ok for me to
> open
> a Jira for this rework, I can take ownership, and do this rework in a
> follow on series?

Sure I think that makes sense. Of immediate concern, though, is gpu
page access if we disable the notifiers, but as you say if we
invalidate the pt root and mark the vm closed under the same vm lock I
agree we should be good.

Thanks,
Thomas




> 
> Matt
> 
> > /Thomas
> > 
> > 
> > > 
> > > How to prevent a race? How about on VM close we invalidate the PT
> > > root?
> > > I had patch at one point which did this. We'd still have dma
> > > mappings
> > > too but I think if need to we can safely dma-unmap the pages if
> > > the
> > > VM
> > > is closed too. Thoughts?
> > > 
> > > > > +
> > > > > +	/* Adjust invalidation to notifier boundaries */
> > > > > +	if (adj_start < notifier->interval.start)
> > > > > +		adj_start = notifier->interval.start;
> > > > > +	if (adj_end > notifier->interval.end)
> > > > > +		adj_end = notifier->interval.end;
> > > > > +
> > > > > +	first = drm_gpusvm_range_find(notifier, adj_start,
> > > > > adj_end);
> > > > > +	if (!first)
> > > > > +		return;
> > > > > +
> > > > > +	/*
> > > > > +	 * XXX: Less than ideal to always wait on VM's resv
> > > > > slots if
> > > > > an
> > > > > +	 * invalidation is not required. Could walk range
> > > > > list
> > > > > twice
> > > > > to figure
> > > > > +	 * out if an invalidations is need, but also not
> > > > > ideal.
> > > > > Maybe a counter
> > > > > +	 * within the notifier, seems like that could work.
> > > > > +	 */
> > > > > +	err = dma_resv_wait_timeout(xe_vm_resv(vm),
> > > > > +				    DMA_RESV_USAGE_BOOKKEEP,
> > > > > +				    false,
> > > > > MAX_SCHEDULE_TIMEOUT);
> > > > > +	XE_WARN_ON(err <= 0);
> > > > > +
> > > > > +	r = first;
> > > > > +	drm_gpusvm_for_each_range(r, notifier, adj_start,
> > > > > adj_end)
> > > > > +		tile_mask |=
> > > > > xe_svm_range_notifier_event_begin(vm,
> > > > > r, mmu_range,
> > > > > +							    
> > > > >   
> > > > > &adj_start,
> > > > > +							    
> > > > >   
> > > > > &adj_end);
> > > > > +	if (!tile_mask)
> > > > > +		goto range_notifier_event_end;
> > > > > +
> > > > > +	xe_device_wmb(xe);
> > > > > +
> > > > > +	for_each_tile(tile, xe, id) {
> > > > > +		if (tile_mask & BIT(id)) {
> > > > > +			int err;
> > > > > +
> > > > > +			xe_gt_tlb_invalidation_fence_init(ti
> > > > > le-
> > > > > > primary_gt,
> > > > > +							 
> > > > > &fence[fence_id], true);
> > > > > +
> > > > > +			err =
> > > > > xe_gt_tlb_invalidation_range(tile-
> > > > > > primary_gt,
> > > > > +							  
> > > > > &fence[fence_id],
> > > > > +							  
> > > > > adj_start,
> > > > > +							  
> > > > > adj_end,
> > > > > +							  
> > > > > vm-
> > > > > > usm.asid);
> > > > > +			if (WARN_ON_ONCE(err < 0))
> > > > > +				goto wait;
> > > > > +			++fence_id;
> > > > > +
> > > > > +			if (!tile->media_gt)
> > > > > +				continue;
> > > > > +
> > > > > +			xe_gt_tlb_invalidation_fence_init(ti
> > > > > le-
> > > > > > media_gt,
> > > > > +							 
> > > > > &fence[fence_id], true);
> > > > > +
> > > > > +			err =
> > > > > xe_gt_tlb_invalidation_range(tile-
> > > > > > media_gt,
> > > > > +							  
> > > > > &fence[fence_id],
> > > > > +							  
> > > > > adj_start,
> > > > > +							  
> > > > > adj_end,
> > > > > +							  
> > > > > vm-
> > > > > > usm.asid);
> > > > > +			if (WARN_ON_ONCE(err < 0))
> > > > > +				goto wait;
> > > > > +			++fence_id;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +wait:
> > > > > +	for (id = 0; id < fence_id; ++id)
> > > > > +		xe_gt_tlb_invalidation_fence_wait(&fence[id]
> > > > > );
> > > > > +
> > > > > +range_notifier_event_end:
> > > > > +	r = first;
> > > > > +	drm_gpusvm_for_each_range(r, notifier, adj_start,
> > > > > adj_end)
> > > > > +		xe_svm_range_notifier_event_end(vm, r,
> > > > > mmu_range);
> > > > >  }
> > > > >  
> > > > >  static const struct drm_gpusvm_ops gpusvm_ops = {
> > > > > +	.range_alloc = xe_svm_range_alloc,
> > > > > +	.range_free = xe_svm_range_free,
> > > > >  	.invalidate = xe_svm_invalidate,
> > > > >  };
> > > > >  
> > > > > @@ -36,6 +206,11 @@ int xe_svm_init(struct xe_vm *vm)
> > > > >  
> > > > >  void xe_svm_close(struct xe_vm *vm)
> > > > >  {
> > > > > +	xe_assert(vm->xe, xe_vm_is_closed(vm));
> > > > > +
> > > > > +	/* Flush running notifiers making xe_vm_close()
> > > > > visable
> > > > > */
> > > > > +	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> > > > > +	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> > > > 
> > > > Calling mmu_notifier_read_begin() ensures that nothing is
> > > > invalidating
> > > > on the range. Probably a better choice.
> > > > 
> > > 
> > > We'd have to call that on every notifier rather than just cycle
> > > the
> > > lock, so with that I'd prefer to leave it as is.
> > >  
> > > > >  }
> > > > >  
> > > > >  void xe_svm_fini(struct xe_vm *vm)
> > > > > @@ -44,3 +219,31 @@ void xe_svm_fini(struct xe_vm *vm)
> > > > >  
> > > > >  	drm_gpusvm_fini(&vm->svm.gpusvm);
> > > > >  }
> > > > > +
> > > > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > > > *vma,
> > > > > +			    struct xe_tile *tile, u64
> > > > > fault_addr,
> > > > > +			    bool atomic)
> > > > > +{
> > > > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > > > xe_vma_read_only(vma), };
> > > > > +	struct drm_gpusvm_range *r;
> > > > > +	int err;
> > > > > +
> > > > > +	lockdep_assert_held_write(&vm->lock);
> > > > > +
> > > > > +retry:
> > > > > +	/* TODO: Run garbage collector */
> > > > > +
> > > > > +	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > > > > fault_addr,
> > > > > +					   
> > > > > xe_vma_start(vma),
> > > > > xe_vma_end(vma),
> > > > > +					    &ctx);
> > > > > +	if (IS_ERR(r))
> > > > > +		return PTR_ERR(r);
> > > > > +
> > > > > +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > > > false);
> > > > > +	if (err == -EFAULT || err == -EPERM)	/* Corner
> > > > > where
> > > > > CPU
> > > > > mappings have change */
> > > > 
> > > > s/change/changed/
> > > > 
> > > 
> > > Yep.
> > > 
> > > > > +	       goto retry;
> > > > > +
> > > > > +	/* TODO: Issue bind */
> > > > > +
> > > > > +	return err;
> > > > > +}
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > index 376e86876a11..c91c5f538024 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > @@ -6,14 +6,27 @@
> > > > >  #ifndef _XE_SVM_H_
> > > > >  #define _XE_SVM_H_
> > > > >  
> > > > > +#include "drm_gpusvm.h"
> > > > >  #include "drm_pagemap.h"
> > > > >  
> > > > >  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> > > > 
> > > > Not used yet
> > > > 
> > > 
> > > Will remove.
> > >  
> > > > >  
> > > > > +struct xe_tile;
> > > > >  struct xe_vm;
> > > > > +struct xe_vma;
> > > > > +
> > > > > +struct xe_svm_range {
> > > > > +	struct drm_gpusvm_range base;
> > > > > +	u8 tile_present;
> > > > > +	u8 tile_invalidated;
> > > > > +};
> > > > 
> > > > Kerneldoc
> > > > 
> > > 
> > > Will add.
> > > 
> > > > 
> > > > >  
> > > > >  int xe_svm_init(struct xe_vm *vm);
> > > > >  void xe_svm_fini(struct xe_vm *vm);
> > > > >  void xe_svm_close(struct xe_vm *vm);
> > > > >  
> > > > > +int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > > > *vma,
> > > > > +			    struct xe_tile *tile, u64
> > > > > fault_addr,
> > > > > +			    bool atomic);
> > > > > +
> > > > >  #endif
> > > > 
> > > > Thanks,
> > > 
> > > Thanks,
> > > Matt
> > > 
> > > > Thomas
> > > > 
> > 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 10/29] drm/gpuvm: Add DRM_GPUVA_OP_USER
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (8 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 09/29] drm/xe: Add SVM range invalidation Matthew Brost
@ 2024-10-16  3:24 ` Matthew Brost
  2024-11-19 13:57   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
                   ` (21 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:24 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add DRM_GPUVA_OP_USER which allows driver to define their own gpuvm ops.

Cc: Danilo Krummrich <dakr@redhat.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/drm/drm_gpuvm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
index 00d4e43b76b6..cc3f8ed5113b 100644
--- a/include/drm/drm_gpuvm.h
+++ b/include/drm/drm_gpuvm.h
@@ -812,6 +812,11 @@ enum drm_gpuva_op_type {
 	 * @DRM_GPUVA_OP_PREFETCH: the prefetch op type
 	 */
 	DRM_GPUVA_OP_PREFETCH,
+
+	/**
+	 * @DRM_GPUVA_OP_USER: the user defined op type
+	 */
+	DRM_GPUVA_OP_USER,
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 10/29] drm/gpuvm: Add DRM_GPUVA_OP_USER
  2024-10-16  3:24 ` [PATCH v2 10/29] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
@ 2024-11-19 13:57   ` Thomas Hellström
  2024-11-19 16:26     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 13:57 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel, Danilo Krummrich
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> Add DRM_GPUVA_OP_USER which allows driver to define their own gpuvm
> ops.
> 
> Cc: Danilo Krummrich <dakr@redhat.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/drm/drm_gpuvm.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
> index 00d4e43b76b6..cc3f8ed5113b 100644
> --- a/include/drm/drm_gpuvm.h
> +++ b/include/drm/drm_gpuvm.h
> @@ -812,6 +812,11 @@ enum drm_gpuva_op_type {
>  	 * @DRM_GPUVA_OP_PREFETCH: the prefetch op type
>  	 */
>  	DRM_GPUVA_OP_PREFETCH,
> +
> +	/**
> +	 * @DRM_GPUVA_OP_USER: the user defined op type
> +	 */
> +	DRM_GPUVA_OP_USER,

Perhaps _OP_DRIVER, But Danilo might want to chime in.

Otherwise LGTM.
Thanks,
Thomas



>  };
>  
>  /**


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 10/29] drm/gpuvm: Add DRM_GPUVA_OP_USER
  2024-11-19 13:57   ` Thomas Hellström
@ 2024-11-19 16:26     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-19 16:26 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, Danilo Krummrich, apopple, airlied,
	christian.koenig, simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 02:57:56PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:24 -0700, Matthew Brost wrote:
> > Add DRM_GPUVA_OP_USER which allows driver to define their own gpuvm
> > ops.
> > 
> > Cc: Danilo Krummrich <dakr@redhat.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/drm/drm_gpuvm.h | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
> > index 00d4e43b76b6..cc3f8ed5113b 100644
> > --- a/include/drm/drm_gpuvm.h
> > +++ b/include/drm/drm_gpuvm.h
> > @@ -812,6 +812,11 @@ enum drm_gpuva_op_type {
> >  	 * @DRM_GPUVA_OP_PREFETCH: the prefetch op type
> >  	 */
> >  	DRM_GPUVA_OP_PREFETCH,
> > +
> > +	/**
> > +	 * @DRM_GPUVA_OP_USER: the user defined op type
> > +	 */
> > +	DRM_GPUVA_OP_USER,
> 
> Perhaps _OP_DRIVER, But Danilo might want to chime in.
> 

I think that better too. Will change and open to any feedback from
Danilo too.

Matt

> Otherwise LGTM.
> Thanks,
> Thomas
> 
> 
> 
> >  };
> >  
> >  /**
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (9 preceding siblings ...)
  2024-10-16  3:24 ` [PATCH v2 10/29] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 14:26   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 12/29] drm/xe: Add SVM garbage collector Matthew Brost
                   ` (20 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add (re)bind to SVM page fault handler. To facilitate add support
function to VM layer which (re)binds a SVM range. Also teach PT layer to
understand (re)binds of SVM ranges.

v2:
 - Don't assert BO lock held for range binds
 - Use xe_svm_notifier_lock/unlock helper in xe_svm_close
 - Use drm_pagemap dma cursor
 - Take notifier lock in bind code to check range state

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       | 170 +++++++++++++++++++++++++++----
 drivers/gpu/drm/xe/xe_pt_types.h |   2 +
 drivers/gpu/drm/xe/xe_svm.c      |  49 ++++++++-
 drivers/gpu/drm/xe/xe_svm.h      |  17 ++++
 drivers/gpu/drm/xe/xe_vm.c       |  80 +++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h       |   5 +
 drivers/gpu/drm/xe/xe_vm_types.h |  19 ++++
 7 files changed, 319 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 282476c4edbd..024e4eb83408 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -587,6 +587,7 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  * range.
  * @tile: The tile we're building for.
  * @vma: The vma indicating the address range.
+ * @range: The range indicating the address range.
  * @entries: Storage for the update entries used for connecting the tree to
  * the main tree at commit time.
  * @num_entries: On output contains the number of @entries used.
@@ -602,6 +603,7 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
  */
 static int
 xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
+		 struct xe_svm_range *range,
 		 struct xe_vm_pgtable_update *entries, u32 *num_entries)
 {
 	struct xe_device *xe = tile_to_xe(tile);
@@ -618,14 +620,38 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		.vm = xe_vma_vm(vma),
 		.tile = tile,
 		.curs = &curs,
-		.va_curs_start = xe_vma_start(vma),
+		.va_curs_start = range ? range->base.va.start :
+			xe_vma_start(vma),
 		.vma = vma,
 		.wupd.entries = entries,
-		.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K) && is_devmem,
 	};
 	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
 	int ret;
 
+	if (range) {
+		/* Move this entire thing to xe_svm.c? */
+		xe_svm_notifier_lock(xe_vma_vm(vma));
+		if (!xe_svm_range_pages_valid(range)) {
+			xe_svm_notifier_unlock(xe_vma_vm(vma));
+			return -EAGAIN;
+		}
+		if (xe_svm_range_has_dma_mapping(range)) {
+			xe_res_first_dma(range->base.dma_addr, 0,
+					 range->base.va.end - range->base.va.start,
+					 &curs);
+			is_devmem = xe_res_is_vram(&curs);
+		} else {
+			xe_assert(xe, false);
+		}
+		/*
+		 * Note, when unlocking the resource cursor dma addresses may become
+		 * stale, but the bind will be aborted anyway att commit time.
+		 */
+		xe_svm_notifier_unlock(xe_vma_vm(vma));
+	}
+
+	xe_walk.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K) && is_devmem;
+
 	/**
 	 * Default atomic expectations for different allocation scenarios are as follows:
 	 *
@@ -647,7 +673,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 			 * gets migrated to LMEM, bind such allocations with
 			 * device atomics enabled.
 			 */
-			else if (is_devmem && !xe_bo_has_single_placement(bo))
+			else if (is_devmem)
 				xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
 		} else {
 			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
@@ -663,15 +689,16 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 
 	if (is_devmem) {
 		xe_walk.default_pte |= XE_PPGTT_PTE_DM;
-		xe_walk.dma_offset = vram_region_gpu_offset(bo->ttm.resource);
+		xe_walk.dma_offset = bo ? vram_region_gpu_offset(bo->ttm.resource) : 0;
 	}
 
 	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
 		xe_walk.dma_offset = xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
 
-	xe_bo_assert_held(bo);
+	if (!range)
+		xe_bo_assert_held(bo);
 
-	if (!xe_vma_is_null(vma)) {
+	if (!xe_vma_is_null(vma) && !range) {
 		if (xe_vma_is_userptr(vma))
 			xe_res_first_sg(to_userptr_vma(vma)->userptr.sg, 0,
 					xe_vma_size(vma), &curs);
@@ -681,12 +708,14 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		else
 			xe_res_first_sg(xe_bo_sg(bo), xe_vma_bo_offset(vma),
 					xe_vma_size(vma), &curs);
-	} else {
+	} else if (!range) {
 		curs.size = xe_vma_size(vma);
 	}
 
-	ret = xe_pt_walk_range(&pt->base, pt->level, xe_vma_start(vma),
-			       xe_vma_end(vma), &xe_walk.base);
+	ret = xe_pt_walk_range(&pt->base, pt->level,
+			       range ? range->base.va.start : xe_vma_start(vma),
+			       range ? range->base.va.end : xe_vma_end(vma),
+			       &xe_walk.base);
 
 	*num_entries = xe_walk.wupd.num_used_entries;
 	return ret;
@@ -902,7 +931,7 @@ static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 
 	lockdep_assert_held(&vm->lock);
 
-	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
+	if (!xe_vma_has_no_bo(vma))
 		dma_resv_assert_held(xe_vma_bo(vma)->ttm.base.resv);
 
 	xe_vm_assert_held(vm);
@@ -1004,12 +1033,13 @@ static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
 
 static int
 xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
+		   struct xe_svm_range *range,
 		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
 {
 	int err;
 
 	*num_entries = 0;
-	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
+	err = xe_pt_stage_bind(tile, vma, range, entries, num_entries);
 	if (!err)
 		xe_tile_assert(tile, *num_entries);
 
@@ -1115,6 +1145,8 @@ static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
 	case DRM_GPUVA_OP_PREFETCH:
 		err = vma_add_deps(gpuva_to_vma(op->base.prefetch.va), job);
 		break;
+	case DRM_GPUVA_OP_USER:
+		break;
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -1339,6 +1371,34 @@ static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
 	return err;
 }
 
+static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update *pt_update)
+{
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vma_op *op;
+	int err;
+
+	err = xe_pt_pre_commit(pt_update);
+	if (err)
+		return err;
+
+	xe_svm_notifier_lock(vm);
+
+	list_for_each_entry(op, &vops->list, link) {
+		struct xe_svm_range *range = op->map_range.range;
+
+		xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
+		xe_assert(vm->xe, op->subop == XE_VMA_SUBOP_MAP_RANGE);
+
+		if (!xe_svm_range_pages_valid(range)) {
+			xe_svm_notifier_unlock(vm);
+			return -EAGAIN;
+		}
+	}
+
+	return 0;
+}
+
 struct invalidation_fence {
 	struct xe_gt_tlb_invalidation_fence base;
 	struct xe_gt *gt;
@@ -1632,12 +1692,12 @@ xe_pt_commit_prepare_unbind(struct xe_vma *vma,
 
 static void
 xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
-				 struct xe_vma *vma)
+				 u64 start, u64 end)
 {
+	u64 last;
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 	int i, level = 0;
-	u64 start, last;
 
 	for (i = 0; i < pt_op->num_entries; i++) {
 		const struct xe_vm_pgtable_update *entry = &pt_op->entries[i];
@@ -1647,8 +1707,8 @@ xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
 	}
 
 	/* Greedy (non-optimal) calculation but simple */
-	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull << xe_pt_shift(level));
-	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level)) - 1;
+	start = ALIGN_DOWN(start, 0x1ull << xe_pt_shift(level));
+	last = ALIGN(end, 0x1ull << xe_pt_shift(level)) - 1;
 
 	if (start < pt_update_ops->start)
 		pt_update_ops->start = start;
@@ -1690,7 +1750,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	if (err)
 		return err;
 
-	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
+	err = xe_pt_prepare_bind(tile, vma, NULL, pt_op->entries,
 				 &pt_op->num_entries);
 	if (!err) {
 		xe_tile_assert(tile, pt_op->num_entries <=
@@ -1698,7 +1758,9 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 					pt_op->num_entries, true);
 
-		xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+		xe_pt_update_ops_rfence_interval(pt_update_ops,
+						 xe_vma_start(vma),
+						 xe_vma_end(vma));
 		++pt_update_ops->current_op;
 		pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 
@@ -1732,6 +1794,48 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	return err;
 }
 
+static int bind_range_prepare(struct xe_vm *vm, struct xe_tile *tile,
+			      struct xe_vm_pgtable_update_ops *pt_update_ops,
+			      struct xe_vma *vma, struct xe_svm_range *range)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	int err;
+
+	xe_tile_assert(tile, xe_vma_is_system_allocator(vma));
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "Preparing bind, with range [%llx...%llx)\n",
+	       range->base.va.start, range->base.va.end - 1);
+
+	pt_op->vma = NULL;
+	pt_op->bind = true;
+	pt_op->rebind = BIT(tile->id) & range->tile_present;
+
+	err = xe_pt_prepare_bind(tile, vma, range, pt_op->entries,
+				 &pt_op->num_entries);
+	if (!err) {
+		xe_tile_assert(tile, pt_op->num_entries <=
+			       ARRAY_SIZE(pt_op->entries));
+		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+					pt_op->num_entries, true);
+
+		xe_pt_update_ops_rfence_interval(pt_update_ops,
+						 range->base.va.start,
+						 range->base.va.end);
+		++pt_update_ops->current_op;
+		pt_update_ops->needs_svm_lock = true;
+
+		pt_op->vma = vma;
+		xe_pt_commit_prepare_bind(vma, pt_op->entries,
+					  pt_op->num_entries, pt_op->rebind);
+	} else {
+		xe_pt_cancel_bind(vma, pt_op->entries, pt_op->num_entries);
+	}
+
+	return err;
+}
+
 static int unbind_op_prepare(struct xe_tile *tile,
 			     struct xe_vm_pgtable_update_ops *pt_update_ops,
 			     struct xe_vma *vma)
@@ -1769,7 +1873,8 @@ static int unbind_op_prepare(struct xe_tile *tile,
 
 	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 				pt_op->num_entries, false);
-	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, xe_vma_start(vma),
+					 xe_vma_end(vma));
 	++pt_update_ops->current_op;
 	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
 	pt_update_ops->needs_invalidation = true;
@@ -1839,6 +1944,15 @@ static int op_prepare(struct xe_vm *vm,
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	}
+	case DRM_GPUVA_OP_USER:
+		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
+			xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
+
+			err = bind_range_prepare(vm, tile, pt_update_ops,
+						 op->map_range.vma,
+						 op->map_range.range);
+		}
+		break;
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -2020,6 +2134,14 @@ static void op_commit(struct xe_vm *vm,
 				       fence2);
 		break;
 	}
+	case DRM_GPUVA_OP_USER:
+	{
+		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
+			op->map_range.range->tile_present |= BIT(tile->id);
+			op->map_range.range->tile_invalidated &= ~BIT(tile->id);
+		}
+		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -2037,6 +2159,12 @@ static const struct xe_migrate_pt_update_ops userptr_migrate_ops = {
 	.pre_commit = xe_pt_userptr_pre_commit,
 };
 
+static const struct xe_migrate_pt_update_ops svm_migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
+	.pre_commit = xe_pt_svm_pre_commit,
+};
+
 /**
  * xe_pt_update_ops_run() - Run PT update operations
  * @tile: Tile of PT update operations
@@ -2062,7 +2190,9 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 	struct xe_vma_op *op;
 	int err = 0, i;
 	struct xe_migrate_pt_update update = {
-		.ops = pt_update_ops->needs_userptr_lock ?
+		.ops = pt_update_ops->needs_svm_lock ?
+			&svm_migrate_ops :
+			pt_update_ops->needs_userptr_lock ?
 			&userptr_migrate_ops :
 			&migrate_ops,
 		.vops = vops,
@@ -2183,6 +2313,8 @@ xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 				  &ifence->base.base, &mfence->base.base);
 	}
 
+	if (pt_update_ops->needs_svm_lock)
+		xe_svm_notifier_unlock(vm);
 	if (pt_update_ops->needs_userptr_lock)
 		up_read(&vm->userptr.notifier_lock);
 
diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index 384cc04de719..69eab6f37cfe 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -104,6 +104,8 @@ struct xe_vm_pgtable_update_ops {
 	u32 num_ops;
 	/** @current_op: current operations */
 	u32 current_op;
+	/** @needs_svm_lock: Needs SVM lock */
+	bool needs_svm_lock;
 	/** @needs_userptr_lock: Needs userptr lock */
 	bool needs_userptr_lock;
 	/** @needs_invalidation: Needs invalidation */
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index b2bc259978c4..a9addaea316d 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -209,8 +209,8 @@ void xe_svm_close(struct xe_vm *vm)
 	xe_assert(vm->xe, xe_vm_is_closed(vm));
 
 	/* Flush running notifiers making xe_vm_close() visable */
-	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
-	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
+	xe_svm_notifier_lock(vm);
+	xe_svm_notifier_unlock(vm);
 }
 
 void xe_svm_fini(struct xe_vm *vm)
@@ -220,12 +220,22 @@ void xe_svm_fini(struct xe_vm *vm)
 	drm_gpusvm_fini(&vm->svm.gpusvm);
 }
 
+static bool xe_svm_range_is_valid(struct xe_svm_range *range,
+				  struct xe_tile *tile)
+{
+	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
+}
+
 int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic)
 {
 	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
+	struct xe_svm_range *range;
 	struct drm_gpusvm_range *r;
+	struct drm_exec exec;
+	struct dma_fence *fence;
+	ktime_t end = 0;
 	int err;
 
 	lockdep_assert_held_write(&vm->lock);
@@ -239,11 +249,42 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (IS_ERR(r))
 		return PTR_ERR(r);
 
-	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
+	range = to_xe_range(r);
+	if (xe_svm_range_is_valid(range, tile))
+		return 0;
+
+	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
 	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
 	       goto retry;
+	if (err)
+		goto err_out;
+
+retry_bind:
+	drm_exec_init(&exec, 0, 0);
+	drm_exec_until_all_locked(&exec) {
+		err = drm_exec_lock_obj(&exec, vm->gpuvm.r_obj);
+		drm_exec_retry_on_contention(&exec);
+		if (err) {
+			drm_exec_fini(&exec);
+			goto err_out;
+		}
+
+		fence = xe_vm_range_rebind(vm, vma, range, BIT(tile->id));
+		if (IS_ERR(fence)) {
+			drm_exec_fini(&exec);
+			err = PTR_ERR(fence);
+			if (err == -EAGAIN)
+				goto retry;
+			if (xe_vm_validate_should_retry(&exec, err, &end))
+				goto retry_bind;
+			goto err_out;
+		}
+	}
+	drm_exec_fini(&exec);
 
-	/* TODO: Issue bind */
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
 
+err_out:
 	return err;
 }
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index c91c5f538024..ee0bd1ae655b 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -29,4 +29,21 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic);
 
+static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
+{
+	return drm_gpusvm_range_pages_valid(range->base.gpusvm, &range->base);
+}
+
+static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range *range)
+{
+	lockdep_assert_held(&range->base.gpusvm->notifier_lock);
+	return range->base.flags.has_dma_mapping;
+}
+
+#define xe_svm_notifier_lock(vm__)	\
+	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
+
+#define xe_svm_notifier_unlock(vm__)	\
+	drm_gpusvm_notifier_unlock(&(vm__)->svm.gpusvm)
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index b11fb0ac9520..63aa0a25d3b7 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -894,6 +894,84 @@ struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma, u8 tile_ma
 	return fence;
 }
 
+static void xe_vm_populate_range_rebind(struct xe_vma_op *op,
+					struct xe_vma *vma,
+					struct xe_svm_range *range,
+					u8 tile_mask)
+{
+	INIT_LIST_HEAD(&op->link);
+	op->tile_mask = tile_mask;
+	op->base.op = DRM_GPUVA_OP_USER;
+	op->subop = XE_VMA_SUBOP_MAP_RANGE;
+	op->map_range.vma = vma;
+	op->map_range.range = range;
+}
+
+static int
+xe_vm_ops_add_range_rebind(struct xe_vma_ops *vops,
+			   struct xe_vma *vma,
+			   struct xe_svm_range *range,
+			   u8 tile_mask)
+{
+	struct xe_vma_op *op;
+
+	op = kzalloc(sizeof(*op), GFP_KERNEL);
+	if (!op)
+		return -ENOMEM;
+
+	xe_vm_populate_range_rebind(op, vma, range, tile_mask);
+	list_add_tail(&op->link, &vops->list);
+	xe_vma_ops_incr_pt_update_ops(vops, tile_mask);
+
+	return 0;
+}
+
+struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
+				     struct xe_vma *vma,
+				     struct xe_svm_range *range,
+				     u8 tile_mask)
+{
+	struct dma_fence *fence = NULL;
+	struct xe_vma_ops vops;
+	struct xe_vma_op *op, *next_op;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
+
+	lockdep_assert_held(&vm->lock);
+	xe_vm_assert_held(vm);
+	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+	xe_assert(vm->xe, xe_vma_is_system_allocator(vma));
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_exec_queue(tile);
+	}
+
+	err = xe_vm_ops_add_range_rebind(&vops, vma, range, tile_mask);
+	if (err)
+		return ERR_PTR(err);
+
+	err = xe_vma_ops_alloc(&vops, false);
+	if (err) {
+		fence = ERR_PTR(err);
+		goto free_ops;
+	}
+
+	fence = ops_execute(vm, &vops);
+
+free_ops:
+	list_for_each_entry_safe(op, next_op, &vops.list, link) {
+		list_del(&op->link);
+		kfree(op);
+	}
+	xe_vma_ops_fini(&vops);
+
+	return fence;
+}
+
 static void xe_vma_free(struct xe_vma *vma)
 {
 	if (xe_vma_is_userptr(vma))
@@ -2514,6 +2592,8 @@ static void op_trace(struct xe_vma_op *op)
 	case DRM_GPUVA_OP_PREFETCH:
 		trace_xe_vma_bind(gpuva_to_vma(op->base.prefetch.va));
 		break;
+	case DRM_GPUVA_OP_USER:
+		break;
 	default:
 		XE_WARN_ON("NOT POSSIBLE");
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 1a5aed678214..8bd921b33090 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -22,6 +22,7 @@ struct ttm_validate_buffer;
 struct xe_exec_queue;
 struct xe_file;
 struct xe_sync_entry;
+struct xe_svm_range;
 struct drm_exec;
 
 struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags);
@@ -217,6 +218,10 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm);
 int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
 struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma *vma,
 				u8 tile_mask);
+struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
+				     struct xe_vma *vma,
+				     struct xe_svm_range *range,
+				     u8 tile_mask);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index bd1c0e368238..b736e53779d2 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -19,6 +19,7 @@
 #include "xe_range_fence.h"
 
 struct xe_bo;
+struct xe_svm_range;
 struct xe_sync_entry;
 struct xe_user_fence;
 struct xe_vm;
@@ -334,6 +335,14 @@ struct xe_vma_op_prefetch {
 	u32 region;
 };
 
+/** struct xe_vma_op_map_range - VMA map range operation */
+struct xe_vma_op_map_range {
+	/** @vma: VMA to map (system allocator VMA) */
+	struct xe_vma *vma;
+	/** @range: SVM range to map */
+	struct xe_svm_range *range;
+};
+
 /** enum xe_vma_op_flags - flags for VMA operation */
 enum xe_vma_op_flags {
 	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
@@ -344,6 +353,12 @@ enum xe_vma_op_flags {
 	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
 };
 
+/** enum xe_vma_subop - VMA sub-operation */
+enum xe_vma_subop {
+	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
+	XE_VMA_SUBOP_MAP_RANGE,
+};
+
 /** struct xe_vma_op - VMA operation */
 struct xe_vma_op {
 	/** @base: GPUVA base operation */
@@ -352,6 +367,8 @@ struct xe_vma_op {
 	struct list_head link;
 	/** @flags: operation flags */
 	enum xe_vma_op_flags flags;
+	/** @subop: user defined sub-operation */
+	enum xe_vma_subop subop;
 	/** @tile_mask: Tile mask for operation */
 	u8 tile_mask;
 
@@ -362,6 +379,8 @@ struct xe_vma_op {
 		struct xe_vma_op_remap remap;
 		/** @prefetch: VMA prefetch operation specific data */
 		struct xe_vma_op_prefetch prefetch;
+		/** @map: VMA map range operation specific data */
+		struct xe_vma_op_map_range map_range;
 	};
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler
  2024-10-16  3:25 ` [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
@ 2024-11-19 14:26   ` Thomas Hellström
  2024-12-11 19:07     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 14:26 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add (re)bind to SVM page fault handler. To facilitate add support
> function to VM layer which (re)binds a SVM range. Also teach PT layer
> to
> understand (re)binds of SVM ranges.
> 
> v2:
>  - Don't assert BO lock held for range binds
>  - Use xe_svm_notifier_lock/unlock helper in xe_svm_close
>  - Use drm_pagemap dma cursor
>  - Take notifier lock in bind code to check range state
> 
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c       | 170 +++++++++++++++++++++++++++--
> --
>  drivers/gpu/drm/xe/xe_pt_types.h |   2 +
>  drivers/gpu/drm/xe/xe_svm.c      |  49 ++++++++-
>  drivers/gpu/drm/xe/xe_svm.h      |  17 ++++
>  drivers/gpu/drm/xe/xe_vm.c       |  80 +++++++++++++++
>  drivers/gpu/drm/xe/xe_vm.h       |   5 +
>  drivers/gpu/drm/xe/xe_vm_types.h |  19 ++++
>  7 files changed, 319 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 282476c4edbd..024e4eb83408 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -587,6 +587,7 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_bind_ops = {
>   * range.
>   * @tile: The tile we're building for.
>   * @vma: The vma indicating the address range.
> + * @range: The range indicating the address range.
>   * @entries: Storage for the update entries used for connecting the
> tree to
>   * the main tree at commit time.
>   * @num_entries: On output contains the number of @entries used.
> @@ -602,6 +603,7 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_bind_ops = {
>   */
>  static int
>  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> +		 struct xe_svm_range *range,
>  		 struct xe_vm_pgtable_update *entries, u32
> *num_entries)

Really the same comment here, should rework the interface to drop vma
and range, although since it's more involved here let's do it as a
follow-up.

>  {
>  	struct xe_device *xe = tile_to_xe(tile);
> @@ -618,14 +620,38 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
>  		.vm = xe_vma_vm(vma),
>  		.tile = tile,
>  		.curs = &curs,
> -		.va_curs_start = xe_vma_start(vma),
> +		.va_curs_start = range ? range->base.va.start :
> +			xe_vma_start(vma),
>  		.vma = vma,
>  		.wupd.entries = entries,
> -		.needs_64K = (xe_vma_vm(vma)->flags &
> XE_VM_FLAG_64K) && is_devmem,
>  	};
>  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
>  	int ret;
>  
> +	if (range) {
> +		/* Move this entire thing to xe_svm.c? */
> +		xe_svm_notifier_lock(xe_vma_vm(vma));
> +		if (!xe_svm_range_pages_valid(range)) {
> +			xe_svm_notifier_unlock(xe_vma_vm(vma));
> +			return -EAGAIN;
> +		}
> +		if (xe_svm_range_has_dma_mapping(range)) {
> +			xe_res_first_dma(range->base.dma_addr, 0,
> +					 range->base.va.end - range-
> >base.va.start,
> +					 &curs);
> +			is_devmem = xe_res_is_vram(&curs);
> +		} else {
> +			xe_assert(xe, false);
> +		}
> +		/*
> +		 * Note, when unlocking the resource cursor dma
> addresses may become
> +		 * stale, but the bind will be aborted anyway att
> commit time.
> +		 */
> +		xe_svm_notifier_unlock(xe_vma_vm(vma));
> +	}
> +
> +	xe_walk.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K)
> && is_devmem;
> +
>  	/**
>  	 * Default atomic expectations for different allocation
> scenarios are as follows:
>  	 *
> @@ -647,7 +673,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
>  			 * gets migrated to LMEM, bind such
> allocations with
>  			 * device atomics enabled.
>  			 */
> -			else if (is_devmem &&
> !xe_bo_has_single_placement(bo))
> +			else if (is_devmem)
>  				xe_walk.default_pte |=
> XE_USM_PPGTT_PTE_AE;
>  		} else {
>  			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> @@ -663,15 +689,16 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
>  
>  	if (is_devmem) {
>  		xe_walk.default_pte |= XE_PPGTT_PTE_DM;
> -		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >ttm.resource);
> +		xe_walk.dma_offset = bo ? vram_region_gpu_offset(bo-
> >ttm.resource) : 0;
>  	}
>  
>  	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>  		xe_walk.dma_offset =
> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>  
> -	xe_bo_assert_held(bo);
> +	if (!range)
> +		xe_bo_assert_held(bo);
>  
> -	if (!xe_vma_is_null(vma)) {
> +	if (!xe_vma_is_null(vma) && !range) {
>  		if (xe_vma_is_userptr(vma))
>  			xe_res_first_sg(to_userptr_vma(vma)-
> >userptr.sg, 0,
>  					xe_vma_size(vma), &curs);
> @@ -681,12 +708,14 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
>  		else
>  			xe_res_first_sg(xe_bo_sg(bo),
> xe_vma_bo_offset(vma),
>  					xe_vma_size(vma), &curs);
> -	} else {
> +	} else if (!range) {
>  		curs.size = xe_vma_size(vma);
>  	}
>  
> -	ret = xe_pt_walk_range(&pt->base, pt->level,
> xe_vma_start(vma),
> -			       xe_vma_end(vma), &xe_walk.base);
> +	ret = xe_pt_walk_range(&pt->base, pt->level,
> +			       range ? range->base.va.start :
> xe_vma_start(vma),
> +			       range ? range->base.va.end :
> xe_vma_end(vma),
> +			       &xe_walk.base);
>  
>  	*num_entries = xe_walk.wupd.num_used_entries;
>  	return ret;
> @@ -902,7 +931,7 @@ static void xe_pt_commit_locks_assert(struct
> xe_vma *vma)
>  
>  	lockdep_assert_held(&vm->lock);
>  
> -	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
> +	if (!xe_vma_has_no_bo(vma))
>  		dma_resv_assert_held(xe_vma_bo(vma)->ttm.base.resv);
>  
>  	xe_vm_assert_held(vm);
> @@ -1004,12 +1033,13 @@ static void xe_pt_free_bind(struct
> xe_vm_pgtable_update *entries,
>  
>  static int
>  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> +		   struct xe_svm_range *range,
>  		   struct xe_vm_pgtable_update *entries, u32
> *num_entries)
>  {
>  	int err;
>  
>  	*num_entries = 0;
> -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> +	err = xe_pt_stage_bind(tile, vma, range, entries,
> num_entries);
>  	if (!err)
>  		xe_tile_assert(tile, *num_entries);
>  
> @@ -1115,6 +1145,8 @@ static int op_add_deps(struct xe_vm *vm, struct
> xe_vma_op *op,
>  	case DRM_GPUVA_OP_PREFETCH:
>  		err = vma_add_deps(gpuva_to_vma(op-
> >base.prefetch.va), job);
>  		break;
> +	case DRM_GPUVA_OP_USER:
> +		break;
>  	default:
>  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
>  	}
> @@ -1339,6 +1371,34 @@ static int xe_pt_userptr_pre_commit(struct
> xe_migrate_pt_update *pt_update)
>  	return err;
>  }
>  
> +static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update
> *pt_update)
> +{
> +	struct xe_vm *vm = pt_update->vops->vm;
> +	struct xe_vma_ops *vops = pt_update->vops;
> +	struct xe_vma_op *op;
> +	int err;
> +
> +	err = xe_pt_pre_commit(pt_update);
> +	if (err)
> +		return err;
> +
> +	xe_svm_notifier_lock(vm);
> +
> +	list_for_each_entry(op, &vops->list, link) {
> +		struct xe_svm_range *range = op->map_range.range;
> +
> +		xe_assert(vm->xe, xe_vma_is_system_allocator(op-
> >map_range.vma));
> +		xe_assert(vm->xe, op->subop ==
> XE_VMA_SUBOP_MAP_RANGE);
> +
> +		if (!xe_svm_range_pages_valid(range)) {
> +			xe_svm_notifier_unlock(vm);
> +			return -EAGAIN;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  struct invalidation_fence {
>  	struct xe_gt_tlb_invalidation_fence base;
>  	struct xe_gt *gt;
> @@ -1632,12 +1692,12 @@ xe_pt_commit_prepare_unbind(struct xe_vma
> *vma,
>  
>  static void
>  xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops
> *pt_update_ops,
> -				 struct xe_vma *vma)
> +				 u64 start, u64 end)
>  {
> +	u64 last;
>  	u32 current_op = pt_update_ops->current_op;
>  	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
>  	int i, level = 0;
> -	u64 start, last;
>  
>  	for (i = 0; i < pt_op->num_entries; i++) {
>  		const struct xe_vm_pgtable_update *entry = &pt_op-
> >entries[i];
> @@ -1647,8 +1707,8 @@ xe_pt_update_ops_rfence_interval(struct
> xe_vm_pgtable_update_ops *pt_update_ops,
>  	}
>  
>  	/* Greedy (non-optimal) calculation but simple */
> -	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull <<
> xe_pt_shift(level));
> -	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level))
> - 1;
> +	start = ALIGN_DOWN(start, 0x1ull << xe_pt_shift(level));
> +	last = ALIGN(end, 0x1ull << xe_pt_shift(level)) - 1;
>  
>  	if (start < pt_update_ops->start)
>  		pt_update_ops->start = start;
> @@ -1690,7 +1750,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  	if (err)
>  		return err;
>  
> -	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> +	err = xe_pt_prepare_bind(tile, vma, NULL, pt_op->entries,
>  				 &pt_op->num_entries);
>  	if (!err) {
>  		xe_tile_assert(tile, pt_op->num_entries <=
> @@ -1698,7 +1758,9 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op-
> >entries,
>  					pt_op->num_entries, true);
>  
> -		xe_pt_update_ops_rfence_interval(pt_update_ops,
> vma);
> +		xe_pt_update_ops_rfence_interval(pt_update_ops,
> +						 xe_vma_start(vma),
> +						 xe_vma_end(vma));
>  		++pt_update_ops->current_op;
>  		pt_update_ops->needs_userptr_lock |=
> xe_vma_is_userptr(vma);
>  
> @@ -1732,6 +1794,48 @@ static int bind_op_prepare(struct xe_vm *vm,
> struct xe_tile *tile,
>  	return err;
>  }
>  
> +static int bind_range_prepare(struct xe_vm *vm, struct xe_tile
> *tile,
> +			      struct xe_vm_pgtable_update_ops
> *pt_update_ops,
> +			      struct xe_vma *vma, struct
> xe_svm_range *range)
> +{
> +	u32 current_op = pt_update_ops->current_op;
> +	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
> +	int err;
> +
> +	xe_tile_assert(tile, xe_vma_is_system_allocator(vma));
> +
> +	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> +	       "Preparing bind, with range [%llx...%llx)\n",
> +	       range->base.va.start, range->base.va.end - 1);
> +
> +	pt_op->vma = NULL;
> +	pt_op->bind = true;
> +	pt_op->rebind = BIT(tile->id) & range->tile_present;
> +
> +	err = xe_pt_prepare_bind(tile, vma, range, pt_op->entries,
> +				 &pt_op->num_entries);
> +	if (!err) {
> +		xe_tile_assert(tile, pt_op->num_entries <=
> +			       ARRAY_SIZE(pt_op->entries));
> +		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op-
> >entries,
> +					pt_op->num_entries, true);
> +
> +		xe_pt_update_ops_rfence_interval(pt_update_ops,
> +						 range-
> >base.va.start,
> +						 range-
> >base.va.end);
> +		++pt_update_ops->current_op;
> +		pt_update_ops->needs_svm_lock = true;
> +
> +		pt_op->vma = vma;
> +		xe_pt_commit_prepare_bind(vma, pt_op->entries,
> +					  pt_op->num_entries, pt_op-
> >rebind);
> +	} else {
> +		xe_pt_cancel_bind(vma, pt_op->entries, pt_op-
> >num_entries);
> +	}
> +
> +	return err;
> +}
> +
>  static int unbind_op_prepare(struct xe_tile *tile,
>  			     struct xe_vm_pgtable_update_ops
> *pt_update_ops,
>  			     struct xe_vma *vma)
> @@ -1769,7 +1873,8 @@ static int unbind_op_prepare(struct xe_tile
> *tile,
>  
>  	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
>  				pt_op->num_entries, false);
> -	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
> +	xe_pt_update_ops_rfence_interval(pt_update_ops,
> xe_vma_start(vma),
> +					 xe_vma_end(vma));
>  	++pt_update_ops->current_op;
>  	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
>  	pt_update_ops->needs_invalidation = true;
> @@ -1839,6 +1944,15 @@ static int op_prepare(struct xe_vm *vm,
>  		pt_update_ops->wait_vm_kernel = true;
>  		break;
>  	}
> +	case DRM_GPUVA_OP_USER:
> +		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {

See question below on subops

> +			xe_assert(vm->xe,
> xe_vma_is_system_allocator(op->map_range.vma));
> +
> +			err = bind_range_prepare(vm, tile,
> pt_update_ops,
> +						 op->map_range.vma,
> +						 op-
> >map_range.range);
> +		}
> +		break;
>  	default:
>  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
>  	}
> @@ -2020,6 +2134,14 @@ static void op_commit(struct xe_vm *vm,
>  				       fence2);
>  		break;
>  	}
> +	case DRM_GPUVA_OP_USER:
> +	{
> +		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
> +			op->map_range.range->tile_present |=
> BIT(tile->id);
> +			op->map_range.range->tile_invalidated &=
> ~BIT(tile->id);
> +		}
> +		break;
> +	}
>  	default:
>  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
>  	}
> @@ -2037,6 +2159,12 @@ static const struct xe_migrate_pt_update_ops
> userptr_migrate_ops = {
>  	.pre_commit = xe_pt_userptr_pre_commit,
>  };
>  
> +static const struct xe_migrate_pt_update_ops svm_migrate_ops = {
> +	.populate = xe_vm_populate_pgtable,
> +	.clear = xe_migrate_clear_pgtable_callback,
> +	.pre_commit = xe_pt_svm_pre_commit,
> +};
> +
>  /**
>   * xe_pt_update_ops_run() - Run PT update operations
>   * @tile: Tile of PT update operations
> @@ -2062,7 +2190,9 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> struct xe_vma_ops *vops)
>  	struct xe_vma_op *op;
>  	int err = 0, i;
>  	struct xe_migrate_pt_update update = {
> -		.ops = pt_update_ops->needs_userptr_lock ?
> +		.ops = pt_update_ops->needs_svm_lock ?
> +			&svm_migrate_ops :
> +			pt_update_ops->needs_userptr_lock ?
>  			&userptr_migrate_ops :
>  			&migrate_ops,
>  		.vops = vops,
> @@ -2183,6 +2313,8 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> struct xe_vma_ops *vops)
>  				  &ifence->base.base, &mfence-
> >base.base);
>  	}
>  
> +	if (pt_update_ops->needs_svm_lock)
> +		xe_svm_notifier_unlock(vm);
>  	if (pt_update_ops->needs_userptr_lock)
>  		up_read(&vm->userptr.notifier_lock);
>  
> diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> b/drivers/gpu/drm/xe/xe_pt_types.h
> index 384cc04de719..69eab6f37cfe 100644
> --- a/drivers/gpu/drm/xe/xe_pt_types.h
> +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> @@ -104,6 +104,8 @@ struct xe_vm_pgtable_update_ops {
>  	u32 num_ops;
>  	/** @current_op: current operations */
>  	u32 current_op;
> +	/** @needs_svm_lock: Needs SVM lock */
> +	bool needs_svm_lock;
>  	/** @needs_userptr_lock: Needs userptr lock */
>  	bool needs_userptr_lock;
>  	/** @needs_invalidation: Needs invalidation */
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index b2bc259978c4..a9addaea316d 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -209,8 +209,8 @@ void xe_svm_close(struct xe_vm *vm)
>  	xe_assert(vm->xe, xe_vm_is_closed(vm));
>  
>  	/* Flush running notifiers making xe_vm_close() visable */
> -	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> -	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> +	xe_svm_notifier_lock(vm);
> +	xe_svm_notifier_unlock(vm);
>  }
>  
>  void xe_svm_fini(struct xe_vm *vm)
> @@ -220,12 +220,22 @@ void xe_svm_fini(struct xe_vm *vm)
>  	drm_gpusvm_fini(&vm->svm.gpusvm);
>  }
>  
> +static bool xe_svm_range_is_valid(struct xe_svm_range *range,
> +				  struct xe_tile *tile)
> +{
> +	return (range->tile_present & ~range->tile_invalidated) &
> BIT(tile->id);
> +}
> +
>  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  			    struct xe_tile *tile, u64 fault_addr,
>  			    bool atomic)
>  {
>  	struct drm_gpusvm_ctx ctx = { .read_only =
> xe_vma_read_only(vma), };
> +	struct xe_svm_range *range;
>  	struct drm_gpusvm_range *r;
> +	struct drm_exec exec;
> +	struct dma_fence *fence;
> +	ktime_t end = 0;
>  	int err;
>  
>  	lockdep_assert_held_write(&vm->lock);
> @@ -239,11 +249,42 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	if (IS_ERR(r))
>  		return PTR_ERR(r);
>  
> -	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
> +	range = to_xe_range(r);
> +	if (xe_svm_range_is_valid(range, tile))
> +		return 0;
> +
> +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>  	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> mappings have change */
>  	       goto retry;
> +	if (err)
> +		goto err_out;
> +
> +retry_bind:
> +	drm_exec_init(&exec, 0, 0);
> +	drm_exec_until_all_locked(&exec) {
> +		err = drm_exec_lock_obj(&exec, vm->gpuvm.r_obj);
> +		drm_exec_retry_on_contention(&exec);
> +		if (err) {
> +			drm_exec_fini(&exec);
> +			goto err_out;
> +		}
> +
> +		fence = xe_vm_range_rebind(vm, vma, range, BIT(tile-
> >id));
> +		if (IS_ERR(fence)) {
> +			drm_exec_fini(&exec);
> +			err = PTR_ERR(fence);
> +			if (err == -EAGAIN)
> +				goto retry;
> +			if (xe_vm_validate_should_retry(&exec, err,
> &end))
> +				goto retry_bind;
> +			goto err_out;
> +		}
> +	}
> +	drm_exec_fini(&exec);
>  
> -	/* TODO: Issue bind */
> +	dma_fence_wait(fence, false);
> +	dma_fence_put(fence);
>  
> +err_out:
>  	return err;
>  }
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index c91c5f538024..ee0bd1ae655b 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -29,4 +29,21 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  			    struct xe_tile *tile, u64 fault_addr,
>  			    bool atomic);
>  
> +static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> *range)
> +{
> +	return drm_gpusvm_range_pages_valid(range->base.gpusvm,
> &range->base);
> +}
> +
> +static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range
> *range)
> +{
> +	lockdep_assert_held(&range->base.gpusvm->notifier_lock);
> +	return range->base.flags.has_dma_mapping;
> +}
> +
> +#define xe_svm_notifier_lock(vm__)	\
> +	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
> +
> +#define xe_svm_notifier_unlock(vm__)	\
> +	drm_gpusvm_notifier_unlock(&(vm__)->svm.gpusvm)
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index b11fb0ac9520..63aa0a25d3b7 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -894,6 +894,84 @@ struct dma_fence *xe_vma_rebind(struct xe_vm
> *vm, struct xe_vma *vma, u8 tile_ma
>  	return fence;
>  }
>  
> +static void xe_vm_populate_range_rebind(struct xe_vma_op *op,
> +					struct xe_vma *vma,
> +					struct xe_svm_range *range,
> +					u8 tile_mask)
> +{
> +	INIT_LIST_HEAD(&op->link);
> +	op->tile_mask = tile_mask;
> +	op->base.op = DRM_GPUVA_OP_USER;
> +	op->subop = XE_VMA_SUBOP_MAP_RANGE;
> +	op->map_range.vma = vma;
> +	op->map_range.range = range;
> +}
> +
> +static int
> +xe_vm_ops_add_range_rebind(struct xe_vma_ops *vops,
> +			   struct xe_vma *vma,
> +			   struct xe_svm_range *range,
> +			   u8 tile_mask)
> +{
> +	struct xe_vma_op *op;
> +
> +	op = kzalloc(sizeof(*op), GFP_KERNEL);
> +	if (!op)
> +		return -ENOMEM;
> +
> +	xe_vm_populate_range_rebind(op, vma, range, tile_mask);
> +	list_add_tail(&op->link, &vops->list);
> +	xe_vma_ops_incr_pt_update_ops(vops, tile_mask);
> +
> +	return 0;
> +}
> +
> +struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
> +				     struct xe_vma *vma,
> +				     struct xe_svm_range *range,
> +				     u8 tile_mask)

kerneldoc

> +{
> +	struct dma_fence *fence = NULL;
> +	struct xe_vma_ops vops;
> +	struct xe_vma_op *op, *next_op;
> +	struct xe_tile *tile;
> +	u8 id;
> +	int err;
> +
> +	lockdep_assert_held(&vm->lock);
> +	xe_vm_assert_held(vm);
> +	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
> +	xe_assert(vm->xe, xe_vma_is_system_allocator(vma));
> +
> +	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
> +	for_each_tile(tile, vm->xe, id) {
> +		vops.pt_update_ops[id].wait_vm_bookkeep = true;
> +		vops.pt_update_ops[tile->id].q =
> +			xe_tile_migrate_exec_queue(tile);
> +	}
> +
> +	err = xe_vm_ops_add_range_rebind(&vops, vma, range,
> tile_mask);
> +	if (err)
> +		return ERR_PTR(err);
> +
> +	err = xe_vma_ops_alloc(&vops, false);
> +	if (err) {
> +		fence = ERR_PTR(err);
> +		goto free_ops;
> +	}
> +
> +	fence = ops_execute(vm, &vops);
> +
> +free_ops:
> +	list_for_each_entry_safe(op, next_op, &vops.list, link) {
> +		list_del(&op->link);
> +		kfree(op);
> +	}
> +	xe_vma_ops_fini(&vops);
> +
> +	return fence;
> +}
> +
>  static void xe_vma_free(struct xe_vma *vma)
>  {
>  	if (xe_vma_is_userptr(vma))
> @@ -2514,6 +2592,8 @@ static void op_trace(struct xe_vma_op *op)
>  	case DRM_GPUVA_OP_PREFETCH:
>  		trace_xe_vma_bind(gpuva_to_vma(op-
> >base.prefetch.va));
>  		break;
> +	case DRM_GPUVA_OP_USER:
> +		break;
>  	default:
>  		XE_WARN_ON("NOT POSSIBLE");
>  	}
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 1a5aed678214..8bd921b33090 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -22,6 +22,7 @@ struct ttm_validate_buffer;
>  struct xe_exec_queue;
>  struct xe_file;
>  struct xe_sync_entry;
> +struct xe_svm_range;
>  struct drm_exec;
>  
>  struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags);
> @@ -217,6 +218,10 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm);
>  int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
>  struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma
> *vma,
>  				u8 tile_mask);
> +struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
> +				     struct xe_vma *vma,
> +				     struct xe_svm_range *range,
> +				     u8 tile_mask);
>  
>  int xe_vm_invalidate_vma(struct xe_vma *vma);
>  
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> index bd1c0e368238..b736e53779d2 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -19,6 +19,7 @@
>  #include "xe_range_fence.h"
>  
>  struct xe_bo;
> +struct xe_svm_range;
>  struct xe_sync_entry;
>  struct xe_user_fence;
>  struct xe_vm;
> @@ -334,6 +335,14 @@ struct xe_vma_op_prefetch {
>  	u32 region;
>  };
>  
> +/** struct xe_vma_op_map_range - VMA map range operation */
> +struct xe_vma_op_map_range {
> +	/** @vma: VMA to map (system allocator VMA) */
> +	struct xe_vma *vma;
> +	/** @range: SVM range to map */
> +	struct xe_svm_range *range;
> +};
> +
>  /** enum xe_vma_op_flags - flags for VMA operation */
>  enum xe_vma_op_flags {
>  	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
> @@ -344,6 +353,12 @@ enum xe_vma_op_flags {
>  	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
>  };
>  
> +/** enum xe_vma_subop - VMA sub-operation */
> +enum xe_vma_subop {
> +	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
> +	XE_VMA_SUBOP_MAP_RANGE,

Instead of introducing subops, should we perhaps consider
 DRM_GPUVMA_OP_USER plus any following op as driver defined so that the
                    next subop would instead be DRM_GPUVMA_OP_USER + 1?

> +};
> +
>  /** struct xe_vma_op - VMA operation */
>  struct xe_vma_op {
>  	/** @base: GPUVA base operation */
> @@ -352,6 +367,8 @@ struct xe_vma_op {
>  	struct list_head link;
>  	/** @flags: operation flags */
>  	enum xe_vma_op_flags flags;
> +	/** @subop: user defined sub-operation */
> +	enum xe_vma_subop subop;
>  	/** @tile_mask: Tile mask for operation */
>  	u8 tile_mask;
>  
> @@ -362,6 +379,8 @@ struct xe_vma_op {
>  		struct xe_vma_op_remap remap;
>  		/** @prefetch: VMA prefetch operation specific data
> */
>  		struct xe_vma_op_prefetch prefetch;
> +		/** @map: VMA map range operation specific data */
> +		struct xe_vma_op_map_range map_range;
>  	};
>  };
>  

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler
  2024-11-19 14:26   ` Thomas Hellström
@ 2024-12-11 19:07     ` Matthew Brost
  2024-12-16 10:03       ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:07 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 03:26:32PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add (re)bind to SVM page fault handler. To facilitate add support
> > function to VM layer which (re)binds a SVM range. Also teach PT layer
> > to
> > understand (re)binds of SVM ranges.
> > 
> > v2:
> >  - Don't assert BO lock held for range binds
> >  - Use xe_svm_notifier_lock/unlock helper in xe_svm_close
> >  - Use drm_pagemap dma cursor
> >  - Take notifier lock in bind code to check range state
> > 
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c       | 170 +++++++++++++++++++++++++++--
> > --
> >  drivers/gpu/drm/xe/xe_pt_types.h |   2 +
> >  drivers/gpu/drm/xe/xe_svm.c      |  49 ++++++++-
> >  drivers/gpu/drm/xe/xe_svm.h      |  17 ++++
> >  drivers/gpu/drm/xe/xe_vm.c       |  80 +++++++++++++++
> >  drivers/gpu/drm/xe/xe_vm.h       |   5 +
> >  drivers/gpu/drm/xe/xe_vm_types.h |  19 ++++
> >  7 files changed, 319 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index 282476c4edbd..024e4eb83408 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -587,6 +587,7 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> >   * range.
> >   * @tile: The tile we're building for.
> >   * @vma: The vma indicating the address range.
> > + * @range: The range indicating the address range.
> >   * @entries: Storage for the update entries used for connecting the
> > tree to
> >   * the main tree at commit time.
> >   * @num_entries: On output contains the number of @entries used.
> > @@ -602,6 +603,7 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_bind_ops = {
> >   */
> >  static int
> >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > +		 struct xe_svm_range *range,
> >  		 struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> 
> Really the same comment here, should rework the interface to drop vma
> and range, although since it's more involved here let's do it as a
> follow-up.
> 

Yep agree. We have Jira open for this but is larger rework, mainly
moving the tile_present / tile_invalidated from the VMA's / Range's to
the PT state.

> >  {
> >  	struct xe_device *xe = tile_to_xe(tile);
> > @@ -618,14 +620,38 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > xe_vma *vma,
> >  		.vm = xe_vma_vm(vma),
> >  		.tile = tile,
> >  		.curs = &curs,
> > -		.va_curs_start = xe_vma_start(vma),
> > +		.va_curs_start = range ? range->base.va.start :
> > +			xe_vma_start(vma),
> >  		.vma = vma,
> >  		.wupd.entries = entries,
> > -		.needs_64K = (xe_vma_vm(vma)->flags &
> > XE_VM_FLAG_64K) && is_devmem,
> >  	};
> >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> >  	int ret;
> >  
> > +	if (range) {
> > +		/* Move this entire thing to xe_svm.c? */
> > +		xe_svm_notifier_lock(xe_vma_vm(vma));
> > +		if (!xe_svm_range_pages_valid(range)) {
> > +			xe_svm_notifier_unlock(xe_vma_vm(vma));
> > +			return -EAGAIN;
> > +		}
> > +		if (xe_svm_range_has_dma_mapping(range)) {
> > +			xe_res_first_dma(range->base.dma_addr, 0,
> > +					 range->base.va.end - range-
> > >base.va.start,
> > +					 &curs);
> > +			is_devmem = xe_res_is_vram(&curs);
> > +		} else {
> > +			xe_assert(xe, false);
> > +		}
> > +		/*
> > +		 * Note, when unlocking the resource cursor dma
> > addresses may become
> > +		 * stale, but the bind will be aborted anyway att
> > commit time.
> > +		 */
> > +		xe_svm_notifier_unlock(xe_vma_vm(vma));
> > +	}
> > +
> > +	xe_walk.needs_64K = (xe_vma_vm(vma)->flags & XE_VM_FLAG_64K)
> > && is_devmem;
> > +
> >  	/**
> >  	 * Default atomic expectations for different allocation
> > scenarios are as follows:
> >  	 *
> > @@ -647,7 +673,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > xe_vma *vma,
> >  			 * gets migrated to LMEM, bind such
> > allocations with
> >  			 * device atomics enabled.
> >  			 */
> > -			else if (is_devmem &&
> > !xe_bo_has_single_placement(bo))
> > +			else if (is_devmem)
> >  				xe_walk.default_pte |=
> > XE_USM_PPGTT_PTE_AE;
> >  		} else {
> >  			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> > @@ -663,15 +689,16 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > xe_vma *vma,
> >  
> >  	if (is_devmem) {
> >  		xe_walk.default_pte |= XE_PPGTT_PTE_DM;
> > -		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> > >ttm.resource);
> > +		xe_walk.dma_offset = bo ? vram_region_gpu_offset(bo-
> > >ttm.resource) : 0;
> >  	}
> >  
> >  	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >  		xe_walk.dma_offset =
> > xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >  
> > -	xe_bo_assert_held(bo);
> > +	if (!range)
> > +		xe_bo_assert_held(bo);
> >  
> > -	if (!xe_vma_is_null(vma)) {
> > +	if (!xe_vma_is_null(vma) && !range) {
> >  		if (xe_vma_is_userptr(vma))
> >  			xe_res_first_sg(to_userptr_vma(vma)-
> > >userptr.sg, 0,
> >  					xe_vma_size(vma), &curs);
> > @@ -681,12 +708,14 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > xe_vma *vma,
> >  		else
> >  			xe_res_first_sg(xe_bo_sg(bo),
> > xe_vma_bo_offset(vma),
> >  					xe_vma_size(vma), &curs);
> > -	} else {
> > +	} else if (!range) {
> >  		curs.size = xe_vma_size(vma);
> >  	}
> >  
> > -	ret = xe_pt_walk_range(&pt->base, pt->level,
> > xe_vma_start(vma),
> > -			       xe_vma_end(vma), &xe_walk.base);
> > +	ret = xe_pt_walk_range(&pt->base, pt->level,
> > +			       range ? range->base.va.start :
> > xe_vma_start(vma),
> > +			       range ? range->base.va.end :
> > xe_vma_end(vma),
> > +			       &xe_walk.base);
> >  
> >  	*num_entries = xe_walk.wupd.num_used_entries;
> >  	return ret;
> > @@ -902,7 +931,7 @@ static void xe_pt_commit_locks_assert(struct
> > xe_vma *vma)
> >  
> >  	lockdep_assert_held(&vm->lock);
> >  
> > -	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
> > +	if (!xe_vma_has_no_bo(vma))
> >  		dma_resv_assert_held(xe_vma_bo(vma)->ttm.base.resv);
> >  
> >  	xe_vm_assert_held(vm);
> > @@ -1004,12 +1033,13 @@ static void xe_pt_free_bind(struct
> > xe_vm_pgtable_update *entries,
> >  
> >  static int
> >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > +		   struct xe_svm_range *range,
> >  		   struct xe_vm_pgtable_update *entries, u32
> > *num_entries)
> >  {
> >  	int err;
> >  
> >  	*num_entries = 0;
> > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > +	err = xe_pt_stage_bind(tile, vma, range, entries,
> > num_entries);
> >  	if (!err)
> >  		xe_tile_assert(tile, *num_entries);
> >  
> > @@ -1115,6 +1145,8 @@ static int op_add_deps(struct xe_vm *vm, struct
> > xe_vma_op *op,
> >  	case DRM_GPUVA_OP_PREFETCH:
> >  		err = vma_add_deps(gpuva_to_vma(op-
> > >base.prefetch.va), job);
> >  		break;
> > +	case DRM_GPUVA_OP_USER:
> > +		break;
> >  	default:
> >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> >  	}
> > @@ -1339,6 +1371,34 @@ static int xe_pt_userptr_pre_commit(struct
> > xe_migrate_pt_update *pt_update)
> >  	return err;
> >  }
> >  
> > +static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update
> > *pt_update)
> > +{
> > +	struct xe_vm *vm = pt_update->vops->vm;
> > +	struct xe_vma_ops *vops = pt_update->vops;
> > +	struct xe_vma_op *op;
> > +	int err;
> > +
> > +	err = xe_pt_pre_commit(pt_update);
> > +	if (err)
> > +		return err;
> > +
> > +	xe_svm_notifier_lock(vm);
> > +
> > +	list_for_each_entry(op, &vops->list, link) {
> > +		struct xe_svm_range *range = op->map_range.range;
> > +
> > +		xe_assert(vm->xe, xe_vma_is_system_allocator(op-
> > >map_range.vma));
> > +		xe_assert(vm->xe, op->subop ==
> > XE_VMA_SUBOP_MAP_RANGE);
> > +
> > +		if (!xe_svm_range_pages_valid(range)) {
> > +			xe_svm_notifier_unlock(vm);
> > +			return -EAGAIN;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  struct invalidation_fence {
> >  	struct xe_gt_tlb_invalidation_fence base;
> >  	struct xe_gt *gt;
> > @@ -1632,12 +1692,12 @@ xe_pt_commit_prepare_unbind(struct xe_vma
> > *vma,
> >  
> >  static void
> >  xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> > -				 struct xe_vma *vma)
> > +				 u64 start, u64 end)
> >  {
> > +	u64 last;
> >  	u32 current_op = pt_update_ops->current_op;
> >  	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> >  	int i, level = 0;
> > -	u64 start, last;
> >  
> >  	for (i = 0; i < pt_op->num_entries; i++) {
> >  		const struct xe_vm_pgtable_update *entry = &pt_op-
> > >entries[i];
> > @@ -1647,8 +1707,8 @@ xe_pt_update_ops_rfence_interval(struct
> > xe_vm_pgtable_update_ops *pt_update_ops,
> >  	}
> >  
> >  	/* Greedy (non-optimal) calculation but simple */
> > -	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull <<
> > xe_pt_shift(level));
> > -	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level))
> > - 1;
> > +	start = ALIGN_DOWN(start, 0x1ull << xe_pt_shift(level));
> > +	last = ALIGN(end, 0x1ull << xe_pt_shift(level)) - 1;
> >  
> >  	if (start < pt_update_ops->start)
> >  		pt_update_ops->start = start;
> > @@ -1690,7 +1750,7 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  	if (err)
> >  		return err;
> >  
> > -	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > +	err = xe_pt_prepare_bind(tile, vma, NULL, pt_op->entries,
> >  				 &pt_op->num_entries);
> >  	if (!err) {
> >  		xe_tile_assert(tile, pt_op->num_entries <=
> > @@ -1698,7 +1758,9 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op-
> > >entries,
> >  					pt_op->num_entries, true);
> >  
> > -		xe_pt_update_ops_rfence_interval(pt_update_ops,
> > vma);
> > +		xe_pt_update_ops_rfence_interval(pt_update_ops,
> > +						 xe_vma_start(vma),
> > +						 xe_vma_end(vma));
> >  		++pt_update_ops->current_op;
> >  		pt_update_ops->needs_userptr_lock |=
> > xe_vma_is_userptr(vma);
> >  
> > @@ -1732,6 +1794,48 @@ static int bind_op_prepare(struct xe_vm *vm,
> > struct xe_tile *tile,
> >  	return err;
> >  }
> >  
> > +static int bind_range_prepare(struct xe_vm *vm, struct xe_tile
> > *tile,
> > +			      struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> > +			      struct xe_vma *vma, struct
> > xe_svm_range *range)
> > +{
> > +	u32 current_op = pt_update_ops->current_op;
> > +	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> > +	int err;
> > +
> > +	xe_tile_assert(tile, xe_vma_is_system_allocator(vma));
> > +
> > +	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > +	       "Preparing bind, with range [%llx...%llx)\n",
> > +	       range->base.va.start, range->base.va.end - 1);
> > +
> > +	pt_op->vma = NULL;
> > +	pt_op->bind = true;
> > +	pt_op->rebind = BIT(tile->id) & range->tile_present;
> > +
> > +	err = xe_pt_prepare_bind(tile, vma, range, pt_op->entries,
> > +				 &pt_op->num_entries);
> > +	if (!err) {
> > +		xe_tile_assert(tile, pt_op->num_entries <=
> > +			       ARRAY_SIZE(pt_op->entries));
> > +		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op-
> > >entries,
> > +					pt_op->num_entries, true);
> > +
> > +		xe_pt_update_ops_rfence_interval(pt_update_ops,
> > +						 range-
> > >base.va.start,
> > +						 range-
> > >base.va.end);
> > +		++pt_update_ops->current_op;
> > +		pt_update_ops->needs_svm_lock = true;
> > +
> > +		pt_op->vma = vma;
> > +		xe_pt_commit_prepare_bind(vma, pt_op->entries,
> > +					  pt_op->num_entries, pt_op-
> > >rebind);
> > +	} else {
> > +		xe_pt_cancel_bind(vma, pt_op->entries, pt_op-
> > >num_entries);
> > +	}
> > +
> > +	return err;
> > +}
> > +
> >  static int unbind_op_prepare(struct xe_tile *tile,
> >  			     struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> >  			     struct xe_vma *vma)
> > @@ -1769,7 +1873,8 @@ static int unbind_op_prepare(struct xe_tile
> > *tile,
> >  
> >  	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
> >  				pt_op->num_entries, false);
> > -	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
> > +	xe_pt_update_ops_rfence_interval(pt_update_ops,
> > xe_vma_start(vma),
> > +					 xe_vma_end(vma));
> >  	++pt_update_ops->current_op;
> >  	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
> >  	pt_update_ops->needs_invalidation = true;
> > @@ -1839,6 +1944,15 @@ static int op_prepare(struct xe_vm *vm,
> >  		pt_update_ops->wait_vm_kernel = true;
> >  		break;
> >  	}
> > +	case DRM_GPUVA_OP_USER:
> > +		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
> 
> See question below on subops
> 
> > +			xe_assert(vm->xe,
> > xe_vma_is_system_allocator(op->map_range.vma));
> > +
> > +			err = bind_range_prepare(vm, tile,
> > pt_update_ops,
> > +						 op->map_range.vma,
> > +						 op-
> > >map_range.range);
> > +		}
> > +		break;
> >  	default:
> >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> >  	}
> > @@ -2020,6 +2134,14 @@ static void op_commit(struct xe_vm *vm,
> >  				       fence2);
> >  		break;
> >  	}
> > +	case DRM_GPUVA_OP_USER:
> > +	{
> > +		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
> > +			op->map_range.range->tile_present |=
> > BIT(tile->id);
> > +			op->map_range.range->tile_invalidated &=
> > ~BIT(tile->id);
> > +		}
> > +		break;
> > +	}
> >  	default:
> >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> >  	}
> > @@ -2037,6 +2159,12 @@ static const struct xe_migrate_pt_update_ops
> > userptr_migrate_ops = {
> >  	.pre_commit = xe_pt_userptr_pre_commit,
> >  };
> >  
> > +static const struct xe_migrate_pt_update_ops svm_migrate_ops = {
> > +	.populate = xe_vm_populate_pgtable,
> > +	.clear = xe_migrate_clear_pgtable_callback,
> > +	.pre_commit = xe_pt_svm_pre_commit,
> > +};
> > +
> >  /**
> >   * xe_pt_update_ops_run() - Run PT update operations
> >   * @tile: Tile of PT update operations
> > @@ -2062,7 +2190,9 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> > struct xe_vma_ops *vops)
> >  	struct xe_vma_op *op;
> >  	int err = 0, i;
> >  	struct xe_migrate_pt_update update = {
> > -		.ops = pt_update_ops->needs_userptr_lock ?
> > +		.ops = pt_update_ops->needs_svm_lock ?
> > +			&svm_migrate_ops :
> > +			pt_update_ops->needs_userptr_lock ?
> >  			&userptr_migrate_ops :
> >  			&migrate_ops,
> >  		.vops = vops,
> > @@ -2183,6 +2313,8 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> > struct xe_vma_ops *vops)
> >  				  &ifence->base.base, &mfence-
> > >base.base);
> >  	}
> >  
> > +	if (pt_update_ops->needs_svm_lock)
> > +		xe_svm_notifier_unlock(vm);
> >  	if (pt_update_ops->needs_userptr_lock)
> >  		up_read(&vm->userptr.notifier_lock);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > b/drivers/gpu/drm/xe/xe_pt_types.h
> > index 384cc04de719..69eab6f37cfe 100644
> > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > @@ -104,6 +104,8 @@ struct xe_vm_pgtable_update_ops {
> >  	u32 num_ops;
> >  	/** @current_op: current operations */
> >  	u32 current_op;
> > +	/** @needs_svm_lock: Needs SVM lock */
> > +	bool needs_svm_lock;
> >  	/** @needs_userptr_lock: Needs userptr lock */
> >  	bool needs_userptr_lock;
> >  	/** @needs_invalidation: Needs invalidation */
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index b2bc259978c4..a9addaea316d 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -209,8 +209,8 @@ void xe_svm_close(struct xe_vm *vm)
> >  	xe_assert(vm->xe, xe_vm_is_closed(vm));
> >  
> >  	/* Flush running notifiers making xe_vm_close() visable */
> > -	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> > -	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> > +	xe_svm_notifier_lock(vm);
> > +	xe_svm_notifier_unlock(vm);
> >  }
> >  
> >  void xe_svm_fini(struct xe_vm *vm)
> > @@ -220,12 +220,22 @@ void xe_svm_fini(struct xe_vm *vm)
> >  	drm_gpusvm_fini(&vm->svm.gpusvm);
> >  }
> >  
> > +static bool xe_svm_range_is_valid(struct xe_svm_range *range,
> > +				  struct xe_tile *tile)
> > +{
> > +	return (range->tile_present & ~range->tile_invalidated) &
> > BIT(tile->id);
> > +}
> > +
> >  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> >  			    struct xe_tile *tile, u64 fault_addr,
> >  			    bool atomic)
> >  {
> >  	struct drm_gpusvm_ctx ctx = { .read_only =
> > xe_vma_read_only(vma), };
> > +	struct xe_svm_range *range;
> >  	struct drm_gpusvm_range *r;
> > +	struct drm_exec exec;
> > +	struct dma_fence *fence;
> > +	ktime_t end = 0;
> >  	int err;
> >  
> >  	lockdep_assert_held_write(&vm->lock);
> > @@ -239,11 +249,42 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	if (IS_ERR(r))
> >  		return PTR_ERR(r);
> >  
> > -	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, false);
> > +	range = to_xe_range(r);
> > +	if (xe_svm_range_is_valid(range, tile))
> > +		return 0;
> > +
> > +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> >  	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> > mappings have change */
> >  	       goto retry;
> > +	if (err)
> > +		goto err_out;
> > +
> > +retry_bind:
> > +	drm_exec_init(&exec, 0, 0);
> > +	drm_exec_until_all_locked(&exec) {
> > +		err = drm_exec_lock_obj(&exec, vm->gpuvm.r_obj);
> > +		drm_exec_retry_on_contention(&exec);
> > +		if (err) {
> > +			drm_exec_fini(&exec);
> > +			goto err_out;
> > +		}
> > +
> > +		fence = xe_vm_range_rebind(vm, vma, range, BIT(tile-
> > >id));
> > +		if (IS_ERR(fence)) {
> > +			drm_exec_fini(&exec);
> > +			err = PTR_ERR(fence);
> > +			if (err == -EAGAIN)
> > +				goto retry;
> > +			if (xe_vm_validate_should_retry(&exec, err,
> > &end))
> > +				goto retry_bind;
> > +			goto err_out;
> > +		}
> > +	}
> > +	drm_exec_fini(&exec);
> >  
> > -	/* TODO: Issue bind */
> > +	dma_fence_wait(fence, false);
> > +	dma_fence_put(fence);
> >  
> > +err_out:
> >  	return err;
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index c91c5f538024..ee0bd1ae655b 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -29,4 +29,21 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  			    struct xe_tile *tile, u64 fault_addr,
> >  			    bool atomic);
> >  
> > +static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> > *range)
> > +{
> > +	return drm_gpusvm_range_pages_valid(range->base.gpusvm,
> > &range->base);
> > +}
> > +
> > +static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range
> > *range)
> > +{
> > +	lockdep_assert_held(&range->base.gpusvm->notifier_lock);
> > +	return range->base.flags.has_dma_mapping;
> > +}
> > +
> > +#define xe_svm_notifier_lock(vm__)	\
> > +	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
> > +
> > +#define xe_svm_notifier_unlock(vm__)	\
> > +	drm_gpusvm_notifier_unlock(&(vm__)->svm.gpusvm)
> > +
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index b11fb0ac9520..63aa0a25d3b7 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -894,6 +894,84 @@ struct dma_fence *xe_vma_rebind(struct xe_vm
> > *vm, struct xe_vma *vma, u8 tile_ma
> >  	return fence;
> >  }
> >  
> > +static void xe_vm_populate_range_rebind(struct xe_vma_op *op,
> > +					struct xe_vma *vma,
> > +					struct xe_svm_range *range,
> > +					u8 tile_mask)
> > +{
> > +	INIT_LIST_HEAD(&op->link);
> > +	op->tile_mask = tile_mask;
> > +	op->base.op = DRM_GPUVA_OP_USER;
> > +	op->subop = XE_VMA_SUBOP_MAP_RANGE;
> > +	op->map_range.vma = vma;
> > +	op->map_range.range = range;
> > +}
> > +
> > +static int
> > +xe_vm_ops_add_range_rebind(struct xe_vma_ops *vops,
> > +			   struct xe_vma *vma,
> > +			   struct xe_svm_range *range,
> > +			   u8 tile_mask)
> > +{
> > +	struct xe_vma_op *op;
> > +
> > +	op = kzalloc(sizeof(*op), GFP_KERNEL);
> > +	if (!op)
> > +		return -ENOMEM;
> > +
> > +	xe_vm_populate_range_rebind(op, vma, range, tile_mask);
> > +	list_add_tail(&op->link, &vops->list);
> > +	xe_vma_ops_incr_pt_update_ops(vops, tile_mask);
> > +
> > +	return 0;
> > +}
> > +
> > +struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
> > +				     struct xe_vma *vma,
> > +				     struct xe_svm_range *range,
> > +				     u8 tile_mask)
> 
> kerneldoc
> 

Will add.

> > +{
> > +	struct dma_fence *fence = NULL;
> > +	struct xe_vma_ops vops;
> > +	struct xe_vma_op *op, *next_op;
> > +	struct xe_tile *tile;
> > +	u8 id;
> > +	int err;
> > +
> > +	lockdep_assert_held(&vm->lock);
> > +	xe_vm_assert_held(vm);
> > +	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
> > +	xe_assert(vm->xe, xe_vma_is_system_allocator(vma));
> > +
> > +	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
> > +	for_each_tile(tile, vm->xe, id) {
> > +		vops.pt_update_ops[id].wait_vm_bookkeep = true;
> > +		vops.pt_update_ops[tile->id].q =
> > +			xe_tile_migrate_exec_queue(tile);
> > +	}
> > +
> > +	err = xe_vm_ops_add_range_rebind(&vops, vma, range,
> > tile_mask);
> > +	if (err)
> > +		return ERR_PTR(err);
> > +
> > +	err = xe_vma_ops_alloc(&vops, false);
> > +	if (err) {
> > +		fence = ERR_PTR(err);
> > +		goto free_ops;
> > +	}
> > +
> > +	fence = ops_execute(vm, &vops);
> > +
> > +free_ops:
> > +	list_for_each_entry_safe(op, next_op, &vops.list, link) {
> > +		list_del(&op->link);
> > +		kfree(op);
> > +	}
> > +	xe_vma_ops_fini(&vops);
> > +
> > +	return fence;
> > +}
> > +
> >  static void xe_vma_free(struct xe_vma *vma)
> >  {
> >  	if (xe_vma_is_userptr(vma))
> > @@ -2514,6 +2592,8 @@ static void op_trace(struct xe_vma_op *op)
> >  	case DRM_GPUVA_OP_PREFETCH:
> >  		trace_xe_vma_bind(gpuva_to_vma(op-
> > >base.prefetch.va));
> >  		break;
> > +	case DRM_GPUVA_OP_USER:
> > +		break;
> >  	default:
> >  		XE_WARN_ON("NOT POSSIBLE");
> >  	}
> > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> > index 1a5aed678214..8bd921b33090 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.h
> > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > @@ -22,6 +22,7 @@ struct ttm_validate_buffer;
> >  struct xe_exec_queue;
> >  struct xe_file;
> >  struct xe_sync_entry;
> > +struct xe_svm_range;
> >  struct drm_exec;
> >  
> >  struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags);
> > @@ -217,6 +218,10 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm);
> >  int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
> >  struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma
> > *vma,
> >  				u8 tile_mask);
> > +struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
> > +				     struct xe_vma *vma,
> > +				     struct xe_svm_range *range,
> > +				     u8 tile_mask);
> >  
> >  int xe_vm_invalidate_vma(struct xe_vma *vma);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index bd1c0e368238..b736e53779d2 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -19,6 +19,7 @@
> >  #include "xe_range_fence.h"
> >  
> >  struct xe_bo;
> > +struct xe_svm_range;
> >  struct xe_sync_entry;
> >  struct xe_user_fence;
> >  struct xe_vm;
> > @@ -334,6 +335,14 @@ struct xe_vma_op_prefetch {
> >  	u32 region;
> >  };
> >  
> > +/** struct xe_vma_op_map_range - VMA map range operation */
> > +struct xe_vma_op_map_range {
> > +	/** @vma: VMA to map (system allocator VMA) */
> > +	struct xe_vma *vma;
> > +	/** @range: SVM range to map */
> > +	struct xe_svm_range *range;
> > +};
> > +
> >  /** enum xe_vma_op_flags - flags for VMA operation */
> >  enum xe_vma_op_flags {
> >  	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
> > @@ -344,6 +353,12 @@ enum xe_vma_op_flags {
> >  	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
> >  };
> >  
> > +/** enum xe_vma_subop - VMA sub-operation */
> > +enum xe_vma_subop {
> > +	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
> > +	XE_VMA_SUBOP_MAP_RANGE,
> 
> Instead of introducing subops, should we perhaps consider
>  DRM_GPUVMA_OP_USER plus any following op as driver defined so that the
>                     next subop would instead be DRM_GPUVMA_OP_USER + 1?
> 

Since DRM_GPUVMA_OP_* is an enum that doesn't really work (think case
statements + warnings). At one point I had this way but then we need to
define a new enum for *every* driver defined op.

e.g.

DRM_GPUVMA_OP_USER1 -> Range MAP
DRM_GPUVMA_OP_USER2 -> Range UNMAP

So with this, I justed added 1 driver enum entry plus subops. Ofc if we
changed DRM_GPUVMA_OP_* to define I think this problem goes away but
that would need a larger community buy in.

> > +};
> > +
> >  /** struct xe_vma_op - VMA operation */
> >  struct xe_vma_op {
> >  	/** @base: GPUVA base operation */
> > @@ -352,6 +367,8 @@ struct xe_vma_op {
> >  	struct list_head link;
> >  	/** @flags: operation flags */
> >  	enum xe_vma_op_flags flags;
> > +	/** @subop: user defined sub-operation */
> > +	enum xe_vma_subop subop;
> >  	/** @tile_mask: Tile mask for operation */
> >  	u8 tile_mask;
> >  
> > @@ -362,6 +379,8 @@ struct xe_vma_op {
> >  		struct xe_vma_op_remap remap;
> >  		/** @prefetch: VMA prefetch operation specific data
> > */
> >  		struct xe_vma_op_prefetch prefetch;
> > +		/** @map: VMA map range operation specific data */
> > +		struct xe_vma_op_map_range map_range;
> >  	};
> >  };
> >  
> 
> Thanks,

Thanks,
Matt

> Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler
  2024-12-11 19:07     ` Matthew Brost
@ 2024-12-16 10:03       ` Thomas Hellström
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-12-16 10:03 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Wed, 2024-12-11 at 11:07 -0800, Matthew Brost wrote:
> On Tue, Nov 19, 2024 at 03:26:32PM +0100, Thomas Hellström wrote:
> > On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > > Add (re)bind to SVM page fault handler. To facilitate add support
> > > function to VM layer which (re)binds a SVM range. Also teach PT
> > > layer
> > > to
> > > understand (re)binds of SVM ranges.
> > > 
> > > v2:
> > >  - Don't assert BO lock held for range binds
> > >  - Use xe_svm_notifier_lock/unlock helper in xe_svm_close
> > >  - Use drm_pagemap dma cursor
> > >  - Take notifier lock in bind code to check range state
> > > 
> > > Signed-off-by: Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_pt.c       | 170
> > > +++++++++++++++++++++++++++--
> > > --
> > >  drivers/gpu/drm/xe/xe_pt_types.h |   2 +
> > >  drivers/gpu/drm/xe/xe_svm.c      |  49 ++++++++-
> > >  drivers/gpu/drm/xe/xe_svm.h      |  17 ++++
> > >  drivers/gpu/drm/xe/xe_vm.c       |  80 +++++++++++++++
> > >  drivers/gpu/drm/xe/xe_vm.h       |   5 +
> > >  drivers/gpu/drm/xe/xe_vm_types.h |  19 ++++
> > >  7 files changed, 319 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > b/drivers/gpu/drm/xe/xe_pt.c
> > > index 282476c4edbd..024e4eb83408 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > @@ -587,6 +587,7 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > >   * range.
> > >   * @tile: The tile we're building for.
> > >   * @vma: The vma indicating the address range.
> > > + * @range: The range indicating the address range.
> > >   * @entries: Storage for the update entries used for connecting
> > > the
> > > tree to
> > >   * the main tree at commit time.
> > >   * @num_entries: On output contains the number of @entries used.
> > > @@ -602,6 +603,7 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > >   */
> > >  static int
> > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > +		 struct xe_svm_range *range,
> > >  		 struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > 
> > Really the same comment here, should rework the interface to drop
> > vma
> > and range, although since it's more involved here let's do it as a
> > follow-up.
> > 
> 
> Yep agree. We have Jira open for this but is larger rework, mainly
> moving the tile_present / tile_invalidated from the VMA's / Range's
> to
> the PT state.
> 
> > >  {
> > >  	struct xe_device *xe = tile_to_xe(tile);
> > > @@ -618,14 +620,38 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > > struct
> > > xe_vma *vma,
> > >  		.vm = xe_vma_vm(vma),
> > >  		.tile = tile,
> > >  		.curs = &curs,
> > > -		.va_curs_start = xe_vma_start(vma),
> > > +		.va_curs_start = range ? range->base.va.start :
> > > +			xe_vma_start(vma),
> > >  		.vma = vma,
> > >  		.wupd.entries = entries,
> > > -		.needs_64K = (xe_vma_vm(vma)->flags &
> > > XE_VM_FLAG_64K) && is_devmem,
> > >  	};
> > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> > >  	int ret;
> > >  
> > > +	if (range) {
> > > +		/* Move this entire thing to xe_svm.c? */
> > > +		xe_svm_notifier_lock(xe_vma_vm(vma));
> > > +		if (!xe_svm_range_pages_valid(range)) {
> > > +			xe_svm_notifier_unlock(xe_vma_vm(vma));
> > > +			return -EAGAIN;
> > > +		}
> > > +		if (xe_svm_range_has_dma_mapping(range)) {
> > > +			xe_res_first_dma(range->base.dma_addr,
> > > 0,
> > > +					 range->base.va.end -
> > > range-
> > > > base.va.start,
> > > +					 &curs);
> > > +			is_devmem = xe_res_is_vram(&curs);
> > > +		} else {
> > > +			xe_assert(xe, false);
> > > +		}
> > > +		/*
> > > +		 * Note, when unlocking the resource cursor dma
> > > addresses may become
> > > +		 * stale, but the bind will be aborted anyway
> > > att
> > > commit time.
> > > +		 */
> > > +		xe_svm_notifier_unlock(xe_vma_vm(vma));
> > > +	}
> > > +
> > > +	xe_walk.needs_64K = (xe_vma_vm(vma)->flags &
> > > XE_VM_FLAG_64K)
> > > && is_devmem;
> > > +
> > >  	/**
> > >  	 * Default atomic expectations for different allocation
> > > scenarios are as follows:
> > >  	 *
> > > @@ -647,7 +673,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > > xe_vma *vma,
> > >  			 * gets migrated to LMEM, bind such
> > > allocations with
> > >  			 * device atomics enabled.
> > >  			 */
> > > -			else if (is_devmem &&
> > > !xe_bo_has_single_placement(bo))
> > > +			else if (is_devmem)
> > >  				xe_walk.default_pte |=
> > > XE_USM_PPGTT_PTE_AE;
> > >  		} else {
> > >  			xe_walk.default_pte |=
> > > XE_USM_PPGTT_PTE_AE;
> > > @@ -663,15 +689,16 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > > struct
> > > xe_vma *vma,
> > >  
> > >  	if (is_devmem) {
> > >  		xe_walk.default_pte |= XE_PPGTT_PTE_DM;
> > > -		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> > > > ttm.resource);
> > > +		xe_walk.dma_offset = bo ?
> > > vram_region_gpu_offset(bo-
> > > > ttm.resource) : 0;
> > >  	}
> > >  
> > >  	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> > >  		xe_walk.dma_offset =
> > > xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> > >  
> > > -	xe_bo_assert_held(bo);
> > > +	if (!range)
> > > +		xe_bo_assert_held(bo);
> > >  
> > > -	if (!xe_vma_is_null(vma)) {
> > > +	if (!xe_vma_is_null(vma) && !range) {
> > >  		if (xe_vma_is_userptr(vma))
> > >  			xe_res_first_sg(to_userptr_vma(vma)-
> > > > userptr.sg, 0,
> > >  					xe_vma_size(vma),
> > > &curs);
> > > @@ -681,12 +708,14 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > > struct
> > > xe_vma *vma,
> > >  		else
> > >  			xe_res_first_sg(xe_bo_sg(bo),
> > > xe_vma_bo_offset(vma),
> > >  					xe_vma_size(vma),
> > > &curs);
> > > -	} else {
> > > +	} else if (!range) {
> > >  		curs.size = xe_vma_size(vma);
> > >  	}
> > >  
> > > -	ret = xe_pt_walk_range(&pt->base, pt->level,
> > > xe_vma_start(vma),
> > > -			       xe_vma_end(vma), &xe_walk.base);
> > > +	ret = xe_pt_walk_range(&pt->base, pt->level,
> > > +			       range ? range->base.va.start :
> > > xe_vma_start(vma),
> > > +			       range ? range->base.va.end :
> > > xe_vma_end(vma),
> > > +			       &xe_walk.base);
> > >  
> > >  	*num_entries = xe_walk.wupd.num_used_entries;
> > >  	return ret;
> > > @@ -902,7 +931,7 @@ static void xe_pt_commit_locks_assert(struct
> > > xe_vma *vma)
> > >  
> > >  	lockdep_assert_held(&vm->lock);
> > >  
> > > -	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
> > > +	if (!xe_vma_has_no_bo(vma))
> > >  		dma_resv_assert_held(xe_vma_bo(vma)-
> > > >ttm.base.resv);
> > >  
> > >  	xe_vm_assert_held(vm);
> > > @@ -1004,12 +1033,13 @@ static void xe_pt_free_bind(struct
> > > xe_vm_pgtable_update *entries,
> > >  
> > >  static int
> > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > +		   struct xe_svm_range *range,
> > >  		   struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > >  {
> > >  	int err;
> > >  
> > >  	*num_entries = 0;
> > > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > > +	err = xe_pt_stage_bind(tile, vma, range, entries,
> > > num_entries);
> > >  	if (!err)
> > >  		xe_tile_assert(tile, *num_entries);
> > >  
> > > @@ -1115,6 +1145,8 @@ static int op_add_deps(struct xe_vm *vm,
> > > struct
> > > xe_vma_op *op,
> > >  	case DRM_GPUVA_OP_PREFETCH:
> > >  		err = vma_add_deps(gpuva_to_vma(op-
> > > > base.prefetch.va), job);
> > >  		break;
> > > +	case DRM_GPUVA_OP_USER:
> > > +		break;
> > >  	default:
> > >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> > >  	}
> > > @@ -1339,6 +1371,34 @@ static int xe_pt_userptr_pre_commit(struct
> > > xe_migrate_pt_update *pt_update)
> > >  	return err;
> > >  }
> > >  
> > > +static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update
> > > *pt_update)
> > > +{
> > > +	struct xe_vm *vm = pt_update->vops->vm;
> > > +	struct xe_vma_ops *vops = pt_update->vops;
> > > +	struct xe_vma_op *op;
> > > +	int err;
> > > +
> > > +	err = xe_pt_pre_commit(pt_update);
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	xe_svm_notifier_lock(vm);
> > > +
> > > +	list_for_each_entry(op, &vops->list, link) {
> > > +		struct xe_svm_range *range = op-
> > > >map_range.range;
> > > +
> > > +		xe_assert(vm->xe, xe_vma_is_system_allocator(op-
> > > > map_range.vma));
> > > +		xe_assert(vm->xe, op->subop ==
> > > XE_VMA_SUBOP_MAP_RANGE);
> > > +
> > > +		if (!xe_svm_range_pages_valid(range)) {
> > > +			xe_svm_notifier_unlock(vm);
> > > +			return -EAGAIN;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  struct invalidation_fence {
> > >  	struct xe_gt_tlb_invalidation_fence base;
> > >  	struct xe_gt *gt;
> > > @@ -1632,12 +1692,12 @@ xe_pt_commit_prepare_unbind(struct xe_vma
> > > *vma,
> > >  
> > >  static void
> > >  xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops
> > > *pt_update_ops,
> > > -				 struct xe_vma *vma)
> > > +				 u64 start, u64 end)
> > >  {
> > > +	u64 last;
> > >  	u32 current_op = pt_update_ops->current_op;
> > >  	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > > > ops[current_op];
> > >  	int i, level = 0;
> > > -	u64 start, last;
> > >  
> > >  	for (i = 0; i < pt_op->num_entries; i++) {
> > >  		const struct xe_vm_pgtable_update *entry =
> > > &pt_op-
> > > > entries[i];
> > > @@ -1647,8 +1707,8 @@ xe_pt_update_ops_rfence_interval(struct
> > > xe_vm_pgtable_update_ops *pt_update_ops,
> > >  	}
> > >  
> > >  	/* Greedy (non-optimal) calculation but simple */
> > > -	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull <<
> > > xe_pt_shift(level));
> > > -	last = ALIGN(xe_vma_end(vma), 0x1ull <<
> > > xe_pt_shift(level))
> > > - 1;
> > > +	start = ALIGN_DOWN(start, 0x1ull << xe_pt_shift(level));
> > > +	last = ALIGN(end, 0x1ull << xe_pt_shift(level)) - 1;
> > >  
> > >  	if (start < pt_update_ops->start)
> > >  		pt_update_ops->start = start;
> > > @@ -1690,7 +1750,7 @@ static int bind_op_prepare(struct xe_vm
> > > *vm,
> > > struct xe_tile *tile,
> > >  	if (err)
> > >  		return err;
> > >  
> > > -	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > +	err = xe_pt_prepare_bind(tile, vma, NULL, pt_op-
> > > >entries,
> > >  				 &pt_op->num_entries);
> > >  	if (!err) {
> > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > @@ -1698,7 +1758,9 @@ static int bind_op_prepare(struct xe_vm
> > > *vm,
> > > struct xe_tile *tile,
> > >  		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op-
> > > > entries,
> > >  					pt_op->num_entries,
> > > true);
> > >  
> > > -		xe_pt_update_ops_rfence_interval(pt_update_ops,
> > > vma);
> > > +		xe_pt_update_ops_rfence_interval(pt_update_ops,
> > > +						
> > > xe_vma_start(vma),
> > > +						
> > > xe_vma_end(vma));
> > >  		++pt_update_ops->current_op;
> > >  		pt_update_ops->needs_userptr_lock |=
> > > xe_vma_is_userptr(vma);
> > >  
> > > @@ -1732,6 +1794,48 @@ static int bind_op_prepare(struct xe_vm
> > > *vm,
> > > struct xe_tile *tile,
> > >  	return err;
> > >  }
> > >  
> > > +static int bind_range_prepare(struct xe_vm *vm, struct xe_tile
> > > *tile,
> > > +			      struct xe_vm_pgtable_update_ops
> > > *pt_update_ops,
> > > +			      struct xe_vma *vma, struct
> > > xe_svm_range *range)
> > > +{
> > > +	u32 current_op = pt_update_ops->current_op;
> > > +	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > > > ops[current_op];
> > > +	int err;
> > > +
> > > +	xe_tile_assert(tile, xe_vma_is_system_allocator(vma));
> > > +
> > > +	vm_dbg(&xe_vma_vm(vma)->xe->drm,
> > > +	       "Preparing bind, with range [%llx...%llx)\n",
> > > +	       range->base.va.start, range->base.va.end - 1);
> > > +
> > > +	pt_op->vma = NULL;
> > > +	pt_op->bind = true;
> > > +	pt_op->rebind = BIT(tile->id) & range->tile_present;
> > > +
> > > +	err = xe_pt_prepare_bind(tile, vma, range, pt_op-
> > > >entries,
> > > +				 &pt_op->num_entries);
> > > +	if (!err) {
> > > +		xe_tile_assert(tile, pt_op->num_entries <=
> > > +			       ARRAY_SIZE(pt_op->entries));
> > > +		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op-
> > > > entries,
> > > +					pt_op->num_entries,
> > > true);
> > > +
> > > +		xe_pt_update_ops_rfence_interval(pt_update_ops,
> > > +						 range-
> > > > base.va.start,
> > > +						 range-
> > > > base.va.end);
> > > +		++pt_update_ops->current_op;
> > > +		pt_update_ops->needs_svm_lock = true;
> > > +
> > > +		pt_op->vma = vma;
> > > +		xe_pt_commit_prepare_bind(vma, pt_op->entries,
> > > +					  pt_op->num_entries,
> > > pt_op-
> > > > rebind);
> > > +	} else {
> > > +		xe_pt_cancel_bind(vma, pt_op->entries, pt_op-
> > > > num_entries);
> > > +	}
> > > +
> > > +	return err;
> > > +}
> > > +
> > >  static int unbind_op_prepare(struct xe_tile *tile,
> > >  			     struct xe_vm_pgtable_update_ops
> > > *pt_update_ops,
> > >  			     struct xe_vma *vma)
> > > @@ -1769,7 +1873,8 @@ static int unbind_op_prepare(struct xe_tile
> > > *tile,
> > >  
> > >  	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op-
> > > >entries,
> > >  				pt_op->num_entries, false);
> > > -	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
> > > +	xe_pt_update_ops_rfence_interval(pt_update_ops,
> > > xe_vma_start(vma),
> > > +					 xe_vma_end(vma));
> > >  	++pt_update_ops->current_op;
> > >  	pt_update_ops->needs_userptr_lock |=
> > > xe_vma_is_userptr(vma);
> > >  	pt_update_ops->needs_invalidation = true;
> > > @@ -1839,6 +1944,15 @@ static int op_prepare(struct xe_vm *vm,
> > >  		pt_update_ops->wait_vm_kernel = true;
> > >  		break;
> > >  	}
> > > +	case DRM_GPUVA_OP_USER:
> > > +		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
> > 
> > See question below on subops
> > 
> > > +			xe_assert(vm->xe,
> > > xe_vma_is_system_allocator(op->map_range.vma));
> > > +
> > > +			err = bind_range_prepare(vm, tile,
> > > pt_update_ops,
> > > +						 op-
> > > >map_range.vma,
> > > +						 op-
> > > > map_range.range);
> > > +		}
> > > +		break;
> > >  	default:
> > >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> > >  	}
> > > @@ -2020,6 +2134,14 @@ static void op_commit(struct xe_vm *vm,
> > >  				       fence2);
> > >  		break;
> > >  	}
> > > +	case DRM_GPUVA_OP_USER:
> > > +	{
> > > +		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
> > > +			op->map_range.range->tile_present |=
> > > BIT(tile->id);
> > > +			op->map_range.range->tile_invalidated &=
> > > ~BIT(tile->id);
> > > +		}
> > > +		break;
> > > +	}
> > >  	default:
> > >  		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
> > >  	}
> > > @@ -2037,6 +2159,12 @@ static const struct
> > > xe_migrate_pt_update_ops
> > > userptr_migrate_ops = {
> > >  	.pre_commit = xe_pt_userptr_pre_commit,
> > >  };
> > >  
> > > +static const struct xe_migrate_pt_update_ops svm_migrate_ops = {
> > > +	.populate = xe_vm_populate_pgtable,
> > > +	.clear = xe_migrate_clear_pgtable_callback,
> > > +	.pre_commit = xe_pt_svm_pre_commit,
> > > +};
> > > +
> > >  /**
> > >   * xe_pt_update_ops_run() - Run PT update operations
> > >   * @tile: Tile of PT update operations
> > > @@ -2062,7 +2190,9 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> > > struct xe_vma_ops *vops)
> > >  	struct xe_vma_op *op;
> > >  	int err = 0, i;
> > >  	struct xe_migrate_pt_update update = {
> > > -		.ops = pt_update_ops->needs_userptr_lock ?
> > > +		.ops = pt_update_ops->needs_svm_lock ?
> > > +			&svm_migrate_ops :
> > > +			pt_update_ops->needs_userptr_lock ?
> > >  			&userptr_migrate_ops :
> > >  			&migrate_ops,
> > >  		.vops = vops,
> > > @@ -2183,6 +2313,8 @@ xe_pt_update_ops_run(struct xe_tile *tile,
> > > struct xe_vma_ops *vops)
> > >  				  &ifence->base.base, &mfence-
> > > > base.base);
> > >  	}
> > >  
> > > +	if (pt_update_ops->needs_svm_lock)
> > > +		xe_svm_notifier_unlock(vm);
> > >  	if (pt_update_ops->needs_userptr_lock)
> > >  		up_read(&vm->userptr.notifier_lock);
> > >  
> > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > index 384cc04de719..69eab6f37cfe 100644
> > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > @@ -104,6 +104,8 @@ struct xe_vm_pgtable_update_ops {
> > >  	u32 num_ops;
> > >  	/** @current_op: current operations */
> > >  	u32 current_op;
> > > +	/** @needs_svm_lock: Needs SVM lock */
> > > +	bool needs_svm_lock;
> > >  	/** @needs_userptr_lock: Needs userptr lock */
> > >  	bool needs_userptr_lock;
> > >  	/** @needs_invalidation: Needs invalidation */
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index b2bc259978c4..a9addaea316d 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -209,8 +209,8 @@ void xe_svm_close(struct xe_vm *vm)
> > >  	xe_assert(vm->xe, xe_vm_is_closed(vm));
> > >  
> > >  	/* Flush running notifiers making xe_vm_close() visable
> > > */
> > > -	drm_gpusvm_notifier_lock(&vm->svm.gpusvm);
> > > -	drm_gpusvm_notifier_unlock(&vm->svm.gpusvm);
> > > +	xe_svm_notifier_lock(vm);
> > > +	xe_svm_notifier_unlock(vm);
> > >  }
> > >  
> > >  void xe_svm_fini(struct xe_vm *vm)
> > > @@ -220,12 +220,22 @@ void xe_svm_fini(struct xe_vm *vm)
> > >  	drm_gpusvm_fini(&vm->svm.gpusvm);
> > >  }
> > >  
> > > +static bool xe_svm_range_is_valid(struct xe_svm_range *range,
> > > +				  struct xe_tile *tile)
> > > +{
> > > +	return (range->tile_present & ~range->tile_invalidated)
> > > &
> > > BIT(tile->id);
> > > +}
> > > +
> > >  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > >  			    struct xe_tile *tile, u64
> > > fault_addr,
> > >  			    bool atomic)
> > >  {
> > >  	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma), };
> > > +	struct xe_svm_range *range;
> > >  	struct drm_gpusvm_range *r;
> > > +	struct drm_exec exec;
> > > +	struct dma_fence *fence;
> > > +	ktime_t end = 0;
> > >  	int err;
> > >  
> > >  	lockdep_assert_held_write(&vm->lock);
> > > @@ -239,11 +249,42 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > *vm,
> > > struct xe_vma *vma,
> > >  	if (IS_ERR(r))
> > >  		return PTR_ERR(r);
> > >  
> > > -	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > false);
> > > +	range = to_xe_range(r);
> > > +	if (xe_svm_range_is_valid(range, tile))
> > > +		return 0;
> > > +
> > > +	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > &ctx);
> > >  	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > CPU
> > > mappings have change */
> > >  	       goto retry;
> > > +	if (err)
> > > +		goto err_out;
> > > +
> > > +retry_bind:
> > > +	drm_exec_init(&exec, 0, 0);
> > > +	drm_exec_until_all_locked(&exec) {
> > > +		err = drm_exec_lock_obj(&exec, vm->gpuvm.r_obj);
> > > +		drm_exec_retry_on_contention(&exec);
> > > +		if (err) {
> > > +			drm_exec_fini(&exec);
> > > +			goto err_out;
> > > +		}
> > > +
> > > +		fence = xe_vm_range_rebind(vm, vma, range,
> > > BIT(tile-
> > > > id));
> > > +		if (IS_ERR(fence)) {
> > > +			drm_exec_fini(&exec);
> > > +			err = PTR_ERR(fence);
> > > +			if (err == -EAGAIN)
> > > +				goto retry;
> > > +			if (xe_vm_validate_should_retry(&exec,
> > > err,
> > > &end))
> > > +				goto retry_bind;
> > > +			goto err_out;
> > > +		}
> > > +	}
> > > +	drm_exec_fini(&exec);
> > >  
> > > -	/* TODO: Issue bind */
> > > +	dma_fence_wait(fence, false);
> > > +	dma_fence_put(fence);
> > >  
> > > +err_out:
> > >  	return err;
> > >  }
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > index c91c5f538024..ee0bd1ae655b 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -29,4 +29,21 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > > struct xe_vma *vma,
> > >  			    struct xe_tile *tile, u64
> > > fault_addr,
> > >  			    bool atomic);
> > >  
> > > +static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> > > *range)
> > > +{
> > > +	return drm_gpusvm_range_pages_valid(range->base.gpusvm,
> > > &range->base);
> > > +}
> > > +
> > > +static inline bool xe_svm_range_has_dma_mapping(struct
> > > xe_svm_range
> > > *range)
> > > +{
> > > +	lockdep_assert_held(&range->base.gpusvm->notifier_lock);
> > > +	return range->base.flags.has_dma_mapping;
> > > +}
> > > +
> > > +#define xe_svm_notifier_lock(vm__)	\
> > > +	drm_gpusvm_notifier_lock(&(vm__)->svm.gpusvm)
> > > +
> > > +#define xe_svm_notifier_unlock(vm__)	\
> > > +	drm_gpusvm_notifier_unlock(&(vm__)->svm.gpusvm)
> > > +
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > index b11fb0ac9520..63aa0a25d3b7 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -894,6 +894,84 @@ struct dma_fence *xe_vma_rebind(struct xe_vm
> > > *vm, struct xe_vma *vma, u8 tile_ma
> > >  	return fence;
> > >  }
> > >  
> > > +static void xe_vm_populate_range_rebind(struct xe_vma_op *op,
> > > +					struct xe_vma *vma,
> > > +					struct xe_svm_range
> > > *range,
> > > +					u8 tile_mask)
> > > +{
> > > +	INIT_LIST_HEAD(&op->link);
> > > +	op->tile_mask = tile_mask;
> > > +	op->base.op = DRM_GPUVA_OP_USER;
> > > +	op->subop = XE_VMA_SUBOP_MAP_RANGE;
> > > +	op->map_range.vma = vma;
> > > +	op->map_range.range = range;
> > > +}
> > > +
> > > +static int
> > > +xe_vm_ops_add_range_rebind(struct xe_vma_ops *vops,
> > > +			   struct xe_vma *vma,
> > > +			   struct xe_svm_range *range,
> > > +			   u8 tile_mask)
> > > +{
> > > +	struct xe_vma_op *op;
> > > +
> > > +	op = kzalloc(sizeof(*op), GFP_KERNEL);
> > > +	if (!op)
> > > +		return -ENOMEM;
> > > +
> > > +	xe_vm_populate_range_rebind(op, vma, range, tile_mask);
> > > +	list_add_tail(&op->link, &vops->list);
> > > +	xe_vma_ops_incr_pt_update_ops(vops, tile_mask);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
> > > +				     struct xe_vma *vma,
> > > +				     struct xe_svm_range *range,
> > > +				     u8 tile_mask)
> > 
> > kerneldoc
> > 
> 
> Will add.
> 
> > > +{
> > > +	struct dma_fence *fence = NULL;
> > > +	struct xe_vma_ops vops;
> > > +	struct xe_vma_op *op, *next_op;
> > > +	struct xe_tile *tile;
> > > +	u8 id;
> > > +	int err;
> > > +
> > > +	lockdep_assert_held(&vm->lock);
> > > +	xe_vm_assert_held(vm);
> > > +	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
> > > +	xe_assert(vm->xe, xe_vma_is_system_allocator(vma));
> > > +
> > > +	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
> > > +	for_each_tile(tile, vm->xe, id) {
> > > +		vops.pt_update_ops[id].wait_vm_bookkeep = true;
> > > +		vops.pt_update_ops[tile->id].q =
> > > +			xe_tile_migrate_exec_queue(tile);
> > > +	}
> > > +
> > > +	err = xe_vm_ops_add_range_rebind(&vops, vma, range,
> > > tile_mask);
> > > +	if (err)
> > > +		return ERR_PTR(err);
> > > +
> > > +	err = xe_vma_ops_alloc(&vops, false);
> > > +	if (err) {
> > > +		fence = ERR_PTR(err);
> > > +		goto free_ops;
> > > +	}
> > > +
> > > +	fence = ops_execute(vm, &vops);
> > > +
> > > +free_ops:
> > > +	list_for_each_entry_safe(op, next_op, &vops.list, link)
> > > {
> > > +		list_del(&op->link);
> > > +		kfree(op);
> > > +	}
> > > +	xe_vma_ops_fini(&vops);
> > > +
> > > +	return fence;
> > > +}
> > > +
> > >  static void xe_vma_free(struct xe_vma *vma)
> > >  {
> > >  	if (xe_vma_is_userptr(vma))
> > > @@ -2514,6 +2592,8 @@ static void op_trace(struct xe_vma_op *op)
> > >  	case DRM_GPUVA_OP_PREFETCH:
> > >  		trace_xe_vma_bind(gpuva_to_vma(op-
> > > > base.prefetch.va));
> > >  		break;
> > > +	case DRM_GPUVA_OP_USER:
> > > +		break;
> > >  	default:
> > >  		XE_WARN_ON("NOT POSSIBLE");
> > >  	}
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.h
> > > b/drivers/gpu/drm/xe/xe_vm.h
> > > index 1a5aed678214..8bd921b33090 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > > @@ -22,6 +22,7 @@ struct ttm_validate_buffer;
> > >  struct xe_exec_queue;
> > >  struct xe_file;
> > >  struct xe_sync_entry;
> > > +struct xe_svm_range;
> > >  struct drm_exec;
> > >  
> > >  struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags);
> > > @@ -217,6 +218,10 @@ int xe_vm_userptr_check_repin(struct xe_vm
> > > *vm);
> > >  int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
> > >  struct dma_fence *xe_vma_rebind(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > >  				u8 tile_mask);
> > > +struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
> > > +				     struct xe_vma *vma,
> > > +				     struct xe_svm_range *range,
> > > +				     u8 tile_mask);
> > >  
> > >  int xe_vm_invalidate_vma(struct xe_vma *vma);
> > >  
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index bd1c0e368238..b736e53779d2 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -19,6 +19,7 @@
> > >  #include "xe_range_fence.h"
> > >  
> > >  struct xe_bo;
> > > +struct xe_svm_range;
> > >  struct xe_sync_entry;
> > >  struct xe_user_fence;
> > >  struct xe_vm;
> > > @@ -334,6 +335,14 @@ struct xe_vma_op_prefetch {
> > >  	u32 region;
> > >  };
> > >  
> > > +/** struct xe_vma_op_map_range - VMA map range operation */
> > > +struct xe_vma_op_map_range {
> > > +	/** @vma: VMA to map (system allocator VMA) */
> > > +	struct xe_vma *vma;
> > > +	/** @range: SVM range to map */
> > > +	struct xe_svm_range *range;
> > > +};
> > > +
> > >  /** enum xe_vma_op_flags - flags for VMA operation */
> > >  enum xe_vma_op_flags {
> > >  	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
> > > @@ -344,6 +353,12 @@ enum xe_vma_op_flags {
> > >  	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
> > >  };
> > >  
> > > +/** enum xe_vma_subop - VMA sub-operation */
> > > +enum xe_vma_subop {
> > > +	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
> > > +	XE_VMA_SUBOP_MAP_RANGE,
> > 
> > Instead of introducing subops, should we perhaps consider
> >  DRM_GPUVMA_OP_USER plus any following op as driver defined so that
> > the
> >                     next subop would instead be DRM_GPUVMA_OP_USER
> > + 1?
> > 
> 
> Since DRM_GPUVMA_OP_* is an enum that doesn't really work (think case
> statements + warnings). At one point I had this way but then we need
> to
> define a new enum for *every* driver defined op.
> 
> e.g.
> 
> DRM_GPUVMA_OP_USER1 -> Range MAP
> DRM_GPUVMA_OP_USER2 -> Range UNMAP
> 
> So with this, I justed added 1 driver enum entry plus subops. Ofc if
> we
> changed DRM_GPUVMA_OP_* to define I think this problem goes away but
> that would need a larger community buy in.

OK. Let's keep it as is for now then.

Thanks,
Thomas


> 
> > > +};
> > > +
> > >  /** struct xe_vma_op - VMA operation */
> > >  struct xe_vma_op {
> > >  	/** @base: GPUVA base operation */
> > > @@ -352,6 +367,8 @@ struct xe_vma_op {
> > >  	struct list_head link;
> > >  	/** @flags: operation flags */
> > >  	enum xe_vma_op_flags flags;
> > > +	/** @subop: user defined sub-operation */
> > > +	enum xe_vma_subop subop;
> > >  	/** @tile_mask: Tile mask for operation */
> > >  	u8 tile_mask;
> > >  
> > > @@ -362,6 +379,8 @@ struct xe_vma_op {
> > >  		struct xe_vma_op_remap remap;
> > >  		/** @prefetch: VMA prefetch operation specific
> > > data
> > > */
> > >  		struct xe_vma_op_prefetch prefetch;
> > > +		/** @map: VMA map range operation specific data
> > > */
> > > +		struct xe_vma_op_map_range map_range;
> > >  	};
> > >  };
> > >  
> > 
> > Thanks,
> 
> Thanks,
> Matt
> 
> > Thomas
> > 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 12/29] drm/xe: Add SVM garbage collector
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (10 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 14:45   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 13/29] drm/xe: Add unbind to " Matthew Brost
                   ` (19 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add basic SVM garbage collector which can destroy an SVM range upon an
MMU UNMAP event.

v2:
 - Flush garbage collector in xe_svm_close

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c      | 87 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h      |  1 +
 drivers/gpu/drm/xe/xe_vm.c       |  4 ++
 drivers/gpu/drm/xe/xe_vm_types.h |  5 ++
 4 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index a9addaea316d..9c2f44cba166 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -30,6 +30,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
 	if (!range)
 		return ERR_PTR(-ENOMEM);
 
+	INIT_LIST_HEAD(&range->garbage_collector_link);
 	xe_vm_get(gpusvm_to_vm(gpusvm));
 
 	return &range->base;
@@ -46,6 +47,24 @@ static struct xe_svm_range *to_xe_range(struct drm_gpusvm_range *r)
 	return container_of(r, struct xe_svm_range, base);
 }
 
+static void
+xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
+				   const struct mmu_notifier_range *mmu_range)
+{
+	struct xe_device *xe = vm->xe;
+
+	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
+
+	spin_lock(&vm->svm.garbage_collector.lock);
+	if (list_empty(&range->garbage_collector_link))
+		list_add_tail(&range->garbage_collector_link,
+			      &vm->svm.garbage_collector.range_list);
+	spin_unlock(&vm->svm.garbage_collector.lock);
+
+	queue_work(xe_device_get_root_tile(xe)->primary_gt->usm.pf_wq,
+		   &vm->svm.garbage_collector.work);
+}
+
 static u8
 xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
 				  const struct mmu_notifier_range *mmu_range,
@@ -88,7 +107,9 @@ xe_svm_range_notifier_event_end(struct xe_vm *vm, struct drm_gpusvm_range *r,
 	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
 
 	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
-	/* TODO: Add range to garbage collector */
+	if (mmu_range->event == MMU_NOTIFY_UNMAP)
+		xe_svm_garbage_collector_add_range(vm, to_xe_range(r),
+						   mmu_range);
 }
 
 static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
@@ -184,6 +205,58 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 		xe_svm_range_notifier_event_end(vm, r, mmu_range);
 }
 
+static int __xe_svm_garbage_collector(struct xe_vm *vm,
+				      struct xe_svm_range *range)
+{
+	/* TODO: Do unbind */
+
+	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
+
+	return 0;
+}
+
+static int xe_svm_garbage_collector(struct xe_vm *vm)
+{
+	struct xe_svm_range *range, *next;
+	int err;
+
+	lockdep_assert_held_write(&vm->lock);
+
+	if (xe_vm_is_closed_or_banned(vm))
+		return -ENOENT;
+
+	spin_lock(&vm->svm.garbage_collector.lock);
+	list_for_each_entry_safe(range, next,
+				 &vm->svm.garbage_collector.range_list,
+				 garbage_collector_link) {
+		list_del(&range->garbage_collector_link);
+		spin_unlock(&vm->svm.garbage_collector.lock);
+
+		err = __xe_svm_garbage_collector(vm, range);
+		if (err) {
+			drm_warn(&vm->xe->drm,
+				 "Garbage collection failed: %d\n", err);
+			xe_vm_kill(vm, true);
+			return err;
+		}
+
+		spin_lock(&vm->svm.garbage_collector.lock);
+	}
+	spin_unlock(&vm->svm.garbage_collector.lock);
+
+	return 0;
+}
+
+static void xe_svm_garbage_collector_work_func(struct work_struct *w)
+{
+	struct xe_vm *vm = container_of(w, struct xe_vm,
+					svm.garbage_collector.work);
+
+	down_write(&vm->lock);
+	xe_svm_garbage_collector(vm);
+	up_write(&vm->lock);
+}
+
 static const struct drm_gpusvm_ops gpusvm_ops = {
 	.range_alloc = xe_svm_range_alloc,
 	.range_free = xe_svm_range_free,
@@ -198,6 +271,11 @@ static const u64 fault_chunk_sizes[] = {
 
 int xe_svm_init(struct xe_vm *vm)
 {
+	spin_lock_init(&vm->svm.garbage_collector.lock);
+	INIT_LIST_HEAD(&vm->svm.garbage_collector.range_list);
+	INIT_WORK(&vm->svm.garbage_collector.work,
+		  xe_svm_garbage_collector_work_func);
+
 	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
 			       current->mm, NULL, 0, vm->size,
 			       SZ_512M, &gpusvm_ops, fault_chunk_sizes,
@@ -211,6 +289,8 @@ void xe_svm_close(struct xe_vm *vm)
 	/* Flush running notifiers making xe_vm_close() visable */
 	xe_svm_notifier_lock(vm);
 	xe_svm_notifier_unlock(vm);
+
+	flush_work(&vm->svm.garbage_collector.work);
 }
 
 void xe_svm_fini(struct xe_vm *vm)
@@ -241,7 +321,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	lockdep_assert_held_write(&vm->lock);
 
 retry:
-	/* TODO: Run garbage collector */
+	/* Always process UNMAPs first so view SVM ranges is current */
+	err = xe_svm_garbage_collector(vm);
+	if (err)
+		return err;
 
 	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm, fault_addr,
 					    xe_vma_start(vma), xe_vma_end(vma),
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index ee0bd1ae655b..06d90d0f71a6 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -17,6 +17,7 @@ struct xe_vma;
 
 struct xe_svm_range {
 	struct drm_gpusvm_range base;
+	struct list_head garbage_collector_link;
 	u8 tile_present;
 	u8 tile_invalidated;
 };
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 63aa0a25d3b7..399cbbdbddd5 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3071,6 +3071,10 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto put_exec_queue;
 	}
 
+	/* Ensure all UNMAPs visable */
+	if (xe_vm_in_fault_mode(vm))
+		flush_work(&vm->svm.garbage_collector.work);
+
 	err = down_write_killable(&vm->lock);
 	if (err)
 		goto put_vm;
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index b736e53779d2..2eae3575c409 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -146,6 +146,11 @@ struct xe_vm {
 	struct {
 		/** @svm.gpusvm: base GPUSVM used to track fault allocations */
 		struct drm_gpusvm gpusvm;
+		struct {
+			spinlock_t lock;
+			struct list_head range_list;
+			struct work_struct work;
+		} garbage_collector;
 	} svm;
 
 	struct xe_device *xe;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 12/29] drm/xe: Add SVM garbage collector
  2024-10-16  3:25 ` [PATCH v2 12/29] drm/xe: Add SVM garbage collector Matthew Brost
@ 2024-11-19 14:45   ` Thomas Hellström
  2024-12-11 19:17     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 14:45 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add basic SVM garbage collector which can destroy an SVM range upon
> an
> MMU UNMAP event.
> 
> v2:
>  - Flush garbage collector in xe_svm_close
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.c      | 87
> +++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h      |  1 +
>  drivers/gpu/drm/xe/xe_vm.c       |  4 ++
>  drivers/gpu/drm/xe/xe_vm_types.h |  5 ++
>  4 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index a9addaea316d..9c2f44cba166 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -30,6 +30,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
>  	if (!range)
>  		return ERR_PTR(-ENOMEM);
>  
> +	INIT_LIST_HEAD(&range->garbage_collector_link);
>  	xe_vm_get(gpusvm_to_vm(gpusvm));
>  
>  	return &range->base;
> @@ -46,6 +47,24 @@ static struct xe_svm_range *to_xe_range(struct
> drm_gpusvm_range *r)
>  	return container_of(r, struct xe_svm_range, base);
>  }
>  
> +static void
> +xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct
> xe_svm_range *range,
> +				   const struct mmu_notifier_range
> *mmu_range)
> +{
> +	struct xe_device *xe = vm->xe;
> +
> +	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
> +
> +	spin_lock(&vm->svm.garbage_collector.lock);
> +	if (list_empty(&range->garbage_collector_link))
> +		list_add_tail(&range->garbage_collector_link,
> +			      &vm-
> >svm.garbage_collector.range_list);
> +	spin_unlock(&vm->svm.garbage_collector.lock);
> +
> +	queue_work(xe_device_get_root_tile(xe)->primary_gt-
> >usm.pf_wq,
> +		   &vm->svm.garbage_collector.work);
> +}
> +
>  static u8
>  xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> drm_gpusvm_range *r,
>  				  const struct mmu_notifier_range
> *mmu_range,
> @@ -88,7 +107,9 @@ xe_svm_range_notifier_event_end(struct xe_vm *vm,
> struct drm_gpusvm_range *r,
>  	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
>  
>  	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> -	/* TODO: Add range to garbage collector */
> +	if (mmu_range->event == MMU_NOTIFY_UNMAP)
> +		xe_svm_garbage_collector_add_range(vm,
> to_xe_range(r),
> +						   mmu_range);
>  }
>  
>  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> @@ -184,6 +205,58 @@ static void xe_svm_invalidate(struct drm_gpusvm
> *gpusvm,
>  		xe_svm_range_notifier_event_end(vm, r, mmu_range);
>  }
>  
> +static int __xe_svm_garbage_collector(struct xe_vm *vm,
> +				      struct xe_svm_range *range)
> +{
> +	/* TODO: Do unbind */
> +
> +	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
> +
> +	return 0;
> +}
> +
> +static int xe_svm_garbage_collector(struct xe_vm *vm)
> +{
> +	struct xe_svm_range *range, *next;
> +	int err;
> +
> +	lockdep_assert_held_write(&vm->lock);
> +
> +	if (xe_vm_is_closed_or_banned(vm))
> +		return -ENOENT;
> +
> +	spin_lock(&vm->svm.garbage_collector.lock);
> +	list_for_each_entry_safe(range, next,
> +				 &vm-
> >svm.garbage_collector.range_list,
> +				 garbage_collector_link) {
> +		list_del(&range->garbage_collector_link);
> +		spin_unlock(&vm->svm.garbage_collector.lock);

This looks broken, what if someone removed the "next" entry here?
You probably want to use list_next_entry_or_null();

> +
> +		err = __xe_svm_garbage_collector(vm, range);
> +		if (err) {
> +			drm_warn(&vm->xe->drm,
> +				 "Garbage collection failed: %d\n",
> err);
> +			xe_vm_kill(vm, true);
> +			return err;
> +		}
> +
> +		spin_lock(&vm->svm.garbage_collector.lock);
> +	}
> +	spin_unlock(&vm->svm.garbage_collector.lock);
> +
> +	return 0;
> +}
> +
> +static void xe_svm_garbage_collector_work_func(struct work_struct
> *w)
> +{
> +	struct xe_vm *vm = container_of(w, struct xe_vm,
> +					svm.garbage_collector.work);
> +
> +	down_write(&vm->lock);
> +	xe_svm_garbage_collector(vm);
> +	up_write(&vm->lock);
> +}
> +
>  static const struct drm_gpusvm_ops gpusvm_ops = {
>  	.range_alloc = xe_svm_range_alloc,
>  	.range_free = xe_svm_range_free,
> @@ -198,6 +271,11 @@ static const u64 fault_chunk_sizes[] = {
>  
>  int xe_svm_init(struct xe_vm *vm)
>  {
> +	spin_lock_init(&vm->svm.garbage_collector.lock);
> +	INIT_LIST_HEAD(&vm->svm.garbage_collector.range_list);
> +	INIT_WORK(&vm->svm.garbage_collector.work,
> +		  xe_svm_garbage_collector_work_func);
> +
>  	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe-
> >drm,
>  			       current->mm, NULL, 0, vm->size,
>  			       SZ_512M, &gpusvm_ops,
> fault_chunk_sizes,
> @@ -211,6 +289,8 @@ void xe_svm_close(struct xe_vm *vm)
>  	/* Flush running notifiers making xe_vm_close() visable */
>  	xe_svm_notifier_lock(vm);
>  	xe_svm_notifier_unlock(vm);
> +
> +	flush_work(&vm->svm.garbage_collector.work);
>  }
>  
>  void xe_svm_fini(struct xe_vm *vm)
> @@ -241,7 +321,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	lockdep_assert_held_write(&vm->lock);
>  
>  retry:
> -	/* TODO: Run garbage collector */
> +	/* Always process UNMAPs first so view SVM ranges is current
> */
> +	err = xe_svm_garbage_collector(vm);
> +	if (err)
> +		return err;
>  
>  	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> fault_addr,
>  					    xe_vma_start(vma),
> xe_vma_end(vma),
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index ee0bd1ae655b..06d90d0f71a6 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -17,6 +17,7 @@ struct xe_vma;
>  
>  struct xe_svm_range {
>  	struct drm_gpusvm_range base;
> +	struct list_head garbage_collector_link;
>  	u8 tile_present;
>  	u8 tile_invalidated;
>  };
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 63aa0a25d3b7..399cbbdbddd5 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -3071,6 +3071,10 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
>  		goto put_exec_queue;
>  	}
>  
> +	/* Ensure all UNMAPs visable */
> +	if (xe_vm_in_fault_mode(vm))
> +		flush_work(&vm->svm.garbage_collector.work);

Hmm, what is someone added an UNMAP here?

Thanks,
Thomas

> +
>  	err = down_write_killable(&vm->lock);
>  	if (err)
>  		goto put_vm;
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> index b736e53779d2..2eae3575c409 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -146,6 +146,11 @@ struct xe_vm {
>  	struct {
>  		/** @svm.gpusvm: base GPUSVM used to track fault
> allocations */
>  		struct drm_gpusvm gpusvm;
> +		struct {
> +			spinlock_t lock;
> +			struct list_head range_list;
> +			struct work_struct work;
> +		} garbage_collector;
>  	} svm;
>  
>  	struct xe_device *xe;


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 12/29] drm/xe: Add SVM garbage collector
  2024-11-19 14:45   ` Thomas Hellström
@ 2024-12-11 19:17     ` Matthew Brost
  2024-12-16 10:36       ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:17 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 03:45:33PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add basic SVM garbage collector which can destroy an SVM range upon
> > an
> > MMU UNMAP event.
> > 
> > v2:
> >  - Flush garbage collector in xe_svm_close
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.c      | 87
> > +++++++++++++++++++++++++++++++-
> >  drivers/gpu/drm/xe/xe_svm.h      |  1 +
> >  drivers/gpu/drm/xe/xe_vm.c       |  4 ++
> >  drivers/gpu/drm/xe/xe_vm_types.h |  5 ++
> >  4 files changed, 95 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index a9addaea316d..9c2f44cba166 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -30,6 +30,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> >  	if (!range)
> >  		return ERR_PTR(-ENOMEM);
> >  
> > +	INIT_LIST_HEAD(&range->garbage_collector_link);
> >  	xe_vm_get(gpusvm_to_vm(gpusvm));
> >  
> >  	return &range->base;
> > @@ -46,6 +47,24 @@ static struct xe_svm_range *to_xe_range(struct
> > drm_gpusvm_range *r)
> >  	return container_of(r, struct xe_svm_range, base);
> >  }
> >  
> > +static void
> > +xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct
> > xe_svm_range *range,
> > +				   const struct mmu_notifier_range
> > *mmu_range)
> > +{
> > +	struct xe_device *xe = vm->xe;
> > +
> > +	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
> > +
> > +	spin_lock(&vm->svm.garbage_collector.lock);
> > +	if (list_empty(&range->garbage_collector_link))
> > +		list_add_tail(&range->garbage_collector_link,
> > +			      &vm-
> > >svm.garbage_collector.range_list);
> > +	spin_unlock(&vm->svm.garbage_collector.lock);
> > +
> > +	queue_work(xe_device_get_root_tile(xe)->primary_gt-
> > >usm.pf_wq,
> > +		   &vm->svm.garbage_collector.work);
> > +}
> > +
> >  static u8
> >  xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > drm_gpusvm_range *r,
> >  				  const struct mmu_notifier_range
> > *mmu_range,
> > @@ -88,7 +107,9 @@ xe_svm_range_notifier_event_end(struct xe_vm *vm,
> > struct drm_gpusvm_range *r,
> >  	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> >  
> >  	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> > -	/* TODO: Add range to garbage collector */
> > +	if (mmu_range->event == MMU_NOTIFY_UNMAP)
> > +		xe_svm_garbage_collector_add_range(vm,
> > to_xe_range(r),
> > +						   mmu_range);
> >  }
> >  
> >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > @@ -184,6 +205,58 @@ static void xe_svm_invalidate(struct drm_gpusvm
> > *gpusvm,
> >  		xe_svm_range_notifier_event_end(vm, r, mmu_range);
> >  }
> >  
> > +static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > +				      struct xe_svm_range *range)
> > +{
> > +	/* TODO: Do unbind */
> > +
> > +	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
> > +
> > +	return 0;
> > +}
> > +
> > +static int xe_svm_garbage_collector(struct xe_vm *vm)
> > +{
> > +	struct xe_svm_range *range, *next;
> > +	int err;
> > +
> > +	lockdep_assert_held_write(&vm->lock);
> > +
> > +	if (xe_vm_is_closed_or_banned(vm))
> > +		return -ENOENT;
> > +
> > +	spin_lock(&vm->svm.garbage_collector.lock);
> > +	list_for_each_entry_safe(range, next,
> > +				 &vm-
> > >svm.garbage_collector.range_list,
> > +				 garbage_collector_link) {
> > +		list_del(&range->garbage_collector_link);
> > +		spin_unlock(&vm->svm.garbage_collector.lock);
> 
> This looks broken, what if someone removed the "next" entry here?
> You probably want to use list_next_entry_or_null();
> 

Yea, let me fix this loop structure.

> > +
> > +		err = __xe_svm_garbage_collector(vm, range);
> > +		if (err) {
> > +			drm_warn(&vm->xe->drm,
> > +				 "Garbage collection failed: %d\n",
> > err);
> > +			xe_vm_kill(vm, true);
> > +			return err;
> > +		}
> > +
> > +		spin_lock(&vm->svm.garbage_collector.lock);
> > +	}
> > +	spin_unlock(&vm->svm.garbage_collector.lock);
> > +
> > +	return 0;
> > +}
> > +
> > +static void xe_svm_garbage_collector_work_func(struct work_struct
> > *w)
> > +{
> > +	struct xe_vm *vm = container_of(w, struct xe_vm,
> > +					svm.garbage_collector.work);
> > +
> > +	down_write(&vm->lock);
> > +	xe_svm_garbage_collector(vm);
> > +	up_write(&vm->lock);
> > +}
> > +
> >  static const struct drm_gpusvm_ops gpusvm_ops = {
> >  	.range_alloc = xe_svm_range_alloc,
> >  	.range_free = xe_svm_range_free,
> > @@ -198,6 +271,11 @@ static const u64 fault_chunk_sizes[] = {
> >  
> >  int xe_svm_init(struct xe_vm *vm)
> >  {
> > +	spin_lock_init(&vm->svm.garbage_collector.lock);
> > +	INIT_LIST_HEAD(&vm->svm.garbage_collector.range_list);
> > +	INIT_WORK(&vm->svm.garbage_collector.work,
> > +		  xe_svm_garbage_collector_work_func);
> > +
> >  	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe-
> > >drm,
> >  			       current->mm, NULL, 0, vm->size,
> >  			       SZ_512M, &gpusvm_ops,
> > fault_chunk_sizes,
> > @@ -211,6 +289,8 @@ void xe_svm_close(struct xe_vm *vm)
> >  	/* Flush running notifiers making xe_vm_close() visable */
> >  	xe_svm_notifier_lock(vm);
> >  	xe_svm_notifier_unlock(vm);
> > +
> > +	flush_work(&vm->svm.garbage_collector.work);
> >  }
> >  
> >  void xe_svm_fini(struct xe_vm *vm)
> > @@ -241,7 +321,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	lockdep_assert_held_write(&vm->lock);
> >  
> >  retry:
> > -	/* TODO: Run garbage collector */
> > +	/* Always process UNMAPs first so view SVM ranges is current
> > */
> > +	err = xe_svm_garbage_collector(vm);
> > +	if (err)
> > +		return err;
> >  
> >  	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > fault_addr,
> >  					    xe_vma_start(vma),
> > xe_vma_end(vma),
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index ee0bd1ae655b..06d90d0f71a6 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -17,6 +17,7 @@ struct xe_vma;
> >  
> >  struct xe_svm_range {
> >  	struct drm_gpusvm_range base;
> > +	struct list_head garbage_collector_link;
> >  	u8 tile_present;
> >  	u8 tile_invalidated;
> >  };
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 63aa0a25d3b7..399cbbdbddd5 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -3071,6 +3071,10 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> > void *data, struct drm_file *file)
> >  		goto put_exec_queue;
> >  	}
> >  
> > +	/* Ensure all UNMAPs visable */
> > +	if (xe_vm_in_fault_mode(vm))
> > +		flush_work(&vm->svm.garbage_collector.work);
> 
> Hmm, what is someone added an UNMAP here?
>

What we really trying to guard to against is user space doing something
like this:

addr = malloc();
gpu access
free(addr)
bind_bo(addr);

We want to make sure all SVM mappings from the GPU access have processed
the UNMAP events from the 'free(addr)'. So I think the code is fine as
is - we just want to make sure UNMAP events prior to the IOCTL are
processed.

Matt

 
> Thanks,
> Thomas
> 
> > +
> >  	err = down_write_killable(&vm->lock);
> >  	if (err)
> >  		goto put_vm;
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index b736e53779d2..2eae3575c409 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -146,6 +146,11 @@ struct xe_vm {
> >  	struct {
> >  		/** @svm.gpusvm: base GPUSVM used to track fault
> > allocations */
> >  		struct drm_gpusvm gpusvm;
> > +		struct {
> > +			spinlock_t lock;
> > +			struct list_head range_list;
> > +			struct work_struct work;
> > +		} garbage_collector;
> >  	} svm;
> >  
> >  	struct xe_device *xe;
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 12/29] drm/xe: Add SVM garbage collector
  2024-12-11 19:17     ` Matthew Brost
@ 2024-12-16 10:36       ` Thomas Hellström
  2024-12-16 23:46         ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-16 10:36 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Wed, 2024-12-11 at 11:17 -0800, Matthew Brost wrote:
> On Tue, Nov 19, 2024 at 03:45:33PM +0100, Thomas Hellström wrote:
> > On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > > Add basic SVM garbage collector which can destroy an SVM range
> > > upon
> > > an
> > > MMU UNMAP event.
> > > 
> > > v2:
> > >  - Flush garbage collector in xe_svm_close
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.c      | 87
> > > +++++++++++++++++++++++++++++++-
> > >  drivers/gpu/drm/xe/xe_svm.h      |  1 +
> > >  drivers/gpu/drm/xe/xe_vm.c       |  4 ++
> > >  drivers/gpu/drm/xe/xe_vm_types.h |  5 ++
> > >  4 files changed, 95 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index a9addaea316d..9c2f44cba166 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -30,6 +30,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> > >  	if (!range)
> > >  		return ERR_PTR(-ENOMEM);
> > >  
> > > +	INIT_LIST_HEAD(&range->garbage_collector_link);
> > >  	xe_vm_get(gpusvm_to_vm(gpusvm));
> > >  
> > >  	return &range->base;
> > > @@ -46,6 +47,24 @@ static struct xe_svm_range *to_xe_range(struct
> > > drm_gpusvm_range *r)
> > >  	return container_of(r, struct xe_svm_range, base);
> > >  }
> > >  
> > > +static void
> > > +xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct
> > > xe_svm_range *range,
> > > +				   const struct
> > > mmu_notifier_range
> > > *mmu_range)
> > > +{
> > > +	struct xe_device *xe = vm->xe;
> > > +
> > > +	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
> > > +
> > > +	spin_lock(&vm->svm.garbage_collector.lock);
> > > +	if (list_empty(&range->garbage_collector_link))
> > > +		list_add_tail(&range->garbage_collector_link,
> > > +			      &vm-
> > > > svm.garbage_collector.range_list);
> > > +	spin_unlock(&vm->svm.garbage_collector.lock);
> > > +
> > > +	queue_work(xe_device_get_root_tile(xe)->primary_gt-
> > > > usm.pf_wq,
> > > +		   &vm->svm.garbage_collector.work);
> > > +}
> > > +
> > >  static u8
> > >  xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > > drm_gpusvm_range *r,
> > >  				  const struct
> > > mmu_notifier_range
> > > *mmu_range,
> > > @@ -88,7 +107,9 @@ xe_svm_range_notifier_event_end(struct xe_vm
> > > *vm,
> > > struct drm_gpusvm_range *r,
> > >  	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > >  
> > >  	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> > > -	/* TODO: Add range to garbage collector */
> > > +	if (mmu_range->event == MMU_NOTIFY_UNMAP)
> > > +		xe_svm_garbage_collector_add_range(vm,
> > > to_xe_range(r),
> > > +						   mmu_range);
> > >  }
> > >  
> > >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > > @@ -184,6 +205,58 @@ static void xe_svm_invalidate(struct
> > > drm_gpusvm
> > > *gpusvm,
> > >  		xe_svm_range_notifier_event_end(vm, r,
> > > mmu_range);
> > >  }
> > >  
> > > +static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > > +				      struct xe_svm_range
> > > *range)
> > > +{
> > > +	/* TODO: Do unbind */
> > > +
> > > +	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int xe_svm_garbage_collector(struct xe_vm *vm)
> > > +{
> > > +	struct xe_svm_range *range, *next;
> > > +	int err;
> > > +
> > > +	lockdep_assert_held_write(&vm->lock);
> > > +
> > > +	if (xe_vm_is_closed_or_banned(vm))
> > > +		return -ENOENT;
> > > +
> > > +	spin_lock(&vm->svm.garbage_collector.lock);
> > > +	list_for_each_entry_safe(range, next,
> > > +				 &vm-
> > > > svm.garbage_collector.range_list,
> > > +				 garbage_collector_link) {
> > > +		list_del(&range->garbage_collector_link);
> > > +		spin_unlock(&vm->svm.garbage_collector.lock);
> > 
> > This looks broken, what if someone removed the "next" entry here?
> > You probably want to use list_next_entry_or_null();
> > 
> 
> Yea, let me fix this loop structure.
> 
> > > +
> > > +		err = __xe_svm_garbage_collector(vm, range);
> > > +		if (err) {
> > > +			drm_warn(&vm->xe->drm,
> > > +				 "Garbage collection failed:
> > > %d\n",
> > > err);
> > > +			xe_vm_kill(vm, true);
> > > +			return err;
> > > +		}
> > > +
> > > +		spin_lock(&vm->svm.garbage_collector.lock);
> > > +	}
> > > +	spin_unlock(&vm->svm.garbage_collector.lock);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void xe_svm_garbage_collector_work_func(struct
> > > work_struct
> > > *w)
> > > +{
> > > +	struct xe_vm *vm = container_of(w, struct xe_vm,
> > > +					svm.garbage_collector.wo
> > > rk);
> > > +
> > > +	down_write(&vm->lock);
> > > +	xe_svm_garbage_collector(vm);
> > > +	up_write(&vm->lock);
> > > +}
> > > +
> > >  static const struct drm_gpusvm_ops gpusvm_ops = {
> > >  	.range_alloc = xe_svm_range_alloc,
> > >  	.range_free = xe_svm_range_free,
> > > @@ -198,6 +271,11 @@ static const u64 fault_chunk_sizes[] = {
> > >  
> > >  int xe_svm_init(struct xe_vm *vm)
> > >  {
> > > +	spin_lock_init(&vm->svm.garbage_collector.lock);
> > > +	INIT_LIST_HEAD(&vm->svm.garbage_collector.range_list);
> > > +	INIT_WORK(&vm->svm.garbage_collector.work,
> > > +		  xe_svm_garbage_collector_work_func);
> > > +
> > >  	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm-
> > > >xe-
> > > > drm,
> > >  			       current->mm, NULL, 0, vm->size,
> > >  			       SZ_512M, &gpusvm_ops,
> > > fault_chunk_sizes,
> > > @@ -211,6 +289,8 @@ void xe_svm_close(struct xe_vm *vm)
> > >  	/* Flush running notifiers making xe_vm_close() visable
> > > */
> > >  	xe_svm_notifier_lock(vm);
> > >  	xe_svm_notifier_unlock(vm);
> > > +
> > > +	flush_work(&vm->svm.garbage_collector.work);
> > >  }
> > >  
> > >  void xe_svm_fini(struct xe_vm *vm)
> > > @@ -241,7 +321,10 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > *vm,
> > > struct xe_vma *vma,
> > >  	lockdep_assert_held_write(&vm->lock);
> > >  
> > >  retry:
> > > -	/* TODO: Run garbage collector */
> > > +	/* Always process UNMAPs first so view SVM ranges is
> > > current
> > > */
> > > +	err = xe_svm_garbage_collector(vm);
> > > +	if (err)
> > > +		return err;
> > >  
> > >  	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > > fault_addr,
> > >  					    xe_vma_start(vma),
> > > xe_vma_end(vma),
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > index ee0bd1ae655b..06d90d0f71a6 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -17,6 +17,7 @@ struct xe_vma;
> > >  
> > >  struct xe_svm_range {
> > >  	struct drm_gpusvm_range base;
> > > +	struct list_head garbage_collector_link;
> > >  	u8 tile_present;
> > >  	u8 tile_invalidated;
> > >  };
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > index 63aa0a25d3b7..399cbbdbddd5 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -3071,6 +3071,10 @@ int xe_vm_bind_ioctl(struct drm_device
> > > *dev,
> > > void *data, struct drm_file *file)
> > >  		goto put_exec_queue;
> > >  	}
> > >  
> > > +	/* Ensure all UNMAPs visable */
> > > +	if (xe_vm_in_fault_mode(vm))
> > > +		flush_work(&vm->svm.garbage_collector.work);
> > 
> > Hmm, what is someone added an UNMAP here?
> > 
> 
> What we really trying to guard to against is user space doing
> something
> like this:
> 
> addr = malloc();
> gpu access
> free(addr)
> bind_bo(addr);
> 
> We want to make sure all SVM mappings from the GPU access have
> processed
> the UNMAP events from the 'free(addr)'. So I think the code is fine
> as
> is - we just want to make sure UNMAP events prior to the IOCTL are
> processed.

But the notion of "prior" only exists in the presence of some form of
synchronization, like a lock. Let's say another thread calls a free
either

a) before the flush_work
b) racing with the flush_work
c) after the flush_work

Is there any difference WRT correctness and how do we differentiate?

I don't think it's clear what this flush_work actually protects
against.

Thanks,
Thomas




> 
> Matt
> 
>  
> > Thanks,
> > Thomas
> > 
> > > +
> > >  	err = down_write_killable(&vm->lock);
> > >  	if (err)
> > >  		goto put_vm;
> > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > index b736e53779d2..2eae3575c409 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > @@ -146,6 +146,11 @@ struct xe_vm {
> > >  	struct {
> > >  		/** @svm.gpusvm: base GPUSVM used to track fault
> > > allocations */
> > >  		struct drm_gpusvm gpusvm;
> > > +		struct {
> > > +			spinlock_t lock;
> > > +			struct list_head range_list;
> > > +			struct work_struct work;
> > > +		} garbage_collector;
> > >  	} svm;
> > >  
> > >  	struct xe_device *xe;
> > 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 12/29] drm/xe: Add SVM garbage collector
  2024-12-16 10:36       ` Thomas Hellström
@ 2024-12-16 23:46         ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-16 23:46 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 16, 2024 at 11:36:20AM +0100, Thomas Hellström wrote:
> On Wed, 2024-12-11 at 11:17 -0800, Matthew Brost wrote:
> > On Tue, Nov 19, 2024 at 03:45:33PM +0100, Thomas Hellström wrote:
> > > On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > > > Add basic SVM garbage collector which can destroy an SVM range
> > > > upon
> > > > an
> > > > MMU UNMAP event.
> > > > 
> > > > v2:
> > > >  - Flush garbage collector in xe_svm_close
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_svm.c      | 87
> > > > +++++++++++++++++++++++++++++++-
> > > >  drivers/gpu/drm/xe/xe_svm.h      |  1 +
> > > >  drivers/gpu/drm/xe/xe_vm.c       |  4 ++
> > > >  drivers/gpu/drm/xe/xe_vm_types.h |  5 ++
> > > >  4 files changed, 95 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > > b/drivers/gpu/drm/xe/xe_svm.c
> > > > index a9addaea316d..9c2f44cba166 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > > @@ -30,6 +30,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
> > > >  	if (!range)
> > > >  		return ERR_PTR(-ENOMEM);
> > > >  
> > > > +	INIT_LIST_HEAD(&range->garbage_collector_link);
> > > >  	xe_vm_get(gpusvm_to_vm(gpusvm));
> > > >  
> > > >  	return &range->base;
> > > > @@ -46,6 +47,24 @@ static struct xe_svm_range *to_xe_range(struct
> > > > drm_gpusvm_range *r)
> > > >  	return container_of(r, struct xe_svm_range, base);
> > > >  }
> > > >  
> > > > +static void
> > > > +xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct
> > > > xe_svm_range *range,
> > > > +				   const struct
> > > > mmu_notifier_range
> > > > *mmu_range)
> > > > +{
> > > > +	struct xe_device *xe = vm->xe;
> > > > +
> > > > +	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
> > > > +
> > > > +	spin_lock(&vm->svm.garbage_collector.lock);
> > > > +	if (list_empty(&range->garbage_collector_link))
> > > > +		list_add_tail(&range->garbage_collector_link,
> > > > +			      &vm-
> > > > > svm.garbage_collector.range_list);
> > > > +	spin_unlock(&vm->svm.garbage_collector.lock);
> > > > +
> > > > +	queue_work(xe_device_get_root_tile(xe)->primary_gt-
> > > > > usm.pf_wq,
> > > > +		   &vm->svm.garbage_collector.work);
> > > > +}
> > > > +
> > > >  static u8
> > > >  xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct
> > > > drm_gpusvm_range *r,
> > > >  				  const struct
> > > > mmu_notifier_range
> > > > *mmu_range,
> > > > @@ -88,7 +107,9 @@ xe_svm_range_notifier_event_end(struct xe_vm
> > > > *vm,
> > > > struct drm_gpusvm_range *r,
> > > >  	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> > > >  
> > > >  	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> > > > -	/* TODO: Add range to garbage collector */
> > > > +	if (mmu_range->event == MMU_NOTIFY_UNMAP)
> > > > +		xe_svm_garbage_collector_add_range(vm,
> > > > to_xe_range(r),
> > > > +						   mmu_range);
> > > >  }
> > > >  
> > > >  static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
> > > > @@ -184,6 +205,58 @@ static void xe_svm_invalidate(struct
> > > > drm_gpusvm
> > > > *gpusvm,
> > > >  		xe_svm_range_notifier_event_end(vm, r,
> > > > mmu_range);
> > > >  }
> > > >  
> > > > +static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > > > +				      struct xe_svm_range
> > > > *range)
> > > > +{
> > > > +	/* TODO: Do unbind */
> > > > +
> > > > +	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int xe_svm_garbage_collector(struct xe_vm *vm)
> > > > +{
> > > > +	struct xe_svm_range *range, *next;
> > > > +	int err;
> > > > +
> > > > +	lockdep_assert_held_write(&vm->lock);
> > > > +
> > > > +	if (xe_vm_is_closed_or_banned(vm))
> > > > +		return -ENOENT;
> > > > +
> > > > +	spin_lock(&vm->svm.garbage_collector.lock);
> > > > +	list_for_each_entry_safe(range, next,
> > > > +				 &vm-
> > > > > svm.garbage_collector.range_list,
> > > > +				 garbage_collector_link) {
> > > > +		list_del(&range->garbage_collector_link);
> > > > +		spin_unlock(&vm->svm.garbage_collector.lock);
> > > 
> > > This looks broken, what if someone removed the "next" entry here?
> > > You probably want to use list_next_entry_or_null();
> > > 
> > 
> > Yea, let me fix this loop structure.
> > 
> > > > +
> > > > +		err = __xe_svm_garbage_collector(vm, range);
> > > > +		if (err) {
> > > > +			drm_warn(&vm->xe->drm,
> > > > +				 "Garbage collection failed:
> > > > %d\n",
> > > > err);
> > > > +			xe_vm_kill(vm, true);
> > > > +			return err;
> > > > +		}
> > > > +
> > > > +		spin_lock(&vm->svm.garbage_collector.lock);
> > > > +	}
> > > > +	spin_unlock(&vm->svm.garbage_collector.lock);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void xe_svm_garbage_collector_work_func(struct
> > > > work_struct
> > > > *w)
> > > > +{
> > > > +	struct xe_vm *vm = container_of(w, struct xe_vm,
> > > > +					svm.garbage_collector.wo
> > > > rk);
> > > > +
> > > > +	down_write(&vm->lock);
> > > > +	xe_svm_garbage_collector(vm);
> > > > +	up_write(&vm->lock);
> > > > +}
> > > > +
> > > >  static const struct drm_gpusvm_ops gpusvm_ops = {
> > > >  	.range_alloc = xe_svm_range_alloc,
> > > >  	.range_free = xe_svm_range_free,
> > > > @@ -198,6 +271,11 @@ static const u64 fault_chunk_sizes[] = {
> > > >  
> > > >  int xe_svm_init(struct xe_vm *vm)
> > > >  {
> > > > +	spin_lock_init(&vm->svm.garbage_collector.lock);
> > > > +	INIT_LIST_HEAD(&vm->svm.garbage_collector.range_list);
> > > > +	INIT_WORK(&vm->svm.garbage_collector.work,
> > > > +		  xe_svm_garbage_collector_work_func);
> > > > +
> > > >  	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm-
> > > > >xe-
> > > > > drm,
> > > >  			       current->mm, NULL, 0, vm->size,
> > > >  			       SZ_512M, &gpusvm_ops,
> > > > fault_chunk_sizes,
> > > > @@ -211,6 +289,8 @@ void xe_svm_close(struct xe_vm *vm)
> > > >  	/* Flush running notifiers making xe_vm_close() visable
> > > > */
> > > >  	xe_svm_notifier_lock(vm);
> > > >  	xe_svm_notifier_unlock(vm);
> > > > +
> > > > +	flush_work(&vm->svm.garbage_collector.work);
> > > >  }
> > > >  
> > > >  void xe_svm_fini(struct xe_vm *vm)
> > > > @@ -241,7 +321,10 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > > *vm,
> > > > struct xe_vma *vma,
> > > >  	lockdep_assert_held_write(&vm->lock);
> > > >  
> > > >  retry:
> > > > -	/* TODO: Run garbage collector */
> > > > +	/* Always process UNMAPs first so view SVM ranges is
> > > > current
> > > > */
> > > > +	err = xe_svm_garbage_collector(vm);
> > > > +	if (err)
> > > > +		return err;
> > > >  
> > > >  	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
> > > > fault_addr,
> > > >  					    xe_vma_start(vma),
> > > > xe_vma_end(vma),
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > index ee0bd1ae655b..06d90d0f71a6 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -17,6 +17,7 @@ struct xe_vma;
> > > >  
> > > >  struct xe_svm_range {
> > > >  	struct drm_gpusvm_range base;
> > > > +	struct list_head garbage_collector_link;
> > > >  	u8 tile_present;
> > > >  	u8 tile_invalidated;
> > > >  };
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > index 63aa0a25d3b7..399cbbdbddd5 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -3071,6 +3071,10 @@ int xe_vm_bind_ioctl(struct drm_device
> > > > *dev,
> > > > void *data, struct drm_file *file)
> > > >  		goto put_exec_queue;
> > > >  	}
> > > >  
> > > > +	/* Ensure all UNMAPs visable */
> > > > +	if (xe_vm_in_fault_mode(vm))
> > > > +		flush_work(&vm->svm.garbage_collector.work);
> > > 
> > > Hmm, what is someone added an UNMAP here?
> > > 
> > 
> > What we really trying to guard to against is user space doing
> > something
> > like this:
> > 
> > addr = malloc();
> > gpu access
> > free(addr)
> > bind_bo(addr);
> > 
> > We want to make sure all SVM mappings from the GPU access have
> > processed
> > the UNMAP events from the 'free(addr)'. So I think the code is fine
> > as
> > is - we just want to make sure UNMAP events prior to the IOCTL are
> > processed.
> 
> But the notion of "prior" only exists in the presence of some form of
> synchronization, like a lock. Let's say another thread calls a free
> either
> 
> a) before the flush_work
> b) racing with the flush_work
> c) after the flush_work
> 
> Is there any difference WRT correctness and how do we differentiate?
> 
> I don't think it's clear what this flush_work actually protects
> against.
> 

I still think this is ok.

Let's say we have 2 threads...

- Thread A munmap(address A)	- This address has a SVM GPU binding, we will get an UNMAP notifier
- Thread B address B = mmap()	- This happens to equal to address A
- Thread B bind BO(address B)	- We flush_work which ensure the UNMAP event is processed allow current view SVM state, avoiding bind returning -EBUSY

The key here is it is impossible for address A == B unless an UNMAP
event is queued in the garbage collector unless I'm completely missing
something. This is really the only race we care about - we care that
UNMAP events prior the bind matching the bind address are processed.

If other UNMAP events occur while processing the bind, that is fine as
they shouldn't colliding. Worst case if a collision occurs we'd return
-EBUSY in the bind IOCTL.

Does this make sense?

Matt

> Thanks,
> Thomas
> 
> 
> 
> 
> > 
> > Matt
> > 
> >  
> > > Thanks,
> > > Thomas
> > > 
> > > > +
> > > >  	err = down_write_killable(&vm->lock);
> > > >  	if (err)
> > > >  		goto put_vm;
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > index b736e53779d2..2eae3575c409 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > @@ -146,6 +146,11 @@ struct xe_vm {
> > > >  	struct {
> > > >  		/** @svm.gpusvm: base GPUSVM used to track fault
> > > > allocations */
> > > >  		struct drm_gpusvm gpusvm;
> > > > +		struct {
> > > > +			spinlock_t lock;
> > > > +			struct list_head range_list;
> > > > +			struct work_struct work;
> > > > +		} garbage_collector;
> > > >  	} svm;
> > > >  
> > > >  	struct xe_device *xe;
> > > 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 13/29] drm/xe: Add unbind to SVM garbage collector
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (11 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 12/29] drm/xe: Add SVM garbage collector Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 15:31   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 14/29] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
                   ` (18 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add unbind to SVM garbage collector. To facilitate add unbind support
function to VM layer which unbinds a SVM range. Also teach PY layer to
understand unbinds of SVM ranges.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       | 84 ++++++++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_svm.c      |  9 +++-
 drivers/gpu/drm/xe/xe_vm.c       | 73 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h       |  2 +
 drivers/gpu/drm/xe/xe_vm_types.h | 12 ++++-
 5 files changed, 162 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 024e4eb83408..687abd1a5e74 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -925,10 +925,16 @@ static void xe_pt_cancel_bind(struct xe_vma *vma,
 	}
 }
 
+#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)
+
 static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 {
-	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_vm *vm;
 
+	if (vma == INVALID_VMA)
+		return;
+
+	vm = xe_vma_vm(vma);
 	lockdep_assert_held(&vm->lock);
 
 	if (!xe_vma_has_no_bo(vma))
@@ -954,7 +960,8 @@ static void xe_pt_commit(struct xe_vma *vma,
 		for (j = 0; j < entries[i].qwords; j++) {
 			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
 
-			xe_pt_destroy(oldpte, xe_vma_vm(vma)->flags, deferred);
+			xe_pt_destroy(oldpte, (vma == INVALID_VMA) ? 0 :
+				      xe_vma_vm(vma)->flags, deferred);
 		}
 	}
 }
@@ -1387,6 +1394,9 @@ static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update *pt_update)
 	list_for_each_entry(op, &vops->list, link) {
 		struct xe_svm_range *range = op->map_range.range;
 
+		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
+			continue;
+
 		xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
 		xe_assert(vm->xe, op->subop == XE_VMA_SUBOP_MAP_RANGE);
 
@@ -1585,7 +1595,9 @@ static const struct xe_pt_walk_ops xe_pt_stage_unbind_ops = {
  * xe_pt_stage_unbind() - Build page-table update structures for an unbind
  * operation
  * @tile: The tile we're unbinding for.
+ * @vm: The vm
  * @vma: The vma we're unbinding.
+ * @range: The range we're unbinding.
  * @entries: Caller-provided storage for the update structures.
  *
  * Builds page-table update structures for an unbind operation. The function
@@ -1595,9 +1607,14 @@ static const struct xe_pt_walk_ops xe_pt_stage_unbind_ops = {
  *
  * Return: The number of entries used.
  */
-static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct xe_vma *vma,
+static unsigned int xe_pt_stage_unbind(struct xe_tile *tile,
+				       struct xe_vm *vm,
+				       struct xe_vma *vma,
+				       struct xe_svm_range *range,
 				       struct xe_vm_pgtable_update *entries)
 {
+	u64 start = range ? range->base.va.start : xe_vma_start(vma);
+	u64 end = range ? range->base.va.end : xe_vma_end(vma);
 	struct xe_pt_stage_unbind_walk xe_walk = {
 		.base = {
 			.ops = &xe_pt_stage_unbind_ops,
@@ -1605,14 +1622,14 @@ static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct xe_vma *vma,
 			.max_level = XE_PT_HIGHEST_LEVEL,
 		},
 		.tile = tile,
-		.modified_start = xe_vma_start(vma),
-		.modified_end = xe_vma_end(vma),
+		.modified_start = start,
+		.modified_end = end,
 		.wupd.entries = entries,
 	};
-	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
+	struct xe_pt *pt = vm->pt_root[tile->id];
 
-	(void)xe_pt_walk_shared(&pt->base, pt->level, xe_vma_start(vma),
-				xe_vma_end(vma), &xe_walk.base);
+	(void)xe_pt_walk_shared(&pt->base, pt->level, start, end,
+				&xe_walk.base);
 
 	return xe_walk.wupd.num_used_entries;
 }
@@ -1854,13 +1871,6 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	       "Preparing unbind, with range [%llx...%llx)\n",
 	       xe_vma_start(vma), xe_vma_end(vma) - 1);
 
-	/*
-	 * Wait for invalidation to complete. Can corrupt internal page table
-	 * state if an invalidation is running while preparing an unbind.
-	 */
-	if (xe_vma_is_userptr(vma) && xe_vm_in_fault_mode(xe_vma_vm(vma)))
-		mmu_interval_read_begin(&to_userptr_vma(vma)->userptr.notifier);
-
 	pt_op->vma = vma;
 	pt_op->bind = false;
 	pt_op->rebind = false;
@@ -1869,7 +1879,8 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	if (err)
 		return err;
 
-	pt_op->num_entries = xe_pt_stage_unbind(tile, vma, pt_op->entries);
+	pt_op->num_entries = xe_pt_stage_unbind(tile, xe_vma_vm(vma),
+						vma, NULL, pt_op->entries);
 
 	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
 				pt_op->num_entries, false);
@@ -1884,6 +1895,42 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	return 0;
 }
 
+static int unbind_range_prepare(struct xe_vm *vm,
+				struct xe_tile *tile,
+				struct xe_vm_pgtable_update_ops *pt_update_ops,
+				struct xe_svm_range *range)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+
+	if (!(range->tile_present & BIT(tile->id)))
+		return 0;
+
+	vm_dbg(&vm->xe->drm,
+	       "Preparing unbind, with range [%llx...%llx)\n",
+	       range->base.va.start, range->base.va.end - 1);
+
+	pt_op->vma = INVALID_VMA;
+	pt_op->bind = false;
+	pt_op->rebind = false;
+
+	pt_op->num_entries = xe_pt_stage_unbind(tile, vm, NULL, range,
+						pt_op->entries);
+
+	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+				pt_op->num_entries, false);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, range->base.va.start,
+					 range->base.va.end);
+	++pt_update_ops->current_op;
+	pt_update_ops->needs_svm_lock = true;
+	pt_update_ops->needs_invalidation = true;
+
+	xe_pt_commit_prepare_unbind(INVALID_VMA, pt_op->entries,
+				    pt_op->num_entries);
+
+	return 0;
+}
+
 static int op_prepare(struct xe_vm *vm,
 		      struct xe_tile *tile,
 		      struct xe_vm_pgtable_update_ops *pt_update_ops,
@@ -1951,6 +1998,9 @@ static int op_prepare(struct xe_vm *vm,
 			err = bind_range_prepare(vm, tile, pt_update_ops,
 						 op->map_range.vma,
 						 op->map_range.range);
+		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
+			err = unbind_range_prepare(vm, tile, pt_update_ops,
+						   op->unmap_range.range);
 		}
 		break;
 	default:
@@ -2139,6 +2189,8 @@ static void op_commit(struct xe_vm *vm,
 		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
 			op->map_range.range->tile_present |= BIT(tile->id);
 			op->map_range.range->tile_invalidated &= ~BIT(tile->id);
+		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
+			op->unmap_range.range->tile_present &= ~BIT(tile->id);
 		}
 		break;
 	}
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 9c2f44cba166..0762126f65e0 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -208,7 +208,14 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 static int __xe_svm_garbage_collector(struct xe_vm *vm,
 				      struct xe_svm_range *range)
 {
-	/* TODO: Do unbind */
+	struct dma_fence *fence;
+
+	xe_vm_lock(vm, false);
+	fence = xe_vm_range_unbind(vm, range);
+	xe_vm_unlock(vm);
+	if (IS_ERR(fence))
+		return PTR_ERR(fence);
+	dma_fence_put(fence);
 
 	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
 
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 399cbbdbddd5..76a20e96084e 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -972,6 +972,79 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
 	return fence;
 }
 
+static void xe_vm_populate_range_unbind(struct xe_vma_op *op,
+					struct xe_svm_range *range)
+{
+	INIT_LIST_HEAD(&op->link);
+	op->tile_mask = range->tile_present;
+	op->base.op = DRM_GPUVA_OP_USER;
+	op->subop = XE_VMA_SUBOP_UNMAP_RANGE;
+	op->unmap_range.range = range;
+}
+
+static int
+xe_vm_ops_add_range_unbind(struct xe_vma_ops *vops,
+			   struct xe_svm_range *range)
+{
+	struct xe_vma_op *op;
+
+	op = kzalloc(sizeof(*op), GFP_KERNEL);
+	if (!op)
+		return -ENOMEM;
+
+	xe_vm_populate_range_unbind(op, range);
+	list_add_tail(&op->link, &vops->list);
+	xe_vma_ops_incr_pt_update_ops(vops, range->tile_present);
+
+	return 0;
+}
+
+struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
+				     struct xe_svm_range *range)
+{
+	struct dma_fence *fence = NULL;
+	struct xe_vma_ops vops;
+	struct xe_vma_op *op, *next_op;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
+
+	lockdep_assert_held(&vm->lock);
+	xe_vm_assert_held(vm);
+	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+
+	if (!range->tile_present)
+		return dma_fence_get_stub();
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_exec_queue(tile);
+	}
+
+	err = xe_vm_ops_add_range_unbind(&vops, range);
+	if (err)
+		return ERR_PTR(err);
+
+	err = xe_vma_ops_alloc(&vops, false);
+	if (err) {
+		fence = ERR_PTR(err);
+		goto free_ops;
+	}
+
+	fence = ops_execute(vm, &vops);
+
+free_ops:
+	list_for_each_entry_safe(op, next_op, &vops.list, link) {
+		list_del(&op->link);
+		kfree(op);
+	}
+	xe_vma_ops_fini(&vops);
+
+	return fence;
+}
+
 static void xe_vma_free(struct xe_vma *vma)
 {
 	if (xe_vma_is_userptr(vma))
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 8bd921b33090..d577ca9e3d65 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -222,6 +222,8 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
 				     struct xe_vma *vma,
 				     struct xe_svm_range *range,
 				     u8 tile_mask);
+struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
+				     struct xe_svm_range *range);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 2eae3575c409..d38cf7558f62 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -348,6 +348,12 @@ struct xe_vma_op_map_range {
 	struct xe_svm_range *range;
 };
 
+/** struct xe_vma_op_unmap_range - VMA unmap range operation */
+struct xe_vma_op_unmap_range {
+	/** @range: SVM range to unmap */
+	struct xe_svm_range *range;
+};
+
 /** enum xe_vma_op_flags - flags for VMA operation */
 enum xe_vma_op_flags {
 	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
@@ -362,6 +368,8 @@ enum xe_vma_op_flags {
 enum xe_vma_subop {
 	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
 	XE_VMA_SUBOP_MAP_RANGE,
+	/** @XE_VMA_SUBOP_UNMAP_RANGE: Unmap range */
+	XE_VMA_SUBOP_UNMAP_RANGE,
 };
 
 /** struct xe_vma_op - VMA operation */
@@ -384,8 +392,10 @@ struct xe_vma_op {
 		struct xe_vma_op_remap remap;
 		/** @prefetch: VMA prefetch operation specific data */
 		struct xe_vma_op_prefetch prefetch;
-		/** @map: VMA map range operation specific data */
+		/** @map_range: VMA map range operation specific data */
 		struct xe_vma_op_map_range map_range;
+		/** @unmap_range: VMA unmap range operation specific data */
+		struct xe_vma_op_map_range unmap_range;
 	};
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 13/29] drm/xe: Add unbind to SVM garbage collector
  2024-10-16  3:25 ` [PATCH v2 13/29] drm/xe: Add unbind to " Matthew Brost
@ 2024-11-19 15:31   ` Thomas Hellström
  2024-11-19 23:44     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 15:31 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add unbind to SVM garbage collector. To facilitate add unbind support
> function to VM layer which unbinds a SVM range. Also teach PY layer
> to
> understand unbinds of SVM ranges.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c       | 84 ++++++++++++++++++++++++++----
> --
>  drivers/gpu/drm/xe/xe_svm.c      |  9 +++-
>  drivers/gpu/drm/xe/xe_vm.c       | 73 +++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_vm.h       |  2 +
>  drivers/gpu/drm/xe/xe_vm_types.h | 12 ++++-
>  5 files changed, 162 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 024e4eb83408..687abd1a5e74 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -925,10 +925,16 @@ static void xe_pt_cancel_bind(struct xe_vma
> *vma,
>  	}
>  }
>  
> +#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)

Please prefix with XE_ to avoid future name clashes.

> +
>  static void xe_pt_commit_locks_assert(struct xe_vma *vma)
>  {
> -	struct xe_vm *vm = xe_vma_vm(vma);
> +	struct xe_vm *vm;
>  
> +	if (vma == INVALID_VMA)
> +		return;
> +
> +	vm = xe_vma_vm(vma);
>  	lockdep_assert_held(&vm->lock);
>  
>  	if (!xe_vma_has_no_bo(vma))
> @@ -954,7 +960,8 @@ static void xe_pt_commit(struct xe_vma *vma,
>  		for (j = 0; j < entries[i].qwords; j++) {
>  			struct xe_pt *oldpte =
> entries[i].pt_entries[j].pt;
>  
> -			xe_pt_destroy(oldpte, xe_vma_vm(vma)->flags,
> deferred);
> +			xe_pt_destroy(oldpte, (vma == INVALID_VMA) ?
> 0 :
> +				      xe_vma_vm(vma)->flags,
> deferred);
>  		}
>  	}
>  }
> @@ -1387,6 +1394,9 @@ static int xe_pt_svm_pre_commit(struct
> xe_migrate_pt_update *pt_update)
>  	list_for_each_entry(op, &vops->list, link) {
>  		struct xe_svm_range *range = op->map_range.range;
>  
> +		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
> +			continue;
> +
>  		xe_assert(vm->xe, xe_vma_is_system_allocator(op-
> >map_range.vma));
>  		xe_assert(vm->xe, op->subop ==
> XE_VMA_SUBOP_MAP_RANGE);
>  
> @@ -1585,7 +1595,9 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_unbind_ops = {
>   * xe_pt_stage_unbind() - Build page-table update structures for an
> unbind
>   * operation
>   * @tile: The tile we're unbinding for.
> + * @vm: The vm
>   * @vma: The vma we're unbinding.
> + * @range: The range we're unbinding.
>   * @entries: Caller-provided storage for the update structures.
>   *
>   * Builds page-table update structures for an unbind operation. The
> function
> @@ -1595,9 +1607,14 @@ static const struct xe_pt_walk_ops
> xe_pt_stage_unbind_ops = {
>   *
>   * Return: The number of entries used.
>   */
> -static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct
> xe_vma *vma,
> +static unsigned int xe_pt_stage_unbind(struct xe_tile *tile,
> +				       struct xe_vm *vm,
> +				       struct xe_vma *vma,
> +				       struct xe_svm_range *range,
>  				       struct xe_vm_pgtable_update
> *entries)
>  {
> +	u64 start = range ? range->base.va.start :
> xe_vma_start(vma);
> +	u64 end = range ? range->base.va.end : xe_vma_end(vma);
>  	struct xe_pt_stage_unbind_walk xe_walk = {
>  		.base = {
>  			.ops = &xe_pt_stage_unbind_ops,
> @@ -1605,14 +1622,14 @@ static unsigned int xe_pt_stage_unbind(struct
> xe_tile *tile, struct xe_vma *vma,
>  			.max_level = XE_PT_HIGHEST_LEVEL,
>  		},
>  		.tile = tile,
> -		.modified_start = xe_vma_start(vma),
> -		.modified_end = xe_vma_end(vma),
> +		.modified_start = start,
> +		.modified_end = end,
>  		.wupd.entries = entries,
>  	};
> -	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> +	struct xe_pt *pt = vm->pt_root[tile->id];
>  
> -	(void)xe_pt_walk_shared(&pt->base, pt->level,
> xe_vma_start(vma),
> -				xe_vma_end(vma), &xe_walk.base);
> +	(void)xe_pt_walk_shared(&pt->base, pt->level, start, end,
> +				&xe_walk.base);
>  
>  	return xe_walk.wupd.num_used_entries;
>  }
> @@ -1854,13 +1871,6 @@ static int unbind_op_prepare(struct xe_tile
> *tile,
>  	       "Preparing unbind, with range [%llx...%llx)\n",
>  	       xe_vma_start(vma), xe_vma_end(vma) - 1);
>  
> -	/*
> -	 * Wait for invalidation to complete. Can corrupt internal
> page table
> -	 * state if an invalidation is running while preparing an
> unbind.
> -	 */
> -	if (xe_vma_is_userptr(vma) &&
> xe_vm_in_fault_mode(xe_vma_vm(vma)))
> -		mmu_interval_read_begin(&to_userptr_vma(vma)-
> >userptr.notifier);
> -
>  	pt_op->vma = vma;
>  	pt_op->bind = false;
>  	pt_op->rebind = false;
> @@ -1869,7 +1879,8 @@ static int unbind_op_prepare(struct xe_tile
> *tile,
>  	if (err)
>  		return err;
>  
> -	pt_op->num_entries = xe_pt_stage_unbind(tile, vma, pt_op-
> >entries);
> +	pt_op->num_entries = xe_pt_stage_unbind(tile,
> xe_vma_vm(vma),
> +						vma, NULL, pt_op-
> >entries);
>  
>  	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
>  				pt_op->num_entries, false);
> @@ -1884,6 +1895,42 @@ static int unbind_op_prepare(struct xe_tile
> *tile,
>  	return 0;
>  }
>  
> +static int unbind_range_prepare(struct xe_vm *vm,
> +				struct xe_tile *tile,
> +				struct xe_vm_pgtable_update_ops
> *pt_update_ops,
> +				struct xe_svm_range *range)
> +{
> +	u32 current_op = pt_update_ops->current_op;
> +	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> >ops[current_op];
> +
> +	if (!(range->tile_present & BIT(tile->id)))
> +		return 0;
> +
> +	vm_dbg(&vm->xe->drm,
> +	       "Preparing unbind, with range [%llx...%llx)\n",
> +	       range->base.va.start, range->base.va.end - 1);
> +
> +	pt_op->vma = INVALID_VMA;
> +	pt_op->bind = false;
> +	pt_op->rebind = false;
> +
> +	pt_op->num_entries = xe_pt_stage_unbind(tile, vm, NULL,
> range,
> +						pt_op->entries);
> +
> +	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
> +				pt_op->num_entries, false);
> +	xe_pt_update_ops_rfence_interval(pt_update_ops, range-
> >base.va.start,
> +					 range->base.va.end);
> +	++pt_update_ops->current_op;
> +	pt_update_ops->needs_svm_lock = true;
> +	pt_update_ops->needs_invalidation = true;
> +
> +	xe_pt_commit_prepare_unbind(INVALID_VMA, pt_op->entries,
> +				    pt_op->num_entries);
> +
> +	return 0;
> +}
> +
>  static int op_prepare(struct xe_vm *vm,
>  		      struct xe_tile *tile,
>  		      struct xe_vm_pgtable_update_ops
> *pt_update_ops,
> @@ -1951,6 +1998,9 @@ static int op_prepare(struct xe_vm *vm,
>  			err = bind_range_prepare(vm, tile,
> pt_update_ops,
>  						 op->map_range.vma,
>  						 op-
> >map_range.range);
> +		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
> +			err = unbind_range_prepare(vm, tile,
> pt_update_ops,
> +						   op-
> >unmap_range.range);
>  		}
>  		break;
>  	default:
> @@ -2139,6 +2189,8 @@ static void op_commit(struct xe_vm *vm,
>  		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
>  			op->map_range.range->tile_present |=
> BIT(tile->id);
>  			op->map_range.range->tile_invalidated &=
> ~BIT(tile->id);
> +		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
> +			op->unmap_range.range->tile_present &=
> ~BIT(tile->id);
>  		}
>  		break;
>  	}

I think this further stresses the need to provide a pt code interface
that is oblivious of vmas, userptr and ranges so that we can get rid of
all special-casing, but for now code looks good as is IMO.


> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 9c2f44cba166..0762126f65e0 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -208,7 +208,14 @@ static void xe_svm_invalidate(struct drm_gpusvm
> *gpusvm,
>  static int __xe_svm_garbage_collector(struct xe_vm *vm,
>  				      struct xe_svm_range *range)
>  {
> -	/* TODO: Do unbind */
> +	struct dma_fence *fence;
> +
> +	xe_vm_lock(vm, false);
> +	fence = xe_vm_range_unbind(vm, range);
> +	xe_vm_unlock(vm);
> +	if (IS_ERR(fence))
> +		return PTR_ERR(fence);
> +	dma_fence_put(fence);
>  
>  	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
>  
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 399cbbdbddd5..76a20e96084e 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -972,6 +972,79 @@ struct dma_fence *xe_vm_range_rebind(struct
> xe_vm *vm,
>  	return fence;
>  }
>  
> +static void xe_vm_populate_range_unbind(struct xe_vma_op *op,
> +					struct xe_svm_range *range)
> +{
> +	INIT_LIST_HEAD(&op->link);
> +	op->tile_mask = range->tile_present;
> +	op->base.op = DRM_GPUVA_OP_USER;
> +	op->subop = XE_VMA_SUBOP_UNMAP_RANGE;
> +	op->unmap_range.range = range;
> +}
> +
> +static int
> +xe_vm_ops_add_range_unbind(struct xe_vma_ops *vops,
> +			   struct xe_svm_range *range)
> +{
> +	struct xe_vma_op *op;
> +
> +	op = kzalloc(sizeof(*op), GFP_KERNEL);
> +	if (!op)
> +		return -ENOMEM;
> +
> +	xe_vm_populate_range_unbind(op, range);
> +	list_add_tail(&op->link, &vops->list);
> +	xe_vma_ops_incr_pt_update_ops(vops, range->tile_present);
> +
> +	return 0;
> +}
> +
> +struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
> +				     struct xe_svm_range *range)

Kerneldoc.


> +{
> +	struct dma_fence *fence = NULL;
> +	struct xe_vma_ops vops;
> +	struct xe_vma_op *op, *next_op;
> +	struct xe_tile *tile;
> +	u8 id;
> +	int err;
> +
> +	lockdep_assert_held(&vm->lock);
> +	xe_vm_assert_held(vm);
> +	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
> +
> +	if (!range->tile_present)
> +		return dma_fence_get_stub();
> +
> +	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
> +	for_each_tile(tile, vm->xe, id) {
> +		vops.pt_update_ops[id].wait_vm_bookkeep = true;
> +		vops.pt_update_ops[tile->id].q =
> +			xe_tile_migrate_exec_queue(tile);
> +	}
> +
> +	err = xe_vm_ops_add_range_unbind(&vops, range);
> +	if (err)
> +		return ERR_PTR(err);
> +
> +	err = xe_vma_ops_alloc(&vops, false);
> +	if (err) {
> +		fence = ERR_PTR(err);
> +		goto free_ops;
> +	}
> +
> +	fence = ops_execute(vm, &vops);
> +
> +free_ops:
> +	list_for_each_entry_safe(op, next_op, &vops.list, link) {
> +		list_del(&op->link);
> +		kfree(op);
> +	}
> +	xe_vma_ops_fini(&vops);
> +
> +	return fence;
> +}
> +
>  static void xe_vma_free(struct xe_vma *vma)
>  {
>  	if (xe_vma_is_userptr(vma))
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 8bd921b33090..d577ca9e3d65 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -222,6 +222,8 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm
> *vm,
>  				     struct xe_vma *vma,
>  				     struct xe_svm_range *range,
>  				     u8 tile_mask);
> +struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
> +				     struct xe_svm_range *range);
>  
>  int xe_vm_invalidate_vma(struct xe_vma *vma);
>  
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> index 2eae3575c409..d38cf7558f62 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -348,6 +348,12 @@ struct xe_vma_op_map_range {
>  	struct xe_svm_range *range;
>  };
>  
> +/** struct xe_vma_op_unmap_range - VMA unmap range operation */
> +struct xe_vma_op_unmap_range {
> +	/** @range: SVM range to unmap */
> +	struct xe_svm_range *range;
> +};
> +
>  /** enum xe_vma_op_flags - flags for VMA operation */
>  enum xe_vma_op_flags {
>  	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
> @@ -362,6 +368,8 @@ enum xe_vma_op_flags {
>  enum xe_vma_subop {
>  	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
>  	XE_VMA_SUBOP_MAP_RANGE,
> +	/** @XE_VMA_SUBOP_UNMAP_RANGE: Unmap range */
> +	XE_VMA_SUBOP_UNMAP_RANGE,
>  };
>  
>  /** struct xe_vma_op - VMA operation */
> @@ -384,8 +392,10 @@ struct xe_vma_op {
>  		struct xe_vma_op_remap remap;
>  		/** @prefetch: VMA prefetch operation specific data
> */
>  		struct xe_vma_op_prefetch prefetch;
> -		/** @map: VMA map range operation specific data */
> +		/** @map_range: VMA map range operation specific
> data */
>  		struct xe_vma_op_map_range map_range;
> +		/** @unmap_range: VMA unmap range operation specific
> data */
> +		struct xe_vma_op_map_range unmap_range;
>  	};
>  };
>  

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 13/29] drm/xe: Add unbind to SVM garbage collector
  2024-11-19 15:31   ` Thomas Hellström
@ 2024-11-19 23:44     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-19 23:44 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 04:31:05PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add unbind to SVM garbage collector. To facilitate add unbind support
> > function to VM layer which unbinds a SVM range. Also teach PY layer
> > to
> > understand unbinds of SVM ranges.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c       | 84 ++++++++++++++++++++++++++----
> > --
> >  drivers/gpu/drm/xe/xe_svm.c      |  9 +++-
> >  drivers/gpu/drm/xe/xe_vm.c       | 73 +++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_vm.h       |  2 +
> >  drivers/gpu/drm/xe/xe_vm_types.h | 12 ++++-
> >  5 files changed, 162 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index 024e4eb83408..687abd1a5e74 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -925,10 +925,16 @@ static void xe_pt_cancel_bind(struct xe_vma
> > *vma,
> >  	}
> >  }
> >  
> > +#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)
> 
> Please prefix with XE_ to avoid future name clashes.
> 

Good idea.

> > +
> >  static void xe_pt_commit_locks_assert(struct xe_vma *vma)
> >  {
> > -	struct xe_vm *vm = xe_vma_vm(vma);
> > +	struct xe_vm *vm;
> >  
> > +	if (vma == INVALID_VMA)
> > +		return;
> > +
> > +	vm = xe_vma_vm(vma);
> >  	lockdep_assert_held(&vm->lock);
> >  
> >  	if (!xe_vma_has_no_bo(vma))
> > @@ -954,7 +960,8 @@ static void xe_pt_commit(struct xe_vma *vma,
> >  		for (j = 0; j < entries[i].qwords; j++) {
> >  			struct xe_pt *oldpte =
> > entries[i].pt_entries[j].pt;
> >  
> > -			xe_pt_destroy(oldpte, xe_vma_vm(vma)->flags,
> > deferred);
> > +			xe_pt_destroy(oldpte, (vma == INVALID_VMA) ?
> > 0 :
> > +				      xe_vma_vm(vma)->flags,
> > deferred);
> >  		}
> >  	}
> >  }
> > @@ -1387,6 +1394,9 @@ static int xe_pt_svm_pre_commit(struct
> > xe_migrate_pt_update *pt_update)
> >  	list_for_each_entry(op, &vops->list, link) {
> >  		struct xe_svm_range *range = op->map_range.range;
> >  
> > +		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
> > +			continue;
> > +
> >  		xe_assert(vm->xe, xe_vma_is_system_allocator(op-
> > >map_range.vma));
> >  		xe_assert(vm->xe, op->subop ==
> > XE_VMA_SUBOP_MAP_RANGE);
> >  
> > @@ -1585,7 +1595,9 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_unbind_ops = {
> >   * xe_pt_stage_unbind() - Build page-table update structures for an
> > unbind
> >   * operation
> >   * @tile: The tile we're unbinding for.
> > + * @vm: The vm
> >   * @vma: The vma we're unbinding.
> > + * @range: The range we're unbinding.
> >   * @entries: Caller-provided storage for the update structures.
> >   *
> >   * Builds page-table update structures for an unbind operation. The
> > function
> > @@ -1595,9 +1607,14 @@ static const struct xe_pt_walk_ops
> > xe_pt_stage_unbind_ops = {
> >   *
> >   * Return: The number of entries used.
> >   */
> > -static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct
> > xe_vma *vma,
> > +static unsigned int xe_pt_stage_unbind(struct xe_tile *tile,
> > +				       struct xe_vm *vm,
> > +				       struct xe_vma *vma,
> > +				       struct xe_svm_range *range,
> >  				       struct xe_vm_pgtable_update
> > *entries)
> >  {
> > +	u64 start = range ? range->base.va.start :
> > xe_vma_start(vma);
> > +	u64 end = range ? range->base.va.end : xe_vma_end(vma);
> >  	struct xe_pt_stage_unbind_walk xe_walk = {
> >  		.base = {
> >  			.ops = &xe_pt_stage_unbind_ops,
> > @@ -1605,14 +1622,14 @@ static unsigned int xe_pt_stage_unbind(struct
> > xe_tile *tile, struct xe_vma *vma,
> >  			.max_level = XE_PT_HIGHEST_LEVEL,
> >  		},
> >  		.tile = tile,
> > -		.modified_start = xe_vma_start(vma),
> > -		.modified_end = xe_vma_end(vma),
> > +		.modified_start = start,
> > +		.modified_end = end,
> >  		.wupd.entries = entries,
> >  	};
> > -	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> > +	struct xe_pt *pt = vm->pt_root[tile->id];
> >  
> > -	(void)xe_pt_walk_shared(&pt->base, pt->level,
> > xe_vma_start(vma),
> > -				xe_vma_end(vma), &xe_walk.base);
> > +	(void)xe_pt_walk_shared(&pt->base, pt->level, start, end,
> > +				&xe_walk.base);
> >  
> >  	return xe_walk.wupd.num_used_entries;
> >  }
> > @@ -1854,13 +1871,6 @@ static int unbind_op_prepare(struct xe_tile
> > *tile,
> >  	       "Preparing unbind, with range [%llx...%llx)\n",
> >  	       xe_vma_start(vma), xe_vma_end(vma) - 1);
> >  
> > -	/*
> > -	 * Wait for invalidation to complete. Can corrupt internal
> > page table
> > -	 * state if an invalidation is running while preparing an
> > unbind.
> > -	 */
> > -	if (xe_vma_is_userptr(vma) &&
> > xe_vm_in_fault_mode(xe_vma_vm(vma)))
> > -		mmu_interval_read_begin(&to_userptr_vma(vma)-
> > >userptr.notifier);
> > -
> >  	pt_op->vma = vma;
> >  	pt_op->bind = false;
> >  	pt_op->rebind = false;
> > @@ -1869,7 +1879,8 @@ static int unbind_op_prepare(struct xe_tile
> > *tile,
> >  	if (err)
> >  		return err;
> >  
> > -	pt_op->num_entries = xe_pt_stage_unbind(tile, vma, pt_op-
> > >entries);
> > +	pt_op->num_entries = xe_pt_stage_unbind(tile,
> > xe_vma_vm(vma),
> > +						vma, NULL, pt_op-
> > >entries);
> >  
> >  	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
> >  				pt_op->num_entries, false);
> > @@ -1884,6 +1895,42 @@ static int unbind_op_prepare(struct xe_tile
> > *tile,
> >  	return 0;
> >  }
> >  
> > +static int unbind_range_prepare(struct xe_vm *vm,
> > +				struct xe_tile *tile,
> > +				struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> > +				struct xe_svm_range *range)
> > +{
> > +	u32 current_op = pt_update_ops->current_op;
> > +	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops-
> > >ops[current_op];
> > +
> > +	if (!(range->tile_present & BIT(tile->id)))
> > +		return 0;
> > +
> > +	vm_dbg(&vm->xe->drm,
> > +	       "Preparing unbind, with range [%llx...%llx)\n",
> > +	       range->base.va.start, range->base.va.end - 1);
> > +
> > +	pt_op->vma = INVALID_VMA;
> > +	pt_op->bind = false;
> > +	pt_op->rebind = false;
> > +
> > +	pt_op->num_entries = xe_pt_stage_unbind(tile, vm, NULL,
> > range,
> > +						pt_op->entries);
> > +
> > +	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
> > +				pt_op->num_entries, false);
> > +	xe_pt_update_ops_rfence_interval(pt_update_ops, range-
> > >base.va.start,
> > +					 range->base.va.end);
> > +	++pt_update_ops->current_op;
> > +	pt_update_ops->needs_svm_lock = true;
> > +	pt_update_ops->needs_invalidation = true;
> > +
> > +	xe_pt_commit_prepare_unbind(INVALID_VMA, pt_op->entries,
> > +				    pt_op->num_entries);
> > +
> > +	return 0;
> > +}
> > +
> >  static int op_prepare(struct xe_vm *vm,
> >  		      struct xe_tile *tile,
> >  		      struct xe_vm_pgtable_update_ops
> > *pt_update_ops,
> > @@ -1951,6 +1998,9 @@ static int op_prepare(struct xe_vm *vm,
> >  			err = bind_range_prepare(vm, tile,
> > pt_update_ops,
> >  						 op->map_range.vma,
> >  						 op-
> > >map_range.range);
> > +		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
> > +			err = unbind_range_prepare(vm, tile,
> > pt_update_ops,
> > +						   op-
> > >unmap_range.range);
> >  		}
> >  		break;
> >  	default:
> > @@ -2139,6 +2189,8 @@ static void op_commit(struct xe_vm *vm,
> >  		if (op->subop == XE_VMA_SUBOP_MAP_RANGE) {
> >  			op->map_range.range->tile_present |=
> > BIT(tile->id);
> >  			op->map_range.range->tile_invalidated &=
> > ~BIT(tile->id);
> > +		} else if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE) {
> > +			op->unmap_range.range->tile_present &=
> > ~BIT(tile->id);
> >  		}
> >  		break;
> >  	}
> 
> I think this further stresses the need to provide a pt code interface
> that is oblivious of vmas, userptr and ranges so that we can get rid of
> all special-casing, but for now code looks good as is IMO.
> 

I agree eventually it would be nice drop vmas / userptr / ranges from
the bind code. I believe I even opened a Jira for this task. The main
issue I see is tile_present, tile_invalidated needs to be moved to the
PT state which is a fairly large refactor. We also short circuit in
bunch of places based on this state which would turn into a page table
walk which also needs to be considered too.

As long as you ok with this for now, I agree let's scope / do in a
follow up.

> 
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 9c2f44cba166..0762126f65e0 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -208,7 +208,14 @@ static void xe_svm_invalidate(struct drm_gpusvm
> > *gpusvm,
> >  static int __xe_svm_garbage_collector(struct xe_vm *vm,
> >  				      struct xe_svm_range *range)
> >  {
> > -	/* TODO: Do unbind */
> > +	struct dma_fence *fence;
> > +
> > +	xe_vm_lock(vm, false);
> > +	fence = xe_vm_range_unbind(vm, range);
> > +	xe_vm_unlock(vm);
> > +	if (IS_ERR(fence))
> > +		return PTR_ERR(fence);
> > +	dma_fence_put(fence);
> >  
> >  	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 399cbbdbddd5..76a20e96084e 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -972,6 +972,79 @@ struct dma_fence *xe_vm_range_rebind(struct
> > xe_vm *vm,
> >  	return fence;
> >  }
> >  
> > +static void xe_vm_populate_range_unbind(struct xe_vma_op *op,
> > +					struct xe_svm_range *range)
> > +{
> > +	INIT_LIST_HEAD(&op->link);
> > +	op->tile_mask = range->tile_present;
> > +	op->base.op = DRM_GPUVA_OP_USER;
> > +	op->subop = XE_VMA_SUBOP_UNMAP_RANGE;
> > +	op->unmap_range.range = range;
> > +}
> > +
> > +static int
> > +xe_vm_ops_add_range_unbind(struct xe_vma_ops *vops,
> > +			   struct xe_svm_range *range)
> > +{
> > +	struct xe_vma_op *op;
> > +
> > +	op = kzalloc(sizeof(*op), GFP_KERNEL);
> > +	if (!op)
> > +		return -ENOMEM;
> > +
> > +	xe_vm_populate_range_unbind(op, range);
> > +	list_add_tail(&op->link, &vops->list);
> > +	xe_vma_ops_incr_pt_update_ops(vops, range->tile_present);
> > +
> > +	return 0;
> > +}
> > +
> > +struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
> > +				     struct xe_svm_range *range)
> 
> Kerneldoc.
> 

Yep.

Matt

> 
> > +{
> > +	struct dma_fence *fence = NULL;
> > +	struct xe_vma_ops vops;
> > +	struct xe_vma_op *op, *next_op;
> > +	struct xe_tile *tile;
> > +	u8 id;
> > +	int err;
> > +
> > +	lockdep_assert_held(&vm->lock);
> > +	xe_vm_assert_held(vm);
> > +	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
> > +
> > +	if (!range->tile_present)
> > +		return dma_fence_get_stub();
> > +
> > +	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
> > +	for_each_tile(tile, vm->xe, id) {
> > +		vops.pt_update_ops[id].wait_vm_bookkeep = true;
> > +		vops.pt_update_ops[tile->id].q =
> > +			xe_tile_migrate_exec_queue(tile);
> > +	}
> > +
> > +	err = xe_vm_ops_add_range_unbind(&vops, range);
> > +	if (err)
> > +		return ERR_PTR(err);
> > +
> > +	err = xe_vma_ops_alloc(&vops, false);
> > +	if (err) {
> > +		fence = ERR_PTR(err);
> > +		goto free_ops;
> > +	}
> > +
> > +	fence = ops_execute(vm, &vops);
> > +
> > +free_ops:
> > +	list_for_each_entry_safe(op, next_op, &vops.list, link) {
> > +		list_del(&op->link);
> > +		kfree(op);
> > +	}
> > +	xe_vma_ops_fini(&vops);
> > +
> > +	return fence;
> > +}
> > +
> >  static void xe_vma_free(struct xe_vma *vma)
> >  {
> >  	if (xe_vma_is_userptr(vma))
> > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> > index 8bd921b33090..d577ca9e3d65 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.h
> > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > @@ -222,6 +222,8 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm
> > *vm,
> >  				     struct xe_vma *vma,
> >  				     struct xe_svm_range *range,
> >  				     u8 tile_mask);
> > +struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
> > +				     struct xe_svm_range *range);
> >  
> >  int xe_vm_invalidate_vma(struct xe_vma *vma);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > b/drivers/gpu/drm/xe/xe_vm_types.h
> > index 2eae3575c409..d38cf7558f62 100644
> > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > @@ -348,6 +348,12 @@ struct xe_vma_op_map_range {
> >  	struct xe_svm_range *range;
> >  };
> >  
> > +/** struct xe_vma_op_unmap_range - VMA unmap range operation */
> > +struct xe_vma_op_unmap_range {
> > +	/** @range: SVM range to unmap */
> > +	struct xe_svm_range *range;
> > +};
> > +
> >  /** enum xe_vma_op_flags - flags for VMA operation */
> >  enum xe_vma_op_flags {
> >  	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
> > @@ -362,6 +368,8 @@ enum xe_vma_op_flags {
> >  enum xe_vma_subop {
> >  	/** @XE_VMA_SUBOP_MAP_RANGE: Map range */
> >  	XE_VMA_SUBOP_MAP_RANGE,
> > +	/** @XE_VMA_SUBOP_UNMAP_RANGE: Unmap range */
> > +	XE_VMA_SUBOP_UNMAP_RANGE,
> >  };
> >  
> >  /** struct xe_vma_op - VMA operation */
> > @@ -384,8 +392,10 @@ struct xe_vma_op {
> >  		struct xe_vma_op_remap remap;
> >  		/** @prefetch: VMA prefetch operation specific data
> > */
> >  		struct xe_vma_op_prefetch prefetch;
> > -		/** @map: VMA map range operation specific data */
> > +		/** @map_range: VMA map range operation specific
> > data */
> >  		struct xe_vma_op_map_range map_range;
> > +		/** @unmap_range: VMA unmap range operation specific
> > data */
> > +		struct xe_vma_op_map_range unmap_range;
> >  	};
> >  };
> >  
> 
> Thanks,
> Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 14/29] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (12 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 13/29] drm/xe: Add unbind to " Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 16:33   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 15/29] drm/xe: Enable system allocator uAPI Matthew Brost
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

uAPI is designed with the the use case that only mapping a BO to a
malloc'd address will unbind a system allocator VMA. Thus it doesn't
make tons of sense to allow a system allocator VMA unbind if the GPU has
bindings in the range being unbound. Do not support this as it
simplifies the code. Can always be revisited if a use case for this
arrises.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c |  5 +++++
 drivers/gpu/drm/xe/xe_svm.h |  1 +
 drivers/gpu/drm/xe/xe_vm.c  | 16 ++++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 0762126f65e0..1d8021b4e2f0 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -378,3 +378,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 err_out:
 	return err;
 }
+
+bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end)
+{
+	return drm_gpusvm_has_mapping(&vm->svm.gpusvm, start, end);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 06d90d0f71a6..472fbc51f30e 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -29,6 +29,7 @@ void xe_svm_close(struct xe_vm *vm);
 int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic);
+bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
 
 static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
 {
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 76a20e96084e..158fbb1c3f28 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2348,6 +2348,17 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
 			bool skip = xe_vma_is_system_allocator(old);
+			u64 start = xe_vma_start(old), end = xe_vma_end(old);
+
+			if (op->base.remap.prev)
+				start = op->base.remap.prev->va.addr +
+					op->base.remap.prev->va.range;
+			if (op->base.remap.next)
+				end = op->base.remap.next->va.addr;
+
+			if (xe_vma_is_system_allocator(old) &&
+			    xe_svm_has_mapping(vm, start, end))
+				return -EBUSY;
 
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
@@ -2430,6 +2441,11 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 		{
 			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
 
+			if (xe_vma_is_system_allocator(vma) &&
+			    xe_svm_has_mapping(vm, xe_vma_start(vma),
+					       xe_vma_end(vma)))
+				return -EBUSY;
+
 			if (!xe_vma_is_system_allocator(vma))
 				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 14/29] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
  2024-10-16  3:25 ` [PATCH v2 14/29] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
@ 2024-11-19 16:33   ` Thomas Hellström
  2024-11-19 23:37     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 16:33 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> uAPI is designed with the the use case that only mapping a BO to a
> malloc'd address will unbind a system allocator VMA. Thus it doesn't
> make tons of sense to allow a system allocator VMA unbind if the GPU
> has
> bindings in the range being unbound. Do not support this as it
> simplifies the code. Can always be revisited if a use case for this
> arrises.

s/arrises/arises

I think a uAPI without special cases like this would be ideal, what is
the code simplification, given that we already support this implicitly?

Thanks,
/Thomas


> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.c |  5 +++++
>  drivers/gpu/drm/xe/xe_svm.h |  1 +
>  drivers/gpu/drm/xe/xe_vm.c  | 16 ++++++++++++++++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 0762126f65e0..1d8021b4e2f0 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -378,3 +378,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  err_out:
>  	return err;
>  }
> +
> +bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end)
> +{
> +	return drm_gpusvm_has_mapping(&vm->svm.gpusvm, start, end);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 06d90d0f71a6..472fbc51f30e 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -29,6 +29,7 @@ void xe_svm_close(struct xe_vm *vm);
>  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  			    struct xe_tile *tile, u64 fault_addr,
>  			    bool atomic);
> +bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
>  
>  static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> *range)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 76a20e96084e..158fbb1c3f28 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -2348,6 +2348,17 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
>  			struct xe_vma *old =
>  				gpuva_to_vma(op->base.remap.unmap-
> >va);
>  			bool skip = xe_vma_is_system_allocator(old);
> +			u64 start = xe_vma_start(old), end =
> xe_vma_end(old);
> +
> +			if (op->base.remap.prev)
> +				start = op->base.remap.prev->va.addr
> +
> +					op->base.remap.prev-
> >va.range;
> +			if (op->base.remap.next)
> +				end = op->base.remap.next->va.addr;
> +
> +			if (xe_vma_is_system_allocator(old) &&
> +			    xe_svm_has_mapping(vm, start, end))
> +				return -EBUSY;
>  
>  			op->remap.start = xe_vma_start(old);
>  			op->remap.range = xe_vma_size(old);
> @@ -2430,6 +2441,11 @@ static int vm_bind_ioctl_ops_parse(struct
> xe_vm *vm, struct drm_gpuva_ops *ops,
>  		{
>  			struct xe_vma *vma = gpuva_to_vma(op-
> >base.unmap.va);
>  
> +			if (xe_vma_is_system_allocator(vma) &&
> +			    xe_svm_has_mapping(vm,
> xe_vma_start(vma),
> +					       xe_vma_end(vma)))
> +				return -EBUSY;
> +
>  			if (!xe_vma_is_system_allocator(vma))
>  				xe_vma_ops_incr_pt_update_ops(vops,
> op->tile_mask);
>  			break;


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 14/29] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
  2024-11-19 16:33   ` Thomas Hellström
@ 2024-11-19 23:37     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-11-19 23:37 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 05:33:11PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > uAPI is designed with the the use case that only mapping a BO to a
> > malloc'd address will unbind a system allocator VMA. Thus it doesn't
> > make tons of sense to allow a system allocator VMA unbind if the GPU
> > has
> > bindings in the range being unbound. Do not support this as it
> > simplifies the code. Can always be revisited if a use case for this
> > arrises.
> 
> s/arrises/arises
> 
> I think a uAPI without special cases like this would be ideal, what is
> the code simplification, given that we already support this implicitly?

Yes, simplicity. SVM allocations are only unbound via the garbage
collector not in the IOCTL - that would be new code.

I also cannot think of a use case where this would need to be supported.

If we are binding a BO (which causes system allocator VMA UNMAP) the UMD
should have allocated the BO's address via malloc or mmap, thus we
shouldn't have GPU mappings for the new address.

We can add support for this but without a use case, it seems rather
pointless.

Matt

> 
> Thanks,
> /Thomas
> 
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.c |  5 +++++
> >  drivers/gpu/drm/xe/xe_svm.h |  1 +
> >  drivers/gpu/drm/xe/xe_vm.c  | 16 ++++++++++++++++
> >  3 files changed, 22 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 0762126f65e0..1d8021b4e2f0 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -378,3 +378,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  err_out:
> >  	return err;
> >  }
> > +
> > +bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end)
> > +{
> > +	return drm_gpusvm_has_mapping(&vm->svm.gpusvm, start, end);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index 06d90d0f71a6..472fbc51f30e 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -29,6 +29,7 @@ void xe_svm_close(struct xe_vm *vm);
> >  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> >  			    struct xe_tile *tile, u64 fault_addr,
> >  			    bool atomic);
> > +bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
> >  
> >  static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> > *range)
> >  {
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 76a20e96084e..158fbb1c3f28 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -2348,6 +2348,17 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  			struct xe_vma *old =
> >  				gpuva_to_vma(op->base.remap.unmap-
> > >va);
> >  			bool skip = xe_vma_is_system_allocator(old);
> > +			u64 start = xe_vma_start(old), end =
> > xe_vma_end(old);
> > +
> > +			if (op->base.remap.prev)
> > +				start = op->base.remap.prev->va.addr
> > +
> > +					op->base.remap.prev-
> > >va.range;
> > +			if (op->base.remap.next)
> > +				end = op->base.remap.next->va.addr;
> > +
> > +			if (xe_vma_is_system_allocator(old) &&
> > +			    xe_svm_has_mapping(vm, start, end))
> > +				return -EBUSY;
> >  
> >  			op->remap.start = xe_vma_start(old);
> >  			op->remap.range = xe_vma_size(old);
> > @@ -2430,6 +2441,11 @@ static int vm_bind_ioctl_ops_parse(struct
> > xe_vm *vm, struct drm_gpuva_ops *ops,
> >  		{
> >  			struct xe_vma *vma = gpuva_to_vma(op-
> > >base.unmap.va);
> >  
> > +			if (xe_vma_is_system_allocator(vma) &&
> > +			    xe_svm_has_mapping(vm,
> > xe_vma_start(vma),
> > +					       xe_vma_end(vma)))
> > +				return -EBUSY;
> > +
> >  			if (!xe_vma_is_system_allocator(vma))
> >  				xe_vma_ops_incr_pt_update_ops(vops,
> > op->tile_mask);
> >  			break;
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 15/29] drm/xe: Enable system allocator uAPI
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (13 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 14/29] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 16:34   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support Matthew Brost
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Support for system allocator bindings in SRAM fully in place, enable the
implementation.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 158fbb1c3f28..8eed820079ba 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2962,12 +2962,6 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
-		/* FIXME: Disabling system allocator for now */
-		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
-			err = -EOPNOTSUPP;
-			goto free_bind_ops;
-		}
-
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 15/29] drm/xe: Enable system allocator uAPI
  2024-10-16  3:25 ` [PATCH v2 15/29] drm/xe: Enable system allocator uAPI Matthew Brost
@ 2024-11-19 16:34   ` Thomas Hellström
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 16:34 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Support for system allocator bindings in SRAM fully in place, enable
> the
> implementation.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

> ---
>  drivers/gpu/drm/xe/xe_vm.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 158fbb1c3f28..8eed820079ba 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -2962,12 +2962,6 @@ static int vm_bind_ioctl_check_args(struct
> xe_device *xe,
>  		u16 pat_index = (*bind_ops)[i].pat_index;
>  		u16 coh_mode;
>  
> -		/* FIXME: Disabling system allocator for now */
> -		if (XE_IOCTL_DBG(xe, is_system_allocator)) {
> -			err = -EOPNOTSUPP;
> -			goto free_bind_ops;
> -		}
> -
>  		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >pat.n_entries)) {
>  			err = -EINVAL;
>  			goto free_bind_ops;


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (14 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 15/29] drm/xe: Enable system allocator uAPI Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 16:45   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring Matthew Brost
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add functions which migrate to / from VRAM accepting a single DPA
argument (VRAM) and array of dma addresses (SRAM).

v2:
 - Don't unlock job_mutex in error path of xe_migrate_vram

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 149 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h |  10 +++
 2 files changed, 159 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index cfd31ae49cc1..d7b6636286ae 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1542,6 +1542,155 @@ void xe_migrate_wait(struct xe_migrate *m)
 		dma_fence_wait(m->fence, false);
 }
 
+static u32 pte_update_cmd_size(u64 size)
+{
+	u32 dword;
+	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+
+	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
+	/*
+	 * MI_STORE_DATA_IMM command is used to update page table. Each
+	 * instruction can update maximumly 0x1ff pte entries. To update
+	 * n (n <= 0x1ff) pte entries, we need:
+	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
+	 * 2 dword for the page table's physical location
+	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
+	 */
+	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
+	dword += entries * 2;
+
+	return dword;
+}
+
+static void build_pt_update_batch_sram(struct xe_migrate *m,
+				       struct xe_bb *bb, u32 pt_offset,
+				       dma_addr_t *sram_addr, u32 size)
+{
+	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
+	u32 ptes;
+	int i = 0;
+
+	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+	while (ptes) {
+		u32 chunk = min(0x1ffU, ptes);
+
+		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
+		bb->cs[bb->len++] = pt_offset;
+		bb->cs[bb->len++] = 0;
+
+		pt_offset += chunk * 8;
+		ptes -= chunk;
+
+		while (chunk--) {
+			u64 addr = sram_addr[i++] & PAGE_MASK;
+
+			xe_tile_assert(m->tile, addr);
+			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
+								 addr, pat_index,
+								 0, false, 0);
+			bb->cs[bb->len++] = lower_32_bits(addr);
+			bb->cs[bb->len++] = upper_32_bits(addr);
+		}
+	}
+}
+
+enum xe_migrate_copy_dir {
+	XE_MIGRATE_COPY_TO_VRAM,
+	XE_MIGRATE_COPY_TO_SRAM,
+};
+
+static struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
+					 unsigned long npages,
+					 dma_addr_t *sram_addr, u64 vram_addr,
+					 const enum xe_migrate_copy_dir dir)
+{
+	struct xe_gt *gt = m->tile->primary_gt;
+	struct xe_device *xe = gt_to_xe(gt);
+	struct dma_fence *fence = NULL;
+	u32 batch_size = 2;
+	u64 src_L0_ofs, dst_L0_ofs;
+	u64 round_update_size;
+	struct xe_sched_job *job;
+	struct xe_bb *bb;
+	u32 update_idx, pt_slot = 0;
+	int err;
+
+	round_update_size = min_t(u64, npages * PAGE_SIZE,
+				  MAX_PREEMPTDISABLE_TRANSFER);
+	batch_size += pte_update_cmd_size(round_update_size);
+	batch_size += EMIT_COPY_DW;
+
+	bb = xe_bb_new(gt, batch_size, true);
+	if (IS_ERR(bb)) {
+		err = PTR_ERR(bb);
+		return ERR_PTR(err);
+	}
+
+	build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+				   sram_addr, round_update_size);
+
+	if (dir == XE_MIGRATE_COPY_TO_VRAM) {
+		src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		dst_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr, false);
+
+	} else {
+		src_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr, false);
+		dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+	}
+
+	bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
+	update_idx = bb->len;
+
+	emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
+		  XE_PAGE_SIZE);
+
+	job = xe_bb_create_migration_job(m->q, bb,
+					 xe_migrate_batch_base(m, true),
+					 update_idx);
+	if (IS_ERR(job)) {
+		err = PTR_ERR(job);
+		goto err;
+	}
+
+	xe_sched_job_add_migrate_flush(job, 0);
+
+	mutex_lock(&m->job_mutex);
+	xe_sched_job_arm(job);
+	fence = dma_fence_get(&job->drm.s_fence->finished);
+	xe_sched_job_push(job);
+
+	dma_fence_put(m->fence);
+	m->fence = dma_fence_get(fence);
+	mutex_unlock(&m->job_mutex);
+
+	xe_bb_free(bb, fence);
+
+	return fence;
+
+err:
+	xe_bb_free(bb, NULL);
+
+	return ERR_PTR(err);
+}
+
+struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
+				     unsigned long npages,
+				     dma_addr_t *src_addr,
+				     u64 dst_addr)
+{
+	return xe_migrate_vram(m, npages, src_addr, dst_addr,
+			       XE_MIGRATE_COPY_TO_VRAM);
+}
+
+struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
+				       unsigned long npages,
+				       u64 src_addr,
+				       dma_addr_t *dst_addr)
+{
+	return xe_migrate_vram(m, npages, dst_addr, src_addr,
+			       XE_MIGRATE_COPY_TO_SRAM);
+}
+
 #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
 #include "tests/xe_migrate.c"
 #endif
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 0109866e398a..6ff9a963425c 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -95,6 +95,16 @@ struct xe_migrate_pt_update {
 
 struct xe_migrate *xe_migrate_init(struct xe_tile *tile);
 
+struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
+				     unsigned long npages,
+				     dma_addr_t *src_addr,
+				     u64 dst_addr);
+
+struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
+				       unsigned long npages,
+				       u64 src_addr,
+				       dma_addr_t *dst_addr);
+
 struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 				  struct xe_bo *src_bo,
 				  struct xe_bo *dst_bo,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support
  2024-10-16  3:25 ` [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support Matthew Brost
@ 2024-11-19 16:45   ` Thomas Hellström
  2024-11-19 23:08     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 16:45 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add functions which migrate to / from VRAM accepting a single DPA
> argument (VRAM) and array of dma addresses (SRAM).
> 
> v2:
>  - Don't unlock job_mutex in error path of xe_migrate_vram
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_migrate.c | 149
> ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_migrate.h |  10 +++
>  2 files changed, 159 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> b/drivers/gpu/drm/xe/xe_migrate.c
> index cfd31ae49cc1..d7b6636286ae 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -1542,6 +1542,155 @@ void xe_migrate_wait(struct xe_migrate *m)
>  		dma_fence_wait(m->fence, false);
>  }
>  
> +static u32 pte_update_cmd_size(u64 size)
> +{
> +	u32 dword;

dwords or num_dword?

> +	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> +
> +	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
> +	/*
> +	 * MI_STORE_DATA_IMM command is used to update page table.
> Each
> +	 * instruction can update maximumly 0x1ff pte entries. To
> update
> +	 * n (n <= 0x1ff) pte entries, we need:
> +	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode
> etc)
> +	 * 2 dword for the page table's physical location
> +	 * 2*n dword for value of pte to fill (each pte entry is 2
> dwords)
> +	 */
> +	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
> +	dword += entries * 2;
> +
> +	return dword;
> +}
> +
> +static void build_pt_update_batch_sram(struct xe_migrate *m,
> +				       struct xe_bb *bb, u32
> pt_offset,
> +				       dma_addr_t *sram_addr, u32
> size)
> +{
> +	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
> +	u32 ptes;
> +	int i = 0;
> +
> +	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> +	while (ptes) {
> +		u32 chunk = min(0x1ffU, ptes);
> +
> +		bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> MI_SDI_NUM_QW(chunk);
> +		bb->cs[bb->len++] = pt_offset;
> +		bb->cs[bb->len++] = 0;
> +
> +		pt_offset += chunk * 8;
> +		ptes -= chunk;
> +
> +		while (chunk--) {
> +			u64 addr = sram_addr[i++] & PAGE_MASK;
> +
> +			xe_tile_assert(m->tile, addr);
> +			addr = m->q->vm->pt_ops->pte_encode_addr(m-
> >tile->xe,
> +								
> addr, pat_index,
> +								 0,
> false, 0);
> +			bb->cs[bb->len++] = lower_32_bits(addr);
> +			bb->cs[bb->len++] = upper_32_bits(addr);
> +		}
> +	}
> +}
> +
> +enum xe_migrate_copy_dir {
> +	XE_MIGRATE_COPY_TO_VRAM,
> +	XE_MIGRATE_COPY_TO_SRAM,
> +};
> +
> +static struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
> +					 unsigned long npages,
> +					 dma_addr_t *sram_addr, u64
> vram_addr,
> +					 const enum
> xe_migrate_copy_dir dir)
> +{
> +	struct xe_gt *gt = m->tile->primary_gt;
> +	struct xe_device *xe = gt_to_xe(gt);
> +	struct dma_fence *fence = NULL;
> +	u32 batch_size = 2;
> +	u64 src_L0_ofs, dst_L0_ofs;
> +	u64 round_update_size;
> +	struct xe_sched_job *job;
> +	struct xe_bb *bb;
> +	u32 update_idx, pt_slot = 0;
> +	int err;
> +
> +	round_update_size = min_t(u64, npages * PAGE_SIZE,
> +				  MAX_PREEMPTDISABLE_TRANSFER);

Hm. How does the caller know how many pages were actually migrated?

> +	batch_size += pte_update_cmd_size(round_update_size);
> +	batch_size += EMIT_COPY_DW;
> +
> +	bb = xe_bb_new(gt, batch_size, true);
> +	if (IS_ERR(bb)) {
> +		err = PTR_ERR(bb);
> +		return ERR_PTR(err);
> +	}
> +
> +	build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
> +				   sram_addr, round_update_size);
> +
> +	if (dir == XE_MIGRATE_COPY_TO_VRAM) {
> +		src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> +		dst_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr,
> false);
> +
> +	} else {
> +		src_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr,
> false);
> +		dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> +	}
> +
> +	bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> +	update_idx = bb->len;
> +
> +	emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
> +		  XE_PAGE_SIZE);
> +
> +	job = xe_bb_create_migration_job(m->q, bb,
> +					 xe_migrate_batch_base(m,
> true),
> +					 update_idx);
> +	if (IS_ERR(job)) {
> +		err = PTR_ERR(job);
> +		goto err;
> +	}
> +
> +	xe_sched_job_add_migrate_flush(job, 0);
> +
> +	mutex_lock(&m->job_mutex);
> +	xe_sched_job_arm(job);
> +	fence = dma_fence_get(&job->drm.s_fence->finished);
> +	xe_sched_job_push(job);
> +
> +	dma_fence_put(m->fence);
> +	m->fence = dma_fence_get(fence);
> +	mutex_unlock(&m->job_mutex);
> +
> +	xe_bb_free(bb, fence);
> +
> +	return fence;
> +
> +err:
> +	xe_bb_free(bb, NULL);
> +
> +	return ERR_PTR(err);
> +}
> +
> +struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
> +				     unsigned long npages,
> +				     dma_addr_t *src_addr,
> +				     u64 dst_addr)

Kerneldoc.

> +{
> +	return xe_migrate_vram(m, npages, src_addr, dst_addr,
> +			       XE_MIGRATE_COPY_TO_VRAM);
> +}
> +
> +struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
> +				       unsigned long npages,
> +				       u64 src_addr,
> +				       dma_addr_t *dst_addr)

Kerneldoc.

> +{
> +	return xe_migrate_vram(m, npages, dst_addr, src_addr,
> +			       XE_MIGRATE_COPY_TO_SRAM);
> +}
> +
>  #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
>  #include "tests/xe_migrate.c"
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> b/drivers/gpu/drm/xe/xe_migrate.h
> index 0109866e398a..6ff9a963425c 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.h
> +++ b/drivers/gpu/drm/xe/xe_migrate.h
> @@ -95,6 +95,16 @@ struct xe_migrate_pt_update {
>  
>  struct xe_migrate *xe_migrate_init(struct xe_tile *tile);
>  
> +struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
> +				     unsigned long npages,
> +				     dma_addr_t *src_addr,
> +				     u64 dst_addr);
> +
> +struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
> +				       unsigned long npages,
> +				       u64 src_addr,
> +				       dma_addr_t *dst_addr);
> +
>  struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
>  				  struct xe_bo *src_bo,
>  				  struct xe_bo *dst_bo,


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support
  2024-11-19 16:45   ` Thomas Hellström
@ 2024-11-19 23:08     ` Matthew Brost
  2024-11-20  8:04       ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-11-19 23:08 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 05:45:27PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add functions which migrate to / from VRAM accepting a single DPA
> > argument (VRAM) and array of dma addresses (SRAM).
> > 
> > v2:
> >  - Don't unlock job_mutex in error path of xe_migrate_vram
> > 
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_migrate.c | 149
> > ++++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_migrate.h |  10 +++
> >  2 files changed, 159 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > b/drivers/gpu/drm/xe/xe_migrate.c
> > index cfd31ae49cc1..d7b6636286ae 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -1542,6 +1542,155 @@ void xe_migrate_wait(struct xe_migrate *m)
> >  		dma_fence_wait(m->fence, false);
> >  }
> >  
> > +static u32 pte_update_cmd_size(u64 size)
> > +{
> > +	u32 dword;
> 
> dwords or num_dword?
> 

num_dword

> > +	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > +
> > +	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
> > +	/*
> > +	 * MI_STORE_DATA_IMM command is used to update page table.
> > Each
> > +	 * instruction can update maximumly 0x1ff pte entries. To
> > update
> > +	 * n (n <= 0x1ff) pte entries, we need:
> > +	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode
> > etc)
> > +	 * 2 dword for the page table's physical location
> > +	 * 2*n dword for value of pte to fill (each pte entry is 2
> > dwords)
> > +	 */
> > +	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
> > +	dword += entries * 2;
> > +
> > +	return dword;
> > +}
> > +
> > +static void build_pt_update_batch_sram(struct xe_migrate *m,
> > +				       struct xe_bb *bb, u32
> > pt_offset,
> > +				       dma_addr_t *sram_addr, u32
> > size)
> > +{
> > +	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
> > +	u32 ptes;
> > +	int i = 0;
> > +
> > +	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > +	while (ptes) {
> > +		u32 chunk = min(0x1ffU, ptes);
> > +
> > +		bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> > MI_SDI_NUM_QW(chunk);
> > +		bb->cs[bb->len++] = pt_offset;
> > +		bb->cs[bb->len++] = 0;
> > +
> > +		pt_offset += chunk * 8;
> > +		ptes -= chunk;
> > +
> > +		while (chunk--) {
> > +			u64 addr = sram_addr[i++] & PAGE_MASK;
> > +
> > +			xe_tile_assert(m->tile, addr);
> > +			addr = m->q->vm->pt_ops->pte_encode_addr(m-
> > >tile->xe,
> > +								
> > addr, pat_index,
> > +								 0,
> > false, 0);
> > +			bb->cs[bb->len++] = lower_32_bits(addr);
> > +			bb->cs[bb->len++] = upper_32_bits(addr);
> > +		}
> > +	}
> > +}
> > +
> > +enum xe_migrate_copy_dir {
> > +	XE_MIGRATE_COPY_TO_VRAM,
> > +	XE_MIGRATE_COPY_TO_SRAM,
> > +};
> > +
> > +static struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
> > +					 unsigned long npages,
> > +					 dma_addr_t *sram_addr, u64
> > vram_addr,
> > +					 const enum
> > xe_migrate_copy_dir dir)
> > +{
> > +	struct xe_gt *gt = m->tile->primary_gt;
> > +	struct xe_device *xe = gt_to_xe(gt);
> > +	struct dma_fence *fence = NULL;
> > +	u32 batch_size = 2;
> > +	u64 src_L0_ofs, dst_L0_ofs;
> > +	u64 round_update_size;
> > +	struct xe_sched_job *job;
> > +	struct xe_bb *bb;
> > +	u32 update_idx, pt_slot = 0;
> > +	int err;
> > +
> > +	round_update_size = min_t(u64, npages * PAGE_SIZE,
> > +				  MAX_PREEMPTDISABLE_TRANSFER);
> 
> Hm. How does the caller know how many pages were actually migrated?
> 

This is an intermediate between migrate_vma_setup and
migrate_vma_pages/finalize. The number of pages here is based on mpfn
returned from migrate_vma_setup. The migration for individual pages may
still be aborted in migrate_vma_pages/finalize. In this case both the
old and new page have the same data, dso migrate_vma_pages/finalize can
pick either page.

> > +	batch_size += pte_update_cmd_size(round_update_size);
> > +	batch_size += EMIT_COPY_DW;
> > +
> > +	bb = xe_bb_new(gt, batch_size, true);
> > +	if (IS_ERR(bb)) {
> > +		err = PTR_ERR(bb);
> > +		return ERR_PTR(err);
> > +	}
> > +
> > +	build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
> > +				   sram_addr, round_update_size);
> > +
> > +	if (dir == XE_MIGRATE_COPY_TO_VRAM) {
> > +		src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> > +		dst_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr,
> > false);
> > +
> > +	} else {
> > +		src_L0_ofs = xe_migrate_vram_ofs(xe, vram_addr,
> > false);
> > +		dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> > +	}
> > +
> > +	bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> > +	update_idx = bb->len;
> > +
> > +	emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
> > +		  XE_PAGE_SIZE);
> > +
> > +	job = xe_bb_create_migration_job(m->q, bb,
> > +					 xe_migrate_batch_base(m,
> > true),
> > +					 update_idx);
> > +	if (IS_ERR(job)) {
> > +		err = PTR_ERR(job);
> > +		goto err;
> > +	}
> > +
> > +	xe_sched_job_add_migrate_flush(job, 0);
> > +
> > +	mutex_lock(&m->job_mutex);
> > +	xe_sched_job_arm(job);
> > +	fence = dma_fence_get(&job->drm.s_fence->finished);
> > +	xe_sched_job_push(job);
> > +
> > +	dma_fence_put(m->fence);
> > +	m->fence = dma_fence_get(fence);
> > +	mutex_unlock(&m->job_mutex);
> > +
> > +	xe_bb_free(bb, fence);
> > +
> > +	return fence;
> > +
> > +err:
> > +	xe_bb_free(bb, NULL);
> > +
> > +	return ERR_PTR(err);
> > +}
> > +
> > +struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
> > +				     unsigned long npages,
> > +				     dma_addr_t *src_addr,
> > +				     u64 dst_addr)
> 
> Kerneldoc.
>

Yep.
 
> > +{
> > +	return xe_migrate_vram(m, npages, src_addr, dst_addr,
> > +			       XE_MIGRATE_COPY_TO_VRAM);
> > +}
> > +
> > +struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
> > +				       unsigned long npages,
> > +				       u64 src_addr,
> > +				       dma_addr_t *dst_addr)
> 
> Kerneldoc.
> 

Yep.

Matt

> > +{
> > +	return xe_migrate_vram(m, npages, dst_addr, src_addr,
> > +			       XE_MIGRATE_COPY_TO_SRAM);
> > +}
> > +
> >  #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
> >  #include "tests/xe_migrate.c"
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> > b/drivers/gpu/drm/xe/xe_migrate.h
> > index 0109866e398a..6ff9a963425c 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.h
> > +++ b/drivers/gpu/drm/xe/xe_migrate.h
> > @@ -95,6 +95,16 @@ struct xe_migrate_pt_update {
> >  
> >  struct xe_migrate *xe_migrate_init(struct xe_tile *tile);
> >  
> > +struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
> > +				     unsigned long npages,
> > +				     dma_addr_t *src_addr,
> > +				     u64 dst_addr);
> > +
> > +struct dma_fence *xe_migrate_from_vram(struct xe_migrate *m,
> > +				       unsigned long npages,
> > +				       u64 src_addr,
> > +				       dma_addr_t *dst_addr);
> > +
> >  struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
> >  				  struct xe_bo *src_bo,
> >  				  struct xe_bo *dst_bo,
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support
  2024-11-19 23:08     ` Matthew Brost
@ 2024-11-20  8:04       ` Thomas Hellström
  2024-12-11 19:11         ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-20  8:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, 2024-11-19 at 15:08 -0800, Matthew Brost wrote:
> On Tue, Nov 19, 2024 at 05:45:27PM +0100, Thomas Hellström wrote:
> > On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > > Add functions which migrate to / from VRAM accepting a single DPA
> > > argument (VRAM) and array of dma addresses (SRAM).
> > > 
> > > v2:
> > >  - Don't unlock job_mutex in error path of xe_migrate_vram
> > > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_migrate.c | 149
> > > ++++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_migrate.h |  10 +++
> > >  2 files changed, 159 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > > b/drivers/gpu/drm/xe/xe_migrate.c
> > > index cfd31ae49cc1..d7b6636286ae 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > @@ -1542,6 +1542,155 @@ void xe_migrate_wait(struct xe_migrate
> > > *m)
> > >  		dma_fence_wait(m->fence, false);
> > >  }
> > >  
> > > +static u32 pte_update_cmd_size(u64 size)
> > > +{
> > > +	u32 dword;
> > 
> > dwords or num_dword?
> > 
> 
> num_dword
> 
> > > +	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > +
> > > +	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
> > > +	/*
> > > +	 * MI_STORE_DATA_IMM command is used to update page
> > > table.
> > > Each
> > > +	 * instruction can update maximumly 0x1ff pte entries.
> > > To
> > > update
> > > +	 * n (n <= 0x1ff) pte entries, we need:
> > > +	 * 1 dword for the MI_STORE_DATA_IMM command header
> > > (opcode
> > > etc)
> > > +	 * 2 dword for the page table's physical location
> > > +	 * 2*n dword for value of pte to fill (each pte entry is
> > > 2
> > > dwords)
> > > +	 */
> > > +	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
> > > +	dword += entries * 2;
> > > +
> > > +	return dword;
> > > +}
> > > +
> > > +static void build_pt_update_batch_sram(struct xe_migrate *m,
> > > +				       struct xe_bb *bb, u32
> > > pt_offset,
> > > +				       dma_addr_t *sram_addr,
> > > u32
> > > size)
> > > +{
> > > +	u16 pat_index = tile_to_xe(m->tile)-
> > > >pat.idx[XE_CACHE_WB];
> > > +	u32 ptes;
> > > +	int i = 0;
> > > +
> > > +	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > +	while (ptes) {
> > > +		u32 chunk = min(0x1ffU, ptes);
> > > +
> > > +		bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> > > MI_SDI_NUM_QW(chunk);
> > > +		bb->cs[bb->len++] = pt_offset;
> > > +		bb->cs[bb->len++] = 0;
> > > +
> > > +		pt_offset += chunk * 8;
> > > +		ptes -= chunk;
> > > +
> > > +		while (chunk--) {
> > > +			u64 addr = sram_addr[i++] & PAGE_MASK;
> > > +
> > > +			xe_tile_assert(m->tile, addr);
> > > +			addr = m->q->vm->pt_ops-
> > > >pte_encode_addr(m-
> > > > tile->xe,
> > > +								
> > > addr, pat_index,
> > > +								
> > > 0,
> > > false, 0);
> > > +			bb->cs[bb->len++] = lower_32_bits(addr);
> > > +			bb->cs[bb->len++] = upper_32_bits(addr);
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +enum xe_migrate_copy_dir {
> > > +	XE_MIGRATE_COPY_TO_VRAM,
> > > +	XE_MIGRATE_COPY_TO_SRAM,
> > > +};
> > > +
> > > +static struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
> > > +					 unsigned long npages,
> > > +					 dma_addr_t *sram_addr,
> > > u64
> > > vram_addr,
> > > +					 const enum
> > > xe_migrate_copy_dir dir)
> > > +{
> > > +	struct xe_gt *gt = m->tile->primary_gt;
> > > +	struct xe_device *xe = gt_to_xe(gt);
> > > +	struct dma_fence *fence = NULL;
> > > +	u32 batch_size = 2;
> > > +	u64 src_L0_ofs, dst_L0_ofs;
> > > +	u64 round_update_size;
> > > +	struct xe_sched_job *job;
> > > +	struct xe_bb *bb;
> > > +	u32 update_idx, pt_slot = 0;
> > > +	int err;
> > > +
> > > +	round_update_size = min_t(u64, npages * PAGE_SIZE,
> > > +				  MAX_PREEMPTDISABLE_TRANSFER);
> > 
> > Hm. How does the caller know how many pages were actually migrated?
> > 
> 
> This is an intermediate between migrate_vma_setup and
> migrate_vma_pages/finalize. The number of pages here is based on mpfn
> returned from migrate_vma_setup. The migration for individual pages
> may
> still be aborted in migrate_vma_pages/finalize. In this case both the
> old and new page have the same data, dso migrate_vma_pages/finalize
> can
> pick either page.

I might be misunderstanding, but I meant if npages is, for example,
which is 16MiB of data, but the above min_t reduces that to 8MiB of
data. How would the caller know?


/Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support
  2024-11-20  8:04       ` Thomas Hellström
@ 2024-12-11 19:11         ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:11 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Wed, Nov 20, 2024 at 09:04:20AM +0100, Thomas Hellström wrote:
> On Tue, 2024-11-19 at 15:08 -0800, Matthew Brost wrote:
> > On Tue, Nov 19, 2024 at 05:45:27PM +0100, Thomas Hellström wrote:
> > > On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > > > Add functions which migrate to / from VRAM accepting a single DPA
> > > > argument (VRAM) and array of dma addresses (SRAM).
> > > > 
> > > > v2:
> > > >  - Don't unlock job_mutex in error path of xe_migrate_vram
> > > > 
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_migrate.c | 149
> > > > ++++++++++++++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/xe_migrate.h |  10 +++
> > > >  2 files changed, 159 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > > > b/drivers/gpu/drm/xe/xe_migrate.c
> > > > index cfd31ae49cc1..d7b6636286ae 100644
> > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > @@ -1542,6 +1542,155 @@ void xe_migrate_wait(struct xe_migrate
> > > > *m)
> > > >  		dma_fence_wait(m->fence, false);
> > > >  }
> > > >  
> > > > +static u32 pte_update_cmd_size(u64 size)
> > > > +{
> > > > +	u32 dword;
> > > 
> > > dwords or num_dword?
> > > 
> > 
> > num_dword
> > 
> > > > +	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > > +
> > > > +	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
> > > > +	/*
> > > > +	 * MI_STORE_DATA_IMM command is used to update page
> > > > table.
> > > > Each
> > > > +	 * instruction can update maximumly 0x1ff pte entries.
> > > > To
> > > > update
> > > > +	 * n (n <= 0x1ff) pte entries, we need:
> > > > +	 * 1 dword for the MI_STORE_DATA_IMM command header
> > > > (opcode
> > > > etc)
> > > > +	 * 2 dword for the page table's physical location
> > > > +	 * 2*n dword for value of pte to fill (each pte entry is
> > > > 2
> > > > dwords)
> > > > +	 */
> > > > +	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
> > > > +	dword += entries * 2;
> > > > +
> > > > +	return dword;
> > > > +}
> > > > +
> > > > +static void build_pt_update_batch_sram(struct xe_migrate *m,
> > > > +				       struct xe_bb *bb, u32
> > > > pt_offset,
> > > > +				       dma_addr_t *sram_addr,
> > > > u32
> > > > size)
> > > > +{
> > > > +	u16 pat_index = tile_to_xe(m->tile)-
> > > > >pat.idx[XE_CACHE_WB];
> > > > +	u32 ptes;
> > > > +	int i = 0;
> > > > +
> > > > +	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > > > +	while (ptes) {
> > > > +		u32 chunk = min(0x1ffU, ptes);
> > > > +
> > > > +		bb->cs[bb->len++] = MI_STORE_DATA_IMM |
> > > > MI_SDI_NUM_QW(chunk);
> > > > +		bb->cs[bb->len++] = pt_offset;
> > > > +		bb->cs[bb->len++] = 0;
> > > > +
> > > > +		pt_offset += chunk * 8;
> > > > +		ptes -= chunk;
> > > > +
> > > > +		while (chunk--) {
> > > > +			u64 addr = sram_addr[i++] & PAGE_MASK;
> > > > +
> > > > +			xe_tile_assert(m->tile, addr);
> > > > +			addr = m->q->vm->pt_ops-
> > > > >pte_encode_addr(m-
> > > > > tile->xe,
> > > > +								
> > > > addr, pat_index,
> > > > +								
> > > > 0,
> > > > false, 0);
> > > > +			bb->cs[bb->len++] = lower_32_bits(addr);
> > > > +			bb->cs[bb->len++] = upper_32_bits(addr);
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +enum xe_migrate_copy_dir {
> > > > +	XE_MIGRATE_COPY_TO_VRAM,
> > > > +	XE_MIGRATE_COPY_TO_SRAM,
> > > > +};
> > > > +
> > > > +static struct dma_fence *xe_migrate_vram(struct xe_migrate *m,
> > > > +					 unsigned long npages,
> > > > +					 dma_addr_t *sram_addr,
> > > > u64
> > > > vram_addr,
> > > > +					 const enum
> > > > xe_migrate_copy_dir dir)
> > > > +{
> > > > +	struct xe_gt *gt = m->tile->primary_gt;
> > > > +	struct xe_device *xe = gt_to_xe(gt);
> > > > +	struct dma_fence *fence = NULL;
> > > > +	u32 batch_size = 2;
> > > > +	u64 src_L0_ofs, dst_L0_ofs;
> > > > +	u64 round_update_size;
> > > > +	struct xe_sched_job *job;
> > > > +	struct xe_bb *bb;
> > > > +	u32 update_idx, pt_slot = 0;
> > > > +	int err;
> > > > +
> > > > +	round_update_size = min_t(u64, npages * PAGE_SIZE,
> > > > +				  MAX_PREEMPTDISABLE_TRANSFER);
> > > 
> > > Hm. How does the caller know how many pages were actually migrated?
> > > 
> > 
> > This is an intermediate between migrate_vma_setup and
> > migrate_vma_pages/finalize. The number of pages here is based on mpfn
> > returned from migrate_vma_setup. The migration for individual pages
> > may
> > still be aborted in migrate_vma_pages/finalize. In this case both the
> > old and new page have the same data, dso migrate_vma_pages/finalize
> > can
> > pick either page.
> 
> I might be misunderstanding, but I meant if npages is, for example,
> which is 16MiB of data, but the above min_t reduces that to 8MiB of
> data. How would the caller know?
> 

Oh, yea - that is broken - it kinda assumes a chunk is 8M or less. I had
some local patches which fixed this function to do a loop, will pull
those into the next rev to future proof this.

Matt

> 
> /Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (15 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 16:50   ` Thomas Hellström
  2024-11-20  3:05   ` Gwan-gyeong Mun
  2024-10-16  3:25 ` [PATCH v2 18/29] drm/xe: Add drm_gpusvm_devmem to xe_bo Matthew Brost
                   ` (14 subsequent siblings)
  31 siblings, 2 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add SVM device memory mirroring which enables device pages for
migration.

TODO: Hide this behind Kconfig

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  8 ++++
 drivers/gpu/drm/xe/xe_svm.c          | 56 +++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h          |  3 ++
 drivers/gpu/drm/xe/xe_tile.c         |  5 +++
 4 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 85bede4dd646..2ac5de7751c9 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -104,6 +104,14 @@ struct xe_mem_region {
 	resource_size_t actual_physical_size;
 	/** @mapping: pointer to VRAM mappable space */
 	void __iomem *mapping;
+	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
+	struct dev_pagemap pagemap;
+	/**
+	 * @hpa_base: base host physical address
+	 *
+	 * This is generated when remap device memory as ZONE_DEVICE
+	 */
+	resource_size_t hpa_base;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 1d8021b4e2f0..22e6341117bd 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -21,6 +21,11 @@ static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
 	return gpusvm_to_vm(r->gpusvm);
 }
 
+static void *xe_svm_devm_owner(struct xe_device *xe)
+{
+	return xe;
+}
+
 static struct drm_gpusvm_range *
 xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
 {
@@ -284,8 +289,9 @@ int xe_svm_init(struct xe_vm *vm)
 		  xe_svm_garbage_collector_work_func);
 
 	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
-			       current->mm, NULL, 0, vm->size,
-			       SZ_512M, &gpusvm_ops, fault_chunk_sizes,
+			       current->mm, xe_svm_devm_owner(vm->xe), 0,
+			       vm->size, SZ_512M, &gpusvm_ops,
+			       fault_chunk_sizes,
 			       ARRAY_SIZE(fault_chunk_sizes));
 }
 
@@ -383,3 +389,49 @@ bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end)
 {
 	return drm_gpusvm_has_mapping(&vm->svm.gpusvm, start, end);
 }
+
+/**
+ * xe_devm_add: Remap and provide memmap backing for device memory
+ * @tile: tile that the memory region belongs to
+ * @mr: memory region to remap
+ *
+ * This remap device memory to host physical address space and create
+ * struct page to back device memory
+ *
+ * Return: 0 on success standard error code otherwise
+ */
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct xe_device *xe = tile_to_xe(tile);
+	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
+	struct resource *res;
+	void *addr;
+	int ret;
+
+	res = devm_request_free_mem_region(dev, &iomem_resource,
+					   mr->usable_size);
+	if (IS_ERR(res)) {
+		ret = PTR_ERR(res);
+		return ret;
+	}
+
+	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	mr->pagemap.range.start = res->start;
+	mr->pagemap.range.end = res->end;
+	mr->pagemap.nr_range = 1;
+	mr->pagemap.ops = drm_gpusvm_pagemap_ops_get();
+	mr->pagemap.owner = xe_svm_devm_owner(xe);
+	addr = devm_memremap_pages(dev, &mr->pagemap);
+	if (IS_ERR(addr)) {
+		devm_release_mem_region(dev, res->start, resource_size(res));
+		ret = PTR_ERR(addr);
+		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);
+		return ret;
+	}
+	mr->hpa_base = res->start;
+
+	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
+		 tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 472fbc51f30e..760d22cefb1e 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -11,6 +11,7 @@
 
 #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
 
+struct xe_mem_region;
 struct xe_tile;
 struct xe_vm;
 struct xe_vma;
@@ -22,6 +23,8 @@ struct xe_svm_range {
 	u8 tile_invalidated;
 };
 
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
+
 int xe_svm_init(struct xe_vm *vm);
 void xe_svm_fini(struct xe_vm *vm);
 void xe_svm_close(struct xe_vm *vm);
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 07cf7cfe4abd..bbb430392dfb 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -13,6 +13,7 @@
 #include "xe_migrate.h"
 #include "xe_pcode.h"
 #include "xe_sa.h"
+#include "xe_svm.h"
 #include "xe_tile.h"
 #include "xe_tile_sysfs.h"
 #include "xe_ttm_vram_mgr.h"
@@ -164,6 +165,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
  */
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
+	struct xe_device *xe = tile_to_xe(tile);
 	int err;
 
 	err = tile_ttm_mgr_init(tile);
@@ -176,6 +178,9 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	xe_wa_apply_tile_workarounds(tile);
 
+	if (xe->info.has_usm && IS_DGFX(xe))
+		xe_devm_add(tile, &tile->mem.vram);
+
 	err = xe_tile_sysfs_init(tile);
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring
  2024-10-16  3:25 ` [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring Matthew Brost
@ 2024-11-19 16:50   ` Thomas Hellström
  2024-11-20  3:05   ` Gwan-gyeong Mun
  1 sibling, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 16:50 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add SVM device memory mirroring which enables device pages for
> migration.
> 
> TODO: Hide this behind Kconfig
> 
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_types.h |  8 ++++
>  drivers/gpu/drm/xe/xe_svm.c          | 56
> +++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h          |  3 ++
>  drivers/gpu/drm/xe/xe_tile.c         |  5 +++
>  4 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> index 85bede4dd646..2ac5de7751c9 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -104,6 +104,14 @@ struct xe_mem_region {
>  	resource_size_t actual_physical_size;
>  	/** @mapping: pointer to VRAM mappable space */
>  	void __iomem *mapping;
> +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> +	struct dev_pagemap pagemap;
> +	/**
> +	 * @hpa_base: base host physical address
> +	 *
> +	 * This is generated when remap device memory as ZONE_DEVICE
> +	 */
> +	resource_size_t hpa_base;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 1d8021b4e2f0..22e6341117bd 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -21,6 +21,11 @@ static struct xe_vm *range_to_vm(struct
> drm_gpusvm_range *r)
>  	return gpusvm_to_vm(r->gpusvm);
>  }
>  
> +static void *xe_svm_devm_owner(struct xe_device *xe)
> +{
> +	return xe;
> +}
> +
>  static struct drm_gpusvm_range *
>  xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
>  {
> @@ -284,8 +289,9 @@ int xe_svm_init(struct xe_vm *vm)
>  		  xe_svm_garbage_collector_work_func);
>  
>  	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe-
> >drm,
> -			       current->mm, NULL, 0, vm->size,
> -			       SZ_512M, &gpusvm_ops,
> fault_chunk_sizes,
> +			       current->mm, xe_svm_devm_owner(vm-
> >xe), 0,
> +			       vm->size, SZ_512M, &gpusvm_ops,
> +			       fault_chunk_sizes,
>  			       ARRAY_SIZE(fault_chunk_sizes));
>  }
>  
> @@ -383,3 +389,49 @@ bool xe_svm_has_mapping(struct xe_vm *vm, u64
> start, u64 end)
>  {
>  	return drm_gpusvm_has_mapping(&vm->svm.gpusvm, start, end);
>  }
> +
> +/**
> + * xe_devm_add: Remap and provide memmap backing for device memory
xe_devm_add():

Otherwise LGTM.
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com<


> + * @tile: tile that the memory region belongs to
> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and
> create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)

> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = drm_gpusvm_pagemap_ops_get();
> +	mr->pagemap.owner = xe_svm_devm_owner(xe);
> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start,
> resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&xe->drm, "Failed to remap tile %d memory,
> errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to
> devm, remapped to %pr\n",
> +		 tile->id, mr->io_start, mr->io_start + mr-
> >usable_size, res);
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 472fbc51f30e..760d22cefb1e 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -11,6 +11,7 @@
>  
>  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
>  
> +struct xe_mem_region;
>  struct xe_tile;
>  struct xe_vm;
>  struct xe_vma;
> @@ -22,6 +23,8 @@ struct xe_svm_range {
>  	u8 tile_invalidated;
>  };
>  
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> +
>  int xe_svm_init(struct xe_vm *vm);
>  void xe_svm_fini(struct xe_vm *vm);
>  void xe_svm_close(struct xe_vm *vm);
> diff --git a/drivers/gpu/drm/xe/xe_tile.c
> b/drivers/gpu/drm/xe/xe_tile.c
> index 07cf7cfe4abd..bbb430392dfb 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -13,6 +13,7 @@
>  #include "xe_migrate.h"
>  #include "xe_pcode.h"
>  #include "xe_sa.h"
> +#include "xe_svm.h"
>  #include "xe_tile.h"
>  #include "xe_tile_sysfs.h"
>  #include "xe_ttm_vram_mgr.h"
> @@ -164,6 +165,7 @@ static int tile_ttm_mgr_init(struct xe_tile
> *tile)
>   */
>  int xe_tile_init_noalloc(struct xe_tile *tile)
>  {
> +	struct xe_device *xe = tile_to_xe(tile);
>  	int err;
>  
>  	err = tile_ttm_mgr_init(tile);
> @@ -176,6 +178,9 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
>  
>  	xe_wa_apply_tile_workarounds(tile);
>  
> +	if (xe->info.has_usm && IS_DGFX(xe))
> +		xe_devm_add(tile, &tile->mem.vram);
> +
>  	err = xe_tile_sysfs_init(tile);
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring
  2024-10-16  3:25 ` [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring Matthew Brost
  2024-11-19 16:50   ` Thomas Hellström
@ 2024-11-20  3:05   ` Gwan-gyeong Mun
  2024-12-11 19:44     ` Matthew Brost
  1 sibling, 1 reply; 129+ messages in thread
From: Gwan-gyeong Mun @ 2024-11-20  3:05 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr



On 10/16/24 6:25 AM, Matthew Brost wrote:
> +/**
> + * xe_devm_add: Remap and provide memmap backing for device memory
> + * @tile: tile that the memory region belongs to
> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
To use the devm_request_free_mem_region() function, 
CONFIG_GET_FREE_REGION=y in config.
xe's kconfig need to have CONFIG_GET_FREE_REGION dependency.
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = drm_gpusvm_pagemap_ops_get();
> +	mr->pagemap.owner = xe_svm_devm_owner(xe);
> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start, resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> +		 tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring
  2024-11-20  3:05   ` Gwan-gyeong Mun
@ 2024-12-11 19:44     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:44 UTC (permalink / raw)
  To: Gwan-gyeong Mun
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	thomas.hellstrom, simona.vetter, felix.kuehling, dakr

On Wed, Nov 20, 2024 at 05:05:32AM +0200, Gwan-gyeong Mun wrote:
> 
> 
> On 10/16/24 6:25 AM, Matthew Brost wrote:
> > +/**
> > + * xe_devm_add: Remap and provide memmap backing for device memory
> > + * @tile: tile that the memory region belongs to
> > + * @mr: memory region to remap
> > + *
> > + * This remap device memory to host physical address space and create
> > + * struct page to back device memory
> > + *
> > + * Return: 0 on success standard error code otherwise
> > + */
> > +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > +{
> > +	struct xe_device *xe = tile_to_xe(tile);
> > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > +	struct resource *res;
> > +	void *addr;
> > +	int ret;
> > +
> > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > +					   mr->usable_size);
> To use the devm_request_free_mem_region() function, CONFIG_GET_FREE_REGION=y
> in config.
> xe's kconfig need to have CONFIG_GET_FREE_REGION dependency.

Will add CONFIG_GET_FREE_REGION dependency or perhaps even
CONFIG_XE_DEVMEM_MIRROR Kconfig which enables this code.

Matt

> > +	if (IS_ERR(res)) {
> > +		ret = PTR_ERR(res);
> > +		return ret;
> > +	}
> > +
> > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > +	mr->pagemap.range.start = res->start;
> > +	mr->pagemap.range.end = res->end;
> > +	mr->pagemap.nr_range = 1;
> > +	mr->pagemap.ops = drm_gpusvm_pagemap_ops_get();
> > +	mr->pagemap.owner = xe_svm_devm_owner(xe);
> > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > +	if (IS_ERR(addr)) {
> > +		devm_release_mem_region(dev, res->start, resource_size(res));
> > +		ret = PTR_ERR(addr);
> > +		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
> > +				tile->id, ret);
> > +		return ret;
> > +	}
> > +	mr->hpa_base = res->start;
> > +
> > +	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> > +		 tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> > +	return 0;
> > +}

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 18/29] drm/xe: Add drm_gpusvm_devmem to xe_bo
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (16 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-11-19 16:51   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 19/29] drm/xe: Add GPUSVM devic memory copy vfunc functions Matthew Brost
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add drm_gpusvm_devmem to xe_bo. Required to enable SVM migrations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
index 13c6d8a69e91..54d337004621 100644
--- a/drivers/gpu/drm/xe/xe_bo_types.h
+++ b/drivers/gpu/drm/xe/xe_bo_types.h
@@ -8,6 +8,8 @@
 
 #include <linux/iosys-map.h>
 
+#include "drm_gpusvm.h"
+
 #include <drm/ttm/ttm_bo.h>
 #include <drm/ttm/ttm_device.h>
 #include <drm/ttm/ttm_execbuf_util.h>
@@ -74,6 +76,9 @@ struct xe_bo {
 	 */
 	u16 cpu_caching;
 
+	/** @devmem_allocation: SVM device memory allocation */
+	struct drm_gpusvm_devmem devmem_allocation;
+
 	/** @vram_userfault_link: Link into @mem_access.vram_userfault.list */
 		struct list_head vram_userfault_link;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 18/29] drm/xe: Add drm_gpusvm_devmem to xe_bo
  2024-10-16  3:25 ` [PATCH v2 18/29] drm/xe: Add drm_gpusvm_devmem to xe_bo Matthew Brost
@ 2024-11-19 16:51   ` Thomas Hellström
  2024-12-15  4:38     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-11-19 16:51 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr



On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add drm_gpusvm_devmem to xe_bo. Required to enable SVM migrations.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo_types.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo_types.h
> b/drivers/gpu/drm/xe/xe_bo_types.h
> index 13c6d8a69e91..54d337004621 100644
> --- a/drivers/gpu/drm/xe/xe_bo_types.h
> +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> @@ -8,6 +8,8 @@
>  
>  #include <linux/iosys-map.h>
>  
> +#include "drm_gpusvm.h"
> +
>  #include <drm/ttm/ttm_bo.h>
>  #include <drm/ttm/ttm_device.h>
>  #include <drm/ttm/ttm_execbuf_util.h>
> @@ -74,6 +76,9 @@ struct xe_bo {
>  	 */
>  	u16 cpu_caching;
>  
> +	/** @devmem_allocation: SVM device memory allocation */
> +	struct drm_gpusvm_devmem devmem_allocation;
> +

I think this can go away with follow-up multi-device patches, but for
now
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>


>  	/** @vram_userfault_link: Link into
> @mem_access.vram_userfault.list */
>  		struct list_head vram_userfault_link;
>  


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 18/29] drm/xe: Add drm_gpusvm_devmem to xe_bo
  2024-11-19 16:51   ` Thomas Hellström
@ 2024-12-15  4:38     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-15  4:38 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Tue, Nov 19, 2024 at 05:51:50PM +0100, Thomas Hellström wrote:
> 
> 
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add drm_gpusvm_devmem to xe_bo. Required to enable SVM migrations.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_bo_types.h | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo_types.h
> > b/drivers/gpu/drm/xe/xe_bo_types.h
> > index 13c6d8a69e91..54d337004621 100644
> > --- a/drivers/gpu/drm/xe/xe_bo_types.h
> > +++ b/drivers/gpu/drm/xe/xe_bo_types.h
> > @@ -8,6 +8,8 @@
> >  
> >  #include <linux/iosys-map.h>
> >  
> > +#include "drm_gpusvm.h"
> > +
> >  #include <drm/ttm/ttm_bo.h>
> >  #include <drm/ttm/ttm_device.h>
> >  #include <drm/ttm/ttm_execbuf_util.h>
> > @@ -74,6 +76,9 @@ struct xe_bo {
> >  	 */
> >  	u16 cpu_caching;
> >  
> > +	/** @devmem_allocation: SVM device memory allocation */
> > +	struct drm_gpusvm_devmem devmem_allocation;
> > +
> 
> I think this can go away with follow-up multi-device patches, but for
> now

Yea I could see that.

> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> 

Thanks.

Matt

> 
> >  	/** @vram_userfault_link: Link into
> > @mem_access.vram_userfault.list */
> >  		struct list_head vram_userfault_link;
> >  
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 19/29] drm/xe: Add GPUSVM devic memory copy vfunc functions
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (17 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 18/29] drm/xe: Add drm_gpusvm_devmem to xe_bo Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 10:13   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 20/29] drm/xe: Add drm_pagemap ops to SVM Matthew Brost
                   ` (12 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add GPUSVM devic memory copy vfunc functions and connect to migration
layer.

v2:
 - Allow NULL device pages in xe_svm_copy
 - Use new drm_gpusvm_devmem_ops

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 150 ++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 22e6341117bd..b33fd42d035b 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -6,6 +6,7 @@
 #include "drm_gpusvm.h"
 
 #include "xe_gt_tlb_invalidation.h"
+#include "xe_migrate.h"
 #include "xe_pt.h"
 #include "xe_svm.h"
 #include "xe_vm.h"
@@ -269,6 +270,155 @@ static void xe_svm_garbage_collector_work_func(struct work_struct *w)
 	up_write(&vm->lock);
 }
 
+static struct xe_mem_region *page_to_mr(struct page *page)
+{
+	return container_of(page->pgmap, struct xe_mem_region, pagemap);
+}
+
+static struct xe_tile *mr_to_tile(struct xe_mem_region *mr)
+{
+	return container_of(mr, struct xe_tile, mem.vram);
+}
+
+static u64 xe_mem_region_page_to_dpa(struct xe_mem_region *mr,
+				     struct page *page)
+{
+	u64 dpa;
+	struct xe_tile *tile = mr_to_tile(mr);
+	u64 pfn = page_to_pfn(page);
+	u64 offset;
+
+	xe_tile_assert(tile, is_device_private_page(page));
+	xe_tile_assert(tile, (pfn << PAGE_SHIFT) >= mr->hpa_base);
+
+	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;
+
+	return dpa;
+}
+
+enum xe_svm_copy_dir {
+	XE_SVM_COPY_TO_VRAM,
+	XE_SVM_COPY_TO_SRAM,
+};
+
+static int xe_svm_copy(struct page **pages, dma_addr_t *dma_addr,
+		       unsigned long npages, const enum xe_svm_copy_dir dir)
+{
+	struct xe_mem_region *mr = NULL;
+	struct xe_tile *tile;
+	struct dma_fence *fence = NULL;
+	unsigned long i;
+#define VRAM_ADDR_INVALID	~0x0ull
+	u64 vram_addr = VRAM_ADDR_INVALID;
+	int err = 0, pos = 0;
+	bool sram = dir == XE_SVM_COPY_TO_SRAM;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *spage = pages[i];
+		struct dma_fence *__fence;
+		u64 __vram_addr;
+		bool match = false, chunk, last;
+
+		chunk = (i - pos) == (SZ_2M / PAGE_SIZE);
+		last = (i + 1) == npages;
+
+		if (!dma_addr[i] && vram_addr == VRAM_ADDR_INVALID)
+			continue;
+
+		if (!mr && spage) {
+			mr = page_to_mr(spage);
+			tile = mr_to_tile(mr);
+		}
+
+		if (dma_addr[i] && spage) {
+			__vram_addr = xe_mem_region_page_to_dpa(mr, spage);
+			if (vram_addr == VRAM_ADDR_INVALID) {
+				vram_addr = __vram_addr;
+				pos = i;
+			}
+
+			match = vram_addr + PAGE_SIZE * (i - pos) == __vram_addr;
+		}
+
+		if (!match || chunk || last) {
+			int incr = (match && last) ? 1 : 0;
+
+			if (vram_addr != VRAM_ADDR_INVALID) {
+				if (sram)
+					__fence = xe_migrate_from_vram(tile->migrate,
+								       i - pos + incr,
+								       vram_addr,
+								       dma_addr + pos);
+				else
+					__fence = xe_migrate_to_vram(tile->migrate,
+								     i - pos + incr,
+								     dma_addr + pos,
+								     vram_addr);
+				if (IS_ERR(__fence)) {
+					err = PTR_ERR(__fence);
+					goto err_out;
+				}
+
+				dma_fence_put(fence);
+				fence = __fence;
+			}
+
+			if (dma_addr[i] && spage) {
+				vram_addr = __vram_addr;
+				pos = i;
+			} else {
+				vram_addr = VRAM_ADDR_INVALID;
+			}
+
+			if (!match && last && dma_addr[i] && spage) {
+				if (sram)
+					__fence = xe_migrate_from_vram(tile->migrate, 1,
+								       vram_addr,
+								       dma_addr + pos);
+				else
+					__fence = xe_migrate_to_vram(tile->migrate, 1,
+								     dma_addr + pos,
+								     vram_addr);
+				if (IS_ERR(__fence)) {
+					err = PTR_ERR(__fence);
+					goto err_out;
+				}
+
+				dma_fence_put(fence);
+				fence = __fence;
+			}
+		}
+	}
+
+err_out:
+	if (fence) {
+		dma_fence_wait(fence, false);
+		dma_fence_put(fence);
+	}
+
+	return err;
+#undef VRAM_ADDR_INVALID
+}
+
+static int xe_svm_copy_to_devmem(struct page **pages, dma_addr_t *dma_addr,
+				 unsigned long npages)
+{
+	return xe_svm_copy(pages, dma_addr, npages, XE_SVM_COPY_TO_VRAM);
+}
+
+static int xe_svm_copy_to_ram(struct page **pages, dma_addr_t *dma_addr,
+			      unsigned long npages)
+{
+	return xe_svm_copy(pages, dma_addr, npages, XE_SVM_COPY_TO_SRAM);
+}
+
+__maybe_unused
+static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
+	.copy_to_devmem = xe_svm_copy_to_devmem,
+	.copy_to_ram = xe_svm_copy_to_ram,
+};
+
 static const struct drm_gpusvm_ops gpusvm_ops = {
 	.range_alloc = xe_svm_range_alloc,
 	.range_free = xe_svm_range_free,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 19/29] drm/xe: Add GPUSVM devic memory copy vfunc functions
  2024-10-16  3:25 ` [PATCH v2 19/29] drm/xe: Add GPUSVM devic memory copy vfunc functions Matthew Brost
@ 2024-12-02 10:13   ` Thomas Hellström
  2024-12-12  3:59     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 10:13 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add GPUSVM devic memory copy vfunc functions and connect to migration

s/devic/device 
> 

> layer.
> 
> v2:
>  - Allow NULL device pages in xe_svm_copy
>  - Use new drm_gpusvm_devmem_ops
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 150
> ++++++++++++++++++++++++++++++++++++
>  1 file changed, 150 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 22e6341117bd..b33fd42d035b 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -6,6 +6,7 @@
>  #include "drm_gpusvm.h"
>  
>  #include "xe_gt_tlb_invalidation.h"
> +#include "xe_migrate.h"
>  #include "xe_pt.h"
>  #include "xe_svm.h"
>  #include "xe_vm.h"
> @@ -269,6 +270,155 @@ static void
> xe_svm_garbage_collector_work_func(struct work_struct *w)
>  	up_write(&vm->lock);
>  }
>  
> +static struct xe_mem_region *page_to_mr(struct page *page)
> +{
> +	return container_of(page->pgmap, struct xe_mem_region,
> pagemap);
> +}
> +
> +static struct xe_tile *mr_to_tile(struct xe_mem_region *mr)
> +{
> +	return container_of(mr, struct xe_tile, mem.vram);
> +}
> +
> +static u64 xe_mem_region_page_to_dpa(struct xe_mem_region *mr,
> +				     struct page *page)
> +{
> +	u64 dpa;
> +	struct xe_tile *tile = mr_to_tile(mr);
> +	u64 pfn = page_to_pfn(page);
> +	u64 offset;
> +
> +	xe_tile_assert(tile, is_device_private_page(page));
> +	xe_tile_assert(tile, (pfn << PAGE_SHIFT) >= mr->hpa_base);
> +
> +	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> +	dpa = mr->dpa_base + offset;
> +
> +	return dpa;
> +}
> +
> +enum xe_svm_copy_dir {
> +	XE_SVM_COPY_TO_VRAM,
> +	XE_SVM_COPY_TO_SRAM,
> +};
> +
> +static int xe_svm_copy(struct page **pages, dma_addr_t *dma_addr,
> +		       unsigned long npages, const enum
> xe_svm_copy_dir dir)
> +{
> +	struct xe_mem_region *mr = NULL;
> +	struct xe_tile *tile;
> +	struct dma_fence *fence = NULL;
> +	unsigned long i;
> +#define VRAM_ADDR_INVALID	~0x0ull
> +	u64 vram_addr = VRAM_ADDR_INVALID;
> +	int err = 0, pos = 0;
> +	bool sram = dir == XE_SVM_COPY_TO_SRAM;
> +
> +	for (i = 0; i < npages; ++i) {
> +		struct page *spage = pages[i];
> +		struct dma_fence *__fence;
> +		u64 __vram_addr;
> +		bool match = false, chunk, last;
> +
> +		chunk = (i - pos) == (SZ_2M / PAGE_SIZE);
> +		last = (i + 1) == npages;
> +
> +		if (!dma_addr[i] && vram_addr == VRAM_ADDR_INVALID)
> +			continue;
> +
> +		if (!mr && spage) {
> +			mr = page_to_mr(spage);
> +			tile = mr_to_tile(mr);
> +		}
> +
> +		if (dma_addr[i] && spage) {
> +			__vram_addr = xe_mem_region_page_to_dpa(mr,
> spage);
> +			if (vram_addr == VRAM_ADDR_INVALID) {
> +				vram_addr = __vram_addr;
> +				pos = i;
> +			}
> +
> +			match = vram_addr + PAGE_SIZE * (i - pos) ==
> __vram_addr;
> +		}
> +
> +		if (!match || chunk || last) {
> +			int incr = (match && last) ? 1 : 0;
> +
> +			if (vram_addr != VRAM_ADDR_INVALID) {
> +				if (sram)
> +					__fence =
> xe_migrate_from_vram(tile->migrate,
> +								    
>    i - pos + incr,
> +								    
>    vram_addr,
> +								    
>    dma_addr + pos);
> +				else
> +					__fence =
> xe_migrate_to_vram(tile->migrate,
> +								    
> i - pos + incr,
> +								    
> dma_addr + pos,
> +								    
> vram_addr);
> +				if (IS_ERR(__fence)) {
> +					err = PTR_ERR(__fence);
> +					goto err_out;
> +				}
> +
> +				dma_fence_put(fence);
> +				fence = __fence;
> +			}
> +
> +			if (dma_addr[i] && spage) {
> +				vram_addr = __vram_addr;
> +				pos = i;
> +			} else {
> +				vram_addr = VRAM_ADDR_INVALID;
> +			}
> +
> +			if (!match && last && dma_addr[i] && spage)
> {
> +				if (sram)
> +					__fence =
> xe_migrate_from_vram(tile->migrate, 1,
> +								    
>    vram_addr,
> +								    
>    dma_addr + pos);
> +				else
> +					__fence =
> xe_migrate_to_vram(tile->migrate, 1,
> +								    
> dma_addr + pos,
> +								    
> vram_addr);
> +				if (IS_ERR(__fence)) {
> +					err = PTR_ERR(__fence);
> +					goto err_out;
> +				}
> +
> +				dma_fence_put(fence);
> +				fence = __fence;
> +			}

I think the flow in this function is a bit hard to follow. Could it
perhaps be simplified? If not, perhaps add a comment to the function
what it expects from the input arguments and the possible corner cases
that complicates it?


> +		}
> +	}
> +
> +err_out:
> +	if (fence) {
> +		dma_fence_wait(fence, false);
> +		dma_fence_put(fence);
> +	}
> +
> +	return err;
> +#undef VRAM_ADDR_INVALID
> +}
> +
> +static int xe_svm_copy_to_devmem(struct page **pages, dma_addr_t
> *dma_addr,
> +				 unsigned long npages)
> +{
> +	return xe_svm_copy(pages, dma_addr, npages,
> XE_SVM_COPY_TO_VRAM);
> +}
> +
> +static int xe_svm_copy_to_ram(struct page **pages, dma_addr_t
> *dma_addr,
> +			      unsigned long npages)
> +{
> +	return xe_svm_copy(pages, dma_addr, npages,
> XE_SVM_COPY_TO_SRAM);
> +}
> +
> +__maybe_unused

Is this __maybe_unused to be removed in a follow-up patch? If so could
you add a comment stating that?

> +static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> +	.copy_to_devmem = xe_svm_copy_to_devmem,
> +	.copy_to_ram = xe_svm_copy_to_ram,
> +};
> +
>  static const struct drm_gpusvm_ops gpusvm_ops = {
>  	.range_alloc = xe_svm_range_alloc,
>  	.range_free = xe_svm_range_free,

Thanks,
Thomas



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 19/29] drm/xe: Add GPUSVM devic memory copy vfunc functions
  2024-12-02 10:13   ` Thomas Hellström
@ 2024-12-12  3:59     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-12  3:59 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 11:13:55AM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add GPUSVM devic memory copy vfunc functions and connect to migration
> 
> s/devic/device 

Yes.

> > 
> 
> > layer.
> > 
> > v2:
> >  - Allow NULL device pages in xe_svm_copy
> >  - Use new drm_gpusvm_devmem_ops
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.c | 150
> > ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 150 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 22e6341117bd..b33fd42d035b 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -6,6 +6,7 @@
> >  #include "drm_gpusvm.h"
> >  
> >  #include "xe_gt_tlb_invalidation.h"
> > +#include "xe_migrate.h"
> >  #include "xe_pt.h"
> >  #include "xe_svm.h"
> >  #include "xe_vm.h"
> > @@ -269,6 +270,155 @@ static void
> > xe_svm_garbage_collector_work_func(struct work_struct *w)
> >  	up_write(&vm->lock);
> >  }
> >  
> > +static struct xe_mem_region *page_to_mr(struct page *page)
> > +{
> > +	return container_of(page->pgmap, struct xe_mem_region,
> > pagemap);
> > +}
> > +
> > +static struct xe_tile *mr_to_tile(struct xe_mem_region *mr)
> > +{
> > +	return container_of(mr, struct xe_tile, mem.vram);
> > +}
> > +
> > +static u64 xe_mem_region_page_to_dpa(struct xe_mem_region *mr,
> > +				     struct page *page)
> > +{
> > +	u64 dpa;
> > +	struct xe_tile *tile = mr_to_tile(mr);
> > +	u64 pfn = page_to_pfn(page);
> > +	u64 offset;
> > +
> > +	xe_tile_assert(tile, is_device_private_page(page));
> > +	xe_tile_assert(tile, (pfn << PAGE_SHIFT) >= mr->hpa_base);
> > +
> > +	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> > +	dpa = mr->dpa_base + offset;
> > +
> > +	return dpa;
> > +}
> > +
> > +enum xe_svm_copy_dir {
> > +	XE_SVM_COPY_TO_VRAM,
> > +	XE_SVM_COPY_TO_SRAM,
> > +};
> > +
> > +static int xe_svm_copy(struct page **pages, dma_addr_t *dma_addr,
> > +		       unsigned long npages, const enum
> > xe_svm_copy_dir dir)
> > +{
> > +	struct xe_mem_region *mr = NULL;
> > +	struct xe_tile *tile;
> > +	struct dma_fence *fence = NULL;
> > +	unsigned long i;
> > +#define VRAM_ADDR_INVALID	~0x0ull
> > +	u64 vram_addr = VRAM_ADDR_INVALID;
> > +	int err = 0, pos = 0;
> > +	bool sram = dir == XE_SVM_COPY_TO_SRAM;
> > +
> > +	for (i = 0; i < npages; ++i) {
> > +		struct page *spage = pages[i];
> > +		struct dma_fence *__fence;
> > +		u64 __vram_addr;
> > +		bool match = false, chunk, last;
> > +
> > +		chunk = (i - pos) == (SZ_2M / PAGE_SIZE);
> > +		last = (i + 1) == npages;
> > +
> > +		if (!dma_addr[i] && vram_addr == VRAM_ADDR_INVALID)
> > +			continue;
> > +
> > +		if (!mr && spage) {
> > +			mr = page_to_mr(spage);
> > +			tile = mr_to_tile(mr);
> > +		}
> > +
> > +		if (dma_addr[i] && spage) {
> > +			__vram_addr = xe_mem_region_page_to_dpa(mr,
> > spage);
> > +			if (vram_addr == VRAM_ADDR_INVALID) {
> > +				vram_addr = __vram_addr;
> > +				pos = i;
> > +			}
> > +
> > +			match = vram_addr + PAGE_SIZE * (i - pos) ==
> > __vram_addr;
> > +		}
> > +
> > +		if (!match || chunk || last) {
> > +			int incr = (match && last) ? 1 : 0;
> > +
> > +			if (vram_addr != VRAM_ADDR_INVALID) {
> > +				if (sram)
> > +					__fence =
> > xe_migrate_from_vram(tile->migrate,
> > +								    
> >    i - pos + incr,
> > +								    
> >    vram_addr,
> > +								    
> >    dma_addr + pos);
> > +				else
> > +					__fence =
> > xe_migrate_to_vram(tile->migrate,
> > +								    
> > i - pos + incr,
> > +								    
> > dma_addr + pos,
> > +								    
> > vram_addr);
> > +				if (IS_ERR(__fence)) {
> > +					err = PTR_ERR(__fence);
> > +					goto err_out;
> > +				}
> > +
> > +				dma_fence_put(fence);
> > +				fence = __fence;
> > +			}
> > +
> > +			if (dma_addr[i] && spage) {
> > +				vram_addr = __vram_addr;
> > +				pos = i;
> > +			} else {
> > +				vram_addr = VRAM_ADDR_INVALID;
> > +			}
> > +
> > +			if (!match && last && dma_addr[i] && spage)
> > {
> > +				if (sram)
> > +					__fence =
> > xe_migrate_from_vram(tile->migrate, 1,
> > +								    
> >    vram_addr,
> > +								    
> >    dma_addr + pos);
> > +				else
> > +					__fence =
> > xe_migrate_to_vram(tile->migrate, 1,
> > +								    
> > dma_addr + pos,
> > +								    
> > vram_addr);
> > +				if (IS_ERR(__fence)) {
> > +					err = PTR_ERR(__fence);
> > +					goto err_out;
> > +				}
> > +
> > +				dma_fence_put(fence);
> > +				fence = __fence;
> > +			}
> 
> I think the flow in this function is a bit hard to follow. Could it
> perhaps be simplified? If not, perhaps add a comment to the function
> what it expects from the input arguments and the possible corner cases
> that complicates it?
> 

Maybe? It may need to updated to do clears to in the next rev, so let me
play around with this. At minimum can add comments + kernel doc.

Matt

> 
> > +		}
> > +	}
> > +
> > +err_out:
> > +	if (fence) {
> > +		dma_fence_wait(fence, false);
> > +		dma_fence_put(fence);
> > +	}
> > +
> > +	return err;
> > +#undef VRAM_ADDR_INVALID
> > +}
> > +
> > +static int xe_svm_copy_to_devmem(struct page **pages, dma_addr_t
> > *dma_addr,
> > +				 unsigned long npages)
> > +{
> > +	return xe_svm_copy(pages, dma_addr, npages,
> > XE_SVM_COPY_TO_VRAM);
> > +}
> > +
> > +static int xe_svm_copy_to_ram(struct page **pages, dma_addr_t
> > *dma_addr,
> > +			      unsigned long npages)
> > +{
> > +	return xe_svm_copy(pages, dma_addr, npages,
> > XE_SVM_COPY_TO_SRAM);
> > +}
> > +
> > +__maybe_unused
> 
> Is this __maybe_unused to be removed in a follow-up patch? If so could
> you add a comment stating that?
> 
> > +static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> > +	.copy_to_devmem = xe_svm_copy_to_devmem,
> > +	.copy_to_ram = xe_svm_copy_to_ram,
> > +};
> > +
> >  static const struct drm_gpusvm_ops gpusvm_ops = {
> >  	.range_alloc = xe_svm_range_alloc,
> >  	.range_free = xe_svm_range_free,
> 
> Thanks,
> Thomas
> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 20/29] drm/xe: Add drm_pagemap ops to SVM
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (18 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 19/29] drm/xe: Add GPUSVM devic memory copy vfunc functions Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-10-16  3:25 ` [PATCH v2 21/29] drm/xe: Add Xe SVM populate_devmem_pfn vfunc Matthew Brost
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Add support for mapping device pages to Xe SVM.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  7 +++++++
 drivers/gpu/drm/xe/xe_svm.c          | 30 ++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 2ac5de7751c9..72264f9f64d7 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -12,6 +12,8 @@
 #include <drm/drm_file.h>
 #include <drm/ttm/ttm_device.h>
 
+#include "drm_pagemap.h"
+
 #include "xe_devcoredump_types.h"
 #include "xe_heci_gsc.h"
 #include "xe_lmtt_types.h"
@@ -106,6 +108,11 @@ struct xe_mem_region {
 	void __iomem *mapping;
 	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
 	struct dev_pagemap pagemap;
+	/**
+	 * @dpagemap: The struct drm_pagemap of the ZONE_DEVICE memory
+	 * pages of this tile.
+	 */
+	struct drm_pagemap dpagemap;
 	/**
 	 * @hpa_base: base host physical address
 	 *
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index b33fd42d035b..4f01941b2cc2 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -431,6 +431,32 @@ static const u64 fault_chunk_sizes[] = {
 	SZ_4K,
 };
 
+static struct drm_pagemap_dma_addr
+xe_drm_pagemap_map_dma(struct drm_pagemap *dpagemap,
+		       struct device *dev,
+		       struct page *page,
+		       unsigned int order,
+		       enum dma_data_direction dir)
+{
+	struct device *pgmap_dev = dpagemap->dev;
+	dma_addr_t addr;
+	enum drm_interconnect_protocol prot;
+
+	if (pgmap_dev == dev) {
+		addr = xe_mem_region_page_to_dpa(page_to_mr(page), page);
+		prot = XE_INTERCONNECT_VRAM;
+	} else {
+		addr = DMA_MAPPING_ERROR;
+		prot = 0;
+	}
+
+	return drm_pagemap_dma_addr_encode(addr, prot, order, dir);
+}
+
+static const struct drm_pagemap_ops xe_drm_pagemap_ops = {
+	.map_dma = xe_drm_pagemap_map_dma,
+};
+
 int xe_svm_init(struct xe_vm *vm)
 {
 	spin_lock_init(&vm->svm.garbage_collector.lock);
@@ -572,6 +598,10 @@ int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
 	mr->pagemap.ops = drm_gpusvm_pagemap_ops_get();
 	mr->pagemap.owner = xe_svm_devm_owner(xe);
 	addr = devm_memremap_pages(dev, &mr->pagemap);
+
+	mr->dpagemap.dev = dev;
+	mr->dpagemap.ops = &xe_drm_pagemap_ops;
+
 	if (IS_ERR(addr)) {
 		devm_release_mem_region(dev, res->start, resource_size(res));
 		ret = PTR_ERR(addr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v2 21/29] drm/xe: Add Xe SVM populate_devmem_pfn vfunc
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (19 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 20/29] drm/xe: Add drm_pagemap ops to SVM Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 10:19   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 22/29] drm/xe: Add Xe SVM devmem_release vfunc Matthew Brost
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Get VRAM pfns from BO's buddy blocks.

v2:
 - Use new drm_gpusvm_devmem_ops

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 40 +++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 4f01941b2cc2..19fcb8f71791 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -9,6 +9,7 @@
 #include "xe_migrate.h"
 #include "xe_pt.h"
 #include "xe_svm.h"
+#include "xe_ttm_vram_mgr.h"
 #include "xe_vm.h"
 #include "xe_vm_types.h"
 
@@ -413,8 +414,47 @@ static int xe_svm_copy_to_ram(struct page **pages, dma_addr_t *dma_addr,
 	return xe_svm_copy(pages, dma_addr, npages, XE_SVM_COPY_TO_SRAM);
 }
 
+static struct xe_bo *to_xe_bo(struct drm_gpusvm_devmem *devmem_allocation)
+{
+	return container_of(devmem_allocation, struct xe_bo, devmem_allocation);
+}
+
+static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
+{
+	return PHYS_PFN(offset + mr->hpa_base);
+}
+
+static struct drm_buddy *tile_to_buddy(struct xe_tile *tile)
+{
+	return &tile->mem.vram_mgr->mm;
+}
+
+static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocation,
+				      unsigned long npages, unsigned long *pfn)
+{
+	struct xe_bo *bo = to_xe_bo(devmem_allocation);
+	struct ttm_resource *res = bo->ttm.resource;
+	struct list_head *blocks = &to_xe_ttm_vram_mgr_resource(res)->blocks;
+	struct drm_buddy_block *block;
+	int j =0;
+
+	list_for_each_entry(block, blocks, link) {
+		struct xe_mem_region *mr = block->private;
+		struct xe_tile *tile = mr_to_tile(mr);
+		struct drm_buddy *buddy = tile_to_buddy(tile);
+		u64 block_pfn = block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		int i;
+
+		for(i = 0; i < drm_buddy_block_size(buddy, block) >> PAGE_SHIFT; ++i)
+			pfn[j++] = block_pfn + i;
+	}
+
+	return 0;
+}
+
 __maybe_unused
 static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
+	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
 	.copy_to_devmem = xe_svm_copy_to_devmem,
 	.copy_to_ram = xe_svm_copy_to_ram,
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 21/29] drm/xe: Add Xe SVM populate_devmem_pfn vfunc
  2024-10-16  3:25 ` [PATCH v2 21/29] drm/xe: Add Xe SVM populate_devmem_pfn vfunc Matthew Brost
@ 2024-12-02 10:19   ` Thomas Hellström
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 10:19 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Get VRAM pfns from BO's buddy blocks.
> 
> v2:
>  - Use new drm_gpusvm_devmem_ops
> 
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

> ---
>  drivers/gpu/drm/xe/xe_svm.c | 40
> +++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 4f01941b2cc2..19fcb8f71791 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -9,6 +9,7 @@
>  #include "xe_migrate.h"
>  #include "xe_pt.h"
>  #include "xe_svm.h"
> +#include "xe_ttm_vram_mgr.h"
>  #include "xe_vm.h"
>  #include "xe_vm_types.h"
>  
> @@ -413,8 +414,47 @@ static int xe_svm_copy_to_ram(struct page
> **pages, dma_addr_t *dma_addr,
>  	return xe_svm_copy(pages, dma_addr, npages,
> XE_SVM_COPY_TO_SRAM);
>  }
>  
> +static struct xe_bo *to_xe_bo(struct drm_gpusvm_devmem
> *devmem_allocation)
> +{
> +	return container_of(devmem_allocation, struct xe_bo,
> devmem_allocation);
> +}
> +
> +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> +{
> +	return PHYS_PFN(offset + mr->hpa_base);
> +}
> +
> +static struct drm_buddy *tile_to_buddy(struct xe_tile *tile)
> +{
> +	return &tile->mem.vram_mgr->mm;
> +}
> +
> +static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem
> *devmem_allocation,
> +				      unsigned long npages, unsigned
> long *pfn)
> +{
> +	struct xe_bo *bo = to_xe_bo(devmem_allocation);
> +	struct ttm_resource *res = bo->ttm.resource;
> +	struct list_head *blocks =
> &to_xe_ttm_vram_mgr_resource(res)->blocks;
> +	struct drm_buddy_block *block;
> +	int j =0;
> +
> +	list_for_each_entry(block, blocks, link) {
> +		struct xe_mem_region *mr = block->private;
> +		struct xe_tile *tile = mr_to_tile(mr);
> +		struct drm_buddy *buddy = tile_to_buddy(tile);
> +		u64 block_pfn = block_offset_to_pfn(mr,
> drm_buddy_block_offset(block));
> +		int i;
> +
> +		for(i = 0; i < drm_buddy_block_size(buddy, block) >>
> PAGE_SHIFT; ++i)
> +			pfn[j++] = block_pfn + i;
> +	}
> +
> +	return 0;
> +}
> +
>  __maybe_unused
>  static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> +	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
>  	.copy_to_devmem = xe_svm_copy_to_devmem,
>  	.copy_to_ram = xe_svm_copy_to_ram,
>  };


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 22/29] drm/xe: Add Xe SVM devmem_release vfunc
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (20 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 21/29] drm/xe: Add Xe SVM populate_devmem_pfn vfunc Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 10:21   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 23/29] drm/xe: Add BO flags required for SVM Matthew Brost
                   ` (9 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Implement with a simple BO put.

v2:
 - Use new drm_gpusvm_devmem_ops

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 19fcb8f71791..976b4ce15db4 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -5,6 +5,7 @@
 
 #include "drm_gpusvm.h"
 
+#include "xe_bo.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
 #include "xe_pt.h"
@@ -419,6 +420,11 @@ static struct xe_bo *to_xe_bo(struct drm_gpusvm_devmem *devmem_allocation)
 	return container_of(devmem_allocation, struct xe_bo, devmem_allocation);
 }
 
+static void xe_svm_devmem_release(struct drm_gpusvm_devmem *devmem_allocation)
+{
+	xe_bo_put(to_xe_bo(devmem_allocation));
+}
+
 static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
 {
 	return PHYS_PFN(offset + mr->hpa_base);
@@ -454,6 +460,7 @@ static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocatio
 
 __maybe_unused
 static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
+	.devmem_release = xe_svm_devmem_release,
 	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
 	.copy_to_devmem = xe_svm_copy_to_devmem,
 	.copy_to_ram = xe_svm_copy_to_ram,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 22/29] drm/xe: Add Xe SVM devmem_release vfunc
  2024-10-16  3:25 ` [PATCH v2 22/29] drm/xe: Add Xe SVM devmem_release vfunc Matthew Brost
@ 2024-12-02 10:21   ` Thomas Hellström
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 10:21 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Implement with a simple BO put.
> 
> v2:
>  - Use new drm_gpusvm_devmem_ops
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

> ---
>  drivers/gpu/drm/xe/xe_svm.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 19fcb8f71791..976b4ce15db4 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -5,6 +5,7 @@
>  
>  #include "drm_gpusvm.h"
>  
> +#include "xe_bo.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_migrate.h"
>  #include "xe_pt.h"
> @@ -419,6 +420,11 @@ static struct xe_bo *to_xe_bo(struct
> drm_gpusvm_devmem *devmem_allocation)
>  	return container_of(devmem_allocation, struct xe_bo,
> devmem_allocation);
>  }
>  
> +static void xe_svm_devmem_release(struct drm_gpusvm_devmem
> *devmem_allocation)
> +{
> +	xe_bo_put(to_xe_bo(devmem_allocation));
> +}
> +
>  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
>  {
>  	return PHYS_PFN(offset + mr->hpa_base);
> @@ -454,6 +460,7 @@ static int xe_svm_populate_devmem_pfn(struct
> drm_gpusvm_devmem *devmem_allocatio
>  
>  __maybe_unused
>  static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> +	.devmem_release = xe_svm_devmem_release,
>  	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
>  	.copy_to_devmem = xe_svm_copy_to_devmem,
>  	.copy_to_ram = xe_svm_copy_to_ram,


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 23/29] drm/xe: Add BO flags required for SVM
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (21 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 22/29] drm/xe: Add Xe SVM devmem_release vfunc Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 10:44   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 24/29] drm/xe: Add SVM VRAM migration Matthew Brost
                   ` (8 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add XE_BO_FLAG_SYSTEM_ALLOC to indicate BO is tied to SVM range.

Add XE_BO_FLAG_SKIP_CLEAR to indicate BO does not need to cleared.

v2:
 - Take VM ref for system allocator BOs

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c | 15 +++++++++------
 drivers/gpu/drm/xe/xe_bo.h |  2 ++
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index a02d63e322ae..dbd03383878e 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -685,8 +685,9 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
 	move_lacks_source = !old_mem || (handle_system_ccs ? (!bo->ccs_cleared) :
 					 (!mem_type_is_vram(old_mem_type) && !tt_has_data));
 
-	needs_clear = (ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC) ||
-		(!ttm && ttm_bo->type == ttm_bo_type_device);
+	needs_clear = !(bo->flags & XE_BO_FLAG_SKIP_CLEAR) &&
+		((ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC) ||
+		(!ttm && ttm_bo->type == ttm_bo_type_device));
 
 	if (new_mem->mem_type == XE_PL_TT) {
 		ret = xe_tt_map_sg(ttm);
@@ -1145,7 +1146,7 @@ static void xe_ttm_bo_destroy(struct ttm_buffer_object *ttm_bo)
 		xe_drm_client_remove_bo(bo);
 #endif
 
-	if (bo->vm && xe_bo_is_user(bo))
+	if (bo->vm && (xe_bo_is_user(bo) || bo->flags & XE_BO_FLAG_SYSTEM_ALLOC))
 		xe_vm_put(bo->vm);
 
 	mutex_lock(&xe->mem_access.vram_userfault.lock);
@@ -1301,7 +1302,8 @@ struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
 	int err;
 
 	/* Only kernel objects should set GT */
-	xe_assert(xe, !tile || type == ttm_bo_type_kernel);
+	xe_assert(xe, !tile || type == ttm_bo_type_kernel ||
+		  flags & XE_BO_FLAG_SYSTEM_ALLOC);
 
 	if (XE_WARN_ON(!size)) {
 		xe_bo_free(bo);
@@ -1493,7 +1495,7 @@ __xe_bo_create_locked(struct xe_device *xe,
 	 * by having all the vm's bo refereferences released at vm close
 	 * time.
 	 */
-	if (vm && xe_bo_is_user(bo))
+	if (vm && (xe_bo_is_user(bo) || bo->flags & XE_BO_FLAG_SYSTEM_ALLOC))
 		xe_vm_get(vm);
 	bo->vm = vm;
 
@@ -2333,7 +2335,8 @@ bool xe_bo_needs_ccs_pages(struct xe_bo *bo)
 	 * can't be used since there's no CCS storage associated with
 	 * non-VRAM addresses.
 	 */
-	if (IS_DGFX(xe) && (bo->flags & XE_BO_FLAG_SYSTEM))
+	if (IS_DGFX(xe) && ((bo->flags & XE_BO_FLAG_SYSTEM) ||
+	    (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC)))
 		return false;
 
 	return true;
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 7fa44a0138b0..caf0459d16ad 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -39,6 +39,8 @@
 #define XE_BO_FLAG_NEEDS_64K		BIT(15)
 #define XE_BO_FLAG_NEEDS_2M		BIT(16)
 #define XE_BO_FLAG_GGTT_INVALIDATE	BIT(17)
+#define XE_BO_FLAG_SYSTEM_ALLOC		BIT(18)
+#define XE_BO_FLAG_SKIP_CLEAR		BIT(19)
 /* this one is trigger internally only */
 #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
 #define XE_BO_FLAG_INTERNAL_64K		BIT(31)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 23/29] drm/xe: Add BO flags required for SVM
  2024-10-16  3:25 ` [PATCH v2 23/29] drm/xe: Add BO flags required for SVM Matthew Brost
@ 2024-12-02 10:44   ` Thomas Hellström
  2024-12-11 21:42     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 10:44 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add XE_BO_FLAG_SYSTEM_ALLOC to indicate BO is tied to SVM range.
> 
> Add XE_BO_FLAG_SKIP_CLEAR to indicate BO does not need to cleared.
> 
> v2:
>  - Take VM ref for system allocator BOs
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c | 15 +++++++++------
>  drivers/gpu/drm/xe/xe_bo.h |  2 ++
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index a02d63e322ae..dbd03383878e 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -685,8 +685,9 @@ static int xe_bo_move(struct ttm_buffer_object
> *ttm_bo, bool evict,
>  	move_lacks_source = !old_mem || (handle_system_ccs ? (!bo-
> >ccs_cleared) :
>  					
> (!mem_type_is_vram(old_mem_type) && !tt_has_data));
>  
> -	needs_clear = (ttm && ttm->page_flags &
> TTM_TT_FLAG_ZERO_ALLOC) ||
> -		(!ttm && ttm_bo->type == ttm_bo_type_device);
> +	needs_clear = !(bo->flags & XE_BO_FLAG_SKIP_CLEAR) &&
> +		((ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC)
> ||
> +		(!ttm && ttm_bo->type == ttm_bo_type_device));

It should be worth adding a note about how clearing for svm bos is
intended to work. From what I can tell, there is an option to clear on
migration from system to vram if no system pages are present?

>  
>  	if (new_mem->mem_type == XE_PL_TT) {
>  		ret = xe_tt_map_sg(ttm);
> @@ -1145,7 +1146,7 @@ static void xe_ttm_bo_destroy(struct
> ttm_buffer_object *ttm_bo)
>  		xe_drm_client_remove_bo(bo);
>  #endif
>  
> -	if (bo->vm && xe_bo_is_user(bo))
> +	if (bo->vm && (xe_bo_is_user(bo) || bo->flags &
> XE_BO_FLAG_SYSTEM_ALLOC))
>  		xe_vm_put(bo->vm);
>  
>  	mutex_lock(&xe->mem_access.vram_userfault.lock);
> @@ -1301,7 +1302,8 @@ struct xe_bo *___xe_bo_create_locked(struct
> xe_device *xe, struct xe_bo *bo,
>  	int err;
>  
>  	/* Only kernel objects should set GT */
> -	xe_assert(xe, !tile || type == ttm_bo_type_kernel);
> +	xe_assert(xe, !tile || type == ttm_bo_type_kernel ||
> +		  flags & XE_BO_FLAG_SYSTEM_ALLOC);
>  
>  	if (XE_WARN_ON(!size)) {
>  		xe_bo_free(bo);
> @@ -1493,7 +1495,7 @@ __xe_bo_create_locked(struct xe_device *xe,
>  	 * by having all the vm's bo refereferences released at vm
> close
>  	 * time.
>  	 */
> -	if (vm && xe_bo_is_user(bo))
> +	if (vm && (xe_bo_is_user(bo) || bo->flags &
> XE_BO_FLAG_SYSTEM_ALLOC))
>  		xe_vm_get(vm);
>  	bo->vm = vm;
>  
> @@ -2333,7 +2335,8 @@ bool xe_bo_needs_ccs_pages(struct xe_bo *bo)
>  	 * can't be used since there's no CCS storage associated
> with
>  	 * non-VRAM addresses.
>  	 */
> -	if (IS_DGFX(xe) && (bo->flags & XE_BO_FLAG_SYSTEM))
> +	if (IS_DGFX(xe) && ((bo->flags & XE_BO_FLAG_SYSTEM) ||
> +	    (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC)))
>  		return false;

Can we support CCS with system allocator? Perhaps add a TODO comment if
so. I figure it should be possible if we resolve on migration to
system, which we do on BMG.


Thanks,
Thomas


>  
>  	return true;
> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> index 7fa44a0138b0..caf0459d16ad 100644
> --- a/drivers/gpu/drm/xe/xe_bo.h
> +++ b/drivers/gpu/drm/xe/xe_bo.h
> @@ -39,6 +39,8 @@
>  #define XE_BO_FLAG_NEEDS_64K		BIT(15)
>  #define XE_BO_FLAG_NEEDS_2M		BIT(16)
>  #define XE_BO_FLAG_GGTT_INVALIDATE	BIT(17)
> +#define XE_BO_FLAG_SYSTEM_ALLOC		BIT(18)
> +#define XE_BO_FLAG_SKIP_CLEAR		BIT(19)
>  /* this one is trigger internally only */
>  #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
>  #define XE_BO_FLAG_INTERNAL_64K		BIT(31)


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 23/29] drm/xe: Add BO flags required for SVM
  2024-12-02 10:44   ` Thomas Hellström
@ 2024-12-11 21:42     ` Matthew Brost
  2024-12-16 10:44       ` Thomas Hellström
  0 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 21:42 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 11:44:47AM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add XE_BO_FLAG_SYSTEM_ALLOC to indicate BO is tied to SVM range.
> > 
> > Add XE_BO_FLAG_SKIP_CLEAR to indicate BO does not need to cleared.
> > 
> > v2:
> >  - Take VM ref for system allocator BOs
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_bo.c | 15 +++++++++------
> >  drivers/gpu/drm/xe/xe_bo.h |  2 ++
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > index a02d63e322ae..dbd03383878e 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -685,8 +685,9 @@ static int xe_bo_move(struct ttm_buffer_object
> > *ttm_bo, bool evict,
> >  	move_lacks_source = !old_mem || (handle_system_ccs ? (!bo-
> > >ccs_cleared) :
> >  					
> > (!mem_type_is_vram(old_mem_type) && !tt_has_data));
> >  
> > -	needs_clear = (ttm && ttm->page_flags &
> > TTM_TT_FLAG_ZERO_ALLOC) ||
> > -		(!ttm && ttm_bo->type == ttm_bo_type_device);
> > +	needs_clear = !(bo->flags & XE_BO_FLAG_SKIP_CLEAR) &&
> > +		((ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC)
> > ||
> > +		(!ttm && ttm_bo->type == ttm_bo_type_device));
> 
> It should be worth adding a note about how clearing for svm bos is
> intended to work. From what I can tell, there is an option to clear on
> migration from system to vram if no system pages are present?
> 

Sure can add a comment. The migration from system to vram doesn't do a
clear currently because when 'check_pages' is set we only migrate CPU
faulted in pages. If we remove that, then yes we'd need a clear on
migration. 

> >  
> >  	if (new_mem->mem_type == XE_PL_TT) {
> >  		ret = xe_tt_map_sg(ttm);
> > @@ -1145,7 +1146,7 @@ static void xe_ttm_bo_destroy(struct
> > ttm_buffer_object *ttm_bo)
> >  		xe_drm_client_remove_bo(bo);
> >  #endif
> >  
> > -	if (bo->vm && xe_bo_is_user(bo))
> > +	if (bo->vm && (xe_bo_is_user(bo) || bo->flags &
> > XE_BO_FLAG_SYSTEM_ALLOC))
> >  		xe_vm_put(bo->vm);
> >  
> >  	mutex_lock(&xe->mem_access.vram_userfault.lock);
> > @@ -1301,7 +1302,8 @@ struct xe_bo *___xe_bo_create_locked(struct
> > xe_device *xe, struct xe_bo *bo,
> >  	int err;
> >  
> >  	/* Only kernel objects should set GT */
> > -	xe_assert(xe, !tile || type == ttm_bo_type_kernel);
> > +	xe_assert(xe, !tile || type == ttm_bo_type_kernel ||
> > +		  flags & XE_BO_FLAG_SYSTEM_ALLOC);
> >  
> >  	if (XE_WARN_ON(!size)) {
> >  		xe_bo_free(bo);
> > @@ -1493,7 +1495,7 @@ __xe_bo_create_locked(struct xe_device *xe,
> >  	 * by having all the vm's bo refereferences released at vm
> > close
> >  	 * time.
> >  	 */
> > -	if (vm && xe_bo_is_user(bo))
> > +	if (vm && (xe_bo_is_user(bo) || bo->flags &
> > XE_BO_FLAG_SYSTEM_ALLOC))
> >  		xe_vm_get(vm);
> >  	bo->vm = vm;
> >  
> > @@ -2333,7 +2335,8 @@ bool xe_bo_needs_ccs_pages(struct xe_bo *bo)
> >  	 * can't be used since there's no CCS storage associated
> > with
> >  	 * non-VRAM addresses.
> >  	 */
> > -	if (IS_DGFX(xe) && (bo->flags & XE_BO_FLAG_SYSTEM))
> > +	if (IS_DGFX(xe) && ((bo->flags & XE_BO_FLAG_SYSTEM) ||
> > +	    (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC)))
> >  		return false;
> 
> Can we support CCS with system allocator? Perhaps add a TODO comment if
> so. I figure it should be possible if we resolve on migration to
> system, which we do on BMG.
> 

Honestly don't really understand how CCS works, so unsure if possible.
Can add a TODO comment and we can circle back.

Matt

> 
> Thanks,
> Thomas
> 
> 
> >  
> >  	return true;
> > diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> > index 7fa44a0138b0..caf0459d16ad 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.h
> > +++ b/drivers/gpu/drm/xe/xe_bo.h
> > @@ -39,6 +39,8 @@
> >  #define XE_BO_FLAG_NEEDS_64K		BIT(15)
> >  #define XE_BO_FLAG_NEEDS_2M		BIT(16)
> >  #define XE_BO_FLAG_GGTT_INVALIDATE	BIT(17)
> > +#define XE_BO_FLAG_SYSTEM_ALLOC		BIT(18)
> > +#define XE_BO_FLAG_SKIP_CLEAR		BIT(19)
> >  /* this one is trigger internally only */
> >  #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
> >  #define XE_BO_FLAG_INTERNAL_64K		BIT(31)
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 23/29] drm/xe: Add BO flags required for SVM
  2024-12-11 21:42     ` Matthew Brost
@ 2024-12-16 10:44       ` Thomas Hellström
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Hellström @ 2024-12-16 10:44 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Wed, 2024-12-11 at 13:42 -0800, Matthew Brost wrote:
> On Mon, Dec 02, 2024 at 11:44:47AM +0100, Thomas Hellström wrote:
> > On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > > Add XE_BO_FLAG_SYSTEM_ALLOC to indicate BO is tied to SVM range.
> > > 
> > > Add XE_BO_FLAG_SKIP_CLEAR to indicate BO does not need to
> > > cleared.
> > > 
> > > v2:
> > >  - Take VM ref for system allocator BOs
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_bo.c | 15 +++++++++------
> > >  drivers/gpu/drm/xe/xe_bo.h |  2 ++
> > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > b/drivers/gpu/drm/xe/xe_bo.c
> > > index a02d63e322ae..dbd03383878e 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > @@ -685,8 +685,9 @@ static int xe_bo_move(struct
> > > ttm_buffer_object
> > > *ttm_bo, bool evict,
> > >  	move_lacks_source = !old_mem || (handle_system_ccs ?
> > > (!bo-
> > > > ccs_cleared) :
> > >  					
> > > (!mem_type_is_vram(old_mem_type) && !tt_has_data));
> > >  
> > > -	needs_clear = (ttm && ttm->page_flags &
> > > TTM_TT_FLAG_ZERO_ALLOC) ||
> > > -		(!ttm && ttm_bo->type == ttm_bo_type_device);
> > > +	needs_clear = !(bo->flags & XE_BO_FLAG_SKIP_CLEAR) &&
> > > +		((ttm && ttm->page_flags &
> > > TTM_TT_FLAG_ZERO_ALLOC)
> > > > > 
> > > +		(!ttm && ttm_bo->type == ttm_bo_type_device));
> > 
> > It should be worth adding a note about how clearing for svm bos is
> > intended to work. From what I can tell, there is an option to clear
> > on
> > migration from system to vram if no system pages are present?
> > 
> 
> Sure can add a comment. The migration from system to vram doesn't do
> a
> clear currently because when 'check_pages' is set we only migrate CPU
> faulted in pages. If we remove that, then yes we'd need a clear on
> migration. 
> 
> > >  
> > >  	if (new_mem->mem_type == XE_PL_TT) {
> > >  		ret = xe_tt_map_sg(ttm);
> > > @@ -1145,7 +1146,7 @@ static void xe_ttm_bo_destroy(struct
> > > ttm_buffer_object *ttm_bo)
> > >  		xe_drm_client_remove_bo(bo);
> > >  #endif
> > >  
> > > -	if (bo->vm && xe_bo_is_user(bo))
> > > +	if (bo->vm && (xe_bo_is_user(bo) || bo->flags &
> > > XE_BO_FLAG_SYSTEM_ALLOC))
> > >  		xe_vm_put(bo->vm);
> > >  
> > >  	mutex_lock(&xe->mem_access.vram_userfault.lock);
> > > @@ -1301,7 +1302,8 @@ struct xe_bo *___xe_bo_create_locked(struct
> > > xe_device *xe, struct xe_bo *bo,
> > >  	int err;
> > >  
> > >  	/* Only kernel objects should set GT */
> > > -	xe_assert(xe, !tile || type == ttm_bo_type_kernel);
> > > +	xe_assert(xe, !tile || type == ttm_bo_type_kernel ||
> > > +		  flags & XE_BO_FLAG_SYSTEM_ALLOC);
> > >  
> > >  	if (XE_WARN_ON(!size)) {
> > >  		xe_bo_free(bo);
> > > @@ -1493,7 +1495,7 @@ __xe_bo_create_locked(struct xe_device *xe,
> > >  	 * by having all the vm's bo refereferences released at
> > > vm
> > > close
> > >  	 * time.
> > >  	 */
> > > -	if (vm && xe_bo_is_user(bo))
> > > +	if (vm && (xe_bo_is_user(bo) || bo->flags &
> > > XE_BO_FLAG_SYSTEM_ALLOC))
> > >  		xe_vm_get(vm);
> > >  	bo->vm = vm;
> > >  
> > > @@ -2333,7 +2335,8 @@ bool xe_bo_needs_ccs_pages(struct xe_bo
> > > *bo)
> > >  	 * can't be used since there's no CCS storage associated
> > > with
> > >  	 * non-VRAM addresses.
> > >  	 */
> > > -	if (IS_DGFX(xe) && (bo->flags & XE_BO_FLAG_SYSTEM))
> > > +	if (IS_DGFX(xe) && ((bo->flags & XE_BO_FLAG_SYSTEM) ||
> > > +	    (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC)))
> > >  		return false;
> > 
> > Can we support CCS with system allocator? Perhaps add a TODO
> > comment if
> > so. I figure it should be possible if we resolve on migration to
> > system, which we do on BMG.
> > 
> 
> Honestly don't really understand how CCS works, so unsure if
> possible.
> Can add a TODO comment and we can circle back.

Sounds good. We should probably also discuss with UMD if they see a
performance use-case.

/Thomas


> 
> Matt
> 
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > >  
> > >  	return true;
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.h
> > > b/drivers/gpu/drm/xe/xe_bo.h
> > > index 7fa44a0138b0..caf0459d16ad 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.h
> > > +++ b/drivers/gpu/drm/xe/xe_bo.h
> > > @@ -39,6 +39,8 @@
> > >  #define XE_BO_FLAG_NEEDS_64K		BIT(15)
> > >  #define XE_BO_FLAG_NEEDS_2M		BIT(16)
> > >  #define XE_BO_FLAG_GGTT_INVALIDATE	BIT(17)
> > > +#define XE_BO_FLAG_SYSTEM_ALLOC		BIT(18)
> > > +#define XE_BO_FLAG_SKIP_CLEAR		BIT(19)
> > >  /* this one is trigger internally only */
> > >  #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
> > >  #define XE_BO_FLAG_INTERNAL_64K		BIT(31)
> > 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 24/29] drm/xe: Add SVM VRAM migration
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (22 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 23/29] drm/xe: Add BO flags required for SVM Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 12:06   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 25/29] drm/xe: Basic SVM BO eviction Matthew Brost
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Migration is implemented with range granularity, with VRAM backing being
a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
SVM range is migrated to SRAM, the TTM BO is destroyed).

The design choice for using TTM BO for VRAM backing store, as opposed to
direct buddy allocation, is as follows:

- DRM buddy allocations are not at page granularity, offering no
  advantage over a BO.
- Unified eviction is required (SVM VRAM and TTM BOs need to be able to
  evict each other).
- For exhaustive eviction [1], SVM VRAM allocations will almost certainly
  require a dma-resv.
- Likely allocation size is 2M which makes of size of BO (872)
  acceptable per allocation (872 / 2M == .0004158).

With this, using TTM BO for VRAM backing store seems to be an obvious
choice as it allows leveraging of the TTM eviction code.

Current migration policy is migrate any SVM range greater than or equal
to 64k once.

[1] https://patchwork.freedesktop.org/series/133643/

v2:
 - Rebase on latest GPU SVM
 - Retry page fault on get pages returning mixed allocation
 - Use drm_gpusvm_devmem

Signed-off-by: Matthew Brost matthew.brost@intel.com
---
 drivers/gpu/drm/xe/xe_svm.c | 96 +++++++++++++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_svm.h |  1 +
 2 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 976b4ce15db4..31b80cde15c4 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -218,6 +218,9 @@ static int __xe_svm_garbage_collector(struct xe_vm *vm,
 {
 	struct dma_fence *fence;
 
+	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap)
+		drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
+
 	xe_vm_lock(vm, false);
 	fence = xe_vm_range_unbind(vm, range);
 	xe_vm_unlock(vm);
@@ -458,7 +461,6 @@ static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocatio
 	return 0;
 }
 
-__maybe_unused
 static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
 	.devmem_release = xe_svm_devmem_release,
 	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
@@ -542,21 +544,84 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
 	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
 }
 
+static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
+{
+	return &tile->mem.vram;
+}
+
+static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
+				       struct xe_svm_range *range,
+				       const struct drm_gpusvm_ctx *ctx)
+{
+	struct xe_mem_region *mr = tile_to_mr(tile);
+	struct drm_buddy_block *block;
+	struct list_head *blocks;
+	struct xe_bo *bo;
+	ktime_t end = 0;
+	int err;
+
+retry:
+	xe_vm_lock(vm, false);
+	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
+			  range->base.va.start, ttm_bo_type_device,
+			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
+			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
+	xe_vm_unlock(vm);
+	if (IS_ERR(bo)) {
+		err = PTR_ERR(bo);
+		if (xe_vm_validate_should_retry(NULL, err, &end))
+			goto retry;
+		return bo;
+	}
+
+	drm_gpusvm_devmem_init(&bo->devmem_allocation,
+			       vm->xe->drm.dev, vm->svm.gpusvm.mm,
+			       &gpusvm_devmem_ops,
+			       &tile->mem.vram.dpagemap,
+			       range->base.va.end -
+			       range->base.va.start);
+
+	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
+	list_for_each_entry(block, blocks, link)
+		block->private = mr;
+
+	/*
+	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem succeeds the
+	 * creation ref can be dropped upon CPU fault or unmap.
+	 */
+	xe_bo_get(bo);
+
+	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range->base,
+					   &bo->devmem_allocation, ctx);
+	if (err) {
+		xe_bo_put(bo);	/* Local ref */
+		xe_bo_put(bo);	/* Creation ref */
+		return ERR_PTR(err);
+	}
+
+	return bo;
+}
+
 int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    struct xe_tile *tile, u64 fault_addr,
 			    bool atomic)
 {
-	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
+	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
+		.devmem_possible = IS_DGFX(vm->xe), .check_pages = true, };
 	struct xe_svm_range *range;
 	struct drm_gpusvm_range *r;
 	struct drm_exec exec;
 	struct dma_fence *fence;
+	struct xe_bo *bo = NULL;
 	ktime_t end = 0;
 	int err;
 
 	lockdep_assert_held_write(&vm->lock);
 
 retry:
+	xe_bo_put(bo);
+	bo = NULL;
+
 	/* Always process UNMAPs first so view SVM ranges is current */
 	err = xe_svm_garbage_collector(vm);
 	if (err)
@@ -572,9 +637,32 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (xe_svm_range_is_valid(range, tile))
 		return 0;
 
+	/* XXX: Add migration policy, for now migrate range once */
+	if (IS_DGFX(vm->xe) && !range->migrated &&
+	    range->base.flags.migrate_devmem &&
+	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
+		range->migrated = true;
+
+		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
+		if (IS_ERR(bo)) {
+			drm_info(&vm->xe->drm,
+				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
+				 vm->usm.asid, PTR_ERR(bo));
+			bo = NULL;
+			goto retry;
+		}
+	}
+
 	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
 	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
-	       goto retry;
+	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {	/* Corner where CPU mappings have change */
+		if (err == -EOPNOTSUPP)
+			drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
+		drm_info(&vm->xe->drm,
+			 "Get pages failed, falling back to retrying, asid=%u, gpusvm=0x%016llx, errno %d\n",
+			 vm->usm.asid, (u64)&vm->svm.gpusvm, err);
+		goto retry;
+	}
 	if (err)
 		goto err_out;
 
@@ -605,6 +693,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	dma_fence_put(fence);
 
 err_out:
+	xe_bo_put(bo);
+
 	return err;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 760d22cefb1e..6893664dae70 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -21,6 +21,7 @@ struct xe_svm_range {
 	struct list_head garbage_collector_link;
 	u8 tile_present;
 	u8 tile_invalidated;
+	u8 migrated	:1;
 };
 
 int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 24/29] drm/xe: Add SVM VRAM migration
  2024-10-16  3:25 ` [PATCH v2 24/29] drm/xe: Add SVM VRAM migration Matthew Brost
@ 2024-12-02 12:06   ` Thomas Hellström
  2024-12-11 20:17     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 12:06 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Migration is implemented with range granularity, with VRAM backing
> being
> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of
> the
> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> SVM range is migrated to SRAM, the TTM BO is destroyed).
> 
> The design choice for using TTM BO for VRAM backing store, as opposed
> to
> direct buddy allocation, is as follows:
> 
> - DRM buddy allocations are not at page granularity, offering no
>   advantage over a BO.
> - Unified eviction is required (SVM VRAM and TTM BOs need to be able
> to
>   evict each other).
> - For exhaustive eviction [1], SVM VRAM allocations will almost
> certainly
>   require a dma-resv.
> - Likely allocation size is 2M which makes of size of BO (872)
>   acceptable per allocation (872 / 2M == .0004158).
> 
> With this, using TTM BO for VRAM backing store seems to be an obvious
> choice as it allows leveraging of the TTM eviction code.
> 
> Current migration policy is migrate any SVM range greater than or
> equal
> to 64k once.
> 
> [1] https://patchwork.freedesktop.org/series/133643/
> 
> v2:
>  - Rebase on latest GPU SVM
>  - Retry page fault on get pages returning mixed allocation
>  - Use drm_gpusvm_devmem
> 
> Signed-off-by: Matthew Brost matthew.brost@intel.com
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 96
> +++++++++++++++++++++++++++++++++++--
>  drivers/gpu/drm/xe/xe_svm.h |  1 +
>  2 files changed, 94 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 976b4ce15db4..31b80cde15c4 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -218,6 +218,9 @@ static int __xe_svm_garbage_collector(struct
> xe_vm *vm,
>  {
>  	struct dma_fence *fence;
>  
> +	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap)
> +		drm_gpusvm_range_evict(&vm->svm.gpusvm, &range-
> >base);
> +
>  	xe_vm_lock(vm, false);
>  	fence = xe_vm_range_unbind(vm, range);
>  	xe_vm_unlock(vm);
> @@ -458,7 +461,6 @@ static int xe_svm_populate_devmem_pfn(struct
> drm_gpusvm_devmem *devmem_allocatio
>  	return 0;
>  }
>  
> -__maybe_unused
>  static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
>  	.devmem_release = xe_svm_devmem_release,
>  	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
> @@ -542,21 +544,84 @@ static bool xe_svm_range_is_valid(struct
> xe_svm_range *range,
>  	return (range->tile_present & ~range->tile_invalidated) &
> BIT(tile->id);
>  }
>  
> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> +{
> +	return &tile->mem.vram;
> +}
> +
> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> xe_tile *tile,
> +				       struct xe_svm_range *range,
> +				       const struct drm_gpusvm_ctx
> *ctx)

This function will se substantial updates with multi-device, but let's
leave as is for now.

> +{
> +	struct xe_mem_region *mr = tile_to_mr(tile);
> +	struct drm_buddy_block *block;
> +	struct list_head *blocks;
> +	struct xe_bo *bo;
> +	ktime_t end = 0;
> +	int err;
> +
> +retry:
> +	xe_vm_lock(vm, false);
> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> >base.va.end -
> +			  range->base.va.start, ttm_bo_type_device,
> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> +			  XE_BO_FLAG_SYSTEM_ALLOC |
> XE_BO_FLAG_SKIP_CLEAR);
> +	xe_vm_unlock(vm);
> +	if (IS_ERR(bo)) {
> +		err = PTR_ERR(bo);
> +		if (xe_vm_validate_should_retry(NULL, err, &end))
> +			goto retry;
> +		return bo;
> +	}
> +
> +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
> +			       vm->xe->drm.dev, vm->svm.gpusvm.mm,
> +			       &gpusvm_devmem_ops,
> +			       &tile->mem.vram.dpagemap,
> +			       range->base.va.end -
> +			       range->base.va.start);
> +
> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> >blocks;
> +	list_for_each_entry(block, blocks, link)
> +		block->private = mr;
> +
> +	/*
> +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem
> succeeds the
> +	 * creation ref can be dropped upon CPU fault or unmap.
> +	 */
> +	xe_bo_get(bo);
> +
> +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range-
> >base,
> +					   &bo->devmem_allocation,
> ctx);
> +	if (err) {
> +		xe_bo_put(bo);	/* Local ref */
> +		xe_bo_put(bo);	/* Creation ref */
> +		return ERR_PTR(err);
> +	}
> +
> +	return bo;
> +}
> +
>  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  			    struct xe_tile *tile, u64 fault_addr,
>  			    bool atomic)
>  {
> -	struct drm_gpusvm_ctx ctx = { .read_only =
> xe_vma_read_only(vma), };
> +	struct drm_gpusvm_ctx ctx = { .read_only =
> xe_vma_read_only(vma),
> +		.devmem_possible = IS_DGFX(vm->xe), .check_pages =
> true, };
>  	struct xe_svm_range *range;
>  	struct drm_gpusvm_range *r;
>  	struct drm_exec exec;
>  	struct dma_fence *fence;
> +	struct xe_bo *bo = NULL;
>  	ktime_t end = 0;
>  	int err;
>  
>  	lockdep_assert_held_write(&vm->lock);
>  
>  retry:
> +	xe_bo_put(bo);
> +	bo = NULL;
> +
>  	/* Always process UNMAPs first so view SVM ranges is current
> */
>  	err = xe_svm_garbage_collector(vm);
>  	if (err)
> @@ -572,9 +637,32 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	if (xe_svm_range_is_valid(range, tile))
>  		return 0;
>  
> +	/* XXX: Add migration policy, for now migrate range once */
> +	if (IS_DGFX(vm->xe) && !range->migrated &&
> +	    range->base.flags.migrate_devmem &&
> +	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
> +		range->migrated = true;
> +
> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> +		if (IS_ERR(bo)) {
> +			drm_info(&vm->xe->drm,
> +				 "VRAM allocation failed, falling
> back to retrying, asid=%u, errno %ld\n",
> +				 vm->usm.asid, PTR_ERR(bo));
> +			bo = NULL;
> +			goto retry;
> +		}
> +	}
> +
>  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>  	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> mappings have change */
> -	       goto retry;
> +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM)
> {	/* Corner where CPU mappings have change */

have changed or have seen a change?


> +		if (err == -EOPNOTSUPP)
> +			drm_gpusvm_range_evict(&vm->svm.gpusvm,
> &range->base);
> +		drm_info(&vm->xe->drm,
> +			 "Get pages failed, falling back to
> retrying, asid=%u, gpusvm=0x%016llx, errno %d\n",
> +			 vm->usm.asid, (u64)&vm->svm.gpusvm, err);
> +		goto retry;
> +	}
>  	if (err)
>  		goto err_out;
>  
> @@ -605,6 +693,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	dma_fence_put(fence);
>  
>  err_out:
> +	xe_bo_put(bo);
> +
>  	return err;
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 760d22cefb1e..6893664dae70 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -21,6 +21,7 @@ struct xe_svm_range {
>  	struct list_head garbage_collector_link;
>  	u8 tile_present;
>  	u8 tile_invalidated;
> +	u8 migrated	:1;

Kerneldoc, including protection information

>  };
>  
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 24/29] drm/xe: Add SVM VRAM migration
  2024-12-02 12:06   ` Thomas Hellström
@ 2024-12-11 20:17     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 20:17 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 01:06:33PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Migration is implemented with range granularity, with VRAM backing
> > being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of
> > the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed
> > to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >   advantage over a BO.
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able
> > to
> >   evict each other).
> > - For exhaustive eviction [1], SVM VRAM allocations will almost
> > certainly
> >   require a dma-resv.
> > - Likely allocation size is 2M which makes of size of BO (872)
> >   acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> > 
> > Current migration policy is migrate any SVM range greater than or
> > equal
> > to 64k once.
> > 
> > [1] https://patchwork.freedesktop.org/series/133643/
> > 
> > v2:
> >  - Rebase on latest GPU SVM
> >  - Retry page fault on get pages returning mixed allocation
> >  - Use drm_gpusvm_devmem
> > 
> > Signed-off-by: Matthew Brost matthew.brost@intel.com
> > ---
> >  drivers/gpu/drm/xe/xe_svm.c | 96
> > +++++++++++++++++++++++++++++++++++--
> >  drivers/gpu/drm/xe/xe_svm.h |  1 +
> >  2 files changed, 94 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 976b4ce15db4..31b80cde15c4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -218,6 +218,9 @@ static int __xe_svm_garbage_collector(struct
> > xe_vm *vm,
> >  {
> >  	struct dma_fence *fence;
> >  
> > +	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap)
> > +		drm_gpusvm_range_evict(&vm->svm.gpusvm, &range-
> > >base);
> > +
> >  	xe_vm_lock(vm, false);
> >  	fence = xe_vm_range_unbind(vm, range);
> >  	xe_vm_unlock(vm);
> > @@ -458,7 +461,6 @@ static int xe_svm_populate_devmem_pfn(struct
> > drm_gpusvm_devmem *devmem_allocatio
> >  	return 0;
> >  }
> >  
> > -__maybe_unused
> >  static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> >  	.devmem_release = xe_svm_devmem_release,
> >  	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
> > @@ -542,21 +544,84 @@ static bool xe_svm_range_is_valid(struct
> > xe_svm_range *range,
> >  	return (range->tile_present & ~range->tile_invalidated) &
> > BIT(tile->id);
> >  }
> >  
> > +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> > +{
> > +	return &tile->mem.vram;
> > +}
> > +
> > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> > xe_tile *tile,
> > +				       struct xe_svm_range *range,
> > +				       const struct drm_gpusvm_ctx
> > *ctx)
> 
> This function will se substantial updates with multi-device, but let's
> leave as is for now.
> 

Agree. Let's get a baseline in and then rework.

> > +{
> > +	struct xe_mem_region *mr = tile_to_mr(tile);
> > +	struct drm_buddy_block *block;
> > +	struct list_head *blocks;
> > +	struct xe_bo *bo;
> > +	ktime_t end = 0;
> > +	int err;
> > +
> > +retry:
> > +	xe_vm_lock(vm, false);
> > +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> > >base.va.end -
> > +			  range->base.va.start, ttm_bo_type_device,
> > +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> > +			  XE_BO_FLAG_SYSTEM_ALLOC |
> > XE_BO_FLAG_SKIP_CLEAR);
> > +	xe_vm_unlock(vm);
> > +	if (IS_ERR(bo)) {
> > +		err = PTR_ERR(bo);
> > +		if (xe_vm_validate_should_retry(NULL, err, &end))
> > +			goto retry;
> > +		return bo;
> > +	}
> > +
> > +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
> > +			       vm->xe->drm.dev, vm->svm.gpusvm.mm,
> > +			       &gpusvm_devmem_ops,
> > +			       &tile->mem.vram.dpagemap,
> > +			       range->base.va.end -
> > +			       range->base.va.start);
> > +
> > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > >blocks;
> > +	list_for_each_entry(block, blocks, link)
> > +		block->private = mr;
> > +
> > +	/*
> > +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem
> > succeeds the
> > +	 * creation ref can be dropped upon CPU fault or unmap.
> > +	 */
> > +	xe_bo_get(bo);
> > +
> > +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range-
> > >base,
> > +					   &bo->devmem_allocation,
> > ctx);
> > +	if (err) {
> > +		xe_bo_put(bo);	/* Local ref */
> > +		xe_bo_put(bo);	/* Creation ref */
> > +		return ERR_PTR(err);
> > +	}
> > +
> > +	return bo;
> > +}
> > +
> >  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> >  			    struct xe_tile *tile, u64 fault_addr,
> >  			    bool atomic)
> >  {
> > -	struct drm_gpusvm_ctx ctx = { .read_only =
> > xe_vma_read_only(vma), };
> > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > xe_vma_read_only(vma),
> > +		.devmem_possible = IS_DGFX(vm->xe), .check_pages =
> > true, };
> >  	struct xe_svm_range *range;
> >  	struct drm_gpusvm_range *r;
> >  	struct drm_exec exec;
> >  	struct dma_fence *fence;
> > +	struct xe_bo *bo = NULL;
> >  	ktime_t end = 0;
> >  	int err;
> >  
> >  	lockdep_assert_held_write(&vm->lock);
> >  
> >  retry:
> > +	xe_bo_put(bo);
> > +	bo = NULL;
> > +
> >  	/* Always process UNMAPs first so view SVM ranges is current
> > */
> >  	err = xe_svm_garbage_collector(vm);
> >  	if (err)
> > @@ -572,9 +637,32 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	if (xe_svm_range_is_valid(range, tile))
> >  		return 0;
> >  
> > +	/* XXX: Add migration policy, for now migrate range once */
> > +	if (IS_DGFX(vm->xe) && !range->migrated &&
> > +	    range->base.flags.migrate_devmem &&
> > +	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
> > +		range->migrated = true;
> > +
> > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > +		if (IS_ERR(bo)) {
> > +			drm_info(&vm->xe->drm,
> > +				 "VRAM allocation failed, falling
> > back to retrying, asid=%u, errno %ld\n",
> > +				 vm->usm.asid, PTR_ERR(bo));
> > +			bo = NULL;
> > +			goto retry;
> > +		}
> > +	}
> > +
> >  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> >  	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> > mappings have change */
> > -	       goto retry;
> > +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM)
> > {	/* Corner where CPU mappings have change */
> 
> have changed or have seen a change?
> 

Have changed. 

> 
> > +		if (err == -EOPNOTSUPP)
> > +			drm_gpusvm_range_evict(&vm->svm.gpusvm,
> > &range->base);
> > +		drm_info(&vm->xe->drm,
> > +			 "Get pages failed, falling back to
> > retrying, asid=%u, gpusvm=0x%016llx, errno %d\n",
> > +			 vm->usm.asid, (u64)&vm->svm.gpusvm, err);
> > +		goto retry;
> > +	}
> >  	if (err)
> >  		goto err_out;
> >  
> > @@ -605,6 +693,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	dma_fence_put(fence);
> >  
> >  err_out:
> > +	xe_bo_put(bo);
> > +
> >  	return err;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index 760d22cefb1e..6893664dae70 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -21,6 +21,7 @@ struct xe_svm_range {
> >  	struct list_head garbage_collector_link;
> >  	u8 tile_present;
> >  	u8 tile_invalidated;
> > +	u8 migrated	:1;
> 
> Kerneldoc, including protection information
> 

Will fix.

Matt

> >  };
> >  
> >  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> 
> Thanks,
> Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 25/29] drm/xe: Basic SVM BO eviction
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (23 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 24/29] drm/xe: Add SVM VRAM migration Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 12:27   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 26/29] drm/xe: Add SVM debug Matthew Brost
                   ` (6 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Wire xe_bo_move to GPUSVM migration to SRAM with trylocking of mmap
lock.

v2:
 - Use xe_svm_bo_evict
 - Drop bo->range

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c  | 20 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.c |  5 +++++
 drivers/gpu/drm/xe/xe_svm.h |  3 +++
 3 files changed, 28 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index dbd03383878e..17d158762e03 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -25,6 +25,7 @@
 #include "xe_pm.h"
 #include "xe_preempt_fence.h"
 #include "xe_res_cursor.h"
+#include "xe_svm.h"
 #include "xe_trace_bo.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -250,6 +251,8 @@ int xe_bo_placement_for_flags(struct xe_device *xe, struct xe_bo *bo,
 static void xe_evict_flags(struct ttm_buffer_object *tbo,
 			   struct ttm_placement *placement)
 {
+	struct xe_bo *bo;
+
 	if (!xe_bo_is_xe_bo(tbo)) {
 		/* Don't handle scatter gather BOs */
 		if (tbo->type == ttm_bo_type_sg) {
@@ -261,6 +264,12 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
 		return;
 	}
 
+	bo = ttm_to_xe_bo(tbo);
+	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) {
+		*placement = sys_placement;
+		return;
+	}
+
 	/*
 	 * For xe, sg bos that are evicted to system just triggers a
 	 * rebind of the sg list upon subsequent validation to XE_PL_TT.
@@ -738,6 +747,17 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
 		}
 	}
 
+	if (!move_lacks_source && (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) &&
+	    new_mem->mem_type == XE_PL_SYSTEM) {
+		ret = xe_svm_bo_evict(bo);
+		if (!ret) {
+			drm_dbg(&xe->drm, "Evict system allocator BO success\n");
+			ttm_bo_move_null(ttm_bo, new_mem);
+		}
+
+		goto out;
+	}
+
 	if (!move_lacks_source &&
 	    ((old_mem_type == XE_PL_SYSTEM && resource_is_vram(new_mem)) ||
 	     (mem_type_is_vram(old_mem_type) &&
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 31b80cde15c4..555bc71ae523 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -752,3 +752,8 @@ int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
 		 tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
 	return 0;
 }
+
+int xe_svm_bo_evict(struct xe_bo *bo)
+{
+	return drm_gpusvm_evict_to_ram(&bo->devmem_allocation);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 6893664dae70..5b9d5ac9ef72 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -11,6 +11,7 @@
 
 #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
 
+struct xe_bo;
 struct xe_mem_region;
 struct xe_tile;
 struct xe_vm;
@@ -35,6 +36,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    bool atomic);
 bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
 
+int xe_svm_bo_evict(struct xe_bo *bo);
+
 static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
 {
 	return drm_gpusvm_range_pages_valid(range->base.gpusvm, &range->base);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 25/29] drm/xe: Basic SVM BO eviction
  2024-10-16  3:25 ` [PATCH v2 25/29] drm/xe: Basic SVM BO eviction Matthew Brost
@ 2024-12-02 12:27   ` Thomas Hellström
  2024-12-11 19:47     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 12:27 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Wire xe_bo_move to GPUSVM migration to SRAM with trylocking of mmap
> lock.
> 
> v2:
>  - Use xe_svm_bo_evict
>  - Drop bo->range
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c  | 20 ++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_svm.c |  5 +++++
>  drivers/gpu/drm/xe/xe_svm.h |  3 +++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index dbd03383878e..17d158762e03 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -25,6 +25,7 @@
>  #include "xe_pm.h"
>  #include "xe_preempt_fence.h"
>  #include "xe_res_cursor.h"
> +#include "xe_svm.h"
>  #include "xe_trace_bo.h"
>  #include "xe_ttm_stolen_mgr.h"
>  #include "xe_vm.h"
> @@ -250,6 +251,8 @@ int xe_bo_placement_for_flags(struct xe_device
> *xe, struct xe_bo *bo,
>  static void xe_evict_flags(struct ttm_buffer_object *tbo,
>  			   struct ttm_placement *placement)
>  {
> +	struct xe_bo *bo;
> +
>  	if (!xe_bo_is_xe_bo(tbo)) {
>  		/* Don't handle scatter gather BOs */
>  		if (tbo->type == ttm_bo_type_sg) {
> @@ -261,6 +264,12 @@ static void xe_evict_flags(struct
> ttm_buffer_object *tbo,
>  		return;
>  	}
>  
> +	bo = ttm_to_xe_bo(tbo);
> +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) {
> +		*placement = sys_placement;
> +		return;
> +	}
> +
>  	/*
>  	 * For xe, sg bos that are evicted to system just triggers a
>  	 * rebind of the sg list upon subsequent validation to
> XE_PL_TT.
> @@ -738,6 +747,17 @@ static int xe_bo_move(struct ttm_buffer_object
> *ttm_bo, bool evict,
>  		}
>  	}
>  
> +	if (!move_lacks_source && (bo->flags &
> XE_BO_FLAG_SYSTEM_ALLOC) &&
> +	    new_mem->mem_type == XE_PL_SYSTEM) {
> +		ret = xe_svm_bo_evict(bo);
> +		if (!ret) {
> +			drm_dbg(&xe->drm, "Evict system allocator BO
> success\n");
> +			ttm_bo_move_null(ttm_bo, new_mem);
> +		}
> +
> +		goto out;
> +	}
> +
>  	if (!move_lacks_source &&
>  	    ((old_mem_type == XE_PL_SYSTEM &&
> resource_is_vram(new_mem)) ||
>  	     (mem_type_is_vram(old_mem_type) &&
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 31b80cde15c4..555bc71ae523 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -752,3 +752,8 @@ int xe_devm_add(struct xe_tile *tile, struct
> xe_mem_region *mr)
>  		 tile->id, mr->io_start, mr->io_start + mr-
> >usable_size, res);
>  	return 0;
>  }
> +
> +int xe_svm_bo_evict(struct xe_bo *bo)

Kerneldoc. Also important IMO to specify the contract that if this
function returns success, then no VRAM pages must be in use anymore
since we will free the vram resource. (Can we guaranteee that?)


Thanks,
Thomas


> +{
> +	return drm_gpusvm_evict_to_ram(&bo->devmem_allocation);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 6893664dae70..5b9d5ac9ef72 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -11,6 +11,7 @@
>  
>  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
>  
> +struct xe_bo;
>  struct xe_mem_region;
>  struct xe_tile;
>  struct xe_vm;
> @@ -35,6 +36,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  			    bool atomic);
>  bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
>  
> +int xe_svm_bo_evict(struct xe_bo *bo);
> +
>  static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> *range)
>  {
>  	return drm_gpusvm_range_pages_valid(range->base.gpusvm,
> &range->base);


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 25/29] drm/xe: Basic SVM BO eviction
  2024-12-02 12:27   ` Thomas Hellström
@ 2024-12-11 19:47     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:47 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 01:27:24PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Wire xe_bo_move to GPUSVM migration to SRAM with trylocking of mmap
> > lock.
> > 
> > v2:
> >  - Use xe_svm_bo_evict
> >  - Drop bo->range
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_bo.c  | 20 ++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_svm.c |  5 +++++
> >  drivers/gpu/drm/xe/xe_svm.h |  3 +++
> >  3 files changed, 28 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > index dbd03383878e..17d158762e03 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -25,6 +25,7 @@
> >  #include "xe_pm.h"
> >  #include "xe_preempt_fence.h"
> >  #include "xe_res_cursor.h"
> > +#include "xe_svm.h"
> >  #include "xe_trace_bo.h"
> >  #include "xe_ttm_stolen_mgr.h"
> >  #include "xe_vm.h"
> > @@ -250,6 +251,8 @@ int xe_bo_placement_for_flags(struct xe_device
> > *xe, struct xe_bo *bo,
> >  static void xe_evict_flags(struct ttm_buffer_object *tbo,
> >  			   struct ttm_placement *placement)
> >  {
> > +	struct xe_bo *bo;
> > +
> >  	if (!xe_bo_is_xe_bo(tbo)) {
> >  		/* Don't handle scatter gather BOs */
> >  		if (tbo->type == ttm_bo_type_sg) {
> > @@ -261,6 +264,12 @@ static void xe_evict_flags(struct
> > ttm_buffer_object *tbo,
> >  		return;
> >  	}
> >  
> > +	bo = ttm_to_xe_bo(tbo);
> > +	if (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC) {
> > +		*placement = sys_placement;
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * For xe, sg bos that are evicted to system just triggers a
> >  	 * rebind of the sg list upon subsequent validation to
> > XE_PL_TT.
> > @@ -738,6 +747,17 @@ static int xe_bo_move(struct ttm_buffer_object
> > *ttm_bo, bool evict,
> >  		}
> >  	}
> >  
> > +	if (!move_lacks_source && (bo->flags &
> > XE_BO_FLAG_SYSTEM_ALLOC) &&
> > +	    new_mem->mem_type == XE_PL_SYSTEM) {
> > +		ret = xe_svm_bo_evict(bo);
> > +		if (!ret) {
> > +			drm_dbg(&xe->drm, "Evict system allocator BO
> > success\n");
> > +			ttm_bo_move_null(ttm_bo, new_mem);
> > +		}
> > +
> > +		goto out;
> > +	}
> > +
> >  	if (!move_lacks_source &&
> >  	    ((old_mem_type == XE_PL_SYSTEM &&
> > resource_is_vram(new_mem)) ||
> >  	     (mem_type_is_vram(old_mem_type) &&
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 31b80cde15c4..555bc71ae523 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -752,3 +752,8 @@ int xe_devm_add(struct xe_tile *tile, struct
> > xe_mem_region *mr)
> >  		 tile->id, mr->io_start, mr->io_start + mr-
> > >usable_size, res);
> >  	return 0;
> >  }
> > +
> > +int xe_svm_bo_evict(struct xe_bo *bo)
> 
> Kerneldoc. Also important IMO to specify the contract that if this
> function returns success, then no VRAM pages must be in use anymore
> since we will free the vram resource. (Can we guaranteee that?)
> 

Will add kernel doc. Yes, we guaranteee that all VRAM pages are evicted
with a retry loop in the GPUSVM layer.

Matt

> 
> Thanks,
> Thomas
> 
> 
> > +{
> > +	return drm_gpusvm_evict_to_ram(&bo->devmem_allocation);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index 6893664dae70..5b9d5ac9ef72 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -11,6 +11,7 @@
> >  
> >  #define XE_INTERCONNECT_VRAM DRM_INTERCONNECT_DRIVER
> >  
> > +struct xe_bo;
> >  struct xe_mem_region;
> >  struct xe_tile;
> >  struct xe_vm;
> > @@ -35,6 +36,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  			    bool atomic);
> >  bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
> >  
> > +int xe_svm_bo_evict(struct xe_bo *bo);
> > +
> >  static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> > *range)
> >  {
> >  	return drm_gpusvm_range_pages_valid(range->base.gpusvm,
> > &range->base);
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 26/29] drm/xe: Add SVM debug
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (24 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 25/29] drm/xe: Basic SVM BO eviction Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 12:33   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 27/29] drm/xe: Add modparam for SVM notifier size Matthew Brost
                   ` (5 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add some useful SVM debug logging.

v2:
 - Upadte logging with latest structure layout

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c  |   8 +++
 drivers/gpu/drm/xe/xe_svm.c | 101 +++++++++++++++++++++++++++++++-----
 drivers/gpu/drm/xe/xe_svm.h |   2 +
 3 files changed, 99 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 687abd1a5e74..75f548ebe2b3 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -632,6 +632,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 		/* Move this entire thing to xe_svm.c? */
 		xe_svm_notifier_lock(xe_vma_vm(vma));
 		if (!xe_svm_range_pages_valid(range)) {
+			xe_svm_range_debug(range, "BIND PREPARE - RETRY");
 			xe_svm_notifier_unlock(xe_vma_vm(vma));
 			return -EAGAIN;
 		}
@@ -640,6 +641,10 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 					 range->base.va.end - range->base.va.start,
 					 &curs);
 			is_devmem = xe_res_is_vram(&curs);
+			if (is_devmem)
+				xe_svm_range_debug(range, "BIND PREPARE - DMA VRAM");
+			else
+				xe_svm_range_debug(range, "BIND PREPARE - DMA");
 		} else {
 			xe_assert(xe, false);
 		}
@@ -1397,10 +1402,13 @@ static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update *pt_update)
 		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
 			continue;
 
+		xe_svm_range_debug(range, "PRE-COMMIT");
+
 		xe_assert(vm->xe, xe_vma_is_system_allocator(op->map_range.vma));
 		xe_assert(vm->xe, op->subop == XE_VMA_SUBOP_MAP_RANGE);
 
 		if (!xe_svm_range_pages_valid(range)) {
+			xe_svm_range_debug(range, "PRE-COMMIT - RETRY");
 			xe_svm_notifier_unlock(vm);
 			return -EAGAIN;
 		}
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 555bc71ae523..acf2a3750f38 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -14,6 +14,18 @@
 #include "xe_vm.h"
 #include "xe_vm_types.h"
 
+static bool xe_svm_range_in_vram(struct xe_svm_range *range)
+{
+	/* Not reliable without notifier lock */
+	return range->base.flags.has_devmem_pages;
+}
+
+static bool xe_svm_range_has_vram_binding(struct xe_svm_range *range)
+{
+	/* Not reliable without notifier lock */
+	return xe_svm_range_in_vram(range) && range->tile_present;
+}
+
 static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
 {
 	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
@@ -24,6 +36,23 @@ static struct xe_vm *range_to_vm(struct drm_gpusvm_range *r)
 	return gpusvm_to_vm(r->gpusvm);
 }
 
+#define range_debug(r__, operaton__)					\
+	vm_dbg(&range_to_vm(&(r__)->base)->xe->drm,			\
+	       "%s: asid=%u, gpusvm=0x%016llx, vram=%d,%d, seqno=%lu, " \
+	       "start=0x%014llx, end=0x%014llx, size=%llu",		\
+	       (operaton__), range_to_vm(&(r__)->base)->usm.asid,	\
+	       (u64)(r__)->base.gpusvm,					\
+	       xe_svm_range_in_vram((r__)) ? 1 : 0,			\
+	       xe_svm_range_has_vram_binding((r__)) ? 1 : 0,		\
+	       (r__)->base.notifier_seq,				\
+	       (r__)->base.va.start, (r__)->base.va.end,		\
+	       (r__)->base.va.end - (r__)->base.va.start)
+
+void xe_svm_range_debug(struct xe_svm_range *range, const char *operation)
+{
+	range_debug(range, operation);
+}
+
 static void *xe_svm_devm_owner(struct xe_device *xe)
 {
 	return xe;
@@ -61,6 +90,8 @@ xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
 {
 	struct xe_device *xe = vm->xe;
 
+	range_debug(range, "GARBAGE COLLECTOR ADD");
+
 	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
 
 	spin_lock(&vm->svm.garbage_collector.lock);
@@ -84,10 +115,14 @@ xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
 	u8 tile_mask = 0;
 	u8 id;
 
+	range_debug(range, "NOTIFIER");
+
 	/* Skip if already unmapped or if no binding exist */
 	if (range->base.flags.unmapped || !range->tile_present)
 		return 0;
 
+	range_debug(range, "NOTIFIER - EXECUTE");
+
 	/* Adjust invalidation to range boundaries */
 	if (range->base.va.start < mmu_range->start)
 		*adj_start = range->base.va.start;
@@ -139,6 +174,11 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
 	if (xe_vm_is_closed(vm))
 		return;
 
+	vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
+	       "INVALIDATE: asid=%u, gpusvm=0x%016llx, seqno=%lu, start=0x%016lx, end=0x%016lx, event=%d",
+	       vm->usm.asid, (u64)gpusvm, notifier->notifier.invalidate_seq,
+	       mmu_range->start, mmu_range->end, mmu_range->event);
+
 	/* Adjust invalidation to notifier boundaries */
 	if (adj_start < notifier->interval.start)
 		adj_start = notifier->interval.start;
@@ -218,8 +258,12 @@ static int __xe_svm_garbage_collector(struct xe_vm *vm,
 {
 	struct dma_fence *fence;
 
-	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap)
+	range_debug(range, "GARBAGE COLLECTOR");
+
+	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap) {
+		range_debug(range, "GARBAGE COLLECTOT - EVICT");
 		drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
+	}
 
 	xe_vm_lock(vm, false);
 	fence = xe_vm_range_unbind(vm, range);
@@ -350,16 +394,23 @@ static int xe_svm_copy(struct page **pages, dma_addr_t *dma_addr,
 			int incr = (match && last) ? 1 : 0;
 
 			if (vram_addr != VRAM_ADDR_INVALID) {
-				if (sram)
+				if (sram) {
+					vm_dbg(&tile->xe->drm,
+					       "COPY TO SRAM - 0x%016llx -> 0x%016llx, NPAGES=%ld",
+					       vram_addr, dma_addr[pos], i - pos + incr);
 					__fence = xe_migrate_from_vram(tile->migrate,
 								       i - pos + incr,
 								       vram_addr,
 								       dma_addr + pos);
-				else
+				} else {
+					vm_dbg(&tile->xe->drm,
+					       "COPY TO VRAM - 0x%016llx -> 0x%016llx, NPAGES=%ld",
+					       dma_addr[pos], vram_addr, i - pos + incr);
 					__fence = xe_migrate_to_vram(tile->migrate,
 								     i - pos + incr,
 								     dma_addr + pos,
 								     vram_addr);
+				}
 				if (IS_ERR(__fence)) {
 					err = PTR_ERR(__fence);
 					goto err_out;
@@ -377,14 +428,21 @@ static int xe_svm_copy(struct page **pages, dma_addr_t *dma_addr,
 			}
 
 			if (!match && last && dma_addr[i] && spage) {
-				if (sram)
+				if (sram) {
+					vm_dbg(&tile->xe->drm,
+					       "COPY TO SRAM - 0x%016llx -> 0x%016llx, NPAGES=%d",
+					       vram_addr, dma_addr[pos], 1);
 					__fence = xe_migrate_from_vram(tile->migrate, 1,
 								       vram_addr,
 								       dma_addr + pos);
-				else
+				} else {
+					vm_dbg(&tile->xe->drm,
+					       "COPY TO VRAM - 0x%016llx -> 0x%016llx, NPAGES=%d",
+					       dma_addr[pos], vram_addr, 1);
 					__fence = xe_migrate_to_vram(tile->migrate, 1,
 								     dma_addr + pos,
 								     vram_addr);
+				}
 				if (IS_ERR(__fence)) {
 					err = PTR_ERR(__fence);
 					goto err_out;
@@ -554,12 +612,14 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
 				       const struct drm_gpusvm_ctx *ctx)
 {
 	struct xe_mem_region *mr = tile_to_mr(tile);
+	struct drm_buddy *buddy = tile_to_buddy(tile);
 	struct drm_buddy_block *block;
 	struct list_head *blocks;
 	struct xe_bo *bo;
 	ktime_t end = 0;
 	int err;
 
+	range_debug(range, "ALLOCATE VRAM");
 retry:
 	xe_vm_lock(vm, false);
 	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
@@ -582,8 +642,13 @@ static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
 			       range->base.va.start);
 
 	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
-	list_for_each_entry(block, blocks, link)
+	list_for_each_entry(block, blocks, link) {
+		vm_dbg(&vm->xe->drm, "ALLOC VRAM: asid=%u, gpusvm=0x%016llx, pfn=%llu, npages=%llu",
+		       vm->usm.asid, (u64)&vm->svm.gpusvm,
+		       block_offset_to_pfn(mr, drm_buddy_block_offset(block)),
+		       drm_buddy_block_size(buddy, block) >> PAGE_SHIFT);
 		block->private = mr;
+	}
 
 	/*
 	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem succeeds the
@@ -637,6 +702,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (xe_svm_range_is_valid(range, tile))
 		return 0;
 
+	range_debug(range, "PAGE FAULT");
+
 	/* XXX: Add migration policy, for now migrate range once */
 	if (IS_DGFX(vm->xe) && !range->migrated &&
 	    range->base.flags.migrate_devmem &&
@@ -646,25 +713,33 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
 		if (IS_ERR(bo)) {
 			drm_info(&vm->xe->drm,
-				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
-				 vm->usm.asid, PTR_ERR(bo));
+				 "VRAM allocation failed, falling back to retrying, asid=%u, gpusvm=0x%016llx, errno %ld\n",
+				 vm->usm.asid, (u64)&vm->svm.gpusvm,
+				 PTR_ERR(bo));
 			bo = NULL;
 			goto retry;
 		}
 	}
 
+	range_debug(range, "GET PAGES");
 	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
-	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
 	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {	/* Corner where CPU mappings have change */
-		if (err == -EOPNOTSUPP)
+		if (err == -EOPNOTSUPP) {
+			range_debug(range, "PAGE FAULT - EVICT PAGES");
 			drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
+		}
 		drm_info(&vm->xe->drm,
 			 "Get pages failed, falling back to retrying, asid=%u, gpusvm=0x%016llx, errno %d\n",
 			 vm->usm.asid, (u64)&vm->svm.gpusvm, err);
+		range_debug(range, "PAGE FAULT - RETRY PAGES");
 		goto retry;
 	}
-	if (err)
+	if (err) {
+		range_debug(range, "PAGE FAULT - FAIL PAGE COLLECT");
 		goto err_out;
+	}
+
+	range_debug(range, "PAGE FAULT - BIND");
 
 retry_bind:
 	drm_exec_init(&exec, 0, 0);
@@ -680,8 +755,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 		if (IS_ERR(fence)) {
 			drm_exec_fini(&exec);
 			err = PTR_ERR(fence);
-			if (err == -EAGAIN)
+			if (err == -EAGAIN) {
+				range_debug(range, "PAGE FAULT - RETRY BIND");
 				goto retry;
+			}
 			if (xe_vm_validate_should_retry(&exec, err, &end))
 				goto retry_bind;
 			goto err_out;
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 5b9d5ac9ef72..139acee41b42 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -36,6 +36,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			    bool atomic);
 bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
 
+void xe_svm_range_debug(struct xe_svm_range *range, const char *operation);
+
 int xe_svm_bo_evict(struct xe_bo *bo);
 
 static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 26/29] drm/xe: Add SVM debug
  2024-10-16  3:25 ` [PATCH v2 26/29] drm/xe: Add SVM debug Matthew Brost
@ 2024-12-02 12:33   ` Thomas Hellström
  2024-12-17  1:05     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 12:33 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add some useful SVM debug logging.
> 
> v2:
>  - Upadte logging with latest structure layout
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c  |   8 +++
>  drivers/gpu/drm/xe/xe_svm.c | 101 +++++++++++++++++++++++++++++++---
> --
>  drivers/gpu/drm/xe/xe_svm.h |   2 +
>  3 files changed, 99 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 687abd1a5e74..75f548ebe2b3 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -632,6 +632,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
>  		/* Move this entire thing to xe_svm.c? */
>  		xe_svm_notifier_lock(xe_vma_vm(vma));
>  		if (!xe_svm_range_pages_valid(range)) {
> +			xe_svm_range_debug(range, "BIND PREPARE -
> RETRY");
>  			xe_svm_notifier_unlock(xe_vma_vm(vma));
>  			return -EAGAIN;
>  		}
> @@ -640,6 +641,10 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> xe_vma *vma,
>  					 range->base.va.end - range-
> >base.va.start,
>  					 &curs);
>  			is_devmem = xe_res_is_vram(&curs);
> +			if (is_devmem)
> +				xe_svm_range_debug(range, "BIND
> PREPARE - DMA VRAM");
> +			else
> +				xe_svm_range_debug(range, "BIND
> PREPARE - DMA");
>  		} else {
>  			xe_assert(xe, false);
>  		}
> @@ -1397,10 +1402,13 @@ static int xe_pt_svm_pre_commit(struct
> xe_migrate_pt_update *pt_update)
>  		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
>  			continue;
>  
> +		xe_svm_range_debug(range, "PRE-COMMIT");
> +
>  		xe_assert(vm->xe, xe_vma_is_system_allocator(op-
> >map_range.vma));
>  		xe_assert(vm->xe, op->subop ==
> XE_VMA_SUBOP_MAP_RANGE);
>  
>  		if (!xe_svm_range_pages_valid(range)) {
> +			xe_svm_range_debug(range, "PRE-COMMIT -
> RETRY");
>  			xe_svm_notifier_unlock(vm);
>  			return -EAGAIN;
>  		}
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 555bc71ae523..acf2a3750f38 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -14,6 +14,18 @@
>  #include "xe_vm.h"
>  #include "xe_vm_types.h"
>  
> +static bool xe_svm_range_in_vram(struct xe_svm_range *range)
> +{
> +	/* Not reliable without notifier lock */

lockdep assert?

> +	return range->base.flags.has_devmem_pages;
> +}
> +
> +static bool xe_svm_range_has_vram_binding(struct xe_svm_range
> *range)
> +{
> +	/* Not reliable without notifier lock */

lockdep assert?

> +	return xe_svm_range_in_vram(range) && range->tile_present;
> +}
> +
>  static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
>  {
>  	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
> @@ -24,6 +36,23 @@ static struct xe_vm *range_to_vm(struct
> drm_gpusvm_range *r)
>  	return gpusvm_to_vm(r->gpusvm);
>  }
>  
> +#define range_debug(r__,
> operaton__)					\
> +	vm_dbg(&range_to_vm(&(r__)->base)->xe-
> >drm,			\
> +	       "%s: asid=%u, gpusvm=0x%016llx, vram=%d,%d,
> seqno=%lu, " \
> +	       "start=0x%014llx, end=0x%014llx,
> size=%llu",		\
> +	       (operaton__), range_to_vm(&(r__)->base)-
> >usm.asid,	\
> +	       (u64)(r__)-
> >base.gpusvm,					\
> +	       xe_svm_range_in_vram((r__)) ? 1 :
> 0,			\
> +	       xe_svm_range_has_vram_binding((r__)) ? 1 :
> 0,		\
> +	       (r__)-
> >base.notifier_seq,				\
> +	       (r__)->base.va.start, (r__)-
> >base.va.end,		\
> +	       (r__)->base.va.end - (r__)->base.va.start)
> +
> +void xe_svm_range_debug(struct xe_svm_range *range, const char
> *operation)
> +{
> +	range_debug(range, operation);
> +}
> +
>  static void *xe_svm_devm_owner(struct xe_device *xe)
>  {
>  	return xe;
> @@ -61,6 +90,8 @@ xe_svm_garbage_collector_add_range(struct xe_vm
> *vm, struct xe_svm_range *range,
>  {
>  	struct xe_device *xe = vm->xe;
>  
> +	range_debug(range, "GARBAGE COLLECTOR ADD");
> +
>  	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
>  
>  	spin_lock(&vm->svm.garbage_collector.lock);
> @@ -84,10 +115,14 @@ xe_svm_range_notifier_event_begin(struct xe_vm
> *vm, struct drm_gpusvm_range *r,
>  	u8 tile_mask = 0;
>  	u8 id;
>  
> +	range_debug(range, "NOTIFIER");
> +
>  	/* Skip if already unmapped or if no binding exist */
>  	if (range->base.flags.unmapped || !range->tile_present)
>  		return 0;
>  
> +	range_debug(range, "NOTIFIER - EXECUTE");
> +
>  	/* Adjust invalidation to range boundaries */
>  	if (range->base.va.start < mmu_range->start)
>  		*adj_start = range->base.va.start;
> @@ -139,6 +174,11 @@ static void xe_svm_invalidate(struct drm_gpusvm
> *gpusvm,
>  	if (xe_vm_is_closed(vm))
>  		return;
>  
> +	vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
> +	       "INVALIDATE: asid=%u, gpusvm=0x%016llx, seqno=%lu,
> start=0x%016lx, end=0x%016lx, event=%d",
> +	       vm->usm.asid, (u64)gpusvm, notifier-
> >notifier.invalidate_seq,
> +	       mmu_range->start, mmu_range->end, mmu_range->event);
> +
>  	/* Adjust invalidation to notifier boundaries */
>  	if (adj_start < notifier->interval.start)
>  		adj_start = notifier->interval.start;
> @@ -218,8 +258,12 @@ static int __xe_svm_garbage_collector(struct
> xe_vm *vm,
>  {
>  	struct dma_fence *fence;
>  
> -	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap)
> +	range_debug(range, "GARBAGE COLLECTOR");
> +
> +	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap) {
> +		range_debug(range, "GARBAGE COLLECTOT - EVICT");
Typo COLLECTOT

With those fixed,
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>




>  		drm_gpusvm_range_evict(&vm->svm.gpusvm, &range-
> >base);
> +	}
>  
>  	xe_vm_lock(vm, false);
>  	fence = xe_vm_range_unbind(vm, range);
> @@ -350,16 +394,23 @@ static int xe_svm_copy(struct page **pages,
> dma_addr_t *dma_addr,
>  			int incr = (match && last) ? 1 : 0;
>  
>  			if (vram_addr != VRAM_ADDR_INVALID) {
> -				if (sram)
> +				if (sram) {
> +					vm_dbg(&tile->xe->drm,
> +					       "COPY TO SRAM -
> 0x%016llx -> 0x%016llx, NPAGES=%ld",
> +					       vram_addr,
> dma_addr[pos], i - pos + incr);
>  					__fence =
> xe_migrate_from_vram(tile->migrate,
>  								    
>    i - pos + incr,
>  								    
>    vram_addr,
>  								    
>    dma_addr + pos);
> -				else
> +				} else {
> +					vm_dbg(&tile->xe->drm,
> +					       "COPY TO VRAM -
> 0x%016llx -> 0x%016llx, NPAGES=%ld",
> +					       dma_addr[pos],
> vram_addr, i - pos + incr);
>  					__fence =
> xe_migrate_to_vram(tile->migrate,
>  								    
> i - pos + incr,
>  								    
> dma_addr + pos,
>  								    
> vram_addr);
> +				}
>  				if (IS_ERR(__fence)) {
>  					err = PTR_ERR(__fence);
>  					goto err_out;
> @@ -377,14 +428,21 @@ static int xe_svm_copy(struct page **pages,
> dma_addr_t *dma_addr,
>  			}
>  
>  			if (!match && last && dma_addr[i] && spage)
> {
> -				if (sram)
> +				if (sram) {
> +					vm_dbg(&tile->xe->drm,
> +					       "COPY TO SRAM -
> 0x%016llx -> 0x%016llx, NPAGES=%d",
> +					       vram_addr,
> dma_addr[pos], 1);
>  					__fence =
> xe_migrate_from_vram(tile->migrate, 1,
>  								    
>    vram_addr,
>  								    
>    dma_addr + pos);
> -				else
> +				} else {
> +					vm_dbg(&tile->xe->drm,
> +					       "COPY TO VRAM -
> 0x%016llx -> 0x%016llx, NPAGES=%d",
> +					       dma_addr[pos],
> vram_addr, 1);
>  					__fence =
> xe_migrate_to_vram(tile->migrate, 1,
>  								    
> dma_addr + pos,
>  								    
> vram_addr);
> +				}
>  				if (IS_ERR(__fence)) {
>  					err = PTR_ERR(__fence);
>  					goto err_out;
> @@ -554,12 +612,14 @@ static struct xe_bo *xe_svm_alloc_vram(struct
> xe_vm *vm, struct xe_tile *tile,
>  				       const struct drm_gpusvm_ctx
> *ctx)
>  {
>  	struct xe_mem_region *mr = tile_to_mr(tile);
> +	struct drm_buddy *buddy = tile_to_buddy(tile);
>  	struct drm_buddy_block *block;
>  	struct list_head *blocks;
>  	struct xe_bo *bo;
>  	ktime_t end = 0;
>  	int err;
>  
> +	range_debug(range, "ALLOCATE VRAM");
>  retry:
>  	xe_vm_lock(vm, false);
>  	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> >base.va.end -
> @@ -582,8 +642,13 @@ static struct xe_bo *xe_svm_alloc_vram(struct
> xe_vm *vm, struct xe_tile *tile,
>  			       range->base.va.start);
>  
>  	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> >blocks;
> -	list_for_each_entry(block, blocks, link)
> +	list_for_each_entry(block, blocks, link) {
> +		vm_dbg(&vm->xe->drm, "ALLOC VRAM: asid=%u,
> gpusvm=0x%016llx, pfn=%llu, npages=%llu",
> +		       vm->usm.asid, (u64)&vm->svm.gpusvm,
> +		       block_offset_to_pfn(mr,
> drm_buddy_block_offset(block)),
> +		       drm_buddy_block_size(buddy, block) >>
> PAGE_SHIFT);
>  		block->private = mr;
> +	}
>  
>  	/*
>  	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem
> succeeds the
> @@ -637,6 +702,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	if (xe_svm_range_is_valid(range, tile))
>  		return 0;
>  
> +	range_debug(range, "PAGE FAULT");
> +
>  	/* XXX: Add migration policy, for now migrate range once */
>  	if (IS_DGFX(vm->xe) && !range->migrated &&
>  	    range->base.flags.migrate_devmem &&
> @@ -646,25 +713,33 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>  		if (IS_ERR(bo)) {
>  			drm_info(&vm->xe->drm,
> -				 "VRAM allocation failed, falling
> back to retrying, asid=%u, errno %ld\n",
> -				 vm->usm.asid, PTR_ERR(bo));
> +				 "VRAM allocation failed, falling
> back to retrying, asid=%u, gpusvm=0x%016llx, errno %ld\n",
> +				 vm->usm.asid, (u64)&vm->svm.gpusvm,
> +				 PTR_ERR(bo));
>  			bo = NULL;
>  			goto retry;
>  		}
>  	}
>  
> +	range_debug(range, "GET PAGES");
>  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> mappings have change */
>  	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM)
> {	/* Corner where CPU mappings have change */
> -		if (err == -EOPNOTSUPP)
> +		if (err == -EOPNOTSUPP) {
> +			range_debug(range, "PAGE FAULT - EVICT
> PAGES");
>  			drm_gpusvm_range_evict(&vm->svm.gpusvm,
> &range->base);
> +		}
>  		drm_info(&vm->xe->drm,
>  			 "Get pages failed, falling back to
> retrying, asid=%u, gpusvm=0x%016llx, errno %d\n",
>  			 vm->usm.asid, (u64)&vm->svm.gpusvm, err);
> +		range_debug(range, "PAGE FAULT - RETRY PAGES");
>  		goto retry;
>  	}
> -	if (err)
> +	if (err) {
> +		range_debug(range, "PAGE FAULT - FAIL PAGE
> COLLECT");
>  		goto err_out;
> +	}
> +
> +	range_debug(range, "PAGE FAULT - BIND");
>  
>  retry_bind:
>  	drm_exec_init(&exec, 0, 0);
> @@ -680,8 +755,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  		if (IS_ERR(fence)) {
>  			drm_exec_fini(&exec);
>  			err = PTR_ERR(fence);
> -			if (err == -EAGAIN)
> +			if (err == -EAGAIN) {
> +				range_debug(range, "PAGE FAULT -
> RETRY BIND");
>  				goto retry;
> +			}
>  			if (xe_vm_validate_should_retry(&exec, err,
> &end))
>  				goto retry_bind;
>  			goto err_out;
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 5b9d5ac9ef72..139acee41b42 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -36,6 +36,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  			    bool atomic);
>  bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
>  
> +void xe_svm_range_debug(struct xe_svm_range *range, const char
> *operation);
> +
>  int xe_svm_bo_evict(struct xe_bo *bo);
>  
>  static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> *range)


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 26/29] drm/xe: Add SVM debug
  2024-12-02 12:33   ` Thomas Hellström
@ 2024-12-17  1:05     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-17  1:05 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 01:33:29PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add some useful SVM debug logging.
> > 
> > v2:
> >  - Upadte logging with latest structure layout
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pt.c  |   8 +++
> >  drivers/gpu/drm/xe/xe_svm.c | 101 +++++++++++++++++++++++++++++++---
> > --
> >  drivers/gpu/drm/xe/xe_svm.h |   2 +
> >  3 files changed, 99 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> > index 687abd1a5e74..75f548ebe2b3 100644
> > --- a/drivers/gpu/drm/xe/xe_pt.c
> > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > @@ -632,6 +632,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > xe_vma *vma,
> >  		/* Move this entire thing to xe_svm.c? */
> >  		xe_svm_notifier_lock(xe_vma_vm(vma));
> >  		if (!xe_svm_range_pages_valid(range)) {
> > +			xe_svm_range_debug(range, "BIND PREPARE -
> > RETRY");
> >  			xe_svm_notifier_unlock(xe_vma_vm(vma));
> >  			return -EAGAIN;
> >  		}
> > @@ -640,6 +641,10 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
> > xe_vma *vma,
> >  					 range->base.va.end - range-
> > >base.va.start,
> >  					 &curs);
> >  			is_devmem = xe_res_is_vram(&curs);
> > +			if (is_devmem)
> > +				xe_svm_range_debug(range, "BIND
> > PREPARE - DMA VRAM");
> > +			else
> > +				xe_svm_range_debug(range, "BIND
> > PREPARE - DMA");
> >  		} else {
> >  			xe_assert(xe, false);
> >  		}
> > @@ -1397,10 +1402,13 @@ static int xe_pt_svm_pre_commit(struct
> > xe_migrate_pt_update *pt_update)
> >  		if (op->subop == XE_VMA_SUBOP_UNMAP_RANGE)
> >  			continue;
> >  
> > +		xe_svm_range_debug(range, "PRE-COMMIT");
> > +
> >  		xe_assert(vm->xe, xe_vma_is_system_allocator(op-
> > >map_range.vma));
> >  		xe_assert(vm->xe, op->subop ==
> > XE_VMA_SUBOP_MAP_RANGE);
> >  
> >  		if (!xe_svm_range_pages_valid(range)) {
> > +			xe_svm_range_debug(range, "PRE-COMMIT -
> > RETRY");
> >  			xe_svm_notifier_unlock(vm);
> >  			return -EAGAIN;
> >  		}
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 555bc71ae523..acf2a3750f38 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -14,6 +14,18 @@
> >  #include "xe_vm.h"
> >  #include "xe_vm_types.h"
> >  
> > +static bool xe_svm_range_in_vram(struct xe_svm_range *range)
> > +{
> > +	/* Not reliable without notifier lock */
> 
> lockdep assert?
> 

Ah, no. We call this from the debug code which doesn't have this lock so
it is best effort there. This comment is saying don't call this and
expect it to be reliable without this lock.

> > +	return range->base.flags.has_devmem_pages;
> > +}
> > +
> > +static bool xe_svm_range_has_vram_binding(struct xe_svm_range
> > *range)
> > +{
> > +	/* Not reliable without notifier lock */
> 
> lockdep assert?
> 

Same here.

> > +	return xe_svm_range_in_vram(range) && range->tile_present;
> > +}
> > +
> >  static struct xe_vm *gpusvm_to_vm(struct drm_gpusvm *gpusvm)
> >  {
> >  	return container_of(gpusvm, struct xe_vm, svm.gpusvm);
> > @@ -24,6 +36,23 @@ static struct xe_vm *range_to_vm(struct
> > drm_gpusvm_range *r)
> >  	return gpusvm_to_vm(r->gpusvm);
> >  }
> >  
> > +#define range_debug(r__,
> > operaton__)					\
> > +	vm_dbg(&range_to_vm(&(r__)->base)->xe-
> > >drm,			\
> > +	       "%s: asid=%u, gpusvm=0x%016llx, vram=%d,%d,
> > seqno=%lu, " \
> > +	       "start=0x%014llx, end=0x%014llx,
> > size=%llu",		\
> > +	       (operaton__), range_to_vm(&(r__)->base)-
> > >usm.asid,	\
> > +	       (u64)(r__)-
> > >base.gpusvm,					\
> > +	       xe_svm_range_in_vram((r__)) ? 1 :
> > 0,			\
> > +	       xe_svm_range_has_vram_binding((r__)) ? 1 :
> > 0,		\
> > +	       (r__)-
> > >base.notifier_seq,				\
> > +	       (r__)->base.va.start, (r__)-
> > >base.va.end,		\
> > +	       (r__)->base.va.end - (r__)->base.va.start)
> > +
> > +void xe_svm_range_debug(struct xe_svm_range *range, const char
> > *operation)
> > +{
> > +	range_debug(range, operation);
> > +}
> > +
> >  static void *xe_svm_devm_owner(struct xe_device *xe)
> >  {
> >  	return xe;
> > @@ -61,6 +90,8 @@ xe_svm_garbage_collector_add_range(struct xe_vm
> > *vm, struct xe_svm_range *range,
> >  {
> >  	struct xe_device *xe = vm->xe;
> >  
> > +	range_debug(range, "GARBAGE COLLECTOR ADD");
> > +
> >  	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
> >  
> >  	spin_lock(&vm->svm.garbage_collector.lock);
> > @@ -84,10 +115,14 @@ xe_svm_range_notifier_event_begin(struct xe_vm
> > *vm, struct drm_gpusvm_range *r,
> >  	u8 tile_mask = 0;
> >  	u8 id;
> >  
> > +	range_debug(range, "NOTIFIER");
> > +
> >  	/* Skip if already unmapped or if no binding exist */
> >  	if (range->base.flags.unmapped || !range->tile_present)
> >  		return 0;
> >  
> > +	range_debug(range, "NOTIFIER - EXECUTE");
> > +
> >  	/* Adjust invalidation to range boundaries */
> >  	if (range->base.va.start < mmu_range->start)
> >  		*adj_start = range->base.va.start;
> > @@ -139,6 +174,11 @@ static void xe_svm_invalidate(struct drm_gpusvm
> > *gpusvm,
> >  	if (xe_vm_is_closed(vm))
> >  		return;
> >  
> > +	vm_dbg(&gpusvm_to_vm(gpusvm)->xe->drm,
> > +	       "INVALIDATE: asid=%u, gpusvm=0x%016llx, seqno=%lu,
> > start=0x%016lx, end=0x%016lx, event=%d",
> > +	       vm->usm.asid, (u64)gpusvm, notifier-
> > >notifier.invalidate_seq,
> > +	       mmu_range->start, mmu_range->end, mmu_range->event);
> > +
> >  	/* Adjust invalidation to notifier boundaries */
> >  	if (adj_start < notifier->interval.start)
> >  		adj_start = notifier->interval.start;
> > @@ -218,8 +258,12 @@ static int __xe_svm_garbage_collector(struct
> > xe_vm *vm,
> >  {
> >  	struct dma_fence *fence;
> >  
> > -	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap)
> > +	range_debug(range, "GARBAGE COLLECTOR");
> > +
> > +	if (IS_DGFX(vm->xe) && range->base.flags.partial_unmap) {
> > +		range_debug(range, "GARBAGE COLLECTOT - EVICT");
> Typo COLLECTOT
> 

Will fix.

Matt

> With those fixed,
> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> 
> 
> 
> 
> >  		drm_gpusvm_range_evict(&vm->svm.gpusvm, &range-
> > >base);
> > +	}
> >  
> >  	xe_vm_lock(vm, false);
> >  	fence = xe_vm_range_unbind(vm, range);
> > @@ -350,16 +394,23 @@ static int xe_svm_copy(struct page **pages,
> > dma_addr_t *dma_addr,
> >  			int incr = (match && last) ? 1 : 0;
> >  
> >  			if (vram_addr != VRAM_ADDR_INVALID) {
> > -				if (sram)
> > +				if (sram) {
> > +					vm_dbg(&tile->xe->drm,
> > +					       "COPY TO SRAM -
> > 0x%016llx -> 0x%016llx, NPAGES=%ld",
> > +					       vram_addr,
> > dma_addr[pos], i - pos + incr);
> >  					__fence =
> > xe_migrate_from_vram(tile->migrate,
> >  								    
> >    i - pos + incr,
> >  								    
> >    vram_addr,
> >  								    
> >    dma_addr + pos);
> > -				else
> > +				} else {
> > +					vm_dbg(&tile->xe->drm,
> > +					       "COPY TO VRAM -
> > 0x%016llx -> 0x%016llx, NPAGES=%ld",
> > +					       dma_addr[pos],
> > vram_addr, i - pos + incr);
> >  					__fence =
> > xe_migrate_to_vram(tile->migrate,
> >  								    
> > i - pos + incr,
> >  								    
> > dma_addr + pos,
> >  								    
> > vram_addr);
> > +				}
> >  				if (IS_ERR(__fence)) {
> >  					err = PTR_ERR(__fence);
> >  					goto err_out;
> > @@ -377,14 +428,21 @@ static int xe_svm_copy(struct page **pages,
> > dma_addr_t *dma_addr,
> >  			}
> >  
> >  			if (!match && last && dma_addr[i] && spage)
> > {
> > -				if (sram)
> > +				if (sram) {
> > +					vm_dbg(&tile->xe->drm,
> > +					       "COPY TO SRAM -
> > 0x%016llx -> 0x%016llx, NPAGES=%d",
> > +					       vram_addr,
> > dma_addr[pos], 1);
> >  					__fence =
> > xe_migrate_from_vram(tile->migrate, 1,
> >  								    
> >    vram_addr,
> >  								    
> >    dma_addr + pos);
> > -				else
> > +				} else {
> > +					vm_dbg(&tile->xe->drm,
> > +					       "COPY TO VRAM -
> > 0x%016llx -> 0x%016llx, NPAGES=%d",
> > +					       dma_addr[pos],
> > vram_addr, 1);
> >  					__fence =
> > xe_migrate_to_vram(tile->migrate, 1,
> >  								    
> > dma_addr + pos,
> >  								    
> > vram_addr);
> > +				}
> >  				if (IS_ERR(__fence)) {
> >  					err = PTR_ERR(__fence);
> >  					goto err_out;
> > @@ -554,12 +612,14 @@ static struct xe_bo *xe_svm_alloc_vram(struct
> > xe_vm *vm, struct xe_tile *tile,
> >  				       const struct drm_gpusvm_ctx
> > *ctx)
> >  {
> >  	struct xe_mem_region *mr = tile_to_mr(tile);
> > +	struct drm_buddy *buddy = tile_to_buddy(tile);
> >  	struct drm_buddy_block *block;
> >  	struct list_head *blocks;
> >  	struct xe_bo *bo;
> >  	ktime_t end = 0;
> >  	int err;
> >  
> > +	range_debug(range, "ALLOCATE VRAM");
> >  retry:
> >  	xe_vm_lock(vm, false);
> >  	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> > >base.va.end -
> > @@ -582,8 +642,13 @@ static struct xe_bo *xe_svm_alloc_vram(struct
> > xe_vm *vm, struct xe_tile *tile,
> >  			       range->base.va.start);
> >  
> >  	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > >blocks;
> > -	list_for_each_entry(block, blocks, link)
> > +	list_for_each_entry(block, blocks, link) {
> > +		vm_dbg(&vm->xe->drm, "ALLOC VRAM: asid=%u,
> > gpusvm=0x%016llx, pfn=%llu, npages=%llu",
> > +		       vm->usm.asid, (u64)&vm->svm.gpusvm,
> > +		       block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block)),
> > +		       drm_buddy_block_size(buddy, block) >>
> > PAGE_SHIFT);
> >  		block->private = mr;
> > +	}
> >  
> >  	/*
> >  	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem
> > succeeds the
> > @@ -637,6 +702,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	if (xe_svm_range_is_valid(range, tile))
> >  		return 0;
> >  
> > +	range_debug(range, "PAGE FAULT");
> > +
> >  	/* XXX: Add migration policy, for now migrate range once */
> >  	if (IS_DGFX(vm->xe) && !range->migrated &&
> >  	    range->base.flags.migrate_devmem &&
> > @@ -646,25 +713,33 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> >  		if (IS_ERR(bo)) {
> >  			drm_info(&vm->xe->drm,
> > -				 "VRAM allocation failed, falling
> > back to retrying, asid=%u, errno %ld\n",
> > -				 vm->usm.asid, PTR_ERR(bo));
> > +				 "VRAM allocation failed, falling
> > back to retrying, asid=%u, gpusvm=0x%016llx, errno %ld\n",
> > +				 vm->usm.asid, (u64)&vm->svm.gpusvm,
> > +				 PTR_ERR(bo));
> >  			bo = NULL;
> >  			goto retry;
> >  		}
> >  	}
> >  
> > +	range_debug(range, "GET PAGES");
> >  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> > -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> > mappings have change */
> >  	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM)
> > {	/* Corner where CPU mappings have change */
> > -		if (err == -EOPNOTSUPP)
> > +		if (err == -EOPNOTSUPP) {
> > +			range_debug(range, "PAGE FAULT - EVICT
> > PAGES");
> >  			drm_gpusvm_range_evict(&vm->svm.gpusvm,
> > &range->base);
> > +		}
> >  		drm_info(&vm->xe->drm,
> >  			 "Get pages failed, falling back to
> > retrying, asid=%u, gpusvm=0x%016llx, errno %d\n",
> >  			 vm->usm.asid, (u64)&vm->svm.gpusvm, err);
> > +		range_debug(range, "PAGE FAULT - RETRY PAGES");
> >  		goto retry;
> >  	}
> > -	if (err)
> > +	if (err) {
> > +		range_debug(range, "PAGE FAULT - FAIL PAGE
> > COLLECT");
> >  		goto err_out;
> > +	}
> > +
> > +	range_debug(range, "PAGE FAULT - BIND");
> >  
> >  retry_bind:
> >  	drm_exec_init(&exec, 0, 0);
> > @@ -680,8 +755,10 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  		if (IS_ERR(fence)) {
> >  			drm_exec_fini(&exec);
> >  			err = PTR_ERR(fence);
> > -			if (err == -EAGAIN)
> > +			if (err == -EAGAIN) {
> > +				range_debug(range, "PAGE FAULT -
> > RETRY BIND");
> >  				goto retry;
> > +			}
> >  			if (xe_vm_validate_should_retry(&exec, err,
> > &end))
> >  				goto retry_bind;
> >  			goto err_out;
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index 5b9d5ac9ef72..139acee41b42 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -36,6 +36,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  			    bool atomic);
> >  bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
> >  
> > +void xe_svm_range_debug(struct xe_svm_range *range, const char
> > *operation);
> > +
> >  int xe_svm_bo_evict(struct xe_bo *bo);
> >  
> >  static inline bool xe_svm_range_pages_valid(struct xe_svm_range
> > *range)
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 27/29] drm/xe: Add modparam for SVM notifier size
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (25 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 26/29] drm/xe: Add SVM debug Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 12:37   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 28/29] drm/xe: Add always_migrate_to_vram modparam Matthew Brost
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Useful to experiment with notifier size and how it affects performance.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_module.h | 1 +
 drivers/gpu/drm/xe/xe_svm.c    | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
index 161a5e6f717f..5a3bfea8b7b4 100644
--- a/drivers/gpu/drm/xe/xe_module.h
+++ b/drivers/gpu/drm/xe/xe_module.h
@@ -22,6 +22,7 @@ struct xe_modparam {
 	unsigned int max_vfs;
 #endif
 	int wedged_mode;
+	u32 svm_notifier_size;
 };
 
 extern struct xe_modparam xe_modparam;
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index acf2a3750f38..16e34aaead79 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -8,6 +8,7 @@
 #include "xe_bo.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
+#include "xe_module.h"
 #include "xe_pt.h"
 #include "xe_svm.h"
 #include "xe_ttm_vram_mgr.h"
@@ -573,8 +574,8 @@ int xe_svm_init(struct xe_vm *vm)
 
 	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe->drm,
 			       current->mm, xe_svm_devm_owner(vm->xe), 0,
-			       vm->size, SZ_512M, &gpusvm_ops,
-			       fault_chunk_sizes,
+			       vm->size, xe_modparam.svm_notifier_size * SZ_1M,
+			       &gpusvm_ops, fault_chunk_sizes,
 			       ARRAY_SIZE(fault_chunk_sizes));
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 27/29] drm/xe: Add modparam for SVM notifier size
  2024-10-16  3:25 ` [PATCH v2 27/29] drm/xe: Add modparam for SVM notifier size Matthew Brost
@ 2024-12-02 12:37   ` Thomas Hellström
  2024-12-11 19:50     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 12:37 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Useful to experiment with notifier size and how it affects
> performance.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_module.h | 1 +
>  drivers/gpu/drm/xe/xe_svm.c    | 5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_module.h
> b/drivers/gpu/drm/xe/xe_module.h
> index 161a5e6f717f..5a3bfea8b7b4 100644
> --- a/drivers/gpu/drm/xe/xe_module.h
> +++ b/drivers/gpu/drm/xe/xe_module.h
> @@ -22,6 +22,7 @@ struct xe_modparam {
>  	unsigned int max_vfs;
>  #endif
>  	int wedged_mode;
> +	u32 svm_notifier_size;

Hmm. Shouldn't this be assigned and documented somewhere?

Thanks,
Thomas



>  };
>  
>  extern struct xe_modparam xe_modparam;
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index acf2a3750f38..16e34aaead79 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -8,6 +8,7 @@
>  #include "xe_bo.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_migrate.h"
> +#include "xe_module.h"
>  #include "xe_pt.h"
>  #include "xe_svm.h"
>  #include "xe_ttm_vram_mgr.h"
> @@ -573,8 +574,8 @@ int xe_svm_init(struct xe_vm *vm)
>  
>  	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe-
> >drm,
>  			       current->mm, xe_svm_devm_owner(vm-
> >xe), 0,
> -			       vm->size, SZ_512M, &gpusvm_ops,
> -			       fault_chunk_sizes,
> +			       vm->size,
> xe_modparam.svm_notifier_size * SZ_1M,
> +			       &gpusvm_ops, fault_chunk_sizes,
>  			       ARRAY_SIZE(fault_chunk_sizes));
>  }
>  


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 27/29] drm/xe: Add modparam for SVM notifier size
  2024-12-02 12:37   ` Thomas Hellström
@ 2024-12-11 19:50     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:50 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 01:37:46PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Useful to experiment with notifier size and how it affects
> > performance.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_module.h | 1 +
> >  drivers/gpu/drm/xe/xe_svm.c    | 5 +++--
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_module.h
> > b/drivers/gpu/drm/xe/xe_module.h
> > index 161a5e6f717f..5a3bfea8b7b4 100644
> > --- a/drivers/gpu/drm/xe/xe_module.h
> > +++ b/drivers/gpu/drm/xe/xe_module.h
> > @@ -22,6 +22,7 @@ struct xe_modparam {
> >  	unsigned int max_vfs;
> >  #endif
> >  	int wedged_mode;
> > +	u32 svm_notifier_size;
> 
> Hmm. Shouldn't this be assigned and documented somewhere?
> 

Yes, the following patch does this - this was mistake in the a rebase. Will fix.

Matt

> Thanks,
> Thomas
> 
> 
> 
> >  };
> >  
> >  extern struct xe_modparam xe_modparam;
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index acf2a3750f38..16e34aaead79 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -8,6 +8,7 @@
> >  #include "xe_bo.h"
> >  #include "xe_gt_tlb_invalidation.h"
> >  #include "xe_migrate.h"
> > +#include "xe_module.h"
> >  #include "xe_pt.h"
> >  #include "xe_svm.h"
> >  #include "xe_ttm_vram_mgr.h"
> > @@ -573,8 +574,8 @@ int xe_svm_init(struct xe_vm *vm)
> >  
> >  	return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM", &vm->xe-
> > >drm,
> >  			       current->mm, xe_svm_devm_owner(vm-
> > >xe), 0,
> > -			       vm->size, SZ_512M, &gpusvm_ops,
> > -			       fault_chunk_sizes,
> > +			       vm->size,
> > xe_modparam.svm_notifier_size * SZ_1M,
> > +			       &gpusvm_ops, fault_chunk_sizes,
> >  			       ARRAY_SIZE(fault_chunk_sizes));
> >  }
> >  
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 28/29] drm/xe: Add always_migrate_to_vram modparam
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (26 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 27/29] drm/xe: Add modparam for SVM notifier size Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 12:40   ` Thomas Hellström
  2024-10-16  3:25 ` [PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation Matthew Brost
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Used to show we can bounce memory multiple times.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_module.c | 7 +++++++
 drivers/gpu/drm/xe/xe_module.h | 1 +
 drivers/gpu/drm/xe/xe_svm.c    | 3 +++
 3 files changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index 77ce9f9ca7a5..088f6caea307 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -25,9 +25,16 @@ struct xe_modparam xe_modparam = {
 	.max_vfs = IS_ENABLED(CONFIG_DRM_XE_DEBUG) ? ~0 : 0,
 #endif
 	.wedged_mode = 1,
+	.svm_notifier_size = 512,
 	/* the rest are 0 by default */
 };
 
+module_param_named(svm_notifier_size, xe_modparam.svm_notifier_size, uint, 0600);
+MODULE_PARM_DESC(svm_notifier_size, "Set the svm notifier size(in MiB), must be pow2");
+
+module_param_named(always_migrate_to_vram, xe_modparam.always_migrate_to_vram, bool, 0444);
+MODULE_PARM_DESC(always_migrate_to_vram, "Always migrate to VRAM on GPU fault");
+
 module_param_named_unsafe(force_execlist, xe_modparam.force_execlist, bool, 0444);
 MODULE_PARM_DESC(force_execlist, "Force Execlist submission");
 
diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
index 5a3bfea8b7b4..84339e509c80 100644
--- a/drivers/gpu/drm/xe/xe_module.h
+++ b/drivers/gpu/drm/xe/xe_module.h
@@ -12,6 +12,7 @@
 struct xe_modparam {
 	bool force_execlist;
 	bool probe_display;
+	bool always_migrate_to_vram;
 	u32 force_vram_bar_size;
 	int guc_log_level;
 	char *guc_firmware_path;
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 16e34aaead79..bb386f56a189 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -767,6 +767,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	}
 	drm_exec_fini(&exec);
 
+	if (xe_modparam.always_migrate_to_vram)
+		range->migrated = false;
+
 	dma_fence_wait(fence, false);
 	dma_fence_put(fence);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 28/29] drm/xe: Add always_migrate_to_vram modparam
  2024-10-16  3:25 ` [PATCH v2 28/29] drm/xe: Add always_migrate_to_vram modparam Matthew Brost
@ 2024-12-02 12:40   ` Thomas Hellström
  2024-12-11 19:51     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 12:40 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Used to show we can bounce memory multiple times.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_module.c | 7 +++++++
>  drivers/gpu/drm/xe/xe_module.h | 1 +
>  drivers/gpu/drm/xe/xe_svm.c    | 3 +++
>  3 files changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_module.c
> b/drivers/gpu/drm/xe/xe_module.c
> index 77ce9f9ca7a5..088f6caea307 100644
> --- a/drivers/gpu/drm/xe/xe_module.c
> +++ b/drivers/gpu/drm/xe/xe_module.c
> @@ -25,9 +25,16 @@ struct xe_modparam xe_modparam = {
>  	.max_vfs = IS_ENABLED(CONFIG_DRM_XE_DEBUG) ? ~0 : 0,
>  #endif
>  	.wedged_mode = 1,
> +	.svm_notifier_size = 512,
>  	/* the rest are 0 by default */
>  };
>  
> +module_param_named(svm_notifier_size, xe_modparam.svm_notifier_size,
> uint, 0600);
> +MODULE_PARM_DESC(svm_notifier_size, "Set the svm notifier size(in
> MiB), must be pow2");

Ah, this should probably have been in the previous patch?
pow2 could be spelled out "a power of 2"?


> +
> +module_param_named(always_migrate_to_vram,
> xe_modparam.always_migrate_to_vram, bool, 0444);
> +MODULE_PARM_DESC(always_migrate_to_vram, "Always migrate to VRAM on
> GPU fault");
> +
>  module_param_named_unsafe(force_execlist,
> xe_modparam.force_execlist, bool, 0444);
>  MODULE_PARM_DESC(force_execlist, "Force Execlist submission");
>  
> diff --git a/drivers/gpu/drm/xe/xe_module.h
> b/drivers/gpu/drm/xe/xe_module.h
> index 5a3bfea8b7b4..84339e509c80 100644
> --- a/drivers/gpu/drm/xe/xe_module.h
> +++ b/drivers/gpu/drm/xe/xe_module.h
> @@ -12,6 +12,7 @@
>  struct xe_modparam {
>  	bool force_execlist;
>  	bool probe_display;
> +	bool always_migrate_to_vram;

Kerneldoc

Thanks,
Thomas


>  	u32 force_vram_bar_size;
>  	int guc_log_level;
>  	char *guc_firmware_path;
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 16e34aaead79..bb386f56a189 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -767,6 +767,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	}
>  	drm_exec_fini(&exec);
>  
> +	if (xe_modparam.always_migrate_to_vram)
> +		range->migrated = false;
> +
>  	dma_fence_wait(fence, false);
>  	dma_fence_put(fence);
>  


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 28/29] drm/xe: Add always_migrate_to_vram modparam
  2024-12-02 12:40   ` Thomas Hellström
@ 2024-12-11 19:51     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-11 19:51 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 01:40:20PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Used to show we can bounce memory multiple times.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_module.c | 7 +++++++
> >  drivers/gpu/drm/xe/xe_module.h | 1 +
> >  drivers/gpu/drm/xe/xe_svm.c    | 3 +++
> >  3 files changed, 11 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_module.c
> > b/drivers/gpu/drm/xe/xe_module.c
> > index 77ce9f9ca7a5..088f6caea307 100644
> > --- a/drivers/gpu/drm/xe/xe_module.c
> > +++ b/drivers/gpu/drm/xe/xe_module.c
> > @@ -25,9 +25,16 @@ struct xe_modparam xe_modparam = {
> >  	.max_vfs = IS_ENABLED(CONFIG_DRM_XE_DEBUG) ? ~0 : 0,
> >  #endif
> >  	.wedged_mode = 1,
> > +	.svm_notifier_size = 512,
> >  	/* the rest are 0 by default */
> >  };
> >  
> > +module_param_named(svm_notifier_size, xe_modparam.svm_notifier_size,
> > uint, 0600);
> > +MODULE_PARM_DESC(svm_notifier_size, "Set the svm notifier size(in
> > MiB), must be pow2");
> 
> Ah, this should probably have been in the previous patch?

Yes.

> pow2 could be spelled out "a power of 2"?

And yes.

> 
> 
> > +
> > +module_param_named(always_migrate_to_vram,
> > xe_modparam.always_migrate_to_vram, bool, 0444);
> > +MODULE_PARM_DESC(always_migrate_to_vram, "Always migrate to VRAM on
> > GPU fault");
> > +
> >  module_param_named_unsafe(force_execlist,
> > xe_modparam.force_execlist, bool, 0444);
> >  MODULE_PARM_DESC(force_execlist, "Force Execlist submission");
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_module.h
> > b/drivers/gpu/drm/xe/xe_module.h
> > index 5a3bfea8b7b4..84339e509c80 100644
> > --- a/drivers/gpu/drm/xe/xe_module.h
> > +++ b/drivers/gpu/drm/xe/xe_module.h
> > @@ -12,6 +12,7 @@
> >  struct xe_modparam {
> >  	bool force_execlist;
> >  	bool probe_display;
> > +	bool always_migrate_to_vram;
> 
> Kerneldoc
>

Will add.

Matt
 
> Thanks,
> Thomas
> 
> 
> >  	u32 force_vram_bar_size;
> >  	int guc_log_level;
> >  	char *guc_firmware_path;
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 16e34aaead79..bb386f56a189 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -767,6 +767,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	}
> >  	drm_exec_fini(&exec);
> >  
> > +	if (xe_modparam.always_migrate_to_vram)
> > +		range->migrated = false;
> > +
> >  	dma_fence_wait(fence, false);
> >  	dma_fence_put(fence);
> >  
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (27 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 28/29] drm/xe: Add always_migrate_to_vram modparam Matthew Brost
@ 2024-10-16  3:25 ` Matthew Brost
  2024-12-02 13:00   ` Thomas Hellström
  2024-10-16  3:30 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation (rev2) Patchwork
                   ` (2 subsequent siblings)
  31 siblings, 1 reply; 129+ messages in thread
From: Matthew Brost @ 2024-10-16  3:25 UTC (permalink / raw)
  To: intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, thomas.hellstrom,
	simona.vetter, felix.kuehling, dakr

Add documentation for agree upon GPU SVM design principles, current
status, and future plans.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/gpusvm.rst | 70 ++++++++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst  |  4 ++
 2 files changed, 74 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpusvm.rst

diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
new file mode 100644
index 000000000000..2d3f79a6c30a
--- /dev/null
+++ b/Documentation/gpu/rfc/gpusvm.rst
@@ -0,0 +1,70 @@
+===============
+GPU SVM Section
+===============
+
+Agreed upon design principles
+=============================
+
+* migrate_to_ram path
+	* Rely on core MM concepts (migration ptes, page refs, and page locking)
+	  only
+	* No driver specific locks other than locks for hardware interaction in
+	  this path
+	* Partial migration is supported
+	* Driver handles mixed migrations via retry loops rather than locking
+* Eviction
+	* Only looking at physical memory datastructures and locks
+	* No looking at mm/vma structs or relying on those being locked
+* GPU fault side
+	* mmap_read only used around core MM functions which require this lock
+	* Big retry loop to handle all races with the mmu notifier under the gpu
+	  pagetable locks/mmu notifier range lock/whatever we end up calling
+          those
+	* Races (especially against concurrent eviction/migrate_to_ram) should
+	  not be handled on the fault side by trying to hold locks
+* Physical memory to virtual backpointer
+	* Does not work, no pointers from physical memory to virtual should
+	  exist
+* GPU pagetable locking
+	* Notifier lock only protects range tree, pages, pagetable entries, and
+	  mmu notifier seqno tracking, it is not a global lock to protect
+          against races
+	* All races handled with big retry as mentioned above
+
+Overview of current design
+==========================
+
+Current design is simple as possible to get a working basline in which can built
+upon.
+
+.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
+   :doc: Overview
+   :doc: Locking
+   :doc: Migrataion
+   :doc: Partial Unmapping of Ranges
+   :doc: Examples
+
+Possible future design features
+===============================
+
+* Concurrent GPU faults
+	* CPU faults are concurrent so makes sense to have concurrent GPU faults
+	* Should be possible with fined grained locking in the driver GPU
+	  fault handler
+	* No expected GPU SVM changes required
+* Ranges with mixed system and device pages
+	* Can be added if required to drm_gpusvm_get_pages fairly easily
+* Multi-GPU support
+	* Work in progress and patches expected after initially landing on GPU
+	  SVM
+	* Ideally can be done with little to no changes to GPU SVM
+* Drop ranges in favor of radix tree
+	* May be desirable for faster notifiers
+* Compound device pages
+	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
+	  migrate device layer are a performance bottleneck, having compound
+	  device pages should help increase performance by reducing the number
+	  of these expensive calls
+* Higher order dma mapping for migration
+	* 4k dma mapping adversely affects migration performance on Intel
+	  hardware, higher order (2M) dma mapping should help here
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 476719771eef..396e535377fb 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -16,6 +16,10 @@ host such documentation:
 * Once the code has landed move all the documentation to the right places in
   the main core, helper or driver sections.
 
+.. toctree::
+
+    gpusvm.rst
+
 .. toctree::
 
     i915_gem_lmem.rst
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation
  2024-10-16  3:25 ` [PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation Matthew Brost
@ 2024-12-02 13:00   ` Thomas Hellström
  2024-12-17 23:14     ` Matthew Brost
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Hellström @ 2024-12-02 13:00 UTC (permalink / raw)
  To: Matthew Brost, intel-xe, dri-devel
  Cc: apopple, airlied, christian.koenig, simona.vetter, felix.kuehling,
	dakr

On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> Add documentation for agree upon GPU SVM design principles, current
> status, and future plans.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  Documentation/gpu/rfc/gpusvm.rst | 70
> ++++++++++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst  |  4 ++
>  2 files changed, 74 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> 
> diff --git a/Documentation/gpu/rfc/gpusvm.rst
> b/Documentation/gpu/rfc/gpusvm.rst
> new file mode 100644
> index 000000000000..2d3f79a6c30a
> --- /dev/null
> +++ b/Documentation/gpu/rfc/gpusvm.rst
> @@ -0,0 +1,70 @@
> +===============
> +GPU SVM Section
> +===============
> +
> +Agreed upon design principles
> +=============================
> +
> +* migrate_to_ram path
> +	* Rely on core MM concepts (migration ptes, page refs, and
> page locking)
> +	  only
> +	* No driver specific locks other than locks for hardware
> interaction in
> +	  this path

We have previously been discussing the bo lock to protect the bo from
eviction during migrate, if the vram allocation is bo-based. This is a
cross-driver lock with a well-established locking order and I suggest
we allow this. Apart from that I think the above statement needs some
elaboration: What is the problem we are trying to avoid with driver-
specific locks, written so that it's easy to understand it's a bad
idea.

> +	* Partial migration is supported

Exactly what do we mean by partial migration.

> +	* Driver handles mixed migrations via retry loops rather
> than locking
> +* Eviction
> +	* Only looking at physical memory datastructures and locks
as opposed to...

> +	* No looking at mm/vma structs or relying on those being
> locked
We're violating this with the current implementation, aren't we?


> +* GPU fault side
> +	* mmap_read only used around core MM functions which require
> this lock
> +	* Big retry loop to handle all races with the mmu notifier
> under the gpu
> +	  pagetable locks/mmu notifier range lock/whatever we end up
> calling
> +          those
> +	* Races (especially against concurrent
> eviction/migrate_to_ram) should
> +	  not be handled on the fault side by trying to hold locks

This actually contradicts my comment written above about using the bo
lock to block eviction here. The alternative would be to pin vram
allocations during migration until the mm_truct has references on the
allocation, but it'd be good to clarify exactly why locking here is a
bad idea, and why we can't rely on lockdep?

> +* Physical memory to virtual backpointer
> +	* Does not work, no pointers from physical memory to virtual
> should
> +	  exist

We actually still have the private zdd structure, but it's strictly not
to virtual but to the allocation metadata. Is it verified that the
zone_device_data field is allowed to be modified by the pagemap between
allocation and migration?


> +* GPU pagetable locking
> +	* Notifier lock only protects range tree, pages, pagetable
> entries, and
> +	  mmu notifier seqno tracking, it is not a global lock to
> protect
> +          against races
> +	* All races handled with big retry as mentioned above

Adding a note here about "pages valid" for subranges rather than
relying on the wider notifer seqno. I.E. a subrange can be valid even
if the notifier seqno says otherwise.

Performance considerations:
Perhaps mention that notifier (core mm) performance is more important
than gpu fault (driver) performance when considering optimizations that
improves one at the cost of the other?

> +
> +Overview of current design
> +==========================
> +
> +Current design is simple as possible to get a working basline in
> which can built

can be built

> +upon.
> +
> +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> +   :doc: Overview
> +   :doc: Locking
> +   :doc: Migrataion
> +   :doc: Partial Unmapping of Ranges
> +   :doc: Examples
> +
> +Possible future design features
> +===============================
> +
> +* Concurrent GPU faults
> +	* CPU faults are concurrent so makes sense to have
> concurrent GPU faults
> +	* Should be possible with fined grained locking in the
> driver GPU
> +	  fault handler
> +	* No expected GPU SVM changes required
> +* Ranges with mixed system and device pages
> +	* Can be added if required to drm_gpusvm_get_pages fairly
> easily
> +* Multi-GPU support
> +	* Work in progress and patches expected after initially
> landing on GPU
> +	  SVM
> +	* Ideally can be done with little to no changes to GPU SVM
> +* Drop ranges in favor of radix tree
> +	* May be desirable for faster notifiers
> +* Compound device pages
> +	* Nvidia, AMD, and Intel all have agreed expensive core MM
> functions in
> +	  migrate device layer are a performance bottleneck, having
> compound
> +	  device pages should help increase performance by reducing
> the number
> +	  of these expensive calls
> +* Higher order dma mapping for migration
> +	* 4k dma mapping adversely affects migration performance on
> Intel
> +	  hardware, higher order (2M) dma mapping should help here
> diff --git a/Documentation/gpu/rfc/index.rst
> b/Documentation/gpu/rfc/index.rst
> index 476719771eef..396e535377fb 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -16,6 +16,10 @@ host such documentation:
>  * Once the code has landed move all the documentation to the right
> places in
>    the main core, helper or driver sections.
>  
> +.. toctree::
> +
> +    gpusvm.rst
> +
>  .. toctree::
>  
>      i915_gem_lmem.rst

Thanks,
Thomas


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation
  2024-12-02 13:00   ` Thomas Hellström
@ 2024-12-17 23:14     ` Matthew Brost
  0 siblings, 0 replies; 129+ messages in thread
From: Matthew Brost @ 2024-12-17 23:14 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, dri-devel, apopple, airlied, christian.koenig,
	simona.vetter, felix.kuehling, dakr

On Mon, Dec 02, 2024 at 02:00:44PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add documentation for agree upon GPU SVM design principles, current
> > status, and future plans.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  Documentation/gpu/rfc/gpusvm.rst | 70
> > ++++++++++++++++++++++++++++++++
> >  Documentation/gpu/rfc/index.rst  |  4 ++
> >  2 files changed, 74 insertions(+)
> >  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> > 
> > diff --git a/Documentation/gpu/rfc/gpusvm.rst
> > b/Documentation/gpu/rfc/gpusvm.rst
> > new file mode 100644
> > index 000000000000..2d3f79a6c30a
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/gpusvm.rst
> > @@ -0,0 +1,70 @@
> > +===============
> > +GPU SVM Section
> > +===============
> > +
> > +Agreed upon design principles
> > +=============================
> > +
> > +* migrate_to_ram path
> > +	* Rely on core MM concepts (migration ptes, page refs, and
> > page locking)
> > +	  only
> > +	* No driver specific locks other than locks for hardware
> > interaction in
> > +	  this path
> 
> We have previously been discussing the bo lock to protect the bo from
> eviction during migrate, if the vram allocation is bo-based. This is a
> cross-driver lock with a well-established locking order and I suggest
> we allow this. Apart from that I think the above statement needs some

Not allowing additional locks was suggested by Sima and I think it makes
sense but I do agree taking a dma-resv in migrate_to_ram would be safe.
However, the way GPU SVM is structured there is not any hooks to enable
this.

> elaboration: What is the problem we are trying to avoid with driver-
> specific locks, written so that it's easy to understand it's a bad
> idea.
> 

Sure, will try to elaborate.

> > +	* Partial migration is supported
> 
> Exactly what do we mean by partial migration.
> 

Will

> > +	* Driver handles mixed migrations via retry loops rather
> > than locking
> > +* Eviction
> > +	* Only looking at physical memory datastructures and locks
> as opposed to...
> 

As opposed looking at virtual memory data structures. Can elaborate why
this is a bad idea too - basically mremap + fork fall apart when you
start looking at virtual things.

> > +	* No looking at mm/vma structs or relying on those being
> > locked
> We're violating this with the current implementation, aren't we?
> 

Aside from when calling migrate_vma_* or creating initial range. Can
elaborate on this and certainly say drivers should not look at CPU VMA.

> 
> > +* GPU fault side
> > +	* mmap_read only used around core MM functions which require
> > this lock
> > +	* Big retry loop to handle all races with the mmu notifier
> > under the gpu
> > +	  pagetable locks/mmu notifier range lock/whatever we end up
> > calling
> > +          those
> > +	* Races (especially against concurrent
> > eviction/migrate_to_ram) should
> > +	  not be handled on the fault side by trying to hold locks
> 
> This actually contradicts my comment written above about using the bo
> lock to block eviction here. The alternative would be to pin vram
> allocations during migration until the mm_truct has references on the
> allocation, but it'd be good to clarify exactly why locking here is a
> bad idea, and why we can't rely on lockdep?
> 

I'll try to clarify.

> > +* Physical memory to virtual backpointer
> > +	* Does not work, no pointers from physical memory to virtual
> > should
> > +	  exist
> 
> We actually still have the private zdd structure, but it's strictly not
> to virtual but to the allocation metadata. Is it verified that the
> zone_device_data field is allowed to be modified by the pagemap between
> allocation and migration?
> 

We don't modify zdd between allocation and migration aside from the ref
count of the zdd.

> 
> > +* GPU pagetable locking
> > +	* Notifier lock only protects range tree, pages, pagetable
> > entries, and
> > +	  mmu notifier seqno tracking, it is not a global lock to
> > protect
> > +          against races
> > +	* All races handled with big retry as mentioned above
> 
> Adding a note here about "pages valid" for subranges rather than
> relying on the wider notifer seqno. I.E. a subrange can be valid even
> if the notifier seqno says otherwise.
> 

Sure.

> Performance considerations:
> Perhaps mention that notifier (core mm) performance is more important
> than gpu fault (driver) performance when considering optimizations that
> improves one at the cost of the other?
> 

Hmm, that is kinda speculation IMO. I have heard that feedback but
unsure if I agree with it nor do we have any data to backup that claim.
I'd rather not write something down like this that is based on
speculation. I do agree we should profile the code to really understand
the hot spots and write down our findings once we have done that.

> > +
> > +Overview of current design
> > +==========================
> > +
> > +Current design is simple as possible to get a working basline in
> > which can built
> 
> can be built
> 

+1 

Matt

> > +upon.
> > +
> > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> > +   :doc: Overview
> > +   :doc: Locking
> > +   :doc: Migrataion
> > +   :doc: Partial Unmapping of Ranges
> > +   :doc: Examples
> > +
> > +Possible future design features
> > +===============================
> > +
> > +* Concurrent GPU faults
> > +	* CPU faults are concurrent so makes sense to have
> > concurrent GPU faults
> > +	* Should be possible with fined grained locking in the
> > driver GPU
> > +	  fault handler
> > +	* No expected GPU SVM changes required
> > +* Ranges with mixed system and device pages
> > +	* Can be added if required to drm_gpusvm_get_pages fairly
> > easily
> > +* Multi-GPU support
> > +	* Work in progress and patches expected after initially
> > landing on GPU
> > +	  SVM
> > +	* Ideally can be done with little to no changes to GPU SVM
> > +* Drop ranges in favor of radix tree
> > +	* May be desirable for faster notifiers
> > +* Compound device pages
> > +	* Nvidia, AMD, and Intel all have agreed expensive core MM
> > functions in
> > +	  migrate device layer are a performance bottleneck, having
> > compound
> > +	  device pages should help increase performance by reducing
> > the number
> > +	  of these expensive calls
> > +* Higher order dma mapping for migration
> > +	* 4k dma mapping adversely affects migration performance on
> > Intel
> > +	  hardware, higher order (2M) dma mapping should help here
> > diff --git a/Documentation/gpu/rfc/index.rst
> > b/Documentation/gpu/rfc/index.rst
> > index 476719771eef..396e535377fb 100644
> > --- a/Documentation/gpu/rfc/index.rst
> > +++ b/Documentation/gpu/rfc/index.rst
> > @@ -16,6 +16,10 @@ host such documentation:
> >  * Once the code has landed move all the documentation to the right
> > places in
> >    the main core, helper or driver sections.
> >  
> > +.. toctree::
> > +
> > +    gpusvm.rst
> > +
> >  .. toctree::
> >  
> >      i915_gem_lmem.rst
> 
> Thanks,
> Thomas
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation (rev2)
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (28 preceding siblings ...)
  2024-10-16  3:25 ` [PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation Matthew Brost
@ 2024-10-16  3:30 ` Patchwork
  2024-10-16  3:31 ` ✗ CI.checkpatch: warning " Patchwork
  2024-10-16  3:31 ` ✗ CI.KUnit: failure " Patchwork
  31 siblings, 0 replies; 129+ messages in thread
From: Patchwork @ 2024-10-16  3:30 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Introduce GPU SVM and Xe SVM implementation (rev2)
URL   : https://patchwork.freedesktop.org/series/137870/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 01c7b2c084e5 drm-tip: 2024y-10m-15d-14h-57m-37s UTC integration manifest
=== git am output follows ===
Applying: drm/xe: Retry BO allocation
Applying: mm/migrate: Add migrate_device_prepopulated_range
Applying: mm/migrate: Trylock device page in do_swap_page
Applying: drm/pagemap: Add DRM pagemap
Applying: drm/gpusvm: Add support for GPU Shared Virtual Memory
Applying: drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
Applying: drm/xe: Add SVM init / close / fini to faulting VMs
Applying: drm/xe: Add dma_addr res cursor
Applying: drm/xe: Add SVM range invalidation
Applying: drm/gpuvm: Add DRM_GPUVA_OP_USER
Applying: drm/xe: Add (re)bind to SVM page fault handler
Applying: drm/xe: Add SVM garbage collector
Applying: drm/xe: Add unbind to SVM garbage collector
Applying: drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
Applying: drm/xe: Enable system allocator uAPI
Applying: drm/xe: Add migrate layer functions for SVM support
Applying: drm/xe: Add SVM device memory mirroring
Applying: drm/xe: Add drm_gpusvm_devmem to xe_bo
Applying: drm/xe: Add GPUSVM devic memory copy vfunc functions
Applying: drm/xe: Add drm_pagemap ops to SVM
Applying: drm/xe: Add Xe SVM populate_devmem_pfn vfunc
Applying: drm/xe: Add Xe SVM devmem_release vfunc
Applying: drm/xe: Add BO flags required for SVM
Applying: drm/xe: Add SVM VRAM migration
Applying: drm/xe: Basic SVM BO eviction
Applying: drm/xe: Add SVM debug
Applying: drm/xe: Add modparam for SVM notifier size
Applying: drm/xe: Add always_migrate_to_vram modparam
Applying: drm/doc: gpusvm: Add GPU SVM documentation



^ permalink raw reply	[flat|nested] 129+ messages in thread

* ✗ CI.checkpatch: warning for Introduce GPU SVM and Xe SVM implementation (rev2)
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (29 preceding siblings ...)
  2024-10-16  3:30 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation (rev2) Patchwork
@ 2024-10-16  3:31 ` Patchwork
  2024-10-16  3:31 ` ✗ CI.KUnit: failure " Patchwork
  31 siblings, 0 replies; 129+ messages in thread
From: Patchwork @ 2024-10-16  3:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Introduce GPU SVM and Xe SVM implementation (rev2)
URL   : https://patchwork.freedesktop.org/series/137870/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
30ab6715fc09baee6cc14cb3c89ad8858688d474
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 6dfbb739d4447b459c1134fbb59b1687bd4ff475
Author: Matthew Brost <matthew.brost@intel.com>
Date:   Tue Oct 15 20:25:18 2024 -0700

    drm/doc: gpusvm: Add GPU SVM documentation
    
    Add documentation for agree upon GPU SVM design principles, current
    status, and future plans.
    
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
+ /mt/dim checkpatch 01c7b2c084e5c84313f382734c10945b9aa49823 drm-intel
90fa5ebc748c drm/xe: Retry BO allocation
1464915f18bd mm/migrate: Add migrate_device_prepopulated_range
af485f4c9894 mm/migrate: Trylock device page in do_swap_page
-:25: WARNING:BAD_SIGN_OFF: Non-standard signature: 'Suggessted-by:' - perhaps 'Suggested-by:'?
#25: 
Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>

-:209: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#209: FILE: mm/migrate_device.c:880:
+void migrate_device_finalize(unsigned long *src_pfns,
+			unsigned long *dst_pfns, unsigned long npages)

total: 0 errors, 1 warnings, 1 checks, 176 lines checked
fa631d64bed5 drm/pagemap: Add DRM pagemap
-:21: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#21: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 103 lines checked
477527327fc5 drm/gpusvm: Add support for GPU Shared Virtual Memory
-:70: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#70: 
new file mode 100644

-:265: WARNING:LONG_LINE_COMMENT: line length of 103 exceeds 100 columns
#265: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:191:
+ *		if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {	// CPU mappings changed

-:490: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'range__' - possible side-effects?
#490: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:416:
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))

-:490: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'next__' - possible side-effects?
#490: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:416:
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))

-:490: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#490: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:416:
+#define drm_gpusvm_for_each_range_safe(range__, next__, notifier__, start__, end__)	\
+	for ((range__) = drm_gpusvm_range_find((notifier__), (start__), (end__)),	\
+	     (next__) = __drm_gpusvm_range_next(range__);				\
+	     (range__) && (range__->va.start < (end__));				\
+	     (range__) = (next__), (next__) = __drm_gpusvm_range_next(range__))

-:523: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'notifier__' - possible side-effects?
#523: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:449:
+#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))

-:523: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#523: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:449:
+#define drm_gpusvm_for_each_notifier(notifier__, gpusvm__, start__, end__)		\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1);	\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = __drm_gpusvm_notifier_next(notifier__))

-:539: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'notifier__' - possible side-effects?
#539: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:465:
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))

-:539: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'next__' - possible side-effects?
#539: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:465:
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))

-:539: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#539: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:465:
+#define drm_gpusvm_for_each_notifier_safe(notifier__, next__, gpusvm__, start__, end__)	\
+	for ((notifier__) = notifier_iter_first(&(gpusvm__)->root, (start__), (end__) - 1),	\
+	     (next__) = __drm_gpusvm_notifier_next(notifier__);				\
+	     (notifier__) && (notifier__->interval.start < (end__));			\
+	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))

-:650: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'fault_addr__' - possible side-effects?
#650: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:576:
+#define drm_gpusvm_notifier_find(gpusvm__, fault_addr__)	\
+	notifier_iter_first(&(gpusvm__)->root, (fault_addr__),	\
+			    (fault_addr__ + 1))

-:694: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#694: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:620:
+#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
+	notifier_remove((notifier__), &(gpusvm__)->root);	\
+	list_del(&(notifier__)->rb.entry)

-:694: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'notifier__' - possible side-effects?
#694: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:620:
+#define drm_gpusvm_notifier_remove(gpusvm__, notifier__)	\
+	notifier_remove((notifier__), &(gpusvm__)->root);	\
+	list_del(&(notifier__)->rb.entry)

-:820: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#820: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:746:
+#define __drm_gpusvm_range_remove(notifier__, range__)		\
+	range_remove((range__), &(notifier__)->root);		\
+	list_del(&(range__)->rb.entry)

-:820: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'range__' - possible side-effects?
#820: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:746:
+#define __drm_gpusvm_range_remove(notifier__, range__)		\
+	range_remove((range__), &(notifier__)->root);		\
+	list_del(&(range__)->rb.entry)

-:1275: WARNING:TYPO_SPELLING: 'commiting' may be misspelled - perhaps 'committing'?
#1275: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1201:
+ * called holding gpusvm->notifier_lock and as the last step before commiting a
                                                                     ^^^^^^^^^

-:1575: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#1575: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1501:
+static void drm_gpusvm_get_devmem_page(struct page *page,
+				     struct drm_gpusvm_zdd *zdd)

-:1598: WARNING:MISORDERED_TYPE: type 'long unsigned int *' should be specified in [[un]signed] [short|int|long|long long] order
#1598: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1524:
+					long unsigned int *migrate_pfn,

-:1598: WARNING:UNNECESSARY_INT: Prefer 'unsigned long *' over 'long unsigned int *' as the int is unnecessary
#1598: FILE: drivers/gpu/drm/xe/drm_gpusvm.c:1524:
+					long unsigned int *migrate_pfn,

-:2207: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#2207: FILE: drivers/gpu/drm/xe/drm_gpusvm.h:53:
+	int (*populate_devmem_pfn)(struct drm_gpusvm_devmem *devmem_allocation,
+				 unsigned long npages, unsigned long *pfn);

-:2553: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'range__' - possible side-effects?
#2553: FILE: drivers/gpu/drm/xe/drm_gpusvm.h:399:
+#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
+	for ((range__) = (range__) ?:					\
+	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
+	     (range__) && (range__->va.start < (end__));		\
+	     (range__) = __drm_gpusvm_range_next(range__))

-:2553: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'end__' - possible side-effects?
#2553: FILE: drivers/gpu/drm/xe/drm_gpusvm.h:399:
+#define drm_gpusvm_for_each_range(range__, notifier__, start__, end__)	\
+	for ((range__) = (range__) ?:					\
+	     drm_gpusvm_range_find((notifier__), (start__), (end__));	\
+	     (range__) && (range__->va.start < (end__));		\
+	     (range__) = __drm_gpusvm_range_next(range__))

total: 2 errors, 5 warnings, 15 checks, 2530 lines checked
58219bdbb278 drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
-:449: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#449: FILE: drivers/gpu/drm/xe/xe_vm.c:2812:
+		    XE_IOCTL_DBG(xe, obj_offset && (is_null ||
+				 is_system_allocator)) ||

total: 0 errors, 0 warnings, 1 checks, 469 lines checked
a07dbc648129 drm/xe: Add SVM init / close / fini to faulting VMs
-:26: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#26: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 123 lines checked
4791a7d3ffc2 drm/xe: Add dma_addr res cursor
928636181157 drm/xe: Add SVM range invalidation
-:365: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 15)
#365: FILE: drivers/gpu/drm/xe/xe_svm.c:243:
+	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
+	       goto retry;

-:366: WARNING:TABSTOP: Statements should start on a tabstop
#366: FILE: drivers/gpu/drm/xe/xe_svm.c:244:
+	       goto retry;

total: 0 errors, 2 warnings, 0 checks, 357 lines checked
83ae34ad1461 drm/gpuvm: Add DRM_GPUVA_OP_USER
34d89d6a47c3 drm/xe: Add (re)bind to SVM page fault handler
f27373534f7f drm/xe: Add SVM garbage collector
-:190: CHECK:UNCOMMENTED_DEFINITION: spinlock_t definition without comment
#190: FILE: drivers/gpu/drm/xe/xe_vm_types.h:150:
+			spinlock_t lock;

total: 0 errors, 0 warnings, 1 checks, 157 lines checked
9c5ad681986a drm/xe: Add unbind to SVM garbage collector
-:20: ERROR:POINTER_LOCATION: "(foo*)" should be "(foo *)"
#20: FILE: drivers/gpu/drm/xe/xe_pt.c:928:
+#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)

-:20: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
#20: FILE: drivers/gpu/drm/xe/xe_pt.c:928:
+#define INVALID_VMA	(struct xe_vma*)(0xdeaddeadull)

total: 2 errors, 0 warnings, 0 checks, 292 lines checked
5f9e1765c48c drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings
-:7: WARNING:REPEATED_WORD: Possible repeated word: 'the'
#7: 
uAPI is designed with the the use case that only mapping a BO to a

total: 0 errors, 1 warnings, 0 checks, 43 lines checked
29044345fb1d drm/xe: Enable system allocator uAPI
8dc37a6cd1c2 drm/xe: Add migrate layer functions for SVM support
3808560a54c6 drm/xe: Add SVM device memory mirroring
-:11: ERROR:BAD_SIGN_OFF: Unrecognized email address: 'Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com'
#11: 
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com

-:103: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#103: FILE: drivers/gpu/drm/xe/xe_svm.c:429:
+		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);

total: 1 errors, 0 warnings, 1 checks, 123 lines checked
f64edb9e67d1 drm/xe: Add drm_gpusvm_devmem to xe_bo
32fd9fc8bcdd drm/xe: Add GPUSVM devic memory copy vfunc functions
2ae09f28b907 drm/xe: Add drm_pagemap ops to SVM
8962941818f0 drm/xe: Add Xe SVM populate_devmem_pfn vfunc
-:53: ERROR:SPACING: spaces required around that '=' (ctx:WxV)
#53: FILE: drivers/gpu/drm/xe/xe_svm.c:439:
+	int j =0;
 	      ^

-:62: ERROR:SPACING: space required before the open parenthesis '('
#62: FILE: drivers/gpu/drm/xe/xe_svm.c:448:
+		for(i = 0; i < drm_buddy_block_size(buddy, block) >> PAGE_SHIFT; ++i)

total: 2 errors, 0 warnings, 0 checks, 54 lines checked
59083f4910ee drm/xe: Add Xe SVM devmem_release vfunc
7eb3ce55282c drm/xe: Add BO flags required for SVM
-:65: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#65: FILE: drivers/gpu/drm/xe/xe_bo.c:2339:
+	if (IS_DGFX(xe) && ((bo->flags & XE_BO_FLAG_SYSTEM) ||
+	    (bo->flags & XE_BO_FLAG_SYSTEM_ALLOC)))

total: 0 errors, 0 warnings, 1 checks, 53 lines checked
766a658cae7b drm/xe: Add SVM VRAM migration
-:36: ERROR:BAD_SIGN_OFF: Unrecognized email address: 'Matthew Brost matthew.brost@intel.com'
#36: 
Signed-off-by: Matthew Brost matthew.brost@intel.com

-:167: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 8)
#167: FILE: drivers/gpu/drm/xe/xe_svm.c:657:
 	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
+	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {	/* Corner where CPU mappings have change */

-:169: WARNING:LONG_LINE_COMMENT: line length of 115 exceeds 100 columns
#169: FILE: drivers/gpu/drm/xe/xe_svm.c:658:
+	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {	/* Corner where CPU mappings have change */

-:200: ERROR:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Matthew Brost <matthew.brost@intel.com>'

total: 2 errors, 2 warnings, 0 checks, 149 lines checked
44266a05dc8e drm/xe: Basic SVM BO eviction
b0f98930e66a drm/xe: Add SVM debug
-:77: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'r__' - possible side-effects?
#77: FILE: drivers/gpu/drm/xe/xe_svm.c:39:
+#define range_debug(r__, operaton__)					\
+	vm_dbg(&range_to_vm(&(r__)->base)->xe->drm,			\
+	       "%s: asid=%u, gpusvm=0x%016llx, vram=%d,%d, seqno=%lu, " \
+	       "start=0x%014llx, end=0x%014llx, size=%llu",		\
+	       (operaton__), range_to_vm(&(r__)->base)->usm.asid,	\
+	       (u64)(r__)->base.gpusvm,					\
+	       xe_svm_range_in_vram((r__)) ? 1 : 0,			\
+	       xe_svm_range_has_vram_binding((r__)) ? 1 : 0,		\
+	       (r__)->base.notifier_seq,				\
+	       (r__)->base.va.start, (r__)->base.va.end,		\
+	       (r__)->base.va.end - (r__)->base.va.start)

total: 0 errors, 0 warnings, 1 checks, 258 lines checked
2194b7e15026 drm/xe: Add modparam for SVM notifier size
664726daef4f drm/xe: Add always_migrate_to_vram modparam
6dfbb739d444 drm/doc: gpusvm: Add GPU SVM documentation
-:12: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#12: 
new file mode 100644

-:17: WARNING:SPDX_LICENSE_TAG: Missing or malformed SPDX-License-Identifier tag in line 1
#17: FILE: Documentation/gpu/rfc/gpusvm.rst:1:
+===============

total: 0 errors, 2 warnings, 0 checks, 80 lines checked



^ permalink raw reply	[flat|nested] 129+ messages in thread

* ✗ CI.KUnit: failure for Introduce GPU SVM and Xe SVM implementation (rev2)
  2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
                   ` (30 preceding siblings ...)
  2024-10-16  3:31 ` ✗ CI.checkpatch: warning " Patchwork
@ 2024-10-16  3:31 ` Patchwork
  31 siblings, 0 replies; 129+ messages in thread
From: Patchwork @ 2024-10-16  3:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Introduce GPU SVM and Xe SVM implementation (rev2)
URL   : https://patchwork.freedesktop.org/series/137870/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../drivers/gpu/drm/xe/drm_gpusvm.c: In function ‘drm_gpusvm_get_devmem_page’:
../drivers/gpu/drm/xe/drm_gpusvm.c:1504:9: error: implicit declaration of function ‘zone_device_page_init’ [-Werror=implicit-function-declaration]
 1504 |         zone_device_page_init(page);
      |         ^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:229: drivers/gpu/drm/xe/drm_gpusvm.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:478: drivers/gpu/drm/xe] Error 2
make[6]: *** Waiting for unfinished jobs....
make[5]: *** [../scripts/Makefile.build:478: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:478: drivers/gpu] Error 2
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [../scripts/Makefile.build:478: drivers] Error 2
make[3]: *** Waiting for unfinished jobs....
../lib/iomap.c:156:5: warning: no previous prototype for ‘ioread64_lo_hi’ [-Wmissing-prototypes]
  156 | u64 ioread64_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:163:5: warning: no previous prototype for ‘ioread64_hi_lo’ [-Wmissing-prototypes]
  163 | u64 ioread64_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:170:5: warning: no previous prototype for ‘ioread64be_lo_hi’ [-Wmissing-prototypes]
  170 | u64 ioread64be_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:178:5: warning: no previous prototype for ‘ioread64be_hi_lo’ [-Wmissing-prototypes]
  178 | u64 ioread64be_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:264:6: warning: no previous prototype for ‘iowrite64_lo_hi’ [-Wmissing-prototypes]
  264 | void iowrite64_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:272:6: warning: no previous prototype for ‘iowrite64_hi_lo’ [-Wmissing-prototypes]
  272 | void iowrite64_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:280:6: warning: no previous prototype for ‘iowrite64be_lo_hi’ [-Wmissing-prototypes]
  280 | void iowrite64be_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../lib/iomap.c:288:6: warning: no previous prototype for ‘iowrite64be_hi_lo’ [-Wmissing-prototypes]
  288 | void iowrite64be_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
make[2]: *** [/kernel/Makefile:1936: .] Error 2
make[1]: *** [/kernel/Makefile:224: __sub-make] Error 2
make: *** [Makefile:224: __sub-make] Error 2

[03:31:15] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:31:20] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2024-12-17 23:14 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-16  3:24 [PATCH v2 00/29] Introduce GPU SVM and Xe SVM implementation Matthew Brost
2024-10-16  3:24 ` [PATCH v2 01/29] drm/xe: Retry BO allocation Matthew Brost
2024-10-16  3:24 ` [PATCH v2 02/29] mm/migrate: Add migrate_device_prepopulated_range Matthew Brost
2024-10-16  4:04   ` Alistair Popple
2024-10-16  4:46     ` Matthew Brost
2024-10-17  0:56       ` Matthew Brost
2024-10-17  1:49         ` Alistair Popple
2024-10-17  2:45           ` Matthew Brost
2024-10-17  3:21             ` Alistair Popple
2024-10-17  4:07               ` Matthew Brost
2024-10-17  5:49                 ` Alistair Popple
2024-10-17 15:40                   ` Matthew Brost
2024-10-17 21:58                     ` Alistair Popple
2024-10-18  0:54                       ` Matthew Brost
2024-10-18  5:59                         ` Alistair Popple
2024-10-18  6:39                           ` Mika Penttilä
2024-10-18  7:16                           ` Matthew Brost
2024-10-18  7:33                             ` Matthew Brost
2024-10-18  7:34                             ` Alistair Popple
2024-10-18  7:57                               ` Matthew Brost
2024-10-18  4:02                       ` Mika Penttilä
2024-10-18  5:55                         ` Alistair Popple
2024-10-16  3:24 ` [PATCH v2 03/29] mm/migrate: Trylock device page in do_swap_page Matthew Brost
2024-10-16  4:00   ` Alistair Popple
2024-10-16  4:41     ` Matthew Brost
2024-10-17  1:51       ` Alistair Popple
2024-10-25  0:31         ` Matthew Brost
2024-10-29  6:37           ` Alistair Popple
2024-11-01 17:19             ` Matthew Brost
2024-11-28 23:31   ` Alistair Popple
2024-12-13 22:16     ` Matthew Brost
2024-12-14  5:59       ` Matthew Brost
2024-10-16  3:24 ` [PATCH v2 04/29] drm/pagemap: Add DRM pagemap Matthew Brost
2024-10-16  3:24 ` [PATCH v2 05/29] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
2024-10-31 18:58   ` Thomas Hellström
2024-11-04 22:53     ` Matthew Brost
2024-11-04 15:25   ` Thomas Hellström
2024-11-04 17:21     ` Matthew Brost
2024-11-04 18:59       ` Thomas Hellström
2024-11-04 23:07         ` Matthew Brost
2024-11-05 10:22           ` Thomas Hellström
2024-11-05 16:12             ` Matthew Brost
2024-11-05 16:28               ` Thomas Hellström
2024-11-05 14:48   ` Thomas Hellström
2024-11-05 16:32     ` Matthew Brost
2024-11-20  3:00   ` Gwan-gyeong Mun
2024-11-29  0:00   ` Alistair Popple
2024-12-14  1:16     ` Matthew Brost
2024-10-16  3:24 ` [PATCH v2 06/29] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
2024-11-18 13:44   ` Thomas Hellström
2024-11-19 16:01     ` Matthew Brost
2024-10-16  3:24 ` [PATCH v2 07/29] drm/xe: Add SVM init / close / fini to faulting VMs Matthew Brost
2024-11-19 12:13   ` Thomas Hellström
2024-11-19 16:22     ` Matthew Brost
2024-10-16  3:24 ` [PATCH v2 08/29] drm/xe: Add dma_addr res cursor Matthew Brost
2024-11-19 12:15   ` Thomas Hellström
2024-11-19 16:24     ` Matthew Brost
2024-10-16  3:24 ` [PATCH v2 09/29] drm/xe: Add SVM range invalidation Matthew Brost
2024-11-19 13:56   ` Thomas Hellström
2024-12-11 19:01     ` Matthew Brost
2024-12-14 23:11       ` Matthew Brost
2024-12-16 10:01       ` Thomas Hellström
2024-12-16 16:09         ` Matthew Brost
2024-12-16 17:35           ` Thomas Hellström
2024-10-16  3:24 ` [PATCH v2 10/29] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
2024-11-19 13:57   ` Thomas Hellström
2024-11-19 16:26     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 11/29] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
2024-11-19 14:26   ` Thomas Hellström
2024-12-11 19:07     ` Matthew Brost
2024-12-16 10:03       ` Thomas Hellström
2024-10-16  3:25 ` [PATCH v2 12/29] drm/xe: Add SVM garbage collector Matthew Brost
2024-11-19 14:45   ` Thomas Hellström
2024-12-11 19:17     ` Matthew Brost
2024-12-16 10:36       ` Thomas Hellström
2024-12-16 23:46         ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 13/29] drm/xe: Add unbind to " Matthew Brost
2024-11-19 15:31   ` Thomas Hellström
2024-11-19 23:44     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 14/29] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
2024-11-19 16:33   ` Thomas Hellström
2024-11-19 23:37     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 15/29] drm/xe: Enable system allocator uAPI Matthew Brost
2024-11-19 16:34   ` Thomas Hellström
2024-10-16  3:25 ` [PATCH v2 16/29] drm/xe: Add migrate layer functions for SVM support Matthew Brost
2024-11-19 16:45   ` Thomas Hellström
2024-11-19 23:08     ` Matthew Brost
2024-11-20  8:04       ` Thomas Hellström
2024-12-11 19:11         ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 17/29] drm/xe: Add SVM device memory mirroring Matthew Brost
2024-11-19 16:50   ` Thomas Hellström
2024-11-20  3:05   ` Gwan-gyeong Mun
2024-12-11 19:44     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 18/29] drm/xe: Add drm_gpusvm_devmem to xe_bo Matthew Brost
2024-11-19 16:51   ` Thomas Hellström
2024-12-15  4:38     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 19/29] drm/xe: Add GPUSVM devic memory copy vfunc functions Matthew Brost
2024-12-02 10:13   ` Thomas Hellström
2024-12-12  3:59     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 20/29] drm/xe: Add drm_pagemap ops to SVM Matthew Brost
2024-10-16  3:25 ` [PATCH v2 21/29] drm/xe: Add Xe SVM populate_devmem_pfn vfunc Matthew Brost
2024-12-02 10:19   ` Thomas Hellström
2024-10-16  3:25 ` [PATCH v2 22/29] drm/xe: Add Xe SVM devmem_release vfunc Matthew Brost
2024-12-02 10:21   ` Thomas Hellström
2024-10-16  3:25 ` [PATCH v2 23/29] drm/xe: Add BO flags required for SVM Matthew Brost
2024-12-02 10:44   ` Thomas Hellström
2024-12-11 21:42     ` Matthew Brost
2024-12-16 10:44       ` Thomas Hellström
2024-10-16  3:25 ` [PATCH v2 24/29] drm/xe: Add SVM VRAM migration Matthew Brost
2024-12-02 12:06   ` Thomas Hellström
2024-12-11 20:17     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 25/29] drm/xe: Basic SVM BO eviction Matthew Brost
2024-12-02 12:27   ` Thomas Hellström
2024-12-11 19:47     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 26/29] drm/xe: Add SVM debug Matthew Brost
2024-12-02 12:33   ` Thomas Hellström
2024-12-17  1:05     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 27/29] drm/xe: Add modparam for SVM notifier size Matthew Brost
2024-12-02 12:37   ` Thomas Hellström
2024-12-11 19:50     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 28/29] drm/xe: Add always_migrate_to_vram modparam Matthew Brost
2024-12-02 12:40   ` Thomas Hellström
2024-12-11 19:51     ` Matthew Brost
2024-10-16  3:25 ` [PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation Matthew Brost
2024-12-02 13:00   ` Thomas Hellström
2024-12-17 23:14     ` Matthew Brost
2024-10-16  3:30 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation (rev2) Patchwork
2024-10-16  3:31 ` ✗ CI.checkpatch: warning " Patchwork
2024-10-16  3:31 ` ✗ CI.KUnit: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox