[RFC PATCH 0/4] Migrate on fault for device pages

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Migrate on fault for device pages
@ 2025-08-14  7:19 Mika Penttilä
  2025-08-14  7:19 ` [RFC PATCH 1/4] mm: use current as mmu notifier's owner Mika Penttilä
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14  7:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mika Penttilä, David Hildenbrand,
	Jason Gunthorpe, Leon Romanovsky, Alistair Popple, Balbir Singh

As of this writing, the way device page faulting and migration
works is not optimal, if you want to do both fault handling
and migration at once.

Being able to migrate not present pages (or pages mapped with incorrect
permissions, eg. COW) to the GPU requires doing either of the following
sequences:

1. hmm_range_fault() - fault in non-present pages with correct
   permissions,etc.
2. migrate_vma_*() - migrate the pages

Or:

1. migrate_vma_*() - migrate present pages
2. If non-present pages detected by migrate_vma_*():
   a) call hmm_range_fault() to fault pages in
   b) call migrate_vma_*() again to migrate now present pages

The problem with the first sequence is that you always have to do two
page walks even when most of the time the pages are present or zero page
mappings so the common case takes a performance hit.

The second sequence is better for the common case, but far worse if
pages aren't present because now you have to walk the page tables three
times (once to find the page is not present, once so hmm_range_fault()
can find a non-present page to fault in and once again to setup the
migration). It also tricky to code correctly.

We should be able to walk the page table once, faulting
pages in as required and replacing them with migration entries if
requested.

These patches add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
which tells to prepare for migration also during fault handling.
Also, for the migrate_vma_setup() call paths, a flag,
MIGRATE_VMA_FAULT, is added to tell to add fault handling to migrate.
The original idea came from Alistair.

These patches are based on 6.16 mainline, and they pass the HMM
selftests. The support for THP pages from Balbir should
not be to hard to integrate into this, and that
is something I am looking at.

Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>

Mika Penttilä (4):
  mm: use current as mmu notifier's owner
  mm: unified fault and migrate device page paths
  mm:/migrate_device.c: remove migrate_vma_collect_*() functions
  mm: add new testcase for the migrate on fault case

 include/linux/hmm.h                    |  10 +-
 include/linux/migrate.h                |   6 +-
 lib/test_hmm.c                         | 105 ++++++-
 lib/test_hmm_uapi.h                    |  17 +-
 mm/hmm.c                               | 351 +++++++++++++++++++++-
 mm/huge_memory.c                       |   6 +-
 mm/migrate_device.c                    | 384 +++++--------------------
 mm/rmap.c                              |   4 +-
 tools/testing/selftests/mm/hmm-tests.c |  53 ++++
 9 files changed, 592 insertions(+), 344 deletions(-)

-- 
2.50.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14  7:19 [RFC PATCH 0/4] Migrate on fault for device pages Mika Penttilä
@ 2025-08-14  7:19 ` Mika Penttilä
  2025-08-14 12:40   ` Jason Gunthorpe
  2025-08-14  7:19 ` [RFC PATCH 2/4] mm: unified fault and migrate device page paths Mika Penttilä
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14  7:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mika Penttilä, David Hildenbrand,
	Jason Gunthorpe, Leon Romanovsky, Alistair Popple, Balbir Singh

When doing migration in combination with device fault handling,
detect the case in the interval notifier.

Without that, we would livelock with our own invalidations
while migrating and splitting pages during fault handling.

Note, pgmap_owner, used in some other code paths as owner for filtering,
is not readily available for split path, so use current for this use case.
Also, current and pgmap_owner, both being pointers to memory, can not be
mis-interpreted to each other.

Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
---
 lib/test_hmm.c   | 5 +++++
 mm/huge_memory.c | 6 +++---
 mm/rmap.c        | 4 ++--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 761725bc713c..cd5c139213be 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
 	    range->owner == dmirror->mdevice)
 		return true;
 
+	if (range->event == MMU_NOTIFY_CLEAR &&
+	    range->owner == current) {
+		return true;
+	}
+
 	if (mmu_notifier_range_blockable(range))
 		mutex_lock(&dmirror->mutex);
 	else if (!mutex_trylock(&dmirror->mutex))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..276e38dd8f68 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3069,9 +3069,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
-				address & HPAGE_PMD_MASK,
-				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+				      address & HPAGE_PMD_MASK,
+				      (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE, current);
 	mmu_notifier_invalidate_range_start(&range);
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	split_huge_pmd_locked(vma, range.start, pmd, freeze);
diff --git a/mm/rmap.c b/mm/rmap.c
index f93ce27132ab..e7829015a40b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2308,8 +2308,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	 * try_to_unmap() must hold a reference on the page.
 	 */
 	range.end = vma_address_end(&pvmw);
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
-				address, range.end);
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+				      address, range.end, current);
 	if (folio_test_hugetlb(folio)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/4] mm: unified fault and migrate device page paths
  2025-08-14  7:19 [RFC PATCH 0/4] Migrate on fault for device pages Mika Penttilä
  2025-08-14  7:19 ` [RFC PATCH 1/4] mm: use current as mmu notifier's owner Mika Penttilä
@ 2025-08-14  7:19 ` Mika Penttilä
  2025-08-21  4:30   ` Balbir Singh
  2025-08-14  7:19 ` [RFC PATCH 3/4] mm:/migrate_device.c: remove migrate_vma_collect_*() functions Mika Penttilä
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14  7:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mika Penttilä, David Hildenbrand,
	Jason Gunthorpe, Leon Romanovsky, Alistair Popple, Balbir Singh

As of this writing, the way device page faulting and migration works
is not optimal, if you want to do both fault handling and
migration at once.

Being able to migrate not present pages (or pages mapped with incorrect
permissions, eg. COW) to the GPU requires doing either of the
following sequences:

1. hmm_range_fault() - fault in non-present pages with correct permissions, etc.
2. migrate_vma_*() - migrate the pages

Or:

1. migrate_vma_*() - migrate present pages
2. If non-present pages detected by migrate_vma_*():
   a) call hmm_range_fault() to fault pages in
   b) call migrate_vma_*() again to migrate now present pages

The problem with the first sequence is that you always have to do two
page walks even when most of the time the pages are present or zero page
mappings so the common case takes a performance hit.

The second sequence is better for the common case, but far worse if
pages aren't present because now you have to walk the page tables three
times (once to find the page is not present, once so hmm_range_fault()
can find a non-present page to fault in and once again to setup the
migration). It also tricky to code correctly.

We should be able to walk the page table once, faulting
pages in as required and replacing them with migration entries if
requested.

Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
which tells to prepare for migration also during fault handling.
Also, for the migrate_vma_setup() call paths, a flags, MIGRATE_VMA_FAULT,
is added to tell to add fault handling to migrate.

Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>

Suggested-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
---
 include/linux/hmm.h     |  10 +-
 include/linux/migrate.h |   6 +-
 mm/hmm.c                | 351 ++++++++++++++++++++++++++++++++++++++--
 mm/migrate_device.c     |  72 ++++++++-
 4 files changed, 420 insertions(+), 19 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index db75ffc949a7..7485e549c675 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -12,7 +12,7 @@
 #include <linux/mm.h>
 
 struct mmu_interval_notifier;
-
+struct migrate_vma;
 /*
  * On output:
  * 0             - The page is faultable and a future call with 
@@ -48,11 +48,14 @@ enum hmm_pfn_flags {
 	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
 	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
 
-	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11),
+	/* Migrate request */
+	HMM_PFN_MIGRATE    = 1UL << (BITS_PER_LONG - 7),
+	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 12),
 
 	/* Input flags */
 	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
 	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
+	HMM_PFN_REQ_MIGRATE = HMM_PFN_MIGRATE,
 
 	HMM_PFN_FLAGS = ~((1UL << HMM_PFN_ORDER_SHIFT) - 1),
 };
@@ -107,6 +110,7 @@ static inline unsigned int hmm_pfn_to_map_order(unsigned long hmm_pfn)
  * @default_flags: default flags for the range (write, read, ... see hmm doc)
  * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
  * @dev_private_owner: owner of device private pages
+ * @migrate: structure for migrating the associated vma
  */
 struct hmm_range {
 	struct mmu_interval_notifier *notifier;
@@ -117,12 +121,14 @@ struct hmm_range {
 	unsigned long		default_flags;
 	unsigned long		pfn_flags_mask;
 	void			*dev_private_owner;
+	struct migrate_vma      *migrate;
 };
 
 /*
  * Please see Documentation/mm/hmm.rst for how to use the range API.
  */
 int hmm_range_fault(struct hmm_range *range);
+int hmm_range_migrate_prepare(struct hmm_range *range, struct migrate_vma **pargs);
 
 /*
  * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index acadd41e0b5c..ab35d0f1f65d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -3,6 +3,7 @@
 #define _LINUX_MIGRATE_H
 
 #include <linux/mm.h>
+#include <linux/hmm.h>
 #include <linux/mempolicy.h>
 #include <linux/migrate_mode.h>
 #include <linux/hugetlb.h>
@@ -143,10 +144,11 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
 }
 
-enum migrate_vma_direction {
+enum migrate_vma_info {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
 	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
+	MIGRATE_VMA_FAULT = 1 << 3,
 };
 
 struct migrate_vma {
@@ -194,7 +196,7 @@ void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
 			unsigned long npages);
 void migrate_device_finalize(unsigned long *src_pfns,
 			unsigned long *dst_pfns, unsigned long npages);
-
+void migrate_hmm_range_setup(struct hmm_range *range);
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/hmm.c b/mm/hmm.c
index d545e2494994..8cb2b325fa9f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -20,6 +20,7 @@
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/sched/mm.h>
 #include <linux/jump_label.h>
@@ -33,6 +34,10 @@
 struct hmm_vma_walk {
 	struct hmm_range	*range;
 	unsigned long		last;
+	struct mmu_notifier_range mmu_range;
+	struct vm_area_struct 	*vma;
+	unsigned long 		start;
+	unsigned long 		end;
 };
 
 enum {
@@ -47,15 +52,33 @@ enum {
 			      HMM_PFN_P2PDMA_BUS,
 };
 
+static enum migrate_vma_info hmm_want_migrate(struct hmm_range *range)
+{
+	enum migrate_vma_info minfo;
+
+	minfo = range->migrate ? range->migrate->flags : 0;
+	minfo |= (range->default_flags & HMM_PFN_REQ_MIGRATE) ?
+		MIGRATE_VMA_SELECT_SYSTEM : 0;
+
+	return minfo;
+}
+
 static int hmm_pfns_fill(unsigned long addr, unsigned long end,
-			 struct hmm_range *range, unsigned long cpu_flags)
+			 struct hmm_vma_walk *hmm_vma_walk, unsigned long cpu_flags)
 {
+	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
 
+	if (cpu_flags != HMM_PFN_ERROR)
+		if (hmm_want_migrate(range) &&
+		    (vma_is_anonymous(hmm_vma_walk->vma)))
+			cpu_flags |= (HMM_PFN_VALID | HMM_PFN_MIGRATE);
+
 	for (; addr < end; addr += PAGE_SIZE, i++) {
 		range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
 		range->hmm_pfns[i] |= cpu_flags;
 	}
+
 	return 0;
 }
 
@@ -171,11 +194,11 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
 	if (!walk->vma) {
 		if (required_fault)
 			return -EFAULT;
-		return hmm_pfns_fill(addr, end, range, HMM_PFN_ERROR);
+		return hmm_pfns_fill(addr, end, hmm_vma_walk, HMM_PFN_ERROR);
 	}
 	if (required_fault)
 		return hmm_vma_fault(addr, end, required_fault, walk);
-	return hmm_pfns_fill(addr, end, range, 0);
+	return hmm_pfns_fill(addr, end, hmm_vma_walk, 0);
 }
 
 static inline unsigned long hmm_pfn_flags_order(unsigned long order)
@@ -326,6 +349,257 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	return hmm_vma_fault(addr, end, required_fault, walk);
 }
 
+/*
+ * Install migration entries if migration requested, either from fault
+ * or migrate paths.
+ *
+ */
+static void hmm_vma_handle_migrate_prepare(const struct mm_walk *walk,
+					   pmd_t *pmdp,
+					   unsigned long addr,
+					   unsigned long *hmm_pfn)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	struct migrate_vma *migrate = range->migrate;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	struct folio *fault_folio = NULL;
+	enum migrate_vma_info minfo;
+	struct dev_pagemap *pgmap;
+	bool anon_exclusive;
+	struct folio *folio;
+	unsigned long pfn;
+	struct page *page;
+	swp_entry_t entry;
+	pte_t pte, swp_pte;
+	spinlock_t *ptl;
+	bool writable = false;
+	pte_t *ptep;
+
+
+	// Do we want to migrate at all?
+	minfo = hmm_want_migrate(range);
+	if (!minfo)
+		return;
+
+	fault_folio = (migrate && migrate->fault_page) ?
+		page_folio(migrate->fault_page) : NULL;
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	if (!ptep)
+		return;
+
+	pte = ptep_get(ptep);
+
+	if (pte_none(pte)) {
+		// migrate without faulting case
+		if (vma_is_anonymous(walk->vma))
+			*hmm_pfn = HMM_PFN_MIGRATE|HMM_PFN_VALID;
+		goto out;
+	}
+
+	if (!(*hmm_pfn & HMM_PFN_VALID))
+		goto out;
+
+	if (!pte_present(pte)) {
+		/*
+		 * Only care about unaddressable device page special
+		 * page table entry. Other special swap entries are not
+		 * migratable, and we ignore regular swapped page.
+		 */
+		entry = pte_to_swp_entry(pte);
+		if (!is_device_private_entry(entry))
+			goto out;
+
+		// We have already checked that are the pgmap owners
+		if (!(minfo & MIGRATE_VMA_SELECT_DEVICE_PRIVATE))
+			goto out;
+
+		page = pfn_swap_entry_to_page(entry);
+		pfn = page_to_pfn(page);
+		if (is_writable_device_private_entry(entry))
+			writable = true;
+	} else {
+		pfn = pte_pfn(pte);
+		if (is_zero_pfn(pfn) &&
+		    (minfo & MIGRATE_VMA_SELECT_SYSTEM)) {
+			*hmm_pfn = HMM_PFN_MIGRATE|HMM_PFN_VALID;
+			goto out;
+		}
+		page = vm_normal_page(walk->vma, addr, pte);
+		if (page && !is_zone_device_page(page) &&
+		    !(minfo & MIGRATE_VMA_SELECT_SYSTEM)) {
+			goto out;
+		} else if (page && is_device_coherent_page(page)) {
+			pgmap = page_pgmap(page);
+
+			if (!(minfo &
+			      MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+			    pgmap->owner != migrate->pgmap_owner)
+				goto out;
+		}
+		writable = pte_write(pte);
+	}
+
+	/* FIXME support THP */
+	if (!page || !page->mapping || PageTransCompound(page))
+		goto out;
+
+	/*
+	 * By getting a reference on the folio we pin it and that blocks
+	 * any kind of migration. Side effect is that it "freezes" the
+	 * pte.
+	 *
+	 * We drop this reference after isolating the folio from the lru
+	 * for non device folio (device folio are not on the lru and thus
+	 * can't be dropped from it).
+	 */
+	folio = page_folio(page);
+	folio_get(folio);
+
+	/*
+	 * We rely on folio_trylock() to avoid deadlock between
+	 * concurrent migrations where each is waiting on the others
+	 * folio lock. If we can't immediately lock the folio we fail this
+	 * migration as it is only best effort anyway.
+	 *
+	 * If we can lock the folio it's safe to set up a migration entry
+	 * now. In the common case where the folio is mapped once in a
+	 * single process setting up the migration entry now is an
+	 * optimisation to avoid walking the rmap later with
+	 * try_to_migrate().
+	 */
+
+	if (fault_folio == folio || folio_trylock(folio)) {
+		anon_exclusive = folio_test_anon(folio) &&
+			PageAnonExclusive(page);
+
+		flush_cache_page(walk->vma, addr, pfn);
+
+		if (anon_exclusive) {
+			pte = ptep_clear_flush(walk->vma, addr, ptep);
+
+			if (folio_try_share_anon_rmap_pte(folio, page)) {
+				set_pte_at(mm, addr, ptep, pte);
+				folio_unlock(folio);
+				folio_put(folio);
+				goto out;
+			}
+		} else {
+			pte = ptep_get_and_clear(mm, addr, ptep);
+		}
+
+		/* Setup special migration page table entry */
+		if (writable)
+			entry = make_writable_migration_entry(pfn);
+		else if (anon_exclusive)
+			entry = make_readable_exclusive_migration_entry(pfn);
+		else
+			entry = make_readable_migration_entry(pfn);
+
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_present(pte)) {
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pte))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+		} else {
+			if (pte_swp_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_swp_uffd_wp(pte))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+		}
+
+		set_pte_at(mm, addr, ptep, swp_pte);
+		folio_remove_rmap_pte(folio, page, walk->vma);
+		folio_put(folio);
+		*hmm_pfn |= HMM_PFN_MIGRATE;
+
+		if (pte_present(pte))
+			flush_tlb_range(walk->vma, addr, addr + PAGE_SIZE);
+	} else
+		folio_put(folio);
+out:
+	pte_unmap_unlock(ptep, ptl);
+
+}
+
+static int hmm_vma_walk_split(pmd_t *pmdp,
+			      unsigned long addr,
+			      struct mm_walk *walk)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	struct migrate_vma *migrate = range->migrate;
+	struct folio *folio, *fault_folio;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	fault_folio = (migrate && migrate->fault_page) ?
+		page_folio(migrate->fault_page) : NULL;
+
+	ptl = pmd_lock(walk->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		goto out;
+	}
+
+	folio = pmd_folio(*pmdp);
+	if (is_huge_zero_folio(folio)) {
+		spin_unlock(ptl);
+		split_huge_pmd(walk->vma, pmdp, addr);
+	} else {
+		folio_get(folio);
+		spin_unlock(ptl);
+		/* FIXME: we don't expect THP for fault_folio */
+		if (WARN_ON_ONCE(fault_folio == folio)) {
+			folio_put(folio);
+			ret = -EBUSY;
+			goto out;
+		}
+		if (unlikely(!folio_trylock(folio))) {
+			folio_put(folio);
+			ret = -EBUSY;
+			goto out;
+		}
+		ret = split_folio(folio);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+out:
+	return ret;
+}
+
+static int hmm_vma_capture_migrate_range(unsigned long start,
+					 unsigned long end,
+					 struct mm_walk *walk)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+
+	if (!hmm_want_migrate(range))
+		return 0;
+
+	if (hmm_vma_walk->vma && (hmm_vma_walk->vma != walk->vma))
+		return -ERANGE;
+
+	hmm_vma_walk->vma = walk->vma;
+	hmm_vma_walk->start = start;
+	hmm_vma_walk->end = end;
+
+	if (end - start > range->end - range->start)
+		return -ERANGE;
+
+	if (!hmm_vma_walk->mmu_range.owner) {
+		mmu_notifier_range_init_owner(&hmm_vma_walk->mmu_range, MMU_NOTIFY_MIGRATE, 0,
+					      walk->vma->vm_mm, start, end,
+					      range->dev_private_owner);
+		mmu_notifier_invalidate_range_start(&hmm_vma_walk->mmu_range);
+	}
+
+	return 0;
+}
+
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long start,
 			    unsigned long end,
@@ -351,13 +625,28 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			pmd_migration_entry_wait(walk->mm, pmdp);
 			return -EBUSY;
 		}
-		return hmm_pfns_fill(start, end, range, 0);
+		return hmm_pfns_fill(start, end, hmm_vma_walk, 0);
 	}
 
 	if (!pmd_present(pmd)) {
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
-		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
+		return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
+	}
+
+	if (hmm_want_migrate(range) &&
+	    pmd_trans_huge(pmd)) {
+		int r;
+
+		r = hmm_vma_walk_split(pmdp, addr, walk);
+		if (r) {
+			/* Split not successful, skip */
+			return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
+		}
+
+		/* Split successful or "again", reloop */
+		hmm_vma_walk->last = addr;
+		return -EBUSY;
 	}
 
 	if (pmd_trans_huge(pmd)) {
@@ -386,7 +675,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	if (pmd_bad(pmd)) {
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
-		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
+		return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
 	}
 
 	ptep = pte_offset_map(pmdp, addr);
@@ -400,8 +689,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			/* hmm_vma_handle_pte() did pte_unmap() */
 			return r;
 		}
+
+		hmm_vma_handle_migrate_prepare(walk, pmdp, addr, hmm_pfns);
 	}
 	pte_unmap(ptep - 1);
+
 	return 0;
 }
 
@@ -535,6 +827,11 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
+	int r;
+
+	r = hmm_vma_capture_migrate_range(start, end, walk);
+	if (r)
+		return r;
 
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)) &&
 	    vma->vm_flags & VM_READ)
@@ -557,7 +854,7 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end,
 				 (end - start) >> PAGE_SHIFT, 0))
 		return -EFAULT;
 
-	hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
+	hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
 
 	/* Skip this vma and continue processing the next vma. */
 	return 1;
@@ -587,9 +884,17 @@ static const struct mm_walk_ops hmm_walk_ops = {
  *		the invalidation to finish.
  * -EFAULT:     A page was requested to be valid and could not be made valid
  *              ie it has no backing VMA or it is illegal to access
+ * -ERANGE:     The range crosses multiple VMAs, or space for hmm_pfns array
+ *              is too low.
  *
  * This is similar to get_user_pages(), except that it can read the page tables
  * without mutating them (ie causing faults).
+ *
+ * If want to do migrate after faultin, call hmm_range_fault() with
+ * HMM_PFN_REQ_MIGRATE and initialize range.migrate field.
+ * After hmm_range_fault() call migrate_hmm_range_setup() instead of
+ * migrate_vma_setup() and after that follow normal migrate calls path.
+ *
  */
 int hmm_range_fault(struct hmm_range *range)
 {
@@ -597,16 +902,28 @@ int hmm_range_fault(struct hmm_range *range)
 		.range = range,
 		.last = range->start,
 	};
-	struct mm_struct *mm = range->notifier->mm;
+	bool is_fault_path = !!range->notifier;
+	struct mm_struct *mm;
 	int ret;
 
+	/*
+	 *
+	 *  Could be serving a device fault or come from migrate
+	 *  entry point. For the former we have not resolved the vma
+	 *  yet, and the latter we don't have a notifier (but have a vma).
+	 *
+	 */
+	mm = is_fault_path ? range->notifier->mm : range->migrate->vma->vm_mm;
 	mmap_assert_locked(mm);
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (mmu_interval_check_retry(range->notifier,
-					     range->notifier_seq))
-			return -EBUSY;
+		if (is_fault_path && mmu_interval_check_retry(range->notifier,
+					     range->notifier_seq)) {
+			ret = -EBUSY;
+			break;
+		}
+
 		ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
 				      &hmm_walk_ops, &hmm_vma_walk);
 		/*
@@ -616,6 +933,18 @@ int hmm_range_fault(struct hmm_range *range)
 		 * output, and all >= are still at their input values.
 		 */
 	} while (ret == -EBUSY);
+
+	if (hmm_want_migrate(range) && range->migrate &&
+	    hmm_vma_walk.mmu_range.owner) {
+		// The migrate_vma path has the following initialized
+		if (is_fault_path) {
+			range->migrate->vma   = hmm_vma_walk.vma;
+			range->migrate->start = range->start;
+			range->migrate->end   = hmm_vma_walk.end;
+		}
+		mmu_notifier_invalidate_range_end(&hmm_vma_walk.mmu_range);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL(hmm_range_fault);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e05e14d6eacd..87ddc0353165 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -535,7 +535,18 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
  */
 int migrate_vma_setup(struct migrate_vma *args)
 {
+	int ret;
 	long nr_pages = (args->end - args->start) >> PAGE_SHIFT;
+	struct hmm_range range = {
+		.notifier = NULL,
+		.start = args->start,
+		.end = args->end,
+		.migrate = args,
+		.hmm_pfns = args->src,
+		.default_flags = HMM_PFN_REQ_MIGRATE,
+		.dev_private_owner = args->pgmap_owner,
+		.migrate = args
+	};
 
 	args->start &= PAGE_MASK;
 	args->end &= PAGE_MASK;
@@ -560,17 +571,19 @@ int migrate_vma_setup(struct migrate_vma *args)
 	args->cpages = 0;
 	args->npages = 0;
 
-	migrate_vma_collect(args);
+	if (args->flags & MIGRATE_VMA_FAULT)
+		range.default_flags |= HMM_PFN_REQ_FAULT;
 
-	if (args->cpages)
-		migrate_vma_unmap(args);
+	ret = hmm_range_fault(&range);
+
+	migrate_hmm_range_setup(&range);
 
 	/*
 	 * At this point pages are locked and unmapped, and thus they have
 	 * stable content and can safely be copied to destination memory that
 	 * is allocated by the drivers.
 	 */
-	return 0;
+	return ret;
 
 }
 EXPORT_SYMBOL(migrate_vma_setup);
@@ -1014,3 +1027,54 @@ int migrate_device_coherent_folio(struct folio *folio)
 		return 0;
 	return -EBUSY;
 }
+
+void migrate_hmm_range_setup(struct hmm_range *range)
+{
+
+	struct migrate_vma *migrate = range->migrate;
+
+	if (!migrate)
+		return;
+
+	migrate->npages = (migrate->end - migrate->start) >> PAGE_SHIFT;
+	migrate->cpages = 0;
+
+	for (unsigned long i = 0; i < migrate->npages; i++) {
+
+		unsigned long pfn = range->hmm_pfns[i];
+
+		/*
+		 *
+		 *  Don't do migration if valid and migrate flags are not both set.
+		 *
+		 */
+		if ((pfn & (HMM_PFN_VALID | HMM_PFN_MIGRATE)) !=
+		    (HMM_PFN_VALID | HMM_PFN_MIGRATE)) {
+			migrate->src[i] = 0;
+			migrate->dst[i] = 0;
+			continue;
+		}
+
+		migrate->cpages++;
+
+		/*
+		 *
+		 * The zero page is encoded in a special way, valid and migrate is
+		 * set, and pfn part is zero. Encode specially for migrate also.
+		 *
+		 */
+		if (pfn == (HMM_PFN_VALID|HMM_PFN_MIGRATE)) {
+			migrate->src[i] = MIGRATE_PFN_MIGRATE;
+			continue;
+		}
+
+		migrate->src[i] = migrate_pfn(page_to_pfn(hmm_pfn_to_page(pfn)))
+			| MIGRATE_PFN_MIGRATE;
+		migrate->src[i] |= (pfn & HMM_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+	}
+
+	if (migrate->cpages)
+		migrate_vma_unmap(migrate);
+
+}
+EXPORT_SYMBOL(migrate_hmm_range_setup);
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/4] mm:/migrate_device.c: remove migrate_vma_collect_*() functions
  2025-08-14  7:19 [RFC PATCH 0/4] Migrate on fault for device pages Mika Penttilä
  2025-08-14  7:19 ` [RFC PATCH 1/4] mm: use current as mmu notifier's owner Mika Penttilä
  2025-08-14  7:19 ` [RFC PATCH 2/4] mm: unified fault and migrate device page paths Mika Penttilä
@ 2025-08-14  7:19 ` Mika Penttilä
  2025-08-14  7:19 ` [RFC PATCH 4/4] mm: add new testcase for the migrate on fault case Mika Penttilä
  2025-08-15 11:36 ` [RFC PATCH 0/4] Migrate on fault for device pages Balbir Singh
  4 siblings, 0 replies; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14  7:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mika Penttilä, David Hildenbrand,
	Jason Gunthorpe, Leon Romanovsky, Alistair Popple, Balbir Singh

With the unified fault handling and migrate path,
the migrate_vma_collect_*() functions are unused,
let's remove them.

Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
---
 mm/migrate_device.c | 312 +-------------------------------------------
 1 file changed, 1 insertion(+), 311 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 87ddc0353165..0c84dfcd5058 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -15,319 +15,9 @@
 #include <linux/rmap.h>
 #include <linux/swapops.h>
 #include <asm/tlbflush.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
-static int migrate_vma_collect_skip(unsigned long start,
-				    unsigned long end,
-				    struct mm_walk *walk)
-{
-	struct migrate_vma *migrate = walk->private;
-	unsigned long addr;
-
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		migrate->dst[migrate->npages] = 0;
-		migrate->src[migrate->npages++] = 0;
-	}
-
-	return 0;
-}
-
-static int migrate_vma_collect_hole(unsigned long start,
-				    unsigned long end,
-				    __always_unused int depth,
-				    struct mm_walk *walk)
-{
-	struct migrate_vma *migrate = walk->private;
-	unsigned long addr;
-
-	/* Only allow populating anonymous memory. */
-	if (!vma_is_anonymous(walk->vma))
-		return migrate_vma_collect_skip(start, end, walk);
-
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
-		migrate->dst[migrate->npages] = 0;
-		migrate->npages++;
-		migrate->cpages++;
-	}
-
-	return 0;
-}
-
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-				   unsigned long start,
-				   unsigned long end,
-				   struct mm_walk *walk)
-{
-	struct migrate_vma *migrate = walk->private;
-	struct folio *fault_folio = migrate->fault_page ?
-		page_folio(migrate->fault_page) : NULL;
-	struct vm_area_struct *vma = walk->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start, unmapped = 0;
-	spinlock_t *ptl;
-	pte_t *ptep;
-
-again:
-	if (pmd_none(*pmdp))
-		return migrate_vma_collect_hole(start, end, -1, walk);
-
-	if (pmd_trans_huge(*pmdp)) {
-		struct folio *folio;
-
-		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
-			spin_unlock(ptl);
-			goto again;
-		}
-
-		folio = pmd_folio(*pmdp);
-		if (is_huge_zero_folio(folio)) {
-			spin_unlock(ptl);
-			split_huge_pmd(vma, pmdp, addr);
-		} else {
-			int ret;
-
-			folio_get(folio);
-			spin_unlock(ptl);
-			/* FIXME: we don't expect THP for fault_folio */
-			if (WARN_ON_ONCE(fault_folio == folio))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			if (unlikely(!folio_trylock(folio)))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			ret = split_folio(folio);
-			if (fault_folio != folio)
-				folio_unlock(folio);
-			folio_put(folio);
-			if (ret)
-				return migrate_vma_collect_skip(start, end,
-								walk);
-		}
-	}
-
-	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-	if (!ptep)
-		goto again;
-	arch_enter_lazy_mmu_mode();
-
-	for (; addr < end; addr += PAGE_SIZE, ptep++) {
-		struct dev_pagemap *pgmap;
-		unsigned long mpfn = 0, pfn;
-		struct folio *folio;
-		struct page *page;
-		swp_entry_t entry;
-		pte_t pte;
-
-		pte = ptep_get(ptep);
-
-		if (pte_none(pte)) {
-			if (vma_is_anonymous(vma)) {
-				mpfn = MIGRATE_PFN_MIGRATE;
-				migrate->cpages++;
-			}
-			goto next;
-		}
-
-		if (!pte_present(pte)) {
-			/*
-			 * Only care about unaddressable device page special
-			 * page table entry. Other special swap entries are not
-			 * migratable, and we ignore regular swapped page.
-			 */
-			entry = pte_to_swp_entry(pte);
-			if (!is_device_private_entry(entry))
-				goto next;
-
-			page = pfn_swap_entry_to_page(entry);
-			pgmap = page_pgmap(page);
-			if (!(migrate->flags &
-				MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
-			    pgmap->owner != migrate->pgmap_owner)
-				goto next;
-
-			mpfn = migrate_pfn(page_to_pfn(page)) |
-					MIGRATE_PFN_MIGRATE;
-			if (is_writable_device_private_entry(entry))
-				mpfn |= MIGRATE_PFN_WRITE;
-		} else {
-			pfn = pte_pfn(pte);
-			if (is_zero_pfn(pfn) &&
-			    (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
-				mpfn = MIGRATE_PFN_MIGRATE;
-				migrate->cpages++;
-				goto next;
-			}
-			page = vm_normal_page(migrate->vma, addr, pte);
-			if (page && !is_zone_device_page(page) &&
-			    !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
-				goto next;
-			} else if (page && is_device_coherent_page(page)) {
-				pgmap = page_pgmap(page);
-
-				if (!(migrate->flags &
-					MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
-					pgmap->owner != migrate->pgmap_owner)
-					goto next;
-			}
-			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
-			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
-		}
-
-		/* FIXME support THP */
-		if (!page || !page->mapping || PageTransCompound(page)) {
-			mpfn = 0;
-			goto next;
-		}
-
-		/*
-		 * By getting a reference on the folio we pin it and that blocks
-		 * any kind of migration. Side effect is that it "freezes" the
-		 * pte.
-		 *
-		 * We drop this reference after isolating the folio from the lru
-		 * for non device folio (device folio are not on the lru and thus
-		 * can't be dropped from it).
-		 */
-		folio = page_folio(page);
-		folio_get(folio);
-
-		/*
-		 * We rely on folio_trylock() to avoid deadlock between
-		 * concurrent migrations where each is waiting on the others
-		 * folio lock. If we can't immediately lock the folio we fail this
-		 * migration as it is only best effort anyway.
-		 *
-		 * If we can lock the folio it's safe to set up a migration entry
-		 * now. In the common case where the folio is mapped once in a
-		 * single process setting up the migration entry now is an
-		 * optimisation to avoid walking the rmap later with
-		 * try_to_migrate().
-		 */
-		if (fault_folio == folio || folio_trylock(folio)) {
-			bool anon_exclusive;
-			pte_t swp_pte;
-
-			flush_cache_page(vma, addr, pte_pfn(pte));
-			anon_exclusive = folio_test_anon(folio) &&
-					  PageAnonExclusive(page);
-			if (anon_exclusive) {
-				pte = ptep_clear_flush(vma, addr, ptep);
-
-				if (folio_try_share_anon_rmap_pte(folio, page)) {
-					set_pte_at(mm, addr, ptep, pte);
-					if (fault_folio != folio)
-						folio_unlock(folio);
-					folio_put(folio);
-					mpfn = 0;
-					goto next;
-				}
-			} else {
-				pte = ptep_get_and_clear(mm, addr, ptep);
-			}
-
-			migrate->cpages++;
-
-			/* Set the dirty flag on the folio now the pte is gone. */
-			if (pte_dirty(pte))
-				folio_mark_dirty(folio);
-
-			/* Setup special migration page table entry */
-			if (mpfn & MIGRATE_PFN_WRITE)
-				entry = make_writable_migration_entry(
-							page_to_pfn(page));
-			else if (anon_exclusive)
-				entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page));
-			else
-				entry = make_readable_migration_entry(
-							page_to_pfn(page));
-			if (pte_present(pte)) {
-				if (pte_young(pte))
-					entry = make_migration_entry_young(entry);
-				if (pte_dirty(pte))
-					entry = make_migration_entry_dirty(entry);
-			}
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_present(pte)) {
-				if (pte_soft_dirty(pte))
-					swp_pte = pte_swp_mksoft_dirty(swp_pte);
-				if (pte_uffd_wp(pte))
-					swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			} else {
-				if (pte_swp_soft_dirty(pte))
-					swp_pte = pte_swp_mksoft_dirty(swp_pte);
-				if (pte_swp_uffd_wp(pte))
-					swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			}
-			set_pte_at(mm, addr, ptep, swp_pte);
-
-			/*
-			 * This is like regular unmap: we remove the rmap and
-			 * drop the folio refcount. The folio won't be freed, as
-			 * we took a reference just above.
-			 */
-			folio_remove_rmap_pte(folio, page, vma);
-			folio_put(folio);
-
-			if (pte_present(pte))
-				unmapped++;
-		} else {
-			folio_put(folio);
-			mpfn = 0;
-		}
-
-next:
-		migrate->dst[migrate->npages] = 0;
-		migrate->src[migrate->npages++] = mpfn;
-	}
-
-	/* Only flush the TLB if we actually modified any entries */
-	if (unmapped)
-		flush_tlb_range(walk->vma, start, end);
-
-	arch_leave_lazy_mmu_mode();
-	pte_unmap_unlock(ptep - 1, ptl);
-
-	return 0;
-}
-
-static const struct mm_walk_ops migrate_vma_walk_ops = {
-	.pmd_entry		= migrate_vma_collect_pmd,
-	.pte_hole		= migrate_vma_collect_hole,
-	.walk_lock		= PGWALK_RDLOCK,
-};
-
-/*
- * migrate_vma_collect() - collect pages over a range of virtual addresses
- * @migrate: migrate struct containing all migration information
- *
- * This will walk the CPU page table. For each virtual address backed by a
- * valid page, it updates the src array and takes a reference on the page, in
- * order to pin the page until we lock it and unmap it.
- */
-static void migrate_vma_collect(struct migrate_vma *migrate)
-{
-	struct mmu_notifier_range range;
-
-	/*
-	 * Note that the pgmap_owner is passed to the mmu notifier callback so
-	 * that the registered device driver can skip invalidating device
-	 * private page mappings that won't be migrated.
-	 */
-	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
-		migrate->vma->vm_mm, migrate->start, migrate->end,
-		migrate->pgmap_owner);
-	mmu_notifier_invalidate_range_start(&range);
-
-	walk_page_range(migrate->vma->vm_mm, migrate->start, migrate->end,
-			&migrate_vma_walk_ops, migrate);
-
-	mmu_notifier_invalidate_range_end(&range);
-	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
-}
-
 /*
  * migrate_vma_check_page() - check if page is pinned or not
  * @page: struct page to check
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/4] mm: add new testcase for the migrate on fault case
  2025-08-14  7:19 [RFC PATCH 0/4] Migrate on fault for device pages Mika Penttilä
                   ` (2 preceding siblings ...)
  2025-08-14  7:19 ` [RFC PATCH 3/4] mm:/migrate_device.c: remove migrate_vma_collect_*() functions Mika Penttilä
@ 2025-08-14  7:19 ` Mika Penttilä
  2025-08-15 11:36 ` [RFC PATCH 0/4] Migrate on fault for device pages Balbir Singh
  4 siblings, 0 replies; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14  7:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mika Penttilä, David Hildenbrand,
	Jason Gunthorpe, Leon Romanovsky, Alistair Popple, Balbir Singh,
	Marco Pagani

Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>

Signed-off-by: Marco Pagani <marpagan@redhat.com>
Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
---
 lib/test_hmm.c                         | 100 ++++++++++++++++++++++++-
 lib/test_hmm_uapi.h                    |  17 +++--
 tools/testing/selftests/mm/hmm-tests.c |  53 +++++++++++++
 3 files changed, 161 insertions(+), 9 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index cd5c139213be..e69a5c87e414 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -36,6 +36,7 @@
 #define DMIRROR_RANGE_FAULT_TIMEOUT	1000
 #define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE		16
+#define PFNS_ARRAY_SIZE			64
 
 /*
  * For device_private pages, dpage is just a dummy struct page
@@ -143,7 +144,7 @@ static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
 		HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
 }
 
-static enum migrate_vma_direction
+static enum migrate_vma_info
 dmirror_select_device(struct dmirror *dmirror)
 {
 	return (dmirror->mdevice->zone_device_type ==
@@ -1016,6 +1017,99 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	return ret;
 }
 
+static int do_fault_and_migrate(struct dmirror *dmirror, struct hmm_range *range)
+{
+	struct migrate_vma *migrate = range->migrate;
+	int ret;
+
+	mmap_read_lock(dmirror->notifier.mm);
+
+	/* Fault-in pages for migration and update device page table */
+	ret = dmirror_range_fault(dmirror, range);
+
+	pr_debug("Migrating from sys mem to device mem\n");
+	migrate_hmm_range_setup(range);
+
+	dmirror_migrate_alloc_and_copy(migrate, dmirror);
+	migrate_vma_pages(migrate);
+	dmirror_migrate_finalize_and_map(migrate, dmirror);
+	migrate_vma_finalize(migrate);
+
+	mmap_read_unlock(dmirror->notifier.mm);
+	return ret;
+}
+
+static int dmirror_fault_and_migrate_to_device(struct dmirror *dmirror,
+					       struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, size, end, next;
+	unsigned long src_pfns[PFNS_ARRAY_SIZE] = { 0 };
+	unsigned long dst_pfns[PFNS_ARRAY_SIZE] = { 0 };
+	struct migrate_vma migrate = { 0 };
+	struct hmm_range range = { 0 };
+	struct dmirror_bounce bounce;
+	int ret = 0;
+
+	/* Whole range */
+	start = cmd->addr;
+	size = cmd->npages << PAGE_SHIFT;
+	end = start + size;
+
+	if (!mmget_not_zero(dmirror->notifier.mm)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	migrate.pgmap_owner = dmirror->mdevice;
+	migrate.src = src_pfns;
+	migrate.dst = dst_pfns;
+
+	range.migrate = &migrate;
+	range.hmm_pfns = src_pfns;
+	range.pfn_flags_mask = 0;
+	range.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_MIGRATE;
+	range.dev_private_owner = dmirror->mdevice;
+	range.notifier = &dmirror->notifier;
+
+	for (next = start; next < end; next = range.end) {
+		range.start = next;
+		range.end = min(end, next + (PFNS_ARRAY_SIZE << PAGE_SHIFT));
+
+		pr_debug("Fault and migrate range start:%#lx end:%#lx\n",
+			 range.start, range.end);
+
+		ret = do_fault_and_migrate(dmirror, &range);
+		if (ret)
+			goto out_mmput;
+	}
+
+	/*
+	 * Return the migrated data for verification.
+	 * Only for pages in device zone
+	 ***/
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		goto out_mmput;
+
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_read(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0) {
+		ret = copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr, bounce.size);
+		if (ret)
+			ret = -EFAULT;
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+
+
+out_mmput:
+	mmput(dmirror->notifier.mm);
+out:
+	return ret;
+}
+
 static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
 			    unsigned char *perm, unsigned long entry)
 {
@@ -1313,6 +1407,10 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		ret = dmirror_migrate_to_device(dmirror, &cmd);
 		break;
 
+	case HMM_DMIRROR_MIGRATE_ON_FAULT_TO_DEV:
+		ret = dmirror_fault_and_migrate_to_device(dmirror, &cmd);
+		break;
+
 	case HMM_DMIRROR_MIGRATE_TO_SYS:
 		ret = dmirror_migrate_to_system(dmirror, &cmd);
 		break;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 8c818a2cf4f6..4266f0a12201 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -29,14 +29,15 @@ struct hmm_dmirror_cmd {
 };
 
 /* Expose the address space of the calling process through hmm device file */
-#define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_MIGRATE_TO_DEV	_IOWR('H', 0x02, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_MIGRATE_TO_SYS	_IOWR('H', 0x03, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_READ			_IOWR('H', 0x00, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_WRITE			_IOWR('H', 0x01, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_DEV		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_ON_FAULT_TO_DEV	_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_SYS		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT			_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE			_IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_RELEASE			_IOWR('H', 0x08, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 141bf63cbe05..059512f62d29 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -272,6 +272,13 @@ static int hmm_migrate_sys_to_dev(int fd,
 	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
 }
 
+static int hmm_migrate_on_fault_sys_to_dev(int fd,
+					   struct hmm_buffer *buffer,
+					   unsigned long npages)
+{
+	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_ON_FAULT_TO_DEV, buffer, npages);
+}
+
 static int hmm_migrate_dev_to_sys(int fd,
 				   struct hmm_buffer *buffer,
 				   unsigned long npages)
@@ -998,6 +1005,52 @@ TEST_F(hmm, migrate)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Fault an migrate anonymous memory to device private memory.
+ */
+TEST_F(hmm, migrate_on_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Fault and migrate memory to device. */
+	ret = hmm_migrate_on_fault_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
 /*
  * Migrate anonymous memory to device private memory and fault some of it back
  * to system memory, then try migrating the resulting mix of system and device
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14  7:19 ` [RFC PATCH 1/4] mm: use current as mmu notifier's owner Mika Penttilä
@ 2025-08-14 12:40   ` Jason Gunthorpe
  2025-08-14 12:53     ` Mika Penttilä
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2025-08-14 12:40 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Alistair Popple, Balbir Singh

On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
> When doing migration in combination with device fault handling,
> detect the case in the interval notifier.
> 
> Without that, we would livelock with our own invalidations
> while migrating and splitting pages during fault handling.
> 
> Note, pgmap_owner, used in some other code paths as owner for filtering,
> is not readily available for split path, so use current for this use case.
> Also, current and pgmap_owner, both being pointers to memory, can not be
> mis-interpreted to each other.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leonro@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> 
> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
> ---
>  lib/test_hmm.c   | 5 +++++
>  mm/huge_memory.c | 6 +++---
>  mm/rmap.c        | 4 ++--
>  3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 761725bc713c..cd5c139213be 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>  	    range->owner == dmirror->mdevice)
>  		return true;
>  
> +	if (range->event == MMU_NOTIFY_CLEAR &&
> +	    range->owner == current) {
> +		return true;
> +	}

I don't understand this, there is nothing in hmm that says only
current can call hmm_range_fault, and indeed most applications won't
even gurantee that.

So if this plan relies on something like the above in drivers I don't
see how it can work.

If this is just some hack for tests, try instead to find a solution
that more accurately matches what a real driver should do.

But this also seems overall troublesome to your goal, if you do a
migrate inside hmm_range_fault() it will generate an invalidation call
back and that will increment the seqlock and we will loop
hmm_range_fault() again which rewalks.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 12:40   ` Jason Gunthorpe
@ 2025-08-14 12:53     ` Mika Penttilä
  2025-08-14 13:04       ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14 12:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Alistair Popple, Balbir Singh


On 8/14/25 15:40, Jason Gunthorpe wrote:
> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
>> When doing migration in combination with device fault handling,
>> detect the case in the interval notifier.
>>
>> Without that, we would livelock with our own invalidations
>> while migrating and splitting pages during fault handling.
>>
>> Note, pgmap_owner, used in some other code paths as owner for filtering,
>> is not readily available for split path, so use current for this use case.
>> Also, current and pgmap_owner, both being pointers to memory, can not be
>> mis-interpreted to each other.
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: Leon Romanovsky <leonro@nvidia.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Balbir Singh <balbirs@nvidia.com>
>>
>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
>> ---
>>  lib/test_hmm.c   | 5 +++++
>>  mm/huge_memory.c | 6 +++---
>>  mm/rmap.c        | 4 ++--
>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 761725bc713c..cd5c139213be 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>>  	    range->owner == dmirror->mdevice)
>>  		return true;
>>  
>> +	if (range->event == MMU_NOTIFY_CLEAR &&
>> +	    range->owner == current) {
>> +		return true;
>> +	}
> I don't understand this, there is nothing in hmm that says only
> current can call hmm_range_fault, and indeed most applications won't
> even gurantee that.

No it's the opposite, if we are ourselves the invalidator, don't care.

> So if this plan relies on something like the above in drivers I don't
> see how it can work.
>
> If this is just some hack for tests, try instead to find a solution
> that more accurately matches what a real driver should do.
>
> But this also seems overall troublesome to your goal, if you do a
> migrate inside hmm_range_fault() it will generate an invalidation call
> back and that will increment the seqlock and we will loop
> hmm_range_fault() again which rewalks.

That's the problem this solves.
The semantics is "if we are the invalidator don't wait for invalidate end",
aka don't mmu_interval_set_seq() that would make hang in the next mmu_interval_read_begin(),
waiting the invalidate to end

>
> Jason
>
--Mika


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 12:53     ` Mika Penttilä
@ 2025-08-14 13:04       ` Jason Gunthorpe
  2025-08-14 13:20         ` Mika Penttilä
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2025-08-14 13:04 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Alistair Popple, Balbir Singh

On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
> 
> On 8/14/25 15:40, Jason Gunthorpe wrote:
> > On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
> >> When doing migration in combination with device fault handling,
> >> detect the case in the interval notifier.
> >>
> >> Without that, we would livelock with our own invalidations
> >> while migrating and splitting pages during fault handling.
> >>
> >> Note, pgmap_owner, used in some other code paths as owner for filtering,
> >> is not readily available for split path, so use current for this use case.
> >> Also, current and pgmap_owner, both being pointers to memory, can not be
> >> mis-interpreted to each other.
> >>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >> Cc: Leon Romanovsky <leonro@nvidia.com>
> >> Cc: Alistair Popple <apopple@nvidia.com>
> >> Cc: Balbir Singh <balbirs@nvidia.com>
> >>
> >> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
> >> ---
> >>  lib/test_hmm.c   | 5 +++++
> >>  mm/huge_memory.c | 6 +++---
> >>  mm/rmap.c        | 4 ++--
> >>  3 files changed, 10 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> >> index 761725bc713c..cd5c139213be 100644
> >> --- a/lib/test_hmm.c
> >> +++ b/lib/test_hmm.c
> >> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
> >>  	    range->owner == dmirror->mdevice)
> >>  		return true;
> >>  
> >> +	if (range->event == MMU_NOTIFY_CLEAR &&
> >> +	    range->owner == current) {
> >> +		return true;
> >> +	}
> > I don't understand this, there is nothing in hmm that says only
> > current can call hmm_range_fault, and indeed most applications won't
> > even gurantee that.
> 
> No it's the opposite, if we are ourselves the invalidator, don't care.

I think you've missed the point, you cannot use range->owner in any
way to tell "we are ourselves the invalidator". It is simply not the
meaning of range->owner.

> > So if this plan relies on something like the above in drivers I don't
> > see how it can work.
> >
> > If this is just some hack for tests, try instead to find a solution
> > that more accurately matches what a real driver should do.
> >
> > But this also seems overall troublesome to your goal, if you do a
> > migrate inside hmm_range_fault() it will generate an invalidation call
> > back and that will increment the seqlock and we will loop
> > hmm_range_fault() again which rewalks.
> 
> That's the problem this solves.
> The semantics is "if we are the invalidator don't wait for invalidate end",
> aka don't mmu_interval_set_seq() that would make hang in the next mmu_interval_read_begin(),
> waiting the invalidate to end

I doubt we can skip mmu_interval_set_seq(), doing so can corrupt concurrent
hmm_range_fault in some other thread.

Nor, as I said, can we actually skip the invalidation toward HW
anyhow.

At the very least this commit message fails to explain the new locking
proposal, or justify why it can work.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 13:04       ` Jason Gunthorpe
@ 2025-08-14 13:20         ` Mika Penttilä
  2025-08-14 14:11           ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14 13:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Alistair Popple, Balbir Singh


On 8/14/25 16:04, Jason Gunthorpe wrote:

> On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
>> On 8/14/25 15:40, Jason Gunthorpe wrote:
>>> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
>>>> When doing migration in combination with device fault handling,
>>>> detect the case in the interval notifier.
>>>>
>>>> Without that, we would livelock with our own invalidations
>>>> while migrating and splitting pages during fault handling.
>>>>
>>>> Note, pgmap_owner, used in some other code paths as owner for filtering,
>>>> is not readily available for split path, so use current for this use case.
>>>> Also, current and pgmap_owner, both being pointers to memory, can not be
>>>> mis-interpreted to each other.
>>>>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>>> Cc: Leon Romanovsky <leonro@nvidia.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Balbir Singh <balbirs@nvidia.com>
>>>>
>>>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
>>>> ---
>>>>  lib/test_hmm.c   | 5 +++++
>>>>  mm/huge_memory.c | 6 +++---
>>>>  mm/rmap.c        | 4 ++--
>>>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 761725bc713c..cd5c139213be 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>>>>  	    range->owner == dmirror->mdevice)
>>>>  		return true;
>>>>  
>>>> +	if (range->event == MMU_NOTIFY_CLEAR &&
>>>> +	    range->owner == current) {
>>>> +		return true;
>>>> +	}
>>> I don't understand this, there is nothing in hmm that says only
>>> current can call hmm_range_fault, and indeed most applications won't
>>> even gurantee that.
>> No it's the opposite, if we are ourselves the invalidator, don't care.
> I think you've missed the point, you cannot use range->owner in any
> way to tell "we are ourselves the invalidator". It is simply not the
> meaning of range->owner.

Usually it is the device but used similarly, look for instance nouveau:


static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
                                         const struct mmu_notifier_range *range,
                                         unsigned long cur_seq)
{
        struct svm_notifier *sn =
                container_of(mni, struct svm_notifier, notifier);

        if (range->event == MMU_NOTIFY_EXCLUSIVE &&
            range->owner == sn->svmm->vmm->cli->drm->dev)
                return true;

Where we return in case are the initiator of the make_device_exclusive. Otherwise,
it would also hang in next mmu_interval_read_begin().

owner is void * and admit used in a creative way here, but it can't be wrongly interpreted
as dev if current.

>
>>> So if this plan relies on something like the above in drivers I don't
>>> see how it can work.
>>>
>>> If this is just some hack for tests, try instead to find a solution
>>> that more accurately matches what a real driver should do.
>>>
>>> But this also seems overall troublesome to your goal, if you do a
>>> migrate inside hmm_range_fault() it will generate an invalidation call
>>> back and that will increment the seqlock and we will loop
>>> hmm_range_fault() again which rewalks.
>> That's the problem this solves.
>> The semantics is "if we are the invalidator don't wait for invalidate end",
>> aka don't mmu_interval_set_seq() that would make hang in the next mmu_interval_read_begin(),
>> waiting the invalidate to end
> I doubt we can skip mmu_interval_set_seq(), doing so can corrupt concurrent
> hmm_range_fault in some other thread.
>
> Nor, as I said, can we actually skip the invalidation toward HW
> anyhow.
>
> At the very least this commit message fails to explain the new locking
> proposal, or justify why it can work.

Yes the commit message could be better. 
But this is essentially the same as nouveau is doing with 
MMU_NOTIFY_EXCLUSIVE handling, just not using the dev as the qualifier,
because that is not easily available in the context.

--Mika

> Jason
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 13:20         ` Mika Penttilä
@ 2025-08-14 14:11           ` Jason Gunthorpe
  2025-08-14 17:00             ` Mika Penttilä
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2025-08-14 14:11 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Alistair Popple, Balbir Singh

On Thu, Aug 14, 2025 at 04:20:42PM +0300, Mika Penttilä wrote:
> 
> On 8/14/25 16:04, Jason Gunthorpe wrote:
> 
> > On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
> >> On 8/14/25 15:40, Jason Gunthorpe wrote:
> >>> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
> >>>> When doing migration in combination with device fault handling,
> >>>> detect the case in the interval notifier.
> >>>>
> >>>> Without that, we would livelock with our own invalidations
> >>>> while migrating and splitting pages during fault handling.
> >>>>
> >>>> Note, pgmap_owner, used in some other code paths as owner for filtering,
> >>>> is not readily available for split path, so use current for this use case.
> >>>> Also, current and pgmap_owner, both being pointers to memory, can not be
> >>>> mis-interpreted to each other.
> >>>>
> >>>> Cc: David Hildenbrand <david@redhat.com>
> >>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Cc: Leon Romanovsky <leonro@nvidia.com>
> >>>> Cc: Alistair Popple <apopple@nvidia.com>
> >>>> Cc: Balbir Singh <balbirs@nvidia.com>
> >>>>
> >>>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
> >>>> ---
> >>>>  lib/test_hmm.c   | 5 +++++
> >>>>  mm/huge_memory.c | 6 +++---
> >>>>  mm/rmap.c        | 4 ++--
> >>>>  3 files changed, 10 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> >>>> index 761725bc713c..cd5c139213be 100644
> >>>> --- a/lib/test_hmm.c
> >>>> +++ b/lib/test_hmm.c
> >>>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
> >>>>  	    range->owner == dmirror->mdevice)
> >>>>  		return true;
> >>>>  
> >>>> +	if (range->event == MMU_NOTIFY_CLEAR &&
> >>>> +	    range->owner == current) {
> >>>> +		return true;
> >>>> +	}
> >>> I don't understand this, there is nothing in hmm that says only
> >>> current can call hmm_range_fault, and indeed most applications won't
> >>> even gurantee that.
> >> No it's the opposite, if we are ourselves the invalidator, don't care.
> > I think you've missed the point, you cannot use range->owner in any
> > way to tell "we are ourselves the invalidator". It is simply not the
> > meaning of range->owner.
> 
> Usually it is the device but used similarly, look for instance nouveau:

Yes, dev is fine, but current or struct task is not reasonable.

> Yes the commit message could be better. 
> But this is essentially the same as nouveau is doing with 
> MMU_NOTIFY_EXCLUSIVE handling, just not using the dev as the
> qualifier,

I wouldn't necesarily assume anything nouveau is correct, but this is
certainly not the same thing. nouveau is trying to eliminate an
unncessary invalidation for it's HW when it knows the memory is
already only private to this local device.

This is not a statement about callchain or recursion.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 14:11           ` Jason Gunthorpe
@ 2025-08-14 17:00             ` Mika Penttilä
  2025-08-14 17:20               ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14 17:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Alistair Popple, Balbir Singh


On 8/14/25 17:11, Jason Gunthorpe wrote:

> On Thu, Aug 14, 2025 at 04:20:42PM +0300, Mika Penttilä wrote:
>> On 8/14/25 16:04, Jason Gunthorpe wrote:
>>
>>> On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
>>>> On 8/14/25 15:40, Jason Gunthorpe wrote:
>>>>> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
>>>>>> When doing migration in combination with device fault handling,
>>>>>> detect the case in the interval notifier.
>>>>>>
>>>>>> Without that, we would livelock with our own invalidations
>>>>>> while migrating and splitting pages during fault handling.
>>>>>>
>>>>>> Note, pgmap_owner, used in some other code paths as owner for filtering,
>>>>>> is not readily available for split path, so use current for this use case.
>>>>>> Also, current and pgmap_owner, both being pointers to memory, can not be
>>>>>> mis-interpreted to each other.
>>>>>>
>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Cc: Leon Romanovsky <leonro@nvidia.com>
>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>> Cc: Balbir Singh <balbirs@nvidia.com>
>>>>>>
>>>>>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
>>>>>> ---
>>>>>>  lib/test_hmm.c   | 5 +++++
>>>>>>  mm/huge_memory.c | 6 +++---
>>>>>>  mm/rmap.c        | 4 ++--
>>>>>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 761725bc713c..cd5c139213be 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>>>>>>  	    range->owner == dmirror->mdevice)
>>>>>>  		return true;
>>>>>>  
>>>>>> +	if (range->event == MMU_NOTIFY_CLEAR &&
>>>>>> +	    range->owner == current) {
>>>>>> +		return true;
>>>>>> +	}
>>>>> I don't understand this, there is nothing in hmm that says only
>>>>> current can call hmm_range_fault, and indeed most applications won't
>>>>> even gurantee that.
>>>> No it's the opposite, if we are ourselves the invalidator, don't care.
>>> I think you've missed the point, you cannot use range->owner in any
>>> way to tell "we are ourselves the invalidator". It is simply not the
>>> meaning of range->owner.
>> Usually it is the device but used similarly, look for instance nouveau:
> Yes, dev is fine, but current or struct task is not reasonable.
>
>> Yes the commit message could be better. 
>> But this is essentially the same as nouveau is doing with 
>> MMU_NOTIFY_EXCLUSIVE handling, just not using the dev as the
>> qualifier,
> I wouldn't necesarily assume anything nouveau is correct, but this is
> certainly not the same thing. nouveau is trying to eliminate an
> unncessary invalidation for it's HW when it knows the memory is
> already only private to this local device.

Yes but it is omitting
  mmu_interval_set_seq()
in that case while returning early, which is ok.

as well as hmm test module with :

         * Ignore invalidation callbacks for device private pages since
         * the invalidation is handled as part of the migration process.
         */
        if (range->event == MMU_NOTIFY_MIGRATE &&
            range->owner == dmirror->mdevice)
                return true;


> This is not a statement about callchain or recursion.

In my case the unnecessary invalidation (self caused) causes hang,
so whatever the reason for unnecessary, the mechanism is the same
and used elsewhere in kernel also.

So technically I think that would work. But there are others ways to do that
without using owner, so will look that way in next version.

>
> Jason
>

--Mika


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 17:00             ` Mika Penttilä
@ 2025-08-14 17:20               ` Jason Gunthorpe
  2025-08-14 17:45                 ` Mika Penttilä
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2025-08-14 17:20 UTC (permalink / raw)
  To: Mika Penttilä, Alistair Popple
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Balbir Singh

On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
> as well as hmm test module with :
> 
>          * Ignore invalidation callbacks for device private pages since
>          * the invalidation is handled as part of the migration process.
>          */
>         if (range->event == MMU_NOTIFY_MIGRATE &&
>             range->owner == dmirror->mdevice)
>                 return true;

If I recall this was about a very specific case where migration does a
number of invalidations and some of the earlier ones are known to be
redundant in this specific case. Redundant means it can be ignored
without causing an inconsistency.

Alistair would know, but I assumed this works OK because the above
invalidation doesn't actually go on to free any pages but keeps them
around until a later invalidation?

This is nothing like what your case is talking about.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 17:20               ` Jason Gunthorpe
@ 2025-08-14 17:45                 ` Mika Penttilä
  2025-08-15  5:23                   ` Alistair Popple
  0 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-14 17:45 UTC (permalink / raw)
  To: Jason Gunthorpe, Alistair Popple
  Cc: linux-mm, linux-kernel, David Hildenbrand, Leon Romanovsky,
	Balbir Singh


On 8/14/25 20:20, Jason Gunthorpe wrote:

> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>> as well as hmm test module with :
>>
>>          * Ignore invalidation callbacks for device private pages since
>>          * the invalidation is handled as part of the migration process.
>>          */
>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>             range->owner == dmirror->mdevice)
>>                 return true;
> If I recall this was about a very specific case where migration does a
> number of invalidations and some of the earlier ones are known to be
> redundant in this specific case. Redundant means it can be ignored
> without causing an inconsistency.
>
> Alistair would know, but I assumed this works OK because the above
> invalidation doesn't actually go on to free any pages but keeps them
> around until a later invalidation?
>
> This is nothing like what your case is talking about.

This one is actually pretty similar, MMU_NOTIFY_CLEAR is also fired in migration process
(split case) and invalidation handled part of the migration process.

But I have already a working version without any of that.

>
> Jason
>
--Mika


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-14 17:45                 ` Mika Penttilä
@ 2025-08-15  5:23                   ` Alistair Popple
  2025-08-15  7:11                     ` Mika Penttilä
  0 siblings, 1 reply; 22+ messages in thread
From: Alistair Popple @ 2025-08-15  5:23 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, David Hildenbrand,
	Leon Romanovsky, Balbir Singh

On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
> 
> On 8/14/25 20:20, Jason Gunthorpe wrote:
> 
> > On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
> >> as well as hmm test module with :
> >>
> >>          * Ignore invalidation callbacks for device private pages since
> >>          * the invalidation is handled as part of the migration process.
> >>          */
> >>         if (range->event == MMU_NOTIFY_MIGRATE &&
> >>             range->owner == dmirror->mdevice)
> >>                 return true;
> > If I recall this was about a very specific case where migration does a
> > number of invalidations and some of the earlier ones are known to be
> > redundant in this specific case. Redundant means it can be ignored
> > without causing an inconsistency.
> >
> > Alistair would know, but I assumed this works OK because the above
> > invalidation doesn't actually go on to free any pages but keeps them
> > around until a later invalidation?

Right, the pages don't actually get freed because a reference is taken on them
during migrate_vma_setup(). However other device MMU's still need invalidating
because the driver will go on to copy the page after this step. It's just
assumed that the driver is able to be consistent with itself (ie. it will unmap/
invalidate it's own MMU prior to initiating the copy).

In practice I suspect what Mika is running into is that the page table
synchronisation for migration works slightly differently for migrate_vma_*().

Instead of using mmu_interval_notifier's which have a sequence number drivers
typically use normal mmu_notifier's and take a device specific lock to block
page table downgrades (eg. RW -> RO). This ensures it's safe to update the
device page tables with the PFNs/permissions collected in migrate_vma_setup()
(or the new PFN) by blocking other threads from updating the page table.

The ususal problem with this approach is that when migrate_vma_setup() calls
the mmu_notifier it deadlocks on the device specific lock in the notifier
callback because it already holds the lock, which it can't drop before calling
migrate_vma_setup().

I think one of the main benefits of a series which consolidates these two
page-table mirroring techniques into common code would also be to make the
mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
move to mmu_interval notifers with sequence numbers for migration, perhaps with
filtering if required/safe and retries.

 - Alistair

> > This is nothing like what your case is talking about.
> 
> This one is actually pretty similar, MMU_NOTIFY_CLEAR is also fired in migration process
> (split case) and invalidation handled part of the migration process.
> 
> But I have already a working version without any of that.
> 
> >
> > Jason
> >
> --Mika
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-15  5:23                   ` Alistair Popple
@ 2025-08-15  7:11                     ` Mika Penttilä
  2025-08-19  4:27                       ` Balbir Singh
  0 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-15  7:11 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, David Hildenbrand,
	Leon Romanovsky, Balbir Singh

On 8/15/25 08:23, Alistair Popple wrote:

> On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
>> On 8/14/25 20:20, Jason Gunthorpe wrote:
>>
>>> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>>>> as well as hmm test module with :
>>>>
>>>>          * Ignore invalidation callbacks for device private pages since
>>>>          * the invalidation is handled as part of the migration process.
>>>>          */
>>>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>>>             range->owner == dmirror->mdevice)
>>>>                 return true;
>>> If I recall this was about a very specific case where migration does a
>>> number of invalidations and some of the earlier ones are known to be
>>> redundant in this specific case. Redundant means it can be ignored
>>> without causing an inconsistency.
>>>
>>> Alistair would know, but I assumed this works OK because the above
>>> invalidation doesn't actually go on to free any pages but keeps them
>>> around until a later invalidation?

Thanks Alistair for your deep insights! 

> Right, the pages don't actually get freed because a reference is taken on them
> during migrate_vma_setup(). However other device MMU's still need invalidating
> because the driver will go on to copy the page after this step. It's just
> assumed that the driver is able to be consistent with itself (ie. it will unmap/
> invalidate it's own MMU prior to initiating the copy).

And reference is taken as well in migrate on fault during hmm_range_fault
if migrating.

>
> In practice I suspect what Mika is running into is that the page table
> synchronisation for migration works slightly differently for migrate_vma_*().
>
> Instead of using mmu_interval_notifier's which have a sequence number drivers
> typically use normal mmu_notifier's and take a device specific lock to block
> page table downgrades (eg. RW -> RO). This ensures it's safe to update the
> device page tables with the PFNs/permissions collected in migrate_vma_setup()
> (or the new PFN) by blocking other threads from updating the page table.
>
> The ususal problem with this approach is that when migrate_vma_setup() calls
> the mmu_notifier it deadlocks on the device specific lock in the notifier
> callback because it already holds the lock, which it can't drop before calling
> migrate_vma_setup().
>
> I think one of the main benefits of a series which consolidates these two
> page-table mirroring techniques into common code would also be to make the
> mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
> move to mmu_interval notifers with sequence numbers for migration, perhaps with
> filtering if required/safe and retries

Yes with the migrate_vma_setup() and collecting removed, the firing of mmu notifiers
and "collecting" are integral part of the hmm_range_fault() flow, so logical to use
interval notifiers for migrate also.

I have removed the commit with the owner games. I studied it more and seems it was added
to mitigate a bug in an early version, which led me to do wrong conclusion of the root cause
of the hang. That version had unbalanced mmu_notifier_invalidate_range_start()
after returning from hmm_range_fault() with EBUSY (after done a folio split).
With that fixed, driving the migrate on fault using the interval notifiers seems to work well, 
filtering MMU_NOTIFY_MIGRATE for device for retries.

>
>  - Alistair

--Mika



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/4] Migrate on fault for device pages
  2025-08-14  7:19 [RFC PATCH 0/4] Migrate on fault for device pages Mika Penttilä
                   ` (3 preceding siblings ...)
  2025-08-14  7:19 ` [RFC PATCH 4/4] mm: add new testcase for the migrate on fault case Mika Penttilä
@ 2025-08-15 11:36 ` Balbir Singh
  2025-08-15 11:44   ` Mika Penttilä
  4 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2025-08-15 11:36 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: linux-kernel, David Hildenbrand, Jason Gunthorpe, Leon Romanovsky,
	Alistair Popple

On 8/14/25 17:19, Mika Penttilä wrote:
> As of this writing, the way device page faulting and migration
> works is not optimal, if you want to do both fault handling
> and migration at once.
> 
> Being able to migrate not present pages (or pages mapped with incorrect
> permissions, eg. COW) to the GPU requires doing either of the following
> sequences:
> 
> 1. hmm_range_fault() - fault in non-present pages with correct
>    permissions,etc.
> 2. migrate_vma_*() - migrate the pages
> 
> Or:
> 
> 1. migrate_vma_*() - migrate present pages
> 2. If non-present pages detected by migrate_vma_*():
>    a) call hmm_range_fault() to fault pages in
>    b) call migrate_vma_*() again to migrate now present pages
> 
> The problem with the first sequence is that you always have to do two
> page walks even when most of the time the pages are present or zero page
> mappings so the common case takes a performance hit.
> 
> The second sequence is better for the common case, but far worse if
> pages aren't present because now you have to walk the page tables three
> times (once to find the page is not present, once so hmm_range_fault()
> can find a non-present page to fault in and once again to setup the
> migration). It also tricky to code correctly.
> 
> We should be able to walk the page table once, faulting
> pages in as required and replacing them with migration entries if
> requested.
> 

The use case makes sense to me, but isn't the sequence always going
to be racy, by the time the pages are faulted in, there could be
others that have been marked non-present or do you intend to lock
all pages during this operation?

Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/4] Migrate on fault for device pages
  2025-08-15 11:36 ` [RFC PATCH 0/4] Migrate on fault for device pages Balbir Singh
@ 2025-08-15 11:44   ` Mika Penttilä
  0 siblings, 0 replies; 22+ messages in thread
From: Mika Penttilä @ 2025-08-15 11:44 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, David Hildenbrand, Jason Gunthorpe, Leon Romanovsky,
	Alistair Popple


On 8/15/25 14:36, Balbir Singh wrote:

> On 8/14/25 17:19, Mika Penttilä wrote:
>> As of this writing, the way device page faulting and migration
>> works is not optimal, if you want to do both fault handling
>> and migration at once.
>>
>> Being able to migrate not present pages (or pages mapped with incorrect
>> permissions, eg. COW) to the GPU requires doing either of the following
>> sequences:
>>
>> 1. hmm_range_fault() - fault in non-present pages with correct
>>    permissions,etc.
>> 2. migrate_vma_*() - migrate the pages
>>
>> Or:
>>
>> 1. migrate_vma_*() - migrate present pages
>> 2. If non-present pages detected by migrate_vma_*():
>>    a) call hmm_range_fault() to fault pages in
>>    b) call migrate_vma_*() again to migrate now present pages
>>
>> The problem with the first sequence is that you always have to do two
>> page walks even when most of the time the pages are present or zero page
>> mappings so the common case takes a performance hit.
>>
>> The second sequence is better for the common case, but far worse if
>> pages aren't present because now you have to walk the page tables three
>> times (once to find the page is not present, once so hmm_range_fault()
>> can find a non-present page to fault in and once again to setup the
>> migration). It also tricky to code correctly.
>>
>> We should be able to walk the page table once, faulting
>> pages in as required and replacing them with migration entries if
>> requested.
>>
> The use case makes sense to me, but isn't the sequence always going
> to be racy, by the time the pages are faulted in, there could be
> others that have been marked non-present or do you intend to lock
> all pages during this operation?
>
> Balbir

Yes the pages are "collected", so locked and ref taken as soon as faulted in.

--Mika

>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-15  7:11                     ` Mika Penttilä
@ 2025-08-19  4:27                       ` Balbir Singh
  2025-08-19  4:33                         ` Mika Penttilä
  0 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2025-08-19  4:27 UTC (permalink / raw)
  To: Mika Penttilä, Alistair Popple
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, David Hildenbrand,
	Leon Romanovsky

On 8/15/25 17:11, Mika Penttilä wrote:
> On 8/15/25 08:23, Alistair Popple wrote:
> 
>> On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
>>> On 8/14/25 20:20, Jason Gunthorpe wrote:
>>>
>>>> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>>>>> as well as hmm test module with :
>>>>>
>>>>>          * Ignore invalidation callbacks for device private pages since
>>>>>          * the invalidation is handled as part of the migration process.
>>>>>          */
>>>>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>>>>             range->owner == dmirror->mdevice)
>>>>>                 return true;
>>>> If I recall this was about a very specific case where migration does a
>>>> number of invalidations and some of the earlier ones are known to be
>>>> redundant in this specific case. Redundant means it can be ignored
>>>> without causing an inconsistency.
>>>>
>>>> Alistair would know, but I assumed this works OK because the above
>>>> invalidation doesn't actually go on to free any pages but keeps them
>>>> around until a later invalidation?
> 
> Thanks Alistair for your deep insights! 
> 
>> Right, the pages don't actually get freed because a reference is taken on them
>> during migrate_vma_setup(). However other device MMU's still need invalidating
>> because the driver will go on to copy the page after this step. It's just
>> assumed that the driver is able to be consistent with itself (ie. it will unmap/
>> invalidate it's own MMU prior to initiating the copy).
> 
> And reference is taken as well in migrate on fault during hmm_range_fault
> if migrating.
> 
>>
>> In practice I suspect what Mika is running into is that the page table
>> synchronisation for migration works slightly differently for migrate_vma_*().
>>
>> Instead of using mmu_interval_notifier's which have a sequence number drivers
>> typically use normal mmu_notifier's and take a device specific lock to block
>> page table downgrades (eg. RW -> RO). This ensures it's safe to update the
>> device page tables with the PFNs/permissions collected in migrate_vma_setup()
>> (or the new PFN) by blocking other threads from updating the page table.
>>
>> The ususal problem with this approach is that when migrate_vma_setup() calls
>> the mmu_notifier it deadlocks on the device specific lock in the notifier
>> callback because it already holds the lock, which it can't drop before calling
>> migrate_vma_setup().
>>
>> I think one of the main benefits of a series which consolidates these two
>> page-table mirroring techniques into common code would also be to make the
>> mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
>> move to mmu_interval notifers with sequence numbers for migration, perhaps with
>> filtering if required/safe and retries
> 
> Yes with the migrate_vma_setup() and collecting removed, the firing of mmu notifiers
> and "collecting" are integral part of the hmm_range_fault() flow, so logical to use
> interval notifiers for migrate also.
> 
> I have removed the commit with the owner games. I studied it more and seems it was added
> to mitigate a bug in an early version, which led me to do wrong conclusion of the root cause
> of the hang. That version had unbalanced mmu_notifier_invalidate_range_start()
> after returning from hmm_range_fault() with EBUSY (after done a folio split).
> With that fixed, driving the migrate on fault using the interval notifiers seems to work well, 
> filtering MMU_NOTIFY_MIGRATE for device for retries.
> 

So this patch can be ignored in the series?

Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner
  2025-08-19  4:27                       ` Balbir Singh
@ 2025-08-19  4:33                         ` Mika Penttilä
  0 siblings, 0 replies; 22+ messages in thread
From: Mika Penttilä @ 2025-08-19  4:33 UTC (permalink / raw)
  To: Balbir Singh, Alistair Popple
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, David Hildenbrand,
	Leon Romanovsky

Hi,

On 8/19/25 07:27, Balbir Singh wrote:

> On 8/15/25 17:11, Mika Penttilä wrote:
>> On 8/15/25 08:23, Alistair Popple wrote:
>>
>>> On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
>>>> On 8/14/25 20:20, Jason Gunthorpe wrote:
>>>>
>>>>> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>>>>>> as well as hmm test module with :
>>>>>>
>>>>>>          * Ignore invalidation callbacks for device private pages since
>>>>>>          * the invalidation is handled as part of the migration process.
>>>>>>          */
>>>>>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>>>>>             range->owner == dmirror->mdevice)
>>>>>>                 return true;
>>>>> If I recall this was about a very specific case where migration does a
>>>>> number of invalidations and some of the earlier ones are known to be
>>>>> redundant in this specific case. Redundant means it can be ignored
>>>>> without causing an inconsistency.
>>>>>
>>>>> Alistair would know, but I assumed this works OK because the above
>>>>> invalidation doesn't actually go on to free any pages but keeps them
>>>>> around until a later invalidation?
>> Thanks Alistair for your deep insights! 
>>
>>> Right, the pages don't actually get freed because a reference is taken on them
>>> during migrate_vma_setup(). However other device MMU's still need invalidating
>>> because the driver will go on to copy the page after this step. It's just
>>> assumed that the driver is able to be consistent with itself (ie. it will unmap/
>>> invalidate it's own MMU prior to initiating the copy).
>> And reference is taken as well in migrate on fault during hmm_range_fault
>> if migrating.
>>
>>> In practice I suspect what Mika is running into is that the page table
>>> synchronisation for migration works slightly differently for migrate_vma_*().
>>>
>>> Instead of using mmu_interval_notifier's which have a sequence number drivers
>>> typically use normal mmu_notifier's and take a device specific lock to block
>>> page table downgrades (eg. RW -> RO). This ensures it's safe to update the
>>> device page tables with the PFNs/permissions collected in migrate_vma_setup()
>>> (or the new PFN) by blocking other threads from updating the page table.
>>>
>>> The ususal problem with this approach is that when migrate_vma_setup() calls
>>> the mmu_notifier it deadlocks on the device specific lock in the notifier
>>> callback because it already holds the lock, which it can't drop before calling
>>> migrate_vma_setup().
>>>
>>> I think one of the main benefits of a series which consolidates these two
>>> page-table mirroring techniques into common code would also be to make the
>>> mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
>>> move to mmu_interval notifers with sequence numbers for migration, perhaps with
>>> filtering if required/safe and retries
>> Yes with the migrate_vma_setup() and collecting removed, the firing of mmu notifiers
>> and "collecting" are integral part of the hmm_range_fault() flow, so logical to use
>> interval notifiers for migrate also.
>>
>> I have removed the commit with the owner games. I studied it more and seems it was added
>> to mitigate a bug in an early version, which led me to do wrong conclusion of the root cause
>> of the hang. That version had unbalanced mmu_notifier_invalidate_range_start()
>> after returning from hmm_range_fault() with EBUSY (after done a folio split).
>> With that fixed, driving the migrate on fault using the interval notifiers seems to work well, 
>> filtering MMU_NOTIFY_MIGRATE for device for retries.
>>
> So this patch can be ignored in the series?

Yes this can be ignored. I will do more testing and repost after a while with this removed and
possibly with other changes if needed.

>
> Balbir
>
--Mika


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/4] mm: unified fault and migrate device page paths
  2025-08-14  7:19 ` [RFC PATCH 2/4] mm: unified fault and migrate device page paths Mika Penttilä
@ 2025-08-21  4:30   ` Balbir Singh
  2025-08-21  5:10     ` Mika Penttilä
  0 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2025-08-21  4:30 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: linux-kernel, David Hildenbrand, Jason Gunthorpe, Leon Romanovsky,
	Alistair Popple

On 8/14/25 17:19, Mika Penttilä wrote:
> As of this writing, the way device page faulting and migration works
> is not optimal, if you want to do both fault handling and
> migration at once.
> 
> Being able to migrate not present pages (or pages mapped with incorrect
> permissions, eg. COW) to the GPU requires doing either of the
> following sequences:
> 
> 1. hmm_range_fault() - fault in non-present pages with correct permissions, etc.
> 2. migrate_vma_*() - migrate the pages
> 
> Or:
> 
> 1. migrate_vma_*() - migrate present pages
> 2. If non-present pages detected by migrate_vma_*():
>    a) call hmm_range_fault() to fault pages in
>    b) call migrate_vma_*() again to migrate now present pages
> 
> The problem with the first sequence is that you always have to do two
> page walks even when most of the time the pages are present or zero page
> mappings so the common case takes a performance hit.
> 
> The second sequence is better for the common case, but far worse if
> pages aren't present because now you have to walk the page tables three
> times (once to find the page is not present, once so hmm_range_fault()
> can find a non-present page to fault in and once again to setup the
> migration). It also tricky to code correctly.
> 
> We should be able to walk the page table once, faulting
> pages in as required and replacing them with migration entries if
> requested.
> 
> Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
> which tells to prepare for migration also during fault handling.
> Also, for the migrate_vma_setup() call paths, a flags, MIGRATE_VMA_FAULT,
> is added to tell to add fault handling to migrate.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leonro@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> 
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
> ---
>  include/linux/hmm.h     |  10 +-
>  include/linux/migrate.h |   6 +-
>  mm/hmm.c                | 351 ++++++++++++++++++++++++++++++++++++++--
>  mm/migrate_device.c     |  72 ++++++++-
>  4 files changed, 420 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index db75ffc949a7..7485e549c675 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -12,7 +12,7 @@
>  #include <linux/mm.h>
>  
>  struct mmu_interval_notifier;
> -
> +struct migrate_vma;
>  /*
>   * On output:
>   * 0             - The page is faultable and a future call with 
> @@ -48,11 +48,14 @@ enum hmm_pfn_flags {
>  	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
>  	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
>  
> -	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11),
> +	/* Migrate request */
> +	HMM_PFN_MIGRATE    = 1UL << (BITS_PER_LONG - 7),
> +	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 12),
>  
>  	/* Input flags */
>  	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
>  	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
> +	HMM_PFN_REQ_MIGRATE = HMM_PFN_MIGRATE,
>  
>  	HMM_PFN_FLAGS = ~((1UL << HMM_PFN_ORDER_SHIFT) - 1),
>  };
> @@ -107,6 +110,7 @@ static inline unsigned int hmm_pfn_to_map_order(unsigned long hmm_pfn)
>   * @default_flags: default flags for the range (write, read, ... see hmm doc)
>   * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>   * @dev_private_owner: owner of device private pages
> + * @migrate: structure for migrating the associated vma
>   */
>  struct hmm_range {
>  	struct mmu_interval_notifier *notifier;
> @@ -117,12 +121,14 @@ struct hmm_range {
>  	unsigned long		default_flags;
>  	unsigned long		pfn_flags_mask;
>  	void			*dev_private_owner;
> +	struct migrate_vma      *migrate;
>  };
>  
>  /*
>   * Please see Documentation/mm/hmm.rst for how to use the range API.
>   */
>  int hmm_range_fault(struct hmm_range *range);
> +int hmm_range_migrate_prepare(struct hmm_range *range, struct migrate_vma **pargs);
>  
>  /*
>   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index acadd41e0b5c..ab35d0f1f65d 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -3,6 +3,7 @@
>  #define _LINUX_MIGRATE_H
>  
>  #include <linux/mm.h>
> +#include <linux/hmm.h>
>  #include <linux/mempolicy.h>
>  #include <linux/migrate_mode.h>
>  #include <linux/hugetlb.h>
> @@ -143,10 +144,11 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>  	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
>  }
>  
> -enum migrate_vma_direction {
> +enum migrate_vma_info {
>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> +	MIGRATE_VMA_FAULT = 1 << 3,
>  };
>  

I suspect there are some points of conflict with my series that can be resolved

>  struct migrate_vma {
> @@ -194,7 +196,7 @@ void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
>  			unsigned long npages);
>  void migrate_device_finalize(unsigned long *src_pfns,
>  			unsigned long *dst_pfns, unsigned long npages);
> -
> +void migrate_hmm_range_setup(struct hmm_range *range);
>  #endif /* CONFIG_MIGRATION */
>  
>  #endif /* _LINUX_MIGRATE_H */
> diff --git a/mm/hmm.c b/mm/hmm.c
> index d545e2494994..8cb2b325fa9f 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -20,6 +20,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/swapops.h>
>  #include <linux/hugetlb.h>
> +#include <linux/migrate.h>
>  #include <linux/memremap.h>
>  #include <linux/sched/mm.h>
>  #include <linux/jump_label.h>
> @@ -33,6 +34,10 @@
>  struct hmm_vma_walk {
>  	struct hmm_range	*range;
>  	unsigned long		last;
> +	struct mmu_notifier_range mmu_range;
> +	struct vm_area_struct 	*vma;
> +	unsigned long 		start;
> +	unsigned long 		end;
>  };
>  
>  enum {
> @@ -47,15 +52,33 @@ enum {
>  			      HMM_PFN_P2PDMA_BUS,
>  };
>  
> +static enum migrate_vma_info hmm_want_migrate(struct hmm_range *range)

hmm_want_migrate -> hmm_select_and_migrate?

> +{
> +	enum migrate_vma_info minfo;
> +
> +	minfo = range->migrate ? range->migrate->flags : 0;
> +	minfo |= (range->default_flags & HMM_PFN_REQ_MIGRATE) ?
> +		MIGRATE_VMA_SELECT_SYSTEM : 0;
> +

Just to understand, this selects just system pages

> +	return minfo;
> +}
> +
>  static int hmm_pfns_fill(unsigned long addr, unsigned long end,
> -			 struct hmm_range *range, unsigned long cpu_flags)
> +			 struct hmm_vma_walk *hmm_vma_walk, unsigned long cpu_flags)
>  {
> +	struct hmm_range *range = hmm_vma_walk->range;
>  	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
>  
> +	if (cpu_flags != HMM_PFN_ERROR)
> +		if (hmm_want_migrate(range) &&
> +		    (vma_is_anonymous(hmm_vma_walk->vma)))
> +			cpu_flags |= (HMM_PFN_VALID | HMM_PFN_MIGRATE);
> +
>  	for (; addr < end; addr += PAGE_SIZE, i++) {
>  		range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
>  		range->hmm_pfns[i] |= cpu_flags;
>  	}
> +
>  	return 0;
>  }
>  
> @@ -171,11 +194,11 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
>  	if (!walk->vma) {
>  		if (required_fault)
>  			return -EFAULT;
> -		return hmm_pfns_fill(addr, end, range, HMM_PFN_ERROR);
> +		return hmm_pfns_fill(addr, end, hmm_vma_walk, HMM_PFN_ERROR);
>  	}
>  	if (required_fault)
>  		return hmm_vma_fault(addr, end, required_fault, walk);
> -	return hmm_pfns_fill(addr, end, range, 0);
> +	return hmm_pfns_fill(addr, end, hmm_vma_walk, 0);
>  }
>  
>  static inline unsigned long hmm_pfn_flags_order(unsigned long order)
> @@ -326,6 +349,257 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  	return hmm_vma_fault(addr, end, required_fault, walk);
>  }
>  
> +/*
> + * Install migration entries if migration requested, either from fault
> + * or migrate paths.
> + *
> + */
> +static void hmm_vma_handle_migrate_prepare(const struct mm_walk *walk,
> +					   pmd_t *pmdp,
> +					   unsigned long addr,
> +					   unsigned long *hmm_pfn)
> +{
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct hmm_range *range = hmm_vma_walk->range;
> +	struct migrate_vma *migrate = range->migrate;
> +	struct mm_struct *mm = walk->vma->vm_mm;
> +	struct folio *fault_folio = NULL;
> +	enum migrate_vma_info minfo;
> +	struct dev_pagemap *pgmap;
> +	bool anon_exclusive;
> +	struct folio *folio;
> +	unsigned long pfn;
> +	struct page *page;
> +	swp_entry_t entry;
> +	pte_t pte, swp_pte;
> +	spinlock_t *ptl;
> +	bool writable = false;
> +	pte_t *ptep;
> +
> +
> +	// Do we want to migrate at all?
> +	minfo = hmm_want_migrate(range);
> +	if (!minfo)
> +		return;
> +
> +	fault_folio = (migrate && migrate->fault_page) ?
> +		page_folio(migrate->fault_page) : NULL;
> +
> +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +	if (!ptep)
> +		return;
> +
> +	pte = ptep_get(ptep);
> +
> +	if (pte_none(pte)) {
> +		// migrate without faulting case
> +		if (vma_is_anonymous(walk->vma))
> +			*hmm_pfn = HMM_PFN_MIGRATE|HMM_PFN_VALID;
> +		goto out;
> +	}
> +
> +	if (!(*hmm_pfn & HMM_PFN_VALID))
> +		goto out;
> +
> +	if (!pte_present(pte)) {
> +		/*
> +		 * Only care about unaddressable device page special
> +		 * page table entry. Other special swap entries are not
> +		 * migratable, and we ignore regular swapped page.
> +		 */
> +		entry = pte_to_swp_entry(pte);
> +		if (!is_device_private_entry(entry))
> +			goto out;
> +
> +		// We have already checked that are the pgmap owners
> +		if (!(minfo & MIGRATE_VMA_SELECT_DEVICE_PRIVATE))
> +			goto out;
> +
> +		page = pfn_swap_entry_to_page(entry);
> +		pfn = page_to_pfn(page);
> +		if (is_writable_device_private_entry(entry))
> +			writable = true;
> +	} else {
> +		pfn = pte_pfn(pte);
> +		if (is_zero_pfn(pfn) &&
> +		    (minfo & MIGRATE_VMA_SELECT_SYSTEM)) {
> +			*hmm_pfn = HMM_PFN_MIGRATE|HMM_PFN_VALID;
> +			goto out;
> +		}
> +		page = vm_normal_page(walk->vma, addr, pte);
> +		if (page && !is_zone_device_page(page) &&
> +		    !(minfo & MIGRATE_VMA_SELECT_SYSTEM)) {
> +			goto out;
> +		} else if (page && is_device_coherent_page(page)) {
> +			pgmap = page_pgmap(page);
> +
> +			if (!(minfo &
> +			      MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
> +			    pgmap->owner != migrate->pgmap_owner)
> +				goto out;
> +		}
> +		writable = pte_write(pte);
> +	}
> +
> +	/* FIXME support THP */
> +	if (!page || !page->mapping || PageTransCompound(page))
> +		goto out;
> +
> +	/*
> +	 * By getting a reference on the folio we pin it and that blocks
> +	 * any kind of migration. Side effect is that it "freezes" the
> +	 * pte.
> +	 *
> +	 * We drop this reference after isolating the folio from the lru
> +	 * for non device folio (device folio are not on the lru and thus
> +	 * can't be dropped from it).
> +	 */
> +	folio = page_folio(page);
> +	folio_get(folio);
> +
> +	/*
> +	 * We rely on folio_trylock() to avoid deadlock between
> +	 * concurrent migrations where each is waiting on the others
> +	 * folio lock. If we can't immediately lock the folio we fail this
> +	 * migration as it is only best effort anyway.
> +	 *
> +	 * If we can lock the folio it's safe to set up a migration entry
> +	 * now. In the common case where the folio is mapped once in a
> +	 * single process setting up the migration entry now is an
> +	 * optimisation to avoid walking the rmap later with
> +	 * try_to_migrate().
> +	 */
> +
> +	if (fault_folio == folio || folio_trylock(folio)) {
> +		anon_exclusive = folio_test_anon(folio) &&
> +			PageAnonExclusive(page);
> +
> +		flush_cache_page(walk->vma, addr, pfn);
> +
> +		if (anon_exclusive) {
> +			pte = ptep_clear_flush(walk->vma, addr, ptep);
> +
> +			if (folio_try_share_anon_rmap_pte(folio, page)) {
> +				set_pte_at(mm, addr, ptep, pte);
> +				folio_unlock(folio);
> +				folio_put(folio);
> +				goto out;
> +			}
> +		} else {
> +			pte = ptep_get_and_clear(mm, addr, ptep);
> +		}
> +
> +		/* Setup special migration page table entry */
> +		if (writable)
> +			entry = make_writable_migration_entry(pfn);
> +		else if (anon_exclusive)
> +			entry = make_readable_exclusive_migration_entry(pfn);
> +		else
> +			entry = make_readable_migration_entry(pfn);
> +
> +		swp_pte = swp_entry_to_pte(entry);
> +		if (pte_present(pte)) {
> +			if (pte_soft_dirty(pte))
> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			if (pte_uffd_wp(pte))
> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
> +		} else {
> +			if (pte_swp_soft_dirty(pte))
> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			if (pte_swp_uffd_wp(pte))
> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
> +		}
> +
> +		set_pte_at(mm, addr, ptep, swp_pte);
> +		folio_remove_rmap_pte(folio, page, walk->vma);
> +		folio_put(folio);
> +		*hmm_pfn |= HMM_PFN_MIGRATE;
> +
> +		if (pte_present(pte))
> +			flush_tlb_range(walk->vma, addr, addr + PAGE_SIZE);
> +	} else
> +		folio_put(folio);
> +out:
> +	pte_unmap_unlock(ptep, ptl);
> +
> +}
> +
> +static int hmm_vma_walk_split(pmd_t *pmdp,
> +			      unsigned long addr,
> +			      struct mm_walk *walk)
> +{
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct hmm_range *range = hmm_vma_walk->range;
> +	struct migrate_vma *migrate = range->migrate;
> +	struct folio *folio, *fault_folio;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	fault_folio = (migrate && migrate->fault_page) ?
> +		page_folio(migrate->fault_page) : NULL;
> +
> +	ptl = pmd_lock(walk->mm, pmdp);
> +	if (unlikely(!pmd_trans_huge(*pmdp))) {
> +		spin_unlock(ptl);
> +		goto out;
> +	}
> +
> +	folio = pmd_folio(*pmdp);
> +	if (is_huge_zero_folio(folio)) {
> +		spin_unlock(ptl);
> +		split_huge_pmd(walk->vma, pmdp, addr);
> +	} else {
> +		folio_get(folio);
> +		spin_unlock(ptl);
> +		/* FIXME: we don't expect THP for fault_folio */
> +		if (WARN_ON_ONCE(fault_folio == folio)) {
> +			folio_put(folio);
> +			ret = -EBUSY;
> +			goto out;
> +		}
> +		if (unlikely(!folio_trylock(folio))) {
> +			folio_put(folio);
> +			ret = -EBUSY;
> +			goto out;
> +		}
> +		ret = split_folio(folio);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int hmm_vma_capture_migrate_range(unsigned long start,
> +					 unsigned long end,
> +					 struct mm_walk *walk)
> +{
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct hmm_range *range = hmm_vma_walk->range;
> +
> +	if (!hmm_want_migrate(range))
> +		return 0;
> +
> +	if (hmm_vma_walk->vma && (hmm_vma_walk->vma != walk->vma))
> +		return -ERANGE;
> +
> +	hmm_vma_walk->vma = walk->vma;
> +	hmm_vma_walk->start = start;
> +	hmm_vma_walk->end = end;
> +
> +	if (end - start > range->end - range->start)
> +		return -ERANGE;
> +
> +	if (!hmm_vma_walk->mmu_range.owner) {
> +		mmu_notifier_range_init_owner(&hmm_vma_walk->mmu_range, MMU_NOTIFY_MIGRATE, 0,
> +					      walk->vma->vm_mm, start, end,
> +					      range->dev_private_owner);
> +		mmu_notifier_invalidate_range_start(&hmm_vma_walk->mmu_range);
> +	}
> +
> +	return 0;
> +}
> +
>  static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  			    unsigned long start,
>  			    unsigned long end,
> @@ -351,13 +625,28 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  			pmd_migration_entry_wait(walk->mm, pmdp);
>  			return -EBUSY;
>  		}
> -		return hmm_pfns_fill(start, end, range, 0);
> +		return hmm_pfns_fill(start, end, hmm_vma_walk, 0);
>  	}
>  
>  	if (!pmd_present(pmd)) {
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
> -		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> +		return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
> +	}
> +
> +	if (hmm_want_migrate(range) &&
> +	    pmd_trans_huge(pmd)) {
> +		int r;
> +
> +		r = hmm_vma_walk_split(pmdp, addr, walk);
> +		if (r) {
> +			/* Split not successful, skip */
> +			return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
> +		}
> +
> +		/* Split successful or "again", reloop */
> +		hmm_vma_walk->last = addr;
> +		return -EBUSY;
>  	}
>  
>  	if (pmd_trans_huge(pmd)) {
> @@ -386,7 +675,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	if (pmd_bad(pmd)) {
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
> -		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> +		return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
>  	}
>  
>  	ptep = pte_offset_map(pmdp, addr);
> @@ -400,8 +689,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  			/* hmm_vma_handle_pte() did pte_unmap() */
>  			return r;
>  		}
> +
> +		hmm_vma_handle_migrate_prepare(walk, pmdp, addr, hmm_pfns);
>  	}
>  	pte_unmap(ptep - 1);
> +
>  	return 0;
>  }
>  
> @@ -535,6 +827,11 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end,
>  	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>  	struct hmm_range *range = hmm_vma_walk->range;
>  	struct vm_area_struct *vma = walk->vma;
> +	int r;
> +
> +	r = hmm_vma_capture_migrate_range(start, end, walk);
> +	if (r)
> +		return r;
>  
>  	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)) &&
>  	    vma->vm_flags & VM_READ)
> @@ -557,7 +854,7 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end,
>  				 (end - start) >> PAGE_SHIFT, 0))
>  		return -EFAULT;
>  
> -	hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> +	hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
>  
>  	/* Skip this vma and continue processing the next vma. */
>  	return 1;
> @@ -587,9 +884,17 @@ static const struct mm_walk_ops hmm_walk_ops = {
>   *		the invalidation to finish.
>   * -EFAULT:     A page was requested to be valid and could not be made valid
>   *              ie it has no backing VMA or it is illegal to access
> + * -ERANGE:     The range crosses multiple VMAs, or space for hmm_pfns array
> + *              is too low.
>   *
>   * This is similar to get_user_pages(), except that it can read the page tables
>   * without mutating them (ie causing faults).
> + *
> + * If want to do migrate after faultin, call hmm_range_fault() with
> + * HMM_PFN_REQ_MIGRATE and initialize range.migrate field.
> + * After hmm_range_fault() call migrate_hmm_range_setup() instead of
> + * migrate_vma_setup() and after that follow normal migrate calls path.
> + *
>   */
>  int hmm_range_fault(struct hmm_range *range)
>  {
> @@ -597,16 +902,28 @@ int hmm_range_fault(struct hmm_range *range)
>  		.range = range,
>  		.last = range->start,
>  	};
> -	struct mm_struct *mm = range->notifier->mm;
> +	bool is_fault_path = !!range->notifier;
> +	struct mm_struct *mm;
>  	int ret;
>  
> +	/*
> +	 *
> +	 *  Could be serving a device fault or come from migrate
> +	 *  entry point. For the former we have not resolved the vma
> +	 *  yet, and the latter we don't have a notifier (but have a vma).
> +	 *
> +	 */
> +	mm = is_fault_path ? range->notifier->mm : range->migrate->vma->vm_mm;
>  	mmap_assert_locked(mm);
>  
>  	do {
>  		/* If range is no longer valid force retry. */
> -		if (mmu_interval_check_retry(range->notifier,
> -					     range->notifier_seq))
> -			return -EBUSY;
> +		if (is_fault_path && mmu_interval_check_retry(range->notifier,
> +					     range->notifier_seq)) {
> +			ret = -EBUSY;
> +			break;
> +		}
> +
>  		ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
>  				      &hmm_walk_ops, &hmm_vma_walk);
>  		/*
> @@ -616,6 +933,18 @@ int hmm_range_fault(struct hmm_range *range)
>  		 * output, and all >= are still at their input values.
>  		 */
>  	} while (ret == -EBUSY);
> +
> +	if (hmm_want_migrate(range) && range->migrate &&
> +	    hmm_vma_walk.mmu_range.owner) {
> +		// The migrate_vma path has the following initialized
> +		if (is_fault_path) {
> +			range->migrate->vma   = hmm_vma_walk.vma;
> +			range->migrate->start = range->start;
> +			range->migrate->end   = hmm_vma_walk.end;
> +		}
> +		mmu_notifier_invalidate_range_end(&hmm_vma_walk.mmu_range);
> +	}
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL(hmm_range_fault);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index e05e14d6eacd..87ddc0353165 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -535,7 +535,18 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>   */
>  int migrate_vma_setup(struct migrate_vma *args)
>  {
> +	int ret;
>  	long nr_pages = (args->end - args->start) >> PAGE_SHIFT;
> +	struct hmm_range range = {
> +		.notifier = NULL,
> +		.start = args->start,
> +		.end = args->end,
> +		.migrate = args,
> +		.hmm_pfns = args->src,
> +		.default_flags = HMM_PFN_REQ_MIGRATE,
> +		.dev_private_owner = args->pgmap_owner,
> +		.migrate = args
> +	};
>  
>  	args->start &= PAGE_MASK;
>  	args->end &= PAGE_MASK;
> @@ -560,17 +571,19 @@ int migrate_vma_setup(struct migrate_vma *args)
>  	args->cpages = 0;
>  	args->npages = 0;
>  
> -	migrate_vma_collect(args);
> +	if (args->flags & MIGRATE_VMA_FAULT)
> +		range.default_flags |= HMM_PFN_REQ_FAULT;
>  
> -	if (args->cpages)
> -		migrate_vma_unmap(args);
> +	ret = hmm_range_fault(&range);
> +
> +	migrate_hmm_range_setup(&range);
>  
>  	/*
>  	 * At this point pages are locked and unmapped, and thus they have
>  	 * stable content and can safely be copied to destination memory that
>  	 * is allocated by the drivers.
>  	 */
> -	return 0;
> +	return ret;
>  
>  }
>  EXPORT_SYMBOL(migrate_vma_setup);
> @@ -1014,3 +1027,54 @@ int migrate_device_coherent_folio(struct folio *folio)
>  		return 0;
>  	return -EBUSY;
>  }
> +
> +void migrate_hmm_range_setup(struct hmm_range *range)
> +{
> +
> +	struct migrate_vma *migrate = range->migrate;
> +
> +	if (!migrate)
> +		return;
> +
> +	migrate->npages = (migrate->end - migrate->start) >> PAGE_SHIFT;
> +	migrate->cpages = 0;
> +
> +	for (unsigned long i = 0; i < migrate->npages; i++) {
> +
> +		unsigned long pfn = range->hmm_pfns[i];
> +
> +		/*
> +		 *
> +		 *  Don't do migration if valid and migrate flags are not both set.
> +		 *
> +		 */
> +		if ((pfn & (HMM_PFN_VALID | HMM_PFN_MIGRATE)) !=
> +		    (HMM_PFN_VALID | HMM_PFN_MIGRATE)) {
> +			migrate->src[i] = 0;
> +			migrate->dst[i] = 0;
> +			continue;
> +		}
> +
> +		migrate->cpages++;
> +
> +		/*
> +		 *
> +		 * The zero page is encoded in a special way, valid and migrate is
> +		 * set, and pfn part is zero. Encode specially for migrate also.
> +		 *
> +		 */
> +		if (pfn == (HMM_PFN_VALID|HMM_PFN_MIGRATE)) {
> +			migrate->src[i] = MIGRATE_PFN_MIGRATE;
> +			continue;
> +		}
> +
> +		migrate->src[i] = migrate_pfn(page_to_pfn(hmm_pfn_to_page(pfn)))
> +			| MIGRATE_PFN_MIGRATE;
> +		migrate->src[i] |= (pfn & HMM_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
> +	}
> +
> +	if (migrate->cpages)
> +		migrate_vma_unmap(migrate);
> +
> +}
> +EXPORT_SYMBOL(migrate_hmm_range_setup);


I've not had a chance to test the code, do you have any numbers with the changes
to show the advantages of doing both fault and migrate together?

Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/4] mm: unified fault and migrate device page paths
  2025-08-21  4:30   ` Balbir Singh
@ 2025-08-21  5:10     ` Mika Penttilä
  2025-08-22  5:02       ` Alistair Popple
  0 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2025-08-21  5:10 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, David Hildenbrand, Jason Gunthorpe, Leon Romanovsky,
	Alistair Popple


On 8/21/25 07:30, Balbir Singh wrote:

> On 8/14/25 17:19, Mika Penttilä wrote:
>> As of this writing, the way device page faulting and migration works
>> is not optimal, if you want to do both fault handling and
>> migration at once.
>>
>> Being able to migrate not present pages (or pages mapped with incorrect
>> permissions, eg. COW) to the GPU requires doing either of the
>> following sequences:
>>
>> 1. hmm_range_fault() - fault in non-present pages with correct permissions, etc.
>> 2. migrate_vma_*() - migrate the pages
>>
>> Or:
>>
>> 1. migrate_vma_*() - migrate present pages
>> 2. If non-present pages detected by migrate_vma_*():
>>    a) call hmm_range_fault() to fault pages in
>>    b) call migrate_vma_*() again to migrate now present pages
>>
>> The problem with the first sequence is that you always have to do two
>> page walks even when most of the time the pages are present or zero page
>> mappings so the common case takes a performance hit.
>>
>> The second sequence is better for the common case, but far worse if
>> pages aren't present because now you have to walk the page tables three
>> times (once to find the page is not present, once so hmm_range_fault()
>> can find a non-present page to fault in and once again to setup the
>> migration). It also tricky to code correctly.
>>
>> We should be able to walk the page table once, faulting
>> pages in as required and replacing them with migration entries if
>> requested.
>>
>> Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
>> which tells to prepare for migration also during fault handling.
>> Also, for the migrate_vma_setup() call paths, a flags, MIGRATE_VMA_FAULT,
>> is added to tell to add fault handling to migrate.
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: Leon Romanovsky <leonro@nvidia.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Balbir Singh <balbirs@nvidia.com>
>>
>> Suggested-by: Alistair Popple <apopple@nvidia.com>
>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
>> ---
>>  include/linux/hmm.h     |  10 +-
>>  include/linux/migrate.h |   6 +-
>>  mm/hmm.c                | 351 ++++++++++++++++++++++++++++++++++++++--
>>  mm/migrate_device.c     |  72 ++++++++-
>>  4 files changed, 420 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>> index db75ffc949a7..7485e549c675 100644
>> --- a/include/linux/hmm.h
>> +++ b/include/linux/hmm.h
>> @@ -12,7 +12,7 @@
>>  #include <linux/mm.h>
>>  
>>  struct mmu_interval_notifier;
>> -
>> +struct migrate_vma;
>>  /*
>>   * On output:
>>   * 0             - The page is faultable and a future call with 
>> @@ -48,11 +48,14 @@ enum hmm_pfn_flags {
>>  	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
>>  	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
>>  
>> -	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11),
>> +	/* Migrate request */
>> +	HMM_PFN_MIGRATE    = 1UL << (BITS_PER_LONG - 7),
>> +	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 12),
>>  
>>  	/* Input flags */
>>  	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
>>  	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
>> +	HMM_PFN_REQ_MIGRATE = HMM_PFN_MIGRATE,
>>  
>>  	HMM_PFN_FLAGS = ~((1UL << HMM_PFN_ORDER_SHIFT) - 1),
>>  };
>> @@ -107,6 +110,7 @@ static inline unsigned int hmm_pfn_to_map_order(unsigned long hmm_pfn)
>>   * @default_flags: default flags for the range (write, read, ... see hmm doc)
>>   * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>>   * @dev_private_owner: owner of device private pages
>> + * @migrate: structure for migrating the associated vma
>>   */
>>  struct hmm_range {
>>  	struct mmu_interval_notifier *notifier;
>> @@ -117,12 +121,14 @@ struct hmm_range {
>>  	unsigned long		default_flags;
>>  	unsigned long		pfn_flags_mask;
>>  	void			*dev_private_owner;
>> +	struct migrate_vma      *migrate;
>>  };
>>  
>>  /*
>>   * Please see Documentation/mm/hmm.rst for how to use the range API.
>>   */
>>  int hmm_range_fault(struct hmm_range *range);
>> +int hmm_range_migrate_prepare(struct hmm_range *range, struct migrate_vma **pargs);
>>  
>>  /*
>>   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index acadd41e0b5c..ab35d0f1f65d 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -3,6 +3,7 @@
>>  #define _LINUX_MIGRATE_H
>>  
>>  #include <linux/mm.h>
>> +#include <linux/hmm.h>
>>  #include <linux/mempolicy.h>
>>  #include <linux/migrate_mode.h>
>>  #include <linux/hugetlb.h>
>> @@ -143,10 +144,11 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>>  	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
>>  }
>>  
>> -enum migrate_vma_direction {
>> +enum migrate_vma_info {
>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>> +	MIGRATE_VMA_FAULT = 1 << 3,
>>  };
>>  
> I suspect there are some points of conflict with my series that can be resolved

Yes there are some, I have also been looking into them and seem not too bad.

>
>>  struct migrate_vma {
>> @@ -194,7 +196,7 @@ void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
>>  			unsigned long npages);
>>  void migrate_device_finalize(unsigned long *src_pfns,
>>  			unsigned long *dst_pfns, unsigned long npages);
>> -
>> +void migrate_hmm_range_setup(struct hmm_range *range);
>>  #endif /* CONFIG_MIGRATION */
>>  
>>  #endif /* _LINUX_MIGRATE_H */
>> diff --git a/mm/hmm.c b/mm/hmm.c
>> index d545e2494994..8cb2b325fa9f 100644
>> --- a/mm/hmm.c
>> +++ b/mm/hmm.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/pagemap.h>
>>  #include <linux/swapops.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/migrate.h>
>>  #include <linux/memremap.h>
>>  #include <linux/sched/mm.h>
>>  #include <linux/jump_label.h>
>> @@ -33,6 +34,10 @@
>>  struct hmm_vma_walk {
>>  	struct hmm_range	*range;
>>  	unsigned long		last;
>> +	struct mmu_notifier_range mmu_range;
>> +	struct vm_area_struct 	*vma;
>> +	unsigned long 		start;
>> +	unsigned long 		end;
>>  };
>>  
>>  enum {
>> @@ -47,15 +52,33 @@ enum {
>>  			      HMM_PFN_P2PDMA_BUS,
>>  };
>>  
>> +static enum migrate_vma_info hmm_want_migrate(struct hmm_range *range)
> hmm_want_migrate -> hmm_select_and_migrate?

Yeah maybe that's better

>
>> +{
>> +	enum migrate_vma_info minfo;
>> +
>> +	minfo = range->migrate ? range->migrate->flags : 0;
>> +	minfo |= (range->default_flags & HMM_PFN_REQ_MIGRATE) ?
>> +		MIGRATE_VMA_SELECT_SYSTEM : 0;
>> +
> Just to understand, this selects just system pages

Yes it indicates the migration type for the fault path (migrate on fault).

>
>> +	return minfo;
>> +}
>> +
>>  static int hmm_pfns_fill(unsigned long addr, unsigned long end,
>> -			 struct hmm_range *range, unsigned long cpu_flags)
>> +			 struct hmm_vma_walk *hmm_vma_walk, unsigned long cpu_flags)
>>  {
>> +	struct hmm_range *range = hmm_vma_walk->range;
>>  	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
>>  
>> +	if (cpu_flags != HMM_PFN_ERROR)
>> +		if (hmm_want_migrate(range) &&
>> +		    (vma_is_anonymous(hmm_vma_walk->vma)))
>> +			cpu_flags |= (HMM_PFN_VALID | HMM_PFN_MIGRATE);
>> +
>>  	for (; addr < end; addr += PAGE_SIZE, i++) {
>>  		range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
>>  		range->hmm_pfns[i] |= cpu_flags;
>>  	}
>> +
>>  	return 0;
>>  }
>>  
>> @@ -171,11 +194,11 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
>>  	if (!walk->vma) {
>>  		if (required_fault)
>>  			return -EFAULT;
>> -		return hmm_pfns_fill(addr, end, range, HMM_PFN_ERROR);
>> +		return hmm_pfns_fill(addr, end, hmm_vma_walk, HMM_PFN_ERROR);
>>  	}
>>  	if (required_fault)
>>  		return hmm_vma_fault(addr, end, required_fault, walk);
>> -	return hmm_pfns_fill(addr, end, range, 0);
>> +	return hmm_pfns_fill(addr, end, hmm_vma_walk, 0);
>>  }
>>  
>>  static inline unsigned long hmm_pfn_flags_order(unsigned long order)
>> @@ -326,6 +349,257 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>>  	return hmm_vma_fault(addr, end, required_fault, walk);
>>  }
>>  
>> +/*
>> + * Install migration entries if migration requested, either from fault
>> + * or migrate paths.
>> + *
>> + */
>> +static void hmm_vma_handle_migrate_prepare(const struct mm_walk *walk,
>> +					   pmd_t *pmdp,
>> +					   unsigned long addr,
>> +					   unsigned long *hmm_pfn)
>> +{
>> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>> +	struct hmm_range *range = hmm_vma_walk->range;
>> +	struct migrate_vma *migrate = range->migrate;
>> +	struct mm_struct *mm = walk->vma->vm_mm;
>> +	struct folio *fault_folio = NULL;
>> +	enum migrate_vma_info minfo;
>> +	struct dev_pagemap *pgmap;
>> +	bool anon_exclusive;
>> +	struct folio *folio;
>> +	unsigned long pfn;
>> +	struct page *page;
>> +	swp_entry_t entry;
>> +	pte_t pte, swp_pte;
>> +	spinlock_t *ptl;
>> +	bool writable = false;
>> +	pte_t *ptep;
>> +
>> +
>> +	// Do we want to migrate at all?
>> +	minfo = hmm_want_migrate(range);
>> +	if (!minfo)
>> +		return;
>> +
>> +	fault_folio = (migrate && migrate->fault_page) ?
>> +		page_folio(migrate->fault_page) : NULL;
>> +
>> +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> +	if (!ptep)
>> +		return;
>> +
>> +	pte = ptep_get(ptep);
>> +
>> +	if (pte_none(pte)) {
>> +		// migrate without faulting case
>> +		if (vma_is_anonymous(walk->vma))
>> +			*hmm_pfn = HMM_PFN_MIGRATE|HMM_PFN_VALID;
>> +		goto out;
>> +	}
>> +
>> +	if (!(*hmm_pfn & HMM_PFN_VALID))
>> +		goto out;
>> +
>> +	if (!pte_present(pte)) {
>> +		/*
>> +		 * Only care about unaddressable device page special
>> +		 * page table entry. Other special swap entries are not
>> +		 * migratable, and we ignore regular swapped page.
>> +		 */
>> +		entry = pte_to_swp_entry(pte);
>> +		if (!is_device_private_entry(entry))
>> +			goto out;
>> +
>> +		// We have already checked that are the pgmap owners
>> +		if (!(minfo & MIGRATE_VMA_SELECT_DEVICE_PRIVATE))
>> +			goto out;
>> +
>> +		page = pfn_swap_entry_to_page(entry);
>> +		pfn = page_to_pfn(page);
>> +		if (is_writable_device_private_entry(entry))
>> +			writable = true;
>> +	} else {
>> +		pfn = pte_pfn(pte);
>> +		if (is_zero_pfn(pfn) &&
>> +		    (minfo & MIGRATE_VMA_SELECT_SYSTEM)) {
>> +			*hmm_pfn = HMM_PFN_MIGRATE|HMM_PFN_VALID;
>> +			goto out;
>> +		}
>> +		page = vm_normal_page(walk->vma, addr, pte);
>> +		if (page && !is_zone_device_page(page) &&
>> +		    !(minfo & MIGRATE_VMA_SELECT_SYSTEM)) {
>> +			goto out;
>> +		} else if (page && is_device_coherent_page(page)) {
>> +			pgmap = page_pgmap(page);
>> +
>> +			if (!(minfo &
>> +			      MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
>> +			    pgmap->owner != migrate->pgmap_owner)
>> +				goto out;
>> +		}
>> +		writable = pte_write(pte);
>> +	}
>> +
>> +	/* FIXME support THP */
>> +	if (!page || !page->mapping || PageTransCompound(page))
>> +		goto out;
>> +
>> +	/*
>> +	 * By getting a reference on the folio we pin it and that blocks
>> +	 * any kind of migration. Side effect is that it "freezes" the
>> +	 * pte.
>> +	 *
>> +	 * We drop this reference after isolating the folio from the lru
>> +	 * for non device folio (device folio are not on the lru and thus
>> +	 * can't be dropped from it).
>> +	 */
>> +	folio = page_folio(page);
>> +	folio_get(folio);
>> +
>> +	/*
>> +	 * We rely on folio_trylock() to avoid deadlock between
>> +	 * concurrent migrations where each is waiting on the others
>> +	 * folio lock. If we can't immediately lock the folio we fail this
>> +	 * migration as it is only best effort anyway.
>> +	 *
>> +	 * If we can lock the folio it's safe to set up a migration entry
>> +	 * now. In the common case where the folio is mapped once in a
>> +	 * single process setting up the migration entry now is an
>> +	 * optimisation to avoid walking the rmap later with
>> +	 * try_to_migrate().
>> +	 */
>> +
>> +	if (fault_folio == folio || folio_trylock(folio)) {
>> +		anon_exclusive = folio_test_anon(folio) &&
>> +			PageAnonExclusive(page);
>> +
>> +		flush_cache_page(walk->vma, addr, pfn);
>> +
>> +		if (anon_exclusive) {
>> +			pte = ptep_clear_flush(walk->vma, addr, ptep);
>> +
>> +			if (folio_try_share_anon_rmap_pte(folio, page)) {
>> +				set_pte_at(mm, addr, ptep, pte);
>> +				folio_unlock(folio);
>> +				folio_put(folio);
>> +				goto out;
>> +			}
>> +		} else {
>> +			pte = ptep_get_and_clear(mm, addr, ptep);
>> +		}
>> +
>> +		/* Setup special migration page table entry */
>> +		if (writable)
>> +			entry = make_writable_migration_entry(pfn);
>> +		else if (anon_exclusive)
>> +			entry = make_readable_exclusive_migration_entry(pfn);
>> +		else
>> +			entry = make_readable_migration_entry(pfn);
>> +
>> +		swp_pte = swp_entry_to_pte(entry);
>> +		if (pte_present(pte)) {
>> +			if (pte_soft_dirty(pte))
>> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> +			if (pte_uffd_wp(pte))
>> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
>> +		} else {
>> +			if (pte_swp_soft_dirty(pte))
>> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> +			if (pte_swp_uffd_wp(pte))
>> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
>> +		}
>> +
>> +		set_pte_at(mm, addr, ptep, swp_pte);
>> +		folio_remove_rmap_pte(folio, page, walk->vma);
>> +		folio_put(folio);
>> +		*hmm_pfn |= HMM_PFN_MIGRATE;
>> +
>> +		if (pte_present(pte))
>> +			flush_tlb_range(walk->vma, addr, addr + PAGE_SIZE);
>> +	} else
>> +		folio_put(folio);
>> +out:
>> +	pte_unmap_unlock(ptep, ptl);
>> +
>> +}
>> +
>> +static int hmm_vma_walk_split(pmd_t *pmdp,
>> +			      unsigned long addr,
>> +			      struct mm_walk *walk)
>> +{
>> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>> +	struct hmm_range *range = hmm_vma_walk->range;
>> +	struct migrate_vma *migrate = range->migrate;
>> +	struct folio *folio, *fault_folio;
>> +	spinlock_t *ptl;
>> +	int ret = 0;
>> +
>> +	fault_folio = (migrate && migrate->fault_page) ?
>> +		page_folio(migrate->fault_page) : NULL;
>> +
>> +	ptl = pmd_lock(walk->mm, pmdp);
>> +	if (unlikely(!pmd_trans_huge(*pmdp))) {
>> +		spin_unlock(ptl);
>> +		goto out;
>> +	}
>> +
>> +	folio = pmd_folio(*pmdp);
>> +	if (is_huge_zero_folio(folio)) {
>> +		spin_unlock(ptl);
>> +		split_huge_pmd(walk->vma, pmdp, addr);
>> +	} else {
>> +		folio_get(folio);
>> +		spin_unlock(ptl);
>> +		/* FIXME: we don't expect THP for fault_folio */
>> +		if (WARN_ON_ONCE(fault_folio == folio)) {
>> +			folio_put(folio);
>> +			ret = -EBUSY;
>> +			goto out;
>> +		}
>> +		if (unlikely(!folio_trylock(folio))) {
>> +			folio_put(folio);
>> +			ret = -EBUSY;
>> +			goto out;
>> +		}
>> +		ret = split_folio(folio);
>> +		folio_unlock(folio);
>> +		folio_put(folio);
>> +	}
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int hmm_vma_capture_migrate_range(unsigned long start,
>> +					 unsigned long end,
>> +					 struct mm_walk *walk)
>> +{
>> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>> +	struct hmm_range *range = hmm_vma_walk->range;
>> +
>> +	if (!hmm_want_migrate(range))
>> +		return 0;
>> +
>> +	if (hmm_vma_walk->vma && (hmm_vma_walk->vma != walk->vma))
>> +		return -ERANGE;
>> +
>> +	hmm_vma_walk->vma = walk->vma;
>> +	hmm_vma_walk->start = start;
>> +	hmm_vma_walk->end = end;
>> +
>> +	if (end - start > range->end - range->start)
>> +		return -ERANGE;
>> +
>> +	if (!hmm_vma_walk->mmu_range.owner) {
>> +		mmu_notifier_range_init_owner(&hmm_vma_walk->mmu_range, MMU_NOTIFY_MIGRATE, 0,
>> +					      walk->vma->vm_mm, start, end,
>> +					      range->dev_private_owner);
>> +		mmu_notifier_invalidate_range_start(&hmm_vma_walk->mmu_range);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>  			    unsigned long start,
>>  			    unsigned long end,
>> @@ -351,13 +625,28 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>  			pmd_migration_entry_wait(walk->mm, pmdp);
>>  			return -EBUSY;
>>  		}
>> -		return hmm_pfns_fill(start, end, range, 0);
>> +		return hmm_pfns_fill(start, end, hmm_vma_walk, 0);
>>  	}
>>  
>>  	if (!pmd_present(pmd)) {
>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>>  			return -EFAULT;
>> -		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>> +		return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
>> +	}
>> +
>> +	if (hmm_want_migrate(range) &&
>> +	    pmd_trans_huge(pmd)) {
>> +		int r;
>> +
>> +		r = hmm_vma_walk_split(pmdp, addr, walk);
>> +		if (r) {
>> +			/* Split not successful, skip */
>> +			return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
>> +		}
>> +
>> +		/* Split successful or "again", reloop */
>> +		hmm_vma_walk->last = addr;
>> +		return -EBUSY;
>>  	}
>>  
>>  	if (pmd_trans_huge(pmd)) {
>> @@ -386,7 +675,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>  	if (pmd_bad(pmd)) {
>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>>  			return -EFAULT;
>> -		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>> +		return hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
>>  	}
>>  
>>  	ptep = pte_offset_map(pmdp, addr);
>> @@ -400,8 +689,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>  			/* hmm_vma_handle_pte() did pte_unmap() */
>>  			return r;
>>  		}
>> +
>> +		hmm_vma_handle_migrate_prepare(walk, pmdp, addr, hmm_pfns);
>>  	}
>>  	pte_unmap(ptep - 1);
>> +
>>  	return 0;
>>  }
>>  
>> @@ -535,6 +827,11 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end,
>>  	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>>  	struct hmm_range *range = hmm_vma_walk->range;
>>  	struct vm_area_struct *vma = walk->vma;
>> +	int r;
>> +
>> +	r = hmm_vma_capture_migrate_range(start, end, walk);
>> +	if (r)
>> +		return r;
>>  
>>  	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)) &&
>>  	    vma->vm_flags & VM_READ)
>> @@ -557,7 +854,7 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end,
>>  				 (end - start) >> PAGE_SHIFT, 0))
>>  		return -EFAULT;
>>  
>> -	hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>> +	hmm_pfns_fill(start, end, hmm_vma_walk, HMM_PFN_ERROR);
>>  
>>  	/* Skip this vma and continue processing the next vma. */
>>  	return 1;
>> @@ -587,9 +884,17 @@ static const struct mm_walk_ops hmm_walk_ops = {
>>   *		the invalidation to finish.
>>   * -EFAULT:     A page was requested to be valid and could not be made valid
>>   *              ie it has no backing VMA or it is illegal to access
>> + * -ERANGE:     The range crosses multiple VMAs, or space for hmm_pfns array
>> + *              is too low.
>>   *
>>   * This is similar to get_user_pages(), except that it can read the page tables
>>   * without mutating them (ie causing faults).
>> + *
>> + * If want to do migrate after faultin, call hmm_range_fault() with
>> + * HMM_PFN_REQ_MIGRATE and initialize range.migrate field.
>> + * After hmm_range_fault() call migrate_hmm_range_setup() instead of
>> + * migrate_vma_setup() and after that follow normal migrate calls path.
>> + *
>>   */
>>  int hmm_range_fault(struct hmm_range *range)
>>  {
>> @@ -597,16 +902,28 @@ int hmm_range_fault(struct hmm_range *range)
>>  		.range = range,
>>  		.last = range->start,
>>  	};
>> -	struct mm_struct *mm = range->notifier->mm;
>> +	bool is_fault_path = !!range->notifier;
>> +	struct mm_struct *mm;
>>  	int ret;
>>  
>> +	/*
>> +	 *
>> +	 *  Could be serving a device fault or come from migrate
>> +	 *  entry point. For the former we have not resolved the vma
>> +	 *  yet, and the latter we don't have a notifier (but have a vma).
>> +	 *
>> +	 */
>> +	mm = is_fault_path ? range->notifier->mm : range->migrate->vma->vm_mm;
>>  	mmap_assert_locked(mm);
>>  
>>  	do {
>>  		/* If range is no longer valid force retry. */
>> -		if (mmu_interval_check_retry(range->notifier,
>> -					     range->notifier_seq))
>> -			return -EBUSY;
>> +		if (is_fault_path && mmu_interval_check_retry(range->notifier,
>> +					     range->notifier_seq)) {
>> +			ret = -EBUSY;
>> +			break;
>> +		}
>> +
>>  		ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
>>  				      &hmm_walk_ops, &hmm_vma_walk);
>>  		/*
>> @@ -616,6 +933,18 @@ int hmm_range_fault(struct hmm_range *range)
>>  		 * output, and all >= are still at their input values.
>>  		 */
>>  	} while (ret == -EBUSY);
>> +
>> +	if (hmm_want_migrate(range) && range->migrate &&
>> +	    hmm_vma_walk.mmu_range.owner) {
>> +		// The migrate_vma path has the following initialized
>> +		if (is_fault_path) {
>> +			range->migrate->vma   = hmm_vma_walk.vma;
>> +			range->migrate->start = range->start;
>> +			range->migrate->end   = hmm_vma_walk.end;
>> +		}
>> +		mmu_notifier_invalidate_range_end(&hmm_vma_walk.mmu_range);
>> +	}
>> +
>>  	return ret;
>>  }
>>  EXPORT_SYMBOL(hmm_range_fault);
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index e05e14d6eacd..87ddc0353165 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -535,7 +535,18 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>>   */
>>  int migrate_vma_setup(struct migrate_vma *args)
>>  {
>> +	int ret;
>>  	long nr_pages = (args->end - args->start) >> PAGE_SHIFT;
>> +	struct hmm_range range = {
>> +		.notifier = NULL,
>> +		.start = args->start,
>> +		.end = args->end,
>> +		.migrate = args,
>> +		.hmm_pfns = args->src,
>> +		.default_flags = HMM_PFN_REQ_MIGRATE,
>> +		.dev_private_owner = args->pgmap_owner,
>> +		.migrate = args
>> +	};
>>  
>>  	args->start &= PAGE_MASK;
>>  	args->end &= PAGE_MASK;
>> @@ -560,17 +571,19 @@ int migrate_vma_setup(struct migrate_vma *args)
>>  	args->cpages = 0;
>>  	args->npages = 0;
>>  
>> -	migrate_vma_collect(args);
>> +	if (args->flags & MIGRATE_VMA_FAULT)
>> +		range.default_flags |= HMM_PFN_REQ_FAULT;
>>  
>> -	if (args->cpages)
>> -		migrate_vma_unmap(args);
>> +	ret = hmm_range_fault(&range);
>> +
>> +	migrate_hmm_range_setup(&range);
>>  
>>  	/*
>>  	 * At this point pages are locked and unmapped, and thus they have
>>  	 * stable content and can safely be copied to destination memory that
>>  	 * is allocated by the drivers.
>>  	 */
>> -	return 0;
>> +	return ret;
>>  
>>  }
>>  EXPORT_SYMBOL(migrate_vma_setup);
>> @@ -1014,3 +1027,54 @@ int migrate_device_coherent_folio(struct folio *folio)
>>  		return 0;
>>  	return -EBUSY;
>>  }
>> +
>> +void migrate_hmm_range_setup(struct hmm_range *range)
>> +{
>> +
>> +	struct migrate_vma *migrate = range->migrate;
>> +
>> +	if (!migrate)
>> +		return;
>> +
>> +	migrate->npages = (migrate->end - migrate->start) >> PAGE_SHIFT;
>> +	migrate->cpages = 0;
>> +
>> +	for (unsigned long i = 0; i < migrate->npages; i++) {
>> +
>> +		unsigned long pfn = range->hmm_pfns[i];
>> +
>> +		/*
>> +		 *
>> +		 *  Don't do migration if valid and migrate flags are not both set.
>> +		 *
>> +		 */
>> +		if ((pfn & (HMM_PFN_VALID | HMM_PFN_MIGRATE)) !=
>> +		    (HMM_PFN_VALID | HMM_PFN_MIGRATE)) {
>> +			migrate->src[i] = 0;
>> +			migrate->dst[i] = 0;
>> +			continue;
>> +		}
>> +
>> +		migrate->cpages++;
>> +
>> +		/*
>> +		 *
>> +		 * The zero page is encoded in a special way, valid and migrate is
>> +		 * set, and pfn part is zero. Encode specially for migrate also.
>> +		 *
>> +		 */
>> +		if (pfn == (HMM_PFN_VALID|HMM_PFN_MIGRATE)) {
>> +			migrate->src[i] = MIGRATE_PFN_MIGRATE;
>> +			continue;
>> +		}
>> +
>> +		migrate->src[i] = migrate_pfn(page_to_pfn(hmm_pfn_to_page(pfn)))
>> +			| MIGRATE_PFN_MIGRATE;
>> +		migrate->src[i] |= (pfn & HMM_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
>> +	}
>> +
>> +	if (migrate->cpages)
>> +		migrate_vma_unmap(migrate);
>> +
>> +}
>> +EXPORT_SYMBOL(migrate_hmm_range_setup);
>
> I've not had a chance to test the code, do you have any numbers with the changes
> to show the advantages of doing both fault and migrate together?

Not yet, but plan to have some numbers later. 

>
> Balbir
>
Thanks,
--Mika


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/4] mm: unified fault and migrate device page paths
  2025-08-21  5:10     ` Mika Penttilä
@ 2025-08-22  5:02       ` Alistair Popple
  0 siblings, 0 replies; 22+ messages in thread
From: Alistair Popple @ 2025-08-22  5:02 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, linux-mm, linux-kernel, David Hildenbrand,
	Jason Gunthorpe, Leon Romanovsky

On Thu, Aug 21, 2025 at 08:10:31AM +0300, Mika Penttilä wrote:
> 
> On 8/21/25 07:30, Balbir Singh wrote:
> 
> > On 8/14/25 17:19, Mika Penttilä wrote:

[...]

> >>  EXPORT_SYMBOL(hmm_range_fault);
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index e05e14d6eacd..87ddc0353165 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -535,7 +535,18 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
> >>   */
> >>  int migrate_vma_setup(struct migrate_vma *args)
> >>  {
> >> +	int ret;
> >>  	long nr_pages = (args->end - args->start) >> PAGE_SHIFT;
> >> +	struct hmm_range range = {
> >> +		.notifier = NULL,
> >> +		.start = args->start,
> >> +		.end = args->end,
> >> +		.migrate = args,
> >> +		.hmm_pfns = args->src,
> >> +		.default_flags = HMM_PFN_REQ_MIGRATE,
> >> +		.dev_private_owner = args->pgmap_owner,
> >> +		.migrate = args
> >> +	};
> >>  
> >>  	args->start &= PAGE_MASK;
> >>  	args->end &= PAGE_MASK;
> >> @@ -560,17 +571,19 @@ int migrate_vma_setup(struct migrate_vma *args)
> >>  	args->cpages = 0;
> >>  	args->npages = 0;
> >>  
> >> -	migrate_vma_collect(args);
> >> +	if (args->flags & MIGRATE_VMA_FAULT)
> >> +		range.default_flags |= HMM_PFN_REQ_FAULT;
> >>  
> >> -	if (args->cpages)
> >> -		migrate_vma_unmap(args);
> >> +	ret = hmm_range_fault(&range);
> >> +
> >> +	migrate_hmm_range_setup(&range);
> >>  
> >>  	/*
> >>  	 * At this point pages are locked and unmapped, and thus they have
> >>  	 * stable content and can safely be copied to destination memory that
> >>  	 * is allocated by the drivers.
> >>  	 */
> >> -	return 0;
> >> +	return ret;
> >>  
> >>  }
> >>  EXPORT_SYMBOL(migrate_vma_setup);
> >> @@ -1014,3 +1027,54 @@ int migrate_device_coherent_folio(struct folio *folio)
> >>  		return 0;
> >>  	return -EBUSY;
> >>  }
> >> +
> >> +void migrate_hmm_range_setup(struct hmm_range *range)
> >> +{
> >> +
> >> +	struct migrate_vma *migrate = range->migrate;
> >> +
> >> +	if (!migrate)
> >> +		return;
> >> +
> >> +	migrate->npages = (migrate->end - migrate->start) >> PAGE_SHIFT;
> >> +	migrate->cpages = 0;
> >> +
> >> +	for (unsigned long i = 0; i < migrate->npages; i++) {
> >> +
> >> +		unsigned long pfn = range->hmm_pfns[i];
> >> +
> >> +		/*
> >> +		 *
> >> +		 *  Don't do migration if valid and migrate flags are not both set.
> >> +		 *
> >> +		 */
> >> +		if ((pfn & (HMM_PFN_VALID | HMM_PFN_MIGRATE)) !=
> >> +		    (HMM_PFN_VALID | HMM_PFN_MIGRATE)) {
> >> +			migrate->src[i] = 0;
> >> +			migrate->dst[i] = 0;
> >> +			continue;
> >> +		}
> >> +
> >> +		migrate->cpages++;
> >> +
> >> +		/*
> >> +		 *
> >> +		 * The zero page is encoded in a special way, valid and migrate is
> >> +		 * set, and pfn part is zero. Encode specially for migrate also.
> >> +		 *
> >> +		 */
> >> +		if (pfn == (HMM_PFN_VALID|HMM_PFN_MIGRATE)) {
> >> +			migrate->src[i] = MIGRATE_PFN_MIGRATE;
> >> +			continue;
> >> +		}
> >> +
> >> +		migrate->src[i] = migrate_pfn(page_to_pfn(hmm_pfn_to_page(pfn)))
> >> +			| MIGRATE_PFN_MIGRATE;
> >> +		migrate->src[i] |= (pfn & HMM_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
> >> +	}
> >> +
> >> +	if (migrate->cpages)
> >> +		migrate_vma_unmap(migrate);
> >> +
> >> +}
> >> +EXPORT_SYMBOL(migrate_hmm_range_setup);
> >
> > I've not had a chance to test the code, do you have any numbers with the changes
> > to show the advantages of doing both fault and migrate together?
> 
> Not yet, but plan to have some numbers later. 

I don't have any recent numbers, but I profiled this a while ago and having to
walk the page tables multiple times was a significant overhead in our driver
at least.

> >
> > Balbir
> >
> Thanks,
> --Mika
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-08-22  5:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14  7:19 [RFC PATCH 0/4] Migrate on fault for device pages Mika Penttilä
2025-08-14  7:19 ` [RFC PATCH 1/4] mm: use current as mmu notifier's owner Mika Penttilä
2025-08-14 12:40   ` Jason Gunthorpe
2025-08-14 12:53     ` Mika Penttilä
2025-08-14 13:04       ` Jason Gunthorpe
2025-08-14 13:20         ` Mika Penttilä
2025-08-14 14:11           ` Jason Gunthorpe
2025-08-14 17:00             ` Mika Penttilä
2025-08-14 17:20               ` Jason Gunthorpe
2025-08-14 17:45                 ` Mika Penttilä
2025-08-15  5:23                   ` Alistair Popple
2025-08-15  7:11                     ` Mika Penttilä
2025-08-19  4:27                       ` Balbir Singh
2025-08-19  4:33                         ` Mika Penttilä
2025-08-14  7:19 ` [RFC PATCH 2/4] mm: unified fault and migrate device page paths Mika Penttilä
2025-08-21  4:30   ` Balbir Singh
2025-08-21  5:10     ` Mika Penttilä
2025-08-22  5:02       ` Alistair Popple
2025-08-14  7:19 ` [RFC PATCH 3/4] mm:/migrate_device.c: remove migrate_vma_collect_*() functions Mika Penttilä
2025-08-14  7:19 ` [RFC PATCH 4/4] mm: add new testcase for the migrate on fault case Mika Penttilä
2025-08-15 11:36 ` [RFC PATCH 0/4] Migrate on fault for device pages Balbir Singh
2025-08-15 11:44   ` Mika Penttilä

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).