[RFC 00/11] THP support for zone device pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/11] THP support for zone device pages
@ 2025-03-06  4:42 Balbir Singh
  2025-03-06  4:42 ` [RFC 01/11] mm/zone_device: support large zone device private folios Balbir Singh
                   ` (11 more replies)
  0 siblings, 12 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

This patch series adds support for THP migration of zone device pages.
To do so, the patches implement support for folio zone device pages
by adding support for setting up larger order pages.

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path.

The nouveau dmem code has been enhanced to use the new THP migration
capability.

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.

References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/

These patches are built on top of mm-everything-2025-03-04-05-51

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Balbir Singh (11):
  mm/zone_device: support large zone device private folios
  mm/migrate_device: flags for selecting device private THP pages
  mm/thp: zone_device awareness in THP handling code
  mm/migrate_device: THP migration of zone device pages
  mm/memory/fault: Add support for zone device THP fault handling
  lib/test_hmm: test cases and support for zone device private THP
  mm/memremap: Add folio_split support
  mm/thp: add split during migration support
  lib/test_hmm: add test case for split pages
  selftests/mm/hmm-tests: new tests for zone device THP migration
  gpu/drm/nouveau: Add THP migration support

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 244 +++++++++----
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/huge_mm.h                |  18 +-
 include/linux/memremap.h               |  29 +-
 include/linux/migrate.h                |   2 +
 include/linux/mm.h                     |   1 +
 lib/test_hmm.c                         | 387 ++++++++++++++++----
 lib/test_hmm_uapi.h                    |   3 +
 mm/huge_memory.c                       | 242 +++++++++---
 mm/memory.c                            |   6 +-
 mm/memremap.c                          |  50 ++-
 mm/migrate.c                           |   2 +
 mm/migrate_device.c                    | 488 +++++++++++++++++++++----
 mm/page_alloc.c                        |   1 +
 mm/page_vma_mapped.c                   |  10 +
 mm/rmap.c                              |  19 +-
 tools/testing/selftests/mm/hmm-tests.c | 407 +++++++++++++++++++++
 18 files changed, 1630 insertions(+), 288 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 01/11] mm/zone_device: support large zone device private folios
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-03-06 23:02   ` Alistair Popple
  2025-07-08 13:37   ` David Hildenbrand
  2025-03-06  4:42 ` [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h | 22 +++++++++++++++++-
 mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
 2 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..11d586dd8ef1 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
 	return is_device_private_page(&folio->page);
 }
 
+static inline void *folio_zone_device_data(const struct folio *folio)
+{
+	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
+	return folio->page.zone_device_data;
+}
+
+static inline void folio_set_zone_device_data(struct folio *folio, void *data)
+{
+	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
+	folio->page.zone_device_data = data;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
@@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void init_zone_device_folio(struct folio *folio, unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
+
+static inline void zone_device_page_init(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	init_zone_device_folio(folio, 0);
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index 2aebc1b192da..7d98d0a4c0cd 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -459,20 +459,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
+	unsigned int nr = folio_nr_pages(folio);
+	int i;
+	bool anon = folio_test_anon(folio);
+	struct page *page = folio_page(folio, 0);
 
 	if (WARN_ON_ONCE(!pgmap))
 		return;
 
 	mem_cgroup_uncharge(folio);
 
-	/*
-	 * Note: we don't expect anonymous compound pages yet. Once supported
-	 * and we could PTE-map them similar to THP, we'd have to clear
-	 * PG_anon_exclusive on all tail pages.
-	 */
-	if (folio_test_anon(folio)) {
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-		__ClearPageAnonExclusive(folio_page(folio, 0));
+	WARN_ON_ONCE(folio_test_large(folio) && !anon);
+
+	for (i = 0; i < nr; i++) {
+		if (anon)
+			__ClearPageAnonExclusive(folio_page(folio, i));
 	}
 
 	/*
@@ -496,10 +497,19 @@ void free_zone_device_folio(struct folio *folio)
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+		if (folio_test_large(folio)) {
+			folio_unqueue_deferred_split(folio);
+
+			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
+		}
+		pgmap->ops->page_free(page);
+		put_dev_pagemap(pgmap);
+		page->mapping = NULL;
+		break;
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
 			break;
-		pgmap->ops->page_free(folio_page(folio, 0));
+		pgmap->ops->page_free(page);
 		put_dev_pagemap(pgmap);
 		break;
 
@@ -523,14 +533,28 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page)
+void init_zone_device_folio(struct folio *folio, unsigned int order)
 {
+	struct page *page = folio_page(folio, 0);
+
+	VM_BUG_ON(order > MAX_ORDER_NR_PAGES);
+
+	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
-	set_page_count(page, 1);
+	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
+	folio_set_count(folio, 1);
 	lock_page(page);
+
+	/*
+	 * Only PMD level migration is supported for THP migration
+	 */
+	if (order > 1) {
+		prep_compound_page(page, order);
+		folio_set_large_rmappable(folio);
+	}
 }
-EXPORT_SYMBOL_GPL(zone_device_page_init);
+EXPORT_SYMBOL_GPL(init_zone_device_folio);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
  2025-03-06  4:42 ` [RFC 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-07-08 13:41   ` David Hildenbrand
  2025-03-06  4:42 ` [RFC 03/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add flags to mark zone device migration pages.

MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
device pages as compound pages during device pfn migration.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/migrate.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 61899ec7a9a3..b5e4f51e64c7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_COMPOUND	(1UL << 4)
 #define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -185,6 +186,7 @@ enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
 	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
+	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
 };
 
 struct migrate_vma {
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 03/11] mm/thp: zone_device awareness in THP handling code
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
  2025-03-06  4:42 ` [RFC 01/11] mm/zone_device: support large zone device private folios Balbir Singh
  2025-03-06  4:42 ` [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-07-08 14:10   ` David Hildenbrand
  2025-03-06  4:42 ` [RFC 04/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Make THP handling code in the mm subsystem for THP pages
aware of zone device pages. Although the code is
designed to be generic when it comes to handling splitting
of pages, the code is designed to work for THP page sizes
corresponding to HPAGE_PMD_NR.

Modify page_vma_mapped_walk() to return true when a zone
device huge entry is present, enabling try_to_migrate()
and other code migration paths to appropriately process the
entry

pmd_pfn() does not work well with zone device entries, use
pfn_pmd_entry_to_swap() for checking and comparison as for
zone device entries.

try_to_map_to_unused_zeropage() does not apply to zone device
entries, zone device entries are ignored in the call.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 mm/huge_memory.c     | 151 +++++++++++++++++++++++++++++++------------
 mm/migrate.c         |   2 +
 mm/page_vma_mapped.c |  10 +++
 mm/rmap.c            |  19 +++++-
 4 files changed, 138 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 826bfe907017..d8e018d1bdbd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2247,10 +2247,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
 			folio = pfn_swap_entry_folio(entry);
 			flush_needed = 0;
+
+			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
+					!folio_is_device_private(folio));
+
+			if (folio_is_device_private(folio)) {
+				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
+				WARN_ON_ONCE(folio_mapcount(folio) < 0);
+			}
 		} else
 			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
 
@@ -2264,6 +2271,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				       -HPAGE_PMD_NR);
 		}
 
+		/*
+		 * Do a folio put on zone device private pages after
+		 * changes to mm_counter, because the folio_put() will
+		 * clean folio->mapping and the folio_test_anon() check
+		 * will not be usable.
+		 */
+		if (folio_is_device_private(folio))
+			folio_put(folio);
+
 		spin_unlock(ptl);
 		if (flush_needed)
 			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
@@ -2392,7 +2408,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct folio *folio = pfn_swap_entry_folio(entry);
 		pmd_t newpmd;
 
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
+			  !folio_is_device_private(folio));
 		if (is_writable_migration_entry(entry)) {
 			/*
 			 * A protection check is difficult so
@@ -2405,9 +2422,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
-		} else {
+		} else if (is_writable_device_private_entry(entry)) {
+			newpmd = swp_entry_to_pmd(entry);
+			entry = make_device_exclusive_entry(swp_offset(entry));
+		} else
 			newpmd = *pmd;
-		}
 
 		if (uffd_wp)
 			newpmd = pmd_swp_mkuffd_wp(newpmd);
@@ -2860,11 +2879,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	bool anon_exclusive = false, dirty = false;
+	bool young, write, soft_dirty, uffd_wp = false;
+	bool anon_exclusive = false, dirty = false, present = false;
 	unsigned long addr;
 	pte_t *pte;
 	int i;
+	swp_entry_t swp_entry;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
@@ -2918,20 +2938,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	pmd_migration = is_pmd_migration_entry(*pmd);
-	if (unlikely(pmd_migration)) {
-		swp_entry_t entry;
 
+	present = pmd_present(*pmd);
+	if (unlikely(!present)) {
+		swp_entry = pmd_to_swp_entry(*pmd);
 		old_pmd = *pmd;
-		entry = pmd_to_swp_entry(old_pmd);
-		page = pfn_swap_entry_to_page(entry);
-		write = is_writable_migration_entry(entry);
+
+		folio = pfn_swap_entry_folio(swp_entry);
+		VM_BUG_ON(!is_migration_entry(swp_entry) &&
+				!is_device_private_entry(swp_entry));
+		page = pfn_swap_entry_to_page(swp_entry);
+		write = is_writable_migration_entry(swp_entry);
+
 		if (PageAnon(page))
-			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = is_migration_entry_young(entry);
-		dirty = is_migration_entry_dirty(entry);
+			anon_exclusive =
+				is_readable_exclusive_migration_entry(swp_entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
+		young = is_migration_entry_young(swp_entry);
+		dirty = is_migration_entry_dirty(swp_entry);
 	} else {
 		/*
 		 * Up to this point the pmd is present and huge and userland has
@@ -3015,30 +3040,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Note that NUMA hinting access restrictions are not transferred to
 	 * avoid any possibility of altering permissions across VMAs.
 	 */
-	if (freeze || pmd_migration) {
+	if (freeze || !present) {
 		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
 			pte_t entry;
-			swp_entry_t swp_entry;
-
-			if (write)
-				swp_entry = make_writable_migration_entry(
-							page_to_pfn(page + i));
-			else if (anon_exclusive)
-				swp_entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page + i));
-			else
-				swp_entry = make_readable_migration_entry(
-							page_to_pfn(page + i));
-			if (young)
-				swp_entry = make_migration_entry_young(swp_entry);
-			if (dirty)
-				swp_entry = make_migration_entry_dirty(swp_entry);
-			entry = swp_entry_to_pte(swp_entry);
-			if (soft_dirty)
-				entry = pte_swp_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_swp_mkuffd_wp(entry);
-
+			if (freeze || is_migration_entry(swp_entry)) {
+				if (write)
+					swp_entry = make_writable_migration_entry(
+								page_to_pfn(page + i));
+				else if (anon_exclusive)
+					swp_entry = make_readable_exclusive_migration_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_migration_entry(
+								page_to_pfn(page + i));
+				if (young)
+					swp_entry = make_migration_entry_young(swp_entry);
+				if (dirty)
+					swp_entry = make_migration_entry_dirty(swp_entry);
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			} else {
+				VM_BUG_ON(!is_device_private_entry(swp_entry));
+				if (write)
+					swp_entry = make_writable_device_private_entry(
+								page_to_pfn(page + i));
+				else if (anon_exclusive)
+					swp_entry = make_device_exclusive_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_device_private_entry(
+								page_to_pfn(page + i));
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			}
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
@@ -3065,7 +3105,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_migration)
+	if (present)
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
@@ -3077,6 +3117,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze, struct folio *folio)
 {
+	struct folio *pmd_folio;
 	VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
 	VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
@@ -3089,7 +3130,14 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 	 */
 	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
 	    is_pmd_migration_entry(*pmd)) {
-		if (folio && folio != pmd_folio(*pmd))
+		if (folio && !pmd_present(*pmd)) {
+			swp_entry_t swp_entry = pmd_to_swp_entry(*pmd);
+
+			pmd_folio = page_folio(pfn_swap_entry_to_page(swp_entry));
+		} else {
+			pmd_folio = pmd_folio(*pmd);
+		}
+		if (folio && folio != pmd_folio)
 			return;
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 	}
@@ -3581,11 +3629,16 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 				     folio_test_swapcache(origin_folio)) ?
 					     folio_nr_pages(release) : 0));
 
+			if (folio_is_device_private(release))
+				percpu_ref_get_many(&release->pgmap->ref,
+							(1 << new_order) - 1);
+
 			if (release == origin_folio)
 				continue;
 
-			lru_add_page_tail(origin_folio, &release->page,
-						lruvec, list);
+			if (!folio_is_device_private(origin_folio))
+				lru_add_page_tail(origin_folio, &release->page,
+							lruvec, list);
 
 			/* Some pages can be beyond EOF: drop them from page cache */
 			if (release->index >= end) {
@@ -4625,7 +4678,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		return 0;
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
-	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	if (!folio_is_device_private(folio))
+		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	else
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
 
 	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
 	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
@@ -4675,6 +4731,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	folio_get(folio);
 	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
+
+	if (unlikely(folio_is_device_private(folio))) {
+		if (pmd_write(pmde))
+			entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+		else
+			entry = make_readable_device_private_entry(
+							page_to_pfn(new));
+		pmde = swp_entry_to_pmd(entry);
+	}
+
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
diff --git a/mm/migrate.c b/mm/migrate.c
index 59e39aaa74e7..0aa1bdb711c3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
 
 	if (PageCompound(page))
 		return false;
+	if (folio_is_device_private(folio))
+		return false;
 	VM_BUG_ON_PAGE(!PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e463c3be934a..5dd2e51477d3 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -278,6 +278,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cannot return prematurely, while zap_huge_pmd() has
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
+			swp_entry_t entry;
+
+			if (!thp_migration_supported())
+				return not_found(pvmw);
+			entry = pmd_to_swp_entry(pmde);
+			if (is_device_private_entry(entry)) {
+				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+				return true;
+			}
+
 			if ((pvmw->flags & PVMW_SYNC) &&
 			    thp_vma_suitable_order(vma, pvmw->address,
 						   PMD_ORDER) &&
diff --git a/mm/rmap.c b/mm/rmap.c
index 67bb273dfb80..67e99dc5f2ef 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2326,8 +2326,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
-			subpage = folio_page(folio,
-				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
+			/*
+			 * Zone device private folios do not work well with
+			 * pmd_pfn() on some architectures due to pte
+			 * inversion.
+			 */
+			if (folio_is_device_private(folio)) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
+				unsigned long pfn = swp_offset_pfn(entry);
+
+				subpage = folio_page(folio, pfn
+							- folio_pfn(folio));
+			} else {
+				subpage = folio_page(folio,
+							pmd_pfn(*pvmw.pmd)
+							- folio_pfn(folio));
+			}
+
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 04/11] mm/migrate_device: THP migration of zone device pages
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (2 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 03/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-03-06  9:24   ` Mika Penttilä
  2025-03-06  4:42 ` [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling Balbir Singh
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

migrate_device code paths go through the collect, setup
and finalize phases of migration. Support for MIGRATE_PFN_COMPOUND
was added earlier in the series to mark THP pages as
MIGRATE_PFN_COMPOUND.

The entries in src and dst arrays passed to these functions still
remain at a PAGE_SIZE granularity. When a compound page is passed,
the first entry has the PFN along with MIGRATE_PFN_COMPOUND
and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
representation allows for the compound page to be split into smaller
page sizes.

migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
page aware. Two new helper functions migrate_vma_collect_huge_pmd()
and migrate_vma_insert_huge_pmd_page() have been added.

migrate_vma_collect_huge_pmd() can collect THP pages, but if for
some reason this fails, there is fallback support to split the folio
and migrate it.

migrate_vma_insert_huge_pmd_page() closely follows the logic of
migrate_vma_insert_page()

Support for splitting pages as needed for migration will follow in
later patches in this series.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 mm/migrate_device.c | 436 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 379 insertions(+), 57 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 7d0d64f67cdf..f3fff5d705bd 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -14,6 +14,7 @@
 #include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swapops.h>
+#include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
 	if (!vma_is_anonymous(walk->vma))
 		return migrate_vma_collect_skip(start, end, walk);
 
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
+						MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages] = 0;
+		migrate->npages++;
+		migrate->cpages++;
+
+		/*
+		 * Collect the remaining entries as holes, in case we
+		 * need to split later
+		 */
+		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+	}
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
 		migrate->dst[migrate->npages] = 0;
@@ -54,50 +72,145 @@ static int migrate_vma_collect_hole(unsigned long start,
 	return 0;
 }
 
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-				   unsigned long start,
-				   unsigned long end,
-				   struct mm_walk *walk)
+/**
+ * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
+ * folio for device private pages.
+ * @pmdp: pointer to pmd entry
+ * @start: start address of the range for migration
+ * @end: end address of the range for migration
+ * @walk: mm_walk callback structure
+ *
+ * Collect the huge pmd entry at @pmdp for migration and set the
+ * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
+ * migration will occur at HPAGE_PMD granularity
+ */
+static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
+					unsigned long end, struct mm_walk *walk)
 {
+	struct mm_struct *mm = walk->mm;
+	struct folio *folio;
 	struct migrate_vma *migrate = walk->private;
-	struct vm_area_struct *vma = walk->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start, unmapped = 0;
+	swp_entry_t entry;
+	int ret;
+	unsigned long write = 0;
+
 	spinlock_t *ptl;
-	pte_t *ptep;
 
-again:
-	if (pmd_none(*pmdp))
+	ptl = pmd_lock(mm, pmdp);
+	if (pmd_none(*pmdp)) {
+		spin_unlock(ptl);
 		return migrate_vma_collect_hole(start, end, -1, walk);
+	}
 
 	if (pmd_trans_huge(*pmdp)) {
-		struct folio *folio;
-
-		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
 			spin_unlock(ptl);
-			goto again;
+			return migrate_vma_collect_skip(start, end, walk);
 		}
 
 		folio = pmd_folio(*pmdp);
 		if (is_huge_zero_folio(folio)) {
 			spin_unlock(ptl);
-			split_huge_pmd(vma, pmdp, addr);
-		} else {
-			int ret;
+			return migrate_vma_collect_hole(start, end, -1, walk);
+		}
+		if (pmd_write(*pmdp))
+			write = MIGRATE_PFN_WRITE;
+	} else if (!pmd_present(*pmdp)) {
+		entry = pmd_to_swp_entry(*pmdp);
+		folio = pfn_swap_entry_folio(entry);
+
+		if (!is_device_private_entry(entry) ||
+			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
+			(folio->pgmap->owner != migrate->pgmap_owner)) {
+			spin_unlock(ptl);
+			return migrate_vma_collect_skip(start, end, walk);
+		}
 
-			folio_get(folio);
+		if (is_migration_entry(entry)) {
+			migration_entry_wait_on_locked(entry, ptl);
 			spin_unlock(ptl);
-			if (unlikely(!folio_trylock(folio)))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			ret = split_folio(folio);
-			folio_unlock(folio);
-			folio_put(folio);
-			if (ret)
-				return migrate_vma_collect_skip(start, end,
-								walk);
+			return -EAGAIN;
 		}
+
+		if (is_writable_device_private_entry(entry))
+			write = MIGRATE_PFN_WRITE;
+	} else {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	folio_get(folio);
+	if (unlikely(!folio_trylock(folio))) {
+		spin_unlock(ptl);
+		folio_put(folio);
+		return migrate_vma_collect_skip(start, end, walk);
+	}
+
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+
+		struct page_vma_mapped_walk pvmw = {
+			.ptl = ptl,
+			.address = start,
+			.pmd = pmdp,
+			.vma = walk->vma,
+		};
+
+		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
+
+		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
+						| MIGRATE_PFN_MIGRATE
+						| MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages++] = 0;
+		migrate->cpages++;
+		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
+		if (ret) {
+			migrate->npages--;
+			migrate->cpages--;
+			migrate->src[migrate->npages] = 0;
+			migrate->dst[migrate->npages] = 0;
+			goto fallback;
+		}
+		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+		spin_unlock(ptl);
+		return 0;
+	}
+
+fallback:
+	spin_unlock(ptl);
+	ret = split_folio(folio);
+	folio_unlock(folio);
+	folio_put(folio);
+	if (ret)
+		return migrate_vma_collect_skip(start, end, walk);
+	if (pmd_none(pmdp_get_lockless(pmdp)))
+		return migrate_vma_collect_hole(start, end, -1, walk);
+
+	return -ENOENT;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start, unmapped = 0;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+again:
+	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
+		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk);
+
+		if (ret == -EAGAIN)
+			goto again;
+		if (ret == 0)
+			return 0;
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -168,8 +281,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
 
-		/* FIXME support THP */
-		if (!page || !page->mapping || PageTransCompound(page)) {
+		if (!page || !page->mapping) {
 			mpfn = 0;
 			goto next;
 		}
@@ -339,14 +451,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 	 */
 	int extra = 1 + (page == fault_page);
 
-	/*
-	 * FIXME support THP (transparent huge page), it is bit more complex to
-	 * check them than regular pages, because they can be mapped with a pmd
-	 * or with a pte (split pte mapping).
-	 */
-	if (folio_test_large(folio))
-		return false;
-
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (folio_is_zone_device(folio))
 		extra++;
@@ -375,17 +479,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 	lru_add_drain();
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct folio *folio;
+		unsigned int nr = 1;
 
 		if (!page) {
 			if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
 				unmapped++;
-			continue;
+			goto next;
 		}
 
 		folio =	page_folio(page);
+		nr = folio_nr_pages(folio);
+
+		if (nr > 1)
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+
+
 		/* ZONE_DEVICE folios are not on LRU */
 		if (!folio_is_zone_device(folio)) {
 			if (!folio_test_lru(folio) && allow_drain) {
@@ -397,7 +508,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			if (!folio_isolate_lru(folio)) {
 				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 				restore++;
-				continue;
+				goto next;
 			}
 
 			/* Drop the reference we took in collect */
@@ -416,10 +527,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			restore++;
-			continue;
+			goto next;
 		}
 
 		unmapped++;
+next:
+		i += nr;
 	}
 
 	for (i = 0; i < npages && restore; i++) {
@@ -562,6 +675,146 @@ int migrate_vma_setup(struct migrate_vma *args)
 }
 EXPORT_SYMBOL(migrate_vma_setup);
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/**
+ * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
+ * at @addr. folio is already allocated as a part of the migration process with
+ * large page.
+ *
+ * @folio needs to be initialized and setup after it's allocated. The code bits
+ * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
+ * not support THP zero pages.
+ *
+ * @migrate: migrate_vma arguments
+ * @addr: address where the folio will be inserted
+ * @folio: folio to be inserted at @addr
+ * @src: src pfn which is being migrated
+ * @pmdp: pointer to the pmd
+ */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	gfp_t gfp = vma_thp_gfp_mask(vma);
+	struct folio *folio = page_folio(page);
+	int ret;
+	spinlock_t *ptl;
+	pgtable_t pgtable;
+	pmd_t entry;
+	bool flush = false;
+	unsigned long i;
+
+	VM_WARN_ON_FOLIO(!folio, folio);
+	VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
+
+	if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
+		return -EINVAL;
+
+	ret = anon_vma_prepare(vma);
+	if (ret)
+		return ret;
+
+	folio_set_order(folio, HPAGE_PMD_ORDER);
+	folio_set_large_rmappable(folio);
+
+	if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+		ret = -ENOMEM;
+		goto abort;
+	}
+
+	__folio_mark_uptodate(folio);
+
+	pgtable = pte_alloc_one(vma->vm_mm);
+	if (unlikely(!pgtable))
+		goto abort;
+
+	if (folio_is_device_private(folio)) {
+		swp_entry_t swp_entry;
+
+		if (vma->vm_flags & VM_WRITE)
+			swp_entry = make_writable_device_private_entry(
+						page_to_pfn(page));
+		else
+			swp_entry = make_readable_device_private_entry(
+						page_to_pfn(page));
+		entry = swp_entry_to_pmd(swp_entry);
+	} else {
+		if (folio_is_zone_device(folio) &&
+		    !folio_is_device_coherent(folio)) {
+			goto abort;
+		}
+		entry = mk_pmd(page, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
+	}
+
+	ptl = pmd_lock(vma->vm_mm, pmdp);
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto abort;
+
+	/*
+	 * Check for userfaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma))
+		goto unlock_abort;
+
+	if (!pmd_none(*pmdp)) {
+		if (!is_huge_zero_pmd(*pmdp))
+			goto unlock_abort;
+		flush = true;
+	} else if (!pmd_none(*pmdp))
+		goto unlock_abort;
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
+	if (!folio_is_zone_device(folio))
+		folio_add_lru_vma(folio, vma);
+	folio_get(folio);
+
+	if (flush) {
+		pte_free(vma->vm_mm, pgtable);
+		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
+		pmdp_invalidate(vma, addr, pmdp);
+	} else {
+		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		mm_inc_nr_ptes(vma->vm_mm);
+	}
+	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
+	update_mmu_cache_pmd(vma, addr, pmdp);
+
+	spin_unlock(ptl);
+
+	count_vm_event(THP_FAULT_ALLOC);
+	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
+	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+
+	return 0;
+
+unlock_abort:
+	spin_unlock(ptl);
+abort:
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		src[i] &= ~MIGRATE_PFN_MIGRATE;
+	return 0;
+}
+#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	return 0;
+}
+#endif
+
 /*
  * This code closely matches the code in:
  *   __handle_mm_fault()
@@ -572,9 +825,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
-				    struct page *page,
+				    unsigned long *dst,
 				    unsigned long *src)
 {
+	struct page *page = migrate_pfn_to_page(*dst);
 	struct folio *folio = page_folio(page);
 	struct vm_area_struct *vma = migrate->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -602,8 +856,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
-		goto abort;
+
+	if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
+		int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
+								src, pmdp);
+		if (ret)
+			goto abort;
+		return;
+	}
+
+	if (!pmd_none(*pmdp)) {
+		if (pmd_trans_huge(*pmdp)) {
+			if (!is_huge_zero_pmd(*pmdp))
+				goto abort;
+			folio_get(pmd_folio(*pmdp));
+			split_huge_pmd(vma, pmdp, addr);
+		} else if (pmd_leaf(*pmdp))
+			goto abort;
+	}
+
 	if (pte_alloc(mm, pmdp))
 		goto abort;
 	if (unlikely(anon_vma_prepare(vma)))
@@ -694,23 +965,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 	unsigned long i;
 	bool notified = false;
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct address_space *mapping;
 		struct folio *newfolio, *folio;
 		int r, extra_cnt = 0;
+		unsigned long nr = 1;
 
 		if (!newpage) {
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		if (!page) {
 			unsigned long addr;
 
 			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-				continue;
+				goto next;
 
 			/*
 			 * The only time there is no vma is when called from
@@ -728,15 +1000,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
-			migrate_vma_insert_page(migrate, addr, newpage,
+
+			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
+				nr = HPAGE_PMD_NR;
+				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				goto next;
+			}
+
+			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
 						&src_pfns[i]);
-			continue;
+			goto next;
 		}
 
 		newfolio = page_folio(newpage);
 		folio = page_folio(page);
 		mapping = folio_mapping(folio);
 
+		/*
+		 * If THP migration is enabled, check if both src and dst
+		 * can migrate large pages
+		 */
+		if (thp_migration_supported()) {
+			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+
+				if (!migrate) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			}
+		}
+
+
 		if (folio_is_device_private(newfolio) ||
 		    folio_is_device_coherent(newfolio)) {
 			if (mapping) {
@@ -749,7 +1053,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				if (!folio_test_anon(folio) ||
 				    !folio_free_swap(folio)) {
 					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-					continue;
+					goto next;
 				}
 			}
 		} else if (folio_is_zone_device(newfolio)) {
@@ -757,7 +1061,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			 * Other types of ZONE_DEVICE page are not supported.
 			 */
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		BUG_ON(folio_test_writeback(folio));
@@ -769,6 +1073,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 		else
 			folio_migrate_flags(newfolio, folio);
+next:
+		i += nr;
 	}
 
 	if (notified)
@@ -899,24 +1205,40 @@ EXPORT_SYMBOL(migrate_vma_finalize);
 int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 			unsigned long npages)
 {
-	unsigned long i, pfn;
+	unsigned long i, j, pfn;
 
-	for (pfn = start, i = 0; i < npages; pfn++, i++) {
-		struct folio *folio;
+	i = 0;
+	pfn = start;
+	while (i < npages) {
+		struct page *page = pfn_to_page(pfn);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
 		folio = folio_get_nontail_page(pfn_to_page(pfn));
 		if (!folio) {
 			src_pfns[i] = 0;
-			continue;
+			goto next;
 		}
 
 		if (!folio_trylock(folio)) {
 			src_pfns[i] = 0;
 			folio_put(folio);
-			continue;
+			goto next;
 		}
 
 		src_pfns[i] = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j;
+			pfn += j;
+			continue;
+		}
+next:
+		i++;
+		pfn++;
 	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (3 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 04/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-07-08 14:40   ` David Hildenbrand
  2025-03-06  4:42 ` [RFC 06/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

When the CPU touches a zone device THP entry, the data needs to
be migrated back to the CPU, call migrate_to_ram() on these pages
via do_huge_pmd_device_private() fault handling helper.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |  7 +++++++
 mm/huge_memory.c        | 35 +++++++++++++++++++++++++++++++++++
 mm/memory.c             |  6 ++++--
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e893d546a49f..ad0c0ccfcbc2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -479,6 +479,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
@@ -634,6 +636,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	return 0;
+}
+
 static inline bool is_huge_zero_folio(const struct folio *folio)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8e018d1bdbd..995ac8be5709 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1375,6 +1375,41 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	return __do_huge_pmd_anonymous_page(vmf);
 }
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	vm_fault_t ret;
+	spinlock_t *ptl;
+	swp_entry_t swp_entry;
+	struct page *page;
+
+	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
+		return VM_FAULT_FALLBACK;
+
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
+	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) {
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	swp_entry = pmd_to_swp_entry(vmf->orig_pmd);
+	page = pfn_swap_entry_to_page(swp_entry);
+	vmf->page = page;
+	vmf->pte = NULL;
+	get_page(page);
+	spin_unlock(ptl);
+	ret = page_pgmap(page)->ops->migrate_to_ram(vmf);
+	put_page(page);
+
+	return ret;
+}
+
 static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
 		pgtable_t pgtable)
diff --git a/mm/memory.c b/mm/memory.c
index a838c8c44bfd..deaa67b88708 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6149,8 +6149,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
 		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_device_private_entry(
+					pmd_to_swp_entry(vmf.orig_pmd)))
+				return do_huge_pmd_device_private(&vmf);
+
 			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 06/11] lib/test_hmm: test cases and support for zone device private THP
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (4 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-03-06  4:42 ` [RFC 07/11] mm/memremap: Add folio_split support Balbir Singh
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Enhance the hmm test driver (lib/test_hmm) with support for
THP pages.

A new pool of free_folios() has now been added to the dmirror
device, which can be allocated when a request for a THP zone
device private page is made.

Add compound page awareness to the allocation function during
normal migration and fault based migration. These routines also
copy folio_nr_pages() when moving data between system memory
and device memory.

args.src and args.dst used to hold migration entries are now
dynamically allocated (as they need to hold HPAGE_PMD_NR entries
or more).

Split and migrate support will be added in future patches in this
series.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 lib/test_hmm.c | 342 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 273 insertions(+), 69 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5b144bc5c4ec..a81d2f8a0426 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -119,6 +119,7 @@ struct dmirror_device {
 	unsigned long		calloc;
 	unsigned long		cfree;
 	struct page		*free_pages;
+	struct folio		*free_folios;
 	spinlock_t		lock;		/* protects the above */
 };
 
@@ -492,7 +493,7 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 }
 
 static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
-				   struct page **ppage)
+				  struct page **ppage, bool is_large)
 {
 	struct dmirror_chunk *devmem;
 	struct resource *res = NULL;
@@ -572,20 +573,45 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		pfn_first, pfn_last);
 
 	spin_lock(&mdevice->lock);
-	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+	for (pfn = pfn_first; pfn < pfn_last; ) {
 		struct page *page = pfn_to_page(pfn);
 
+		if (is_large && IS_ALIGNED(pfn, HPAGE_PMD_NR)
+			&& (pfn + HPAGE_PMD_NR <= pfn_last)) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+			pfn += HPAGE_PMD_NR;
+			continue;
+		}
+
 		page->zone_device_data = mdevice->free_pages;
 		mdevice->free_pages = page;
+		pfn++;
 	}
+
+	ret = 0;
 	if (ppage) {
-		*ppage = mdevice->free_pages;
-		mdevice->free_pages = (*ppage)->zone_device_data;
-		mdevice->calloc++;
+		if (is_large) {
+			if (!mdevice->free_folios) {
+				ret = -ENOMEM;
+				goto err_unlock;
+			}
+			*ppage = folio_page(mdevice->free_folios, 0);
+			mdevice->free_folios = (*ppage)->zone_device_data;
+			mdevice->calloc += HPAGE_PMD_NR;
+		} else if (mdevice->free_pages) {
+			*ppage = mdevice->free_pages;
+			mdevice->free_pages = (*ppage)->zone_device_data;
+			mdevice->calloc++;
+		} else {
+			ret = -ENOMEM;
+			goto err_unlock;
+		}
 	}
+err_unlock:
 	spin_unlock(&mdevice->lock);
 
-	return 0;
+	return ret;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -598,10 +624,13 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	return ret;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
+					      bool is_large)
 {
 	struct page *dpage = NULL;
 	struct page *rpage = NULL;
+	unsigned int order = is_large ? HPAGE_PMD_ORDER : 0;
+	struct dmirror_device *mdevice = dmirror->mdevice;
 
 	/*
 	 * For ZONE_DEVICE private type, this is a fake device so we allocate
@@ -610,49 +639,55 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	 * data and ignore rpage.
 	 */
 	if (dmirror_is_private_zone(mdevice)) {
-		rpage = alloc_page(GFP_HIGHUSER);
+		rpage = folio_page(folio_alloc(GFP_HIGHUSER, order), 0);
 		if (!rpage)
 			return NULL;
 	}
 	spin_lock(&mdevice->lock);
 
-	if (mdevice->free_pages) {
+	if (is_large && mdevice->free_folios) {
+		dpage = folio_page(mdevice->free_folios, 0);
+		mdevice->free_folios = dpage->zone_device_data;
+		mdevice->calloc += 1 << order;
+		spin_unlock(&mdevice->lock);
+	} else if (!is_large && mdevice->free_pages) {
 		dpage = mdevice->free_pages;
 		mdevice->free_pages = dpage->zone_device_data;
 		mdevice->calloc++;
 		spin_unlock(&mdevice->lock);
 	} else {
 		spin_unlock(&mdevice->lock);
-		if (dmirror_allocate_chunk(mdevice, &dpage))
+		if (dmirror_allocate_chunk(mdevice, &dpage, is_large))
 			goto error;
 	}
 
-	zone_device_page_init(dpage);
+	init_zone_device_folio(page_folio(dpage), order);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
 error:
 	if (rpage)
-		__free_page(rpage);
+		__free_pages(rpage, order);
 	return NULL;
 }
 
 static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 					   struct dmirror *dmirror)
 {
-	struct dmirror_device *mdevice = dmirror->mdevice;
 	const unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
 	unsigned long addr;
 
-	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
-						   src++, dst++) {
+	for (addr = args->start; addr < args->end; ) {
 		struct page *spage;
 		struct page *dpage;
 		struct page *rpage;
+		bool is_large = *src & MIGRATE_PFN_COMPOUND;
+		int write = (*src & MIGRATE_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+		unsigned long nr = 1;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		/*
 		 * Note that spage might be NULL which is OK since it is an
@@ -662,17 +697,45 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		if (WARN(spage && is_zone_device_page(spage),
 		     "page already in device spage pfn: 0x%lx\n",
 		     page_to_pfn(spage)))
+			goto next;
+
+		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (!dpage) {
+			struct folio *folio;
+			unsigned long i;
+			unsigned long spfn = *src >> MIGRATE_PFN_SHIFT;
+			struct page *src_page;
+
+			if (!is_large)
+				goto next;
+
+			if (!spage && is_large) {
+				nr = HPAGE_PMD_NR;
+			} else {
+				folio = page_folio(spage);
+				nr = folio_nr_pages(folio);
+			}
+
+			for (i = 0; i < nr && addr < args->end; i++) {
+				dpage = dmirror_devmem_alloc_page(dmirror, false);
+				rpage = BACKING_PAGE(dpage);
+				rpage->zone_device_data = dmirror;
+
+				*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+				src_page = pfn_to_page(spfn + i);
+
+				if (spage)
+					copy_highpage(rpage, src_page);
+				else
+					clear_highpage(rpage);
+				src++;
+				dst++;
+				addr += PAGE_SIZE;
+			}
 			continue;
-
-		dpage = dmirror_devmem_alloc_page(mdevice);
-		if (!dpage)
-			continue;
+		}
 
 		rpage = BACKING_PAGE(dpage);
-		if (spage)
-			copy_highpage(rpage, spage);
-		else
-			clear_highpage(rpage);
 
 		/*
 		 * Normally, a device would use the page->zone_device_data to
@@ -684,10 +747,42 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 			 page_to_pfn(spage), page_to_pfn(dpage));
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if ((*src & MIGRATE_PFN_WRITE) ||
-		    (!spage && args->vma->vm_flags & VM_WRITE))
-			*dst |= MIGRATE_PFN_WRITE;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+
+		if (is_large) {
+			int i;
+			struct folio *folio = page_folio(dpage);
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+			if (folio_test_large(folio)) {
+				for (i = 0; i < folio_nr_pages(folio); i++) {
+					struct page *dst_page =
+						pfn_to_page(page_to_pfn(rpage) + i);
+					struct page *src_page =
+						pfn_to_page(page_to_pfn(spage) + i);
+
+					if (spage)
+						copy_highpage(dst_page, src_page);
+					else
+						clear_highpage(dst_page);
+					src++;
+					dst++;
+					addr += PAGE_SIZE;
+				}
+				continue;
+			}
+		}
+
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+next:
+		src++;
+		dst++;
+		addr += PAGE_SIZE;
 	}
 }
 
@@ -734,14 +829,17 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	const unsigned long *src = args->src;
 	const unsigned long *dst = args->dst;
 	unsigned long pfn;
+	const unsigned long start_pfn = start >> PAGE_SHIFT;
+	const unsigned long end_pfn = end >> PAGE_SHIFT;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
-								src++, dst++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn++, src++, dst++) {
 		struct page *dpage;
 		void *entry;
+		int nr, i;
+		struct page *rpage;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
 			continue;
@@ -750,13 +848,25 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 		if (!dpage)
 			continue;
 
-		entry = BACKING_PAGE(dpage);
-		if (*dst & MIGRATE_PFN_WRITE)
-			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
-		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
-		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
+		if (*dst & MIGRATE_PFN_COMPOUND)
+			nr = folio_nr_pages(page_folio(dpage));
+		else
+			nr = 1;
+
+		WARN_ON_ONCE(end_pfn < start_pfn + nr);
+
+		rpage = BACKING_PAGE(dpage);
+		VM_BUG_ON(folio_nr_pages(page_folio(rpage)) != nr);
+
+		for (i = 0; i < nr; i++) {
+			entry = folio_page(page_folio(rpage), i);
+			if (*dst & MIGRATE_PFN_WRITE)
+				entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+			entry = xa_store(&dmirror->pt, pfn + i, entry, GFP_ATOMIC);
+			if (xa_is_err(entry)) {
+				mutex_unlock(&dmirror->mutex);
+				return xa_err(entry);
+			}
 		}
 	}
 
@@ -829,31 +939,61 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 	unsigned long start = args->start;
 	unsigned long end = args->end;
 	unsigned long addr;
+	unsigned int order = 0;
+	int i;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
+	for (addr = start; addr < end; ) {
 		struct page *dpage, *spage;
 
 		spage = migrate_pfn_to_page(*src);
 		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		if (WARN_ON(!is_device_private_page(spage) &&
 			    !is_device_coherent_page(spage)))
-			continue;
+			goto next;
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
-			 page_to_pfn(spage), page_to_pfn(dpage));
+		order = folio_order(page_folio(spage));
 
+		if (order)
+			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
+						order, args->vma, addr), 0);
+		else
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+
+		/* Try with smaller pages if large allocation fails */
+		if (!dpage && order) {
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+			if (!dpage)
+				return VM_FAULT_OOM;
+			order = 0;
+		}
+
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+				page_to_pfn(spage), page_to_pfn(dpage));
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
 		copy_highpage(dpage, spage);
 		*dst = migrate_pfn(page_to_pfn(dpage));
 		if (*src & MIGRATE_PFN_WRITE)
 			*dst |= MIGRATE_PFN_WRITE;
+		if (order)
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+		for (i = 0; i < (1 << order); i++) {
+			struct page *src_page;
+			struct page *dst_page;
+
+			src_page = pfn_to_page(page_to_pfn(spage) + i);
+			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+
+			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			copy_highpage(dst_page, src_page);
+		}
+next:
+		addr += PAGE_SIZE << order;
+		src += 1 << order;
+		dst += 1 << order;
 	}
 	return 0;
 }
@@ -939,8 +1079,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[64] = { 0 };
-	unsigned long dst_pfns[64] = { 0 };
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
 	struct dmirror_bounce bounce;
 	struct migrate_vma args = { 0 };
 	unsigned long next;
@@ -955,6 +1095,18 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	if (!mmget_not_zero(mm))
 		return -EINVAL;
 
+	ret = -ENOMEM;
+	src_pfns = kmalloc_array(PTRS_PER_PTE, sizeof(*src_pfns),
+			  GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	if (!src_pfns)
+		goto free_mem;
+
+	dst_pfns = kmalloc_array(PTRS_PER_PTE, sizeof(*dst_pfns),
+			  GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	if (!dst_pfns)
+		goto free_mem;
+
+	ret = 0;
 	mmap_read_lock(mm);
 	for (addr = start; addr < end; addr = next) {
 		vma = vma_lookup(mm, addr);
@@ -962,7 +1114,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -972,7 +1124,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+		args.flags = MIGRATE_VMA_SELECT_SYSTEM |
+				MIGRATE_VMA_SELECT_COMPOUND;
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out;
@@ -992,7 +1145,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	 */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
-		return ret;
+		goto free_mem;
 	mutex_lock(&dmirror->mutex);
 	ret = dmirror_do_read(dmirror, start, end, &bounce);
 	mutex_unlock(&dmirror->mutex);
@@ -1003,11 +1156,14 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	}
 	cmd->cpages = bounce.cpages;
 	dmirror_bounce_fini(&bounce);
-	return ret;
+	goto free_mem;
 
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+free_mem:
+	kfree(src_pfns);
+	kfree(dst_pfns);
 	return ret;
 }
 
@@ -1200,6 +1356,7 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 	unsigned long i;
 	unsigned long *src_pfns;
 	unsigned long *dst_pfns;
+	unsigned int order = 0;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
@@ -1215,13 +1372,25 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 		if (WARN_ON(!is_device_private_page(spage) &&
 			    !is_device_coherent_page(spage)))
 			continue;
+
+		order = folio_order(page_folio(spage));
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+		if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+			dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE,
+					      order), 0);
+		} else {
+			dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+			order = 0;
+		}
+
+		/* TODO Support splitting here */
 		lock_page(dpage);
-		copy_highpage(dpage, spage);
 		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
 		if (src_pfns[i] & MIGRATE_PFN_WRITE)
 			dst_pfns[i] |= MIGRATE_PFN_WRITE;
+		if (order)
+			dst_pfns[i] |= MIGRATE_PFN_COMPOUND;
+		folio_copy(page_folio(dpage), page_folio(spage));
 	}
 	migrate_device_pages(src_pfns, dst_pfns, npages);
 	migrate_device_finalize(src_pfns, dst_pfns, npages);
@@ -1234,7 +1403,12 @@ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
 {
 	struct dmirror_device *mdevice = devmem->mdevice;
 	struct page *page;
+	struct folio *folio;
+
 
+	for (folio = mdevice->free_folios; folio; folio = folio_zone_device_data(folio))
+		if (dmirror_page_to_chunk(folio_page(folio, 0)) == devmem)
+			mdevice->free_folios = folio_zone_device_data(folio);
 	for (page = mdevice->free_pages; page; page = page->zone_device_data)
 		if (dmirror_page_to_chunk(page) == devmem)
 			mdevice->free_pages = page->zone_device_data;
@@ -1265,6 +1439,7 @@ static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
 		mdevice->devmem_count = 0;
 		mdevice->devmem_capacity = 0;
 		mdevice->free_pages = NULL;
+		mdevice->free_folios = NULL;
 		kfree(mdevice->devmem_chunks);
 		mdevice->devmem_chunks = NULL;
 	}
@@ -1378,18 +1553,29 @@ static void dmirror_devmem_free(struct page *page)
 {
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
+	struct folio *folio = page_folio(page);
+	unsigned int order = folio_order(folio);
 
-	if (rpage != page)
-		__free_page(rpage);
+	if (rpage != page) {
+		if (order)
+			__free_pages(rpage, order);
+		else
+			__free_page(rpage);
+	}
 
 	mdevice = dmirror_page_to_device(page);
 	spin_lock(&mdevice->lock);
 
 	/* Return page to our allocator if not freeing the chunk */
 	if (!dmirror_page_to_chunk(page)->remove) {
-		mdevice->cfree++;
-		page->zone_device_data = mdevice->free_pages;
-		mdevice->free_pages = page;
+		mdevice->cfree += 1 << order;
+		if (order) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = folio;
+		} else {
+			page->zone_device_data = mdevice->free_pages;
+			mdevice->free_pages = page;
+		}
 	}
 	spin_unlock(&mdevice->lock);
 }
@@ -1397,11 +1583,10 @@ static void dmirror_devmem_free(struct page *page)
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args = { 0 };
-	unsigned long src_pfns = 0;
-	unsigned long dst_pfns = 0;
 	struct page *rpage;
 	struct dmirror *dmirror;
-	vm_fault_t ret;
+	vm_fault_t ret = 0;
+	unsigned int order, nr;
 
 	/*
 	 * Normally, a device would use the page->zone_device_data to point to
@@ -1412,21 +1597,36 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
+	order = folio_order(page_folio(vmf->page));
+	nr = 1 << order;
+
+	/*
+	 * Consider a per-cpu cache of src and dst pfns, but with
+	 * large number of cpus that might not scale well.
+	 */
+	args.start = ALIGN_DOWN(vmf->address, (1 << (PAGE_SHIFT + order)));
 	args.vma = vmf->vma;
-	args.start = vmf->address;
-	args.end = args.start + PAGE_SIZE;
-	args.src = &src_pfns;
-	args.dst = &dst_pfns;
+	args.end = args.start + (PAGE_SIZE << order);
+	args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
 	args.pgmap_owner = dmirror->mdevice;
 	args.flags = dmirror_select_device(dmirror);
 	args.fault_page = vmf->page;
 
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
 	if (migrate_vma_setup(&args))
 		return VM_FAULT_SIGBUS;
 
 	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
 	if (ret)
-		return ret;
+		goto err;
 	migrate_vma_pages(&args);
 	/*
 	 * No device finalize step is needed since
@@ -1434,12 +1634,16 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * invalidated the device page table.
 	 */
 	migrate_vma_finalize(&args);
-	return 0;
+err:
+	kfree(args.src);
+	kfree(args.dst);
+	return ret;
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
 	.page_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
+	.page_free	= dmirror_devmem_free,
 };
 
 static int dmirror_device_init(struct dmirror_device *mdevice, int id)
@@ -1465,7 +1669,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 		return ret;
 
 	/* Build a list of free ZONE_DEVICE struct pages */
-	return dmirror_allocate_chunk(mdevice, NULL);
+	return dmirror_allocate_chunk(mdevice, NULL, false);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 07/11] mm/memremap: Add folio_split support
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (5 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 06/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-03-06  8:16   ` Mika Penttilä
                     ` (2 more replies)
  2025-03-06  4:42 ` [RFC 08/11] mm/thp: add split during migration support Balbir Singh
                   ` (4 subsequent siblings)
  11 siblings, 3 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

When a zone device page is split (via huge pmd folio split). The
driver callback for folio_split is invoked to let the device driver
know that the folio size has been split into a smaller order.

The HMM test driver has been updated to handle the split, since the
test driver uses backing pages, it requires a mechanism of reorganizing
the backing pages (backing pages are used to create a mirror device)
again into the right sized order pages. This is supported by exporting
prep_compound_page().

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h |  7 +++++++
 include/linux/mm.h       |  1 +
 lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         |  5 +++++
 mm/page_alloc.c          |  1 +
 5 files changed, 49 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 11d586dd8ef1..2091b754f1da 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,13 @@ struct dev_pagemap_ops {
 	 */
 	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
 			      unsigned long nr_pages, int mf_flags);
+
+	/*
+	 * Used for private (un-addressable) device memory only.
+	 * This callback is used when a folio is split into
+	 * a smaller folio
+	 */
+	void (*folio_split)(struct folio *head, struct folio *tail);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98a67488b5fe..3d0e91e0a923 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1415,6 +1415,7 @@ static inline struct folio *virt_to_folio(const void *x)
 void __folio_put(struct folio *folio);
 
 void split_page(struct page *page, unsigned int order);
+void prep_compound_page(struct page *page, unsigned int order);
 void folio_copy(struct folio *dst, struct folio *src);
 int folio_mc_copy(struct folio *dst, struct folio *src);
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index a81d2f8a0426..18b6a7b061d7 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1640,10 +1640,45 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+
+static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
+{
+	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
+	struct folio *new_rfolio;
+	struct folio *rfolio;
+	unsigned long offset = 0;
+
+	if (!rpage) {
+		folio_page(tail, 0)->zone_device_data = NULL;
+		return;
+	}
+
+	offset = folio_pfn(tail) - folio_pfn(head);
+	rfolio = page_folio(rpage);
+	new_rfolio = page_folio(folio_page(rfolio, offset));
+
+	folio_page(tail, 0)->zone_device_data = folio_page(new_rfolio, 0);
+
+	if (folio_pfn(tail) - folio_pfn(head) == 1) {
+		if (folio_order(head))
+			prep_compound_page(folio_page(rfolio, 0),
+						folio_order(head));
+		folio_set_count(rfolio, 1);
+	}
+	clear_compound_head(folio_page(new_rfolio, 0));
+	if (folio_order(tail))
+		prep_compound_page(folio_page(new_rfolio, 0),
+						folio_order(tail));
+	folio_set_count(new_rfolio, 1);
+	folio_page(new_rfolio, 0)->mapping = folio_page(rfolio, 0)->mapping;
+	tail->pgmap = head->pgmap;
+}
+
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
 	.page_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
 	.page_free	= dmirror_devmem_free,
+	.folio_split	= dmirror_devmem_folio_split,
 };
 
 static int dmirror_device_init(struct dmirror_device *mdevice, int id)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 995ac8be5709..518a70d1b58a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3655,6 +3655,11 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 						MTHP_STAT_NR_ANON, 1);
 			}
 
+			if (folio_is_device_private(origin_folio) &&
+					origin_folio->pgmap->ops->folio_split)
+				origin_folio->pgmap->ops->folio_split(
+					origin_folio, release);
+
 			/*
 			 * Unfreeze refcount first. Additional reference from
 			 * page cache.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17ea8fb27cbf..563f7e39aa79 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -573,6 +573,7 @@ void prep_compound_page(struct page *page, unsigned int order)
 
 	prep_compound_head(page, order);
 }
+EXPORT_SYMBOL_GPL(prep_compound_page);
 
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 08/11] mm/thp: add split during migration support
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (6 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 07/11] mm/memremap: Add folio_split support Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-07-08 14:38   ` David Hildenbrand
  2025-03-06  4:42 ` [RFC 09/11] lib/test_hmm: add test case for split pages Balbir Singh
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Support splitting pages during THP zone device migration as needed.
The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages.

Add a new routine migrate_vma_split_pages() to support the splitting
of already isolated pages. The pages being migrated are already unmapped
and marked for migration during setup (via unmap). folio_split() and
__split_unmapped_folio() take additional isolated arguments, to avoid
unmapping and remaping these pages and unlocking/putting the folio.

Since unmap/remap is avoided in these code paths, an extra reference
count is added to the split folio pages, which will be dropped in
the finalize phase.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h | 11 ++++++--
 mm/huge_memory.c        | 53 +++++++++++++++++++++++++-----------
 mm/migrate_device.c     | 60 ++++++++++++++++++++++++++++++++---------
 3 files changed, 94 insertions(+), 30 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad0c0ccfcbc2..abb8debfb362 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -341,8 +341,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order, bool isolated);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -351,6 +351,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns);
 int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 		struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
 /*
  * try_folio_split - try to split a @folio at @page using non uniform split.
  * @folio: folio to be split
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 518a70d1b58a..1a6f0e70acee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3544,7 +3544,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		struct page *split_at, struct page *lock_at,
 		struct list_head *list, pgoff_t end,
 		struct xa_state *xas, struct address_space *mapping,
-		bool uniform_split)
+		bool uniform_split, bool isolated)
 {
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
@@ -3586,6 +3586,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		int old_order = folio_order(folio);
 		struct folio *release;
 		struct folio *end_folio = folio_next(folio);
+		int extra_count = 1;
 
 		/* order-1 anonymous folio is not supported */
 		if (folio_test_anon(folio) && split_order == 1)
@@ -3629,6 +3630,14 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		__split_folio_to_order(folio, old_order, split_order);
 
 after_split:
+		/*
+		 * When a folio is isolated, the split folios will
+		 * not go through unmap/remap, so add the extra
+		 * count here
+		 */
+		if (isolated)
+			extra_count++;
+
 		/*
 		 * Iterate through after-split folios and perform related
 		 * operations. But in buddy allocator like split, the folio
@@ -3665,7 +3674,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			 * page cache.
 			 */
 			folio_ref_unfreeze(release,
-				1 + ((!folio_test_anon(origin_folio) ||
+				extra_count + ((!folio_test_anon(origin_folio) ||
 				     folio_test_swapcache(origin_folio)) ?
 					     folio_nr_pages(release) : 0));
 
@@ -3676,7 +3685,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			if (release == origin_folio)
 				continue;
 
-			if (!folio_is_device_private(origin_folio))
+			if (!isolated && !folio_is_device_private(origin_folio))
 				lru_add_page_tail(origin_folio, &release->page,
 							lruvec, list);
 
@@ -3714,6 +3723,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	if (nr_dropped)
 		shmem_uncharge(mapping->host, nr_dropped);
 
+	/*
+	 * Don't remap and unlock isolated folios
+	 */
+	if (isolated)
+		return ret;
+
 	remap_page(origin_folio, 1 << order,
 			folio_test_anon(origin_folio) ?
 				RMP_USE_SHARED_ZEROPAGE : 0);
@@ -3808,6 +3823,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
+ * @isolated: The pages are already unmapped
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3818,7 +3834,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split)
+		struct list_head *list, bool uniform_split, bool isolated)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3864,14 +3880,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = folio_get_anon_vma(folio);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (!isolated) {
+			anon_vma = folio_get_anon_vma(folio);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		end = -1;
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		unsigned int min_order;
 		gfp_t gfp;
@@ -3933,7 +3951,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	unmap_folio(folio);
+	if (!isolated)
+		unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3986,14 +4005,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 		ret = __split_unmapped_folio(folio, new_order,
 				split_at, lock_at, list, end, &xas, mapping,
-				uniform_split);
+				uniform_split, isolated);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio), 0);
+		if (!isolated)
+			remap_page(folio, folio_nr_pages(folio), 0);
 		ret = -EAGAIN;
 	}
 
@@ -4059,12 +4079,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order, bool isolated)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true);
+	return __folio_split(folio, new_order, &folio->page, page, list, true,
+				isolated);
 }
 
 /*
@@ -4093,7 +4114,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
 	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false);
+			false, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index f3fff5d705bd..e4510bb86b3c 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -804,6 +804,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		src[i] &= ~MIGRATE_PFN_MIGRATE;
 	return 0;
 }
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{
+	unsigned long i;
+	unsigned long pfn;
+	unsigned long flags;
+
+	folio_get(folio);
+	split_huge_pmd_address(migrate->vma, addr, true, folio);
+	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
+	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+	for (i = 1; i < HPAGE_PMD_NR; i++)
+		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+}
 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 					 unsigned long addr,
@@ -813,6 +831,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 {
 	return 0;
 }
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{}
 #endif
 
 /*
@@ -962,8 +985,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				struct migrate_vma *migrate)
 {
 	struct mmu_notifier_range range;
-	unsigned long i;
+	unsigned long i, j;
 	bool notified = false;
+	unsigned long addr;
 
 	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1005,12 +1029,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
 				nr = HPAGE_PMD_NR;
 				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-				goto next;
+			} else {
+				nr = 1;
 			}
 
-			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
-						&src_pfns[i]);
+			for (j = 0; j < nr && i + j < npages; j++) {
+				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+				migrate_vma_insert_page(migrate,
+					addr + j * PAGE_SIZE,
+					&dst_pfns[i+j], &src_pfns[i+j]);
+			}
 			goto next;
 		}
 
@@ -1032,7 +1060,10 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 							 MIGRATE_PFN_COMPOUND);
 					goto next;
 				}
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				nr = 1 << folio_order(folio);
+				addr = migrate->start + i * PAGE_SIZE;
+				migrate_vma_split_pages(migrate, i, addr, folio);
+				extra_cnt++;
 			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
 				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
 				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1067,12 +1098,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 		BUG_ON(folio_test_writeback(folio));
 
 		if (migrate && migrate->fault_page == page)
-			extra_cnt = 1;
-		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
-		if (r != MIGRATEPAGE_SUCCESS)
-			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-		else
-			folio_migrate_flags(newfolio, folio);
+			extra_cnt++;
+		for (j = 0; j < nr && i + j < npages; j++) {
+			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+			if (r != MIGRATEPAGE_SUCCESS)
+				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+			else
+				folio_migrate_flags(newfolio, folio);
+		}
 next:
 		i += nr;
 	}
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 09/11] lib/test_hmm: add test case for split pages
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (7 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 08/11] mm/thp: add split during migration support Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-03-06  4:42 ` [RFC 10/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add a new flag HMM_DMIRROR_FLAG_FAIL_ALLOC to emulate
failure of allocating a large page. This tests the code paths
involving split migration.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 lib/test_hmm.c      | 12 +++++++++++-
 lib/test_hmm_uapi.h |  3 +++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 18b6a7b061d7..36209184c430 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -92,6 +92,7 @@ struct dmirror {
 	struct xarray			pt;
 	struct mmu_interval_notifier	notifier;
 	struct mutex			mutex;
+	__u64			flags;
 };
 
 /*
@@ -699,7 +700,12 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		     page_to_pfn(spage)))
 			goto next;
 
-		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			dpage = NULL;
+		} else
+			dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+
 		if (!dpage) {
 			struct folio *folio;
 			unsigned long i;
@@ -1504,6 +1510,10 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		dmirror_device_remove_chunks(dmirror->mdevice);
 		ret = 0;
 		break;
+	case HMM_DMIRROR_FLAGS:
+		dmirror->flags = cmd.npages;
+		ret = 0;
+		break;
 
 	default:
 		return -EINVAL;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 8c818a2cf4f6..f94c6d457338 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -37,6 +37,9 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_FLAGS		_IOWR('H', 0x08, struct hmm_dmirror_cmd)
+
+#define HMM_DMIRROR_FLAG_FAIL_ALLOC	(1ULL << 0)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 10/11] selftests/mm/hmm-tests: new tests for zone device THP migration
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (8 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 09/11] lib/test_hmm: add test case for split pages Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-03-06  4:42 ` [RFC 11/11] gpu/drm/nouveau: Add THP migration support Balbir Singh
  2025-03-06 23:08 ` [RFC 00/11] THP support for zone device pages Matthew Brost
  11 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add new tests for migrating anon THP pages, including anon_huge,
anon_huge_zero and error cases involving forced splitting of pages
during migration.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 407 +++++++++++++++++++++++++
 1 file changed, 407 insertions(+)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 141bf63cbe05..b79274190022 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -2056,4 +2056,411 @@ TEST_F(hmm, hmm_cow_in_device)
 
 	hmm_buffer_free(buffer);
 }
+
+/*
+ * Migrate private anonymous huge empty page.
+ */
+TEST_F(hmm, migrate_anon_huge_empty)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page.
+ */
+TEST_F(hmm, migrate_anon_huge_zero)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+	int val;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize a read-only zero huge page. */
+	val = *(int *)buffer->ptr;
+	ASSERT_EQ(val, 0);
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], 0);
+		/* If it asserts once, it probably will 500,000 times */
+		if (ptr[i] != 0)
+			break;
+	}
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and free.
+ */
+TEST_F(hmm, migrate_anon_huge_free)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Try freeing it. */
+	ret = madvise(map, size, MADV_FREE);
+	ASSERT_EQ(ret, 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and fault back to sysmem.
+ */
+TEST_F(hmm, migrate_anon_huge_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], i);
+		if (ptr[i] != i)
+			break;
+	}
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate THP to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/*
+	 * Force an allocation error when faulting back a THP resident in the
+	 * device.
+	 */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_zero_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 0);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory (zero THP page). */
+	ret = ptr[0];
+	ASSERT_EQ(ret, 0);
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Fault the device memory back and check it. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 11/11] gpu/drm/nouveau: Add THP migration support
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (9 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 10/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
@ 2025-03-06  4:42 ` Balbir Singh
  2025-03-06 23:08 ` [RFC 00/11] THP support for zone device pages Matthew Brost
  11 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06  4:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: dri-devel, nouveau, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Change the code to add support for MIGRATE_VMA_SELECT_COMPOUND
and appropriately handling page sizes in the migrate/evict
code paths.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 244 +++++++++++++++++--------
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 176 insertions(+), 77 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 61d0f411ef84..bf3681f52ce0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -83,9 +83,15 @@ struct nouveau_dmem {
 	struct list_head chunks;
 	struct mutex mutex;
 	struct page *free_pages;
+	struct folio *free_folios;
 	spinlock_t lock;
 };
 
+struct nouveau_dmem_dma_info {
+	dma_addr_t dma_addr;
+	size_t size;
+};
+
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
 	return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
@@ -112,10 +118,16 @@ static void nouveau_dmem_page_free(struct page *page)
 {
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
+	struct folio *folio = page_folio(page);
 
 	spin_lock(&dmem->lock);
-	page->zone_device_data = dmem->free_pages;
-	dmem->free_pages = page;
+	if (folio_order(folio)) {
+		folio_set_zone_device_data(folio, dmem->free_folios);
+		dmem->free_folios = folio;
+	} else {
+		page->zone_device_data = dmem->free_pages;
+		dmem->free_pages = page;
+	}
 
 	WARN_ON(!chunk->callocated);
 	chunk->callocated--;
@@ -139,20 +151,28 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
 	}
 }
 
-static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
-				struct page *dpage, dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_folio(struct nouveau_drm *drm,
+				   struct folio *sfolio, struct folio *dfolio,
+				   struct nouveau_dmem_dma_info *dma_info)
 {
 	struct device *dev = drm->dev->dev;
+	struct page *dpage = folio_page(dfolio, 0);
+	struct page *spage = folio_page(sfolio, 0);
 
-	lock_page(dpage);
+	folio_lock(dfolio);
 
-	*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
-	if (dma_mapping_error(dev, *dma_addr))
+	dma_info->dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+					DMA_BIDIRECTIONAL);
+	dma_info->size = page_size(dpage);
+	if (dma_mapping_error(dev, dma_info->dma_addr))
 		return -EIO;
 
-	if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-					 NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage))) {
-		dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(sfolio),
+					 NOUVEAU_APER_HOST, dma_info->dma_addr,
+					 NOUVEAU_APER_VRAM,
+					 nouveau_dmem_page_addr(spage))) {
+		dma_unmap_page(dev, dma_info->dma_addr, page_size(dpage),
+					DMA_BIDIRECTIONAL);
 		return -EIO;
 	}
 
@@ -165,21 +185,38 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	struct nouveau_dmem *dmem = drm->dmem;
 	struct nouveau_fence *fence;
 	struct nouveau_svmm *svmm;
-	struct page *spage, *dpage;
-	unsigned long src = 0, dst = 0;
-	dma_addr_t dma_addr = 0;
+	struct page *dpage;
 	vm_fault_t ret = 0;
 	struct migrate_vma args = {
 		.vma		= vmf->vma,
-		.start		= vmf->address,
-		.end		= vmf->address + PAGE_SIZE,
-		.src		= &src,
-		.dst		= &dst,
 		.pgmap_owner	= drm->dev,
 		.fault_page	= vmf->page,
-		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+				  MIGRATE_VMA_SELECT_COMPOUND,
+		.src = NULL,
+		.dst = NULL,
 	};
-
+	unsigned int order, nr;
+	struct folio *sfolio, *dfolio;
+	struct nouveau_dmem_dma_info dma_info;
+
+	sfolio = page_folio(vmf->page);
+	order = folio_order(sfolio);
+	nr = 1 << order;
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
+	args.start = ALIGN_DOWN(vmf->address, (1 << (PAGE_SHIFT + order)));
+	args.vma = vmf->vma;
+	args.end = args.start + (PAGE_SIZE << order);
+	args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
+
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
 	/*
 	 * FIXME what we really want is to find some heuristic to migrate more
 	 * than just one page on CPU fault. When such fault happens it is very
@@ -190,20 +227,26 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	if (!args.cpages)
 		return 0;
 
-	spage = migrate_pfn_to_page(src);
-	if (!spage || !(src & MIGRATE_PFN_MIGRATE))
-		goto done;
-
-	dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma, vmf->address);
-	if (!dpage)
+	if (order)
+		dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER | __GFP_ZERO,
+					order, vmf->vma, vmf->address), 0);
+	else
+		dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma,
+					vmf->address);
+	if (!dpage) {
+		ret = VM_FAULT_OOM;
 		goto done;
+	}
 
-	dst = migrate_pfn(page_to_pfn(dpage));
+	args.dst[0] = migrate_pfn(page_to_pfn(dpage));
+	if (order)
+		args.dst[0] |= MIGRATE_PFN_COMPOUND;
+	dfolio = page_folio(dpage);
 
-	svmm = spage->zone_device_data;
+	svmm = folio_zone_device_data(sfolio);
 	mutex_lock(&svmm->mutex);
 	nouveau_svmm_invalidate(svmm, args.start, args.end);
-	ret = nouveau_dmem_copy_one(drm, spage, dpage, &dma_addr);
+	ret = nouveau_dmem_copy_folio(drm, sfolio, dfolio, &dma_info);
 	mutex_unlock(&svmm->mutex);
 	if (ret) {
 		ret = VM_FAULT_SIGBUS;
@@ -213,19 +256,31 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	nouveau_fence_new(&fence, dmem->migrate.chan);
 	migrate_vma_pages(&args);
 	nouveau_dmem_fence_done(&fence);
-	dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(drm->dev->dev, dma_info.dma_addr, PAGE_SIZE,
+				DMA_BIDIRECTIONAL);
 done:
 	migrate_vma_finalize(&args);
+err:
+	kfree(args.src);
+	kfree(args.dst);
 	return ret;
 }
 
+static void nouveau_dmem_folio_split(struct folio *head, struct folio *tail)
+{
+	tail->pgmap = head->pgmap;
+	folio_set_zone_device_data(tail, folio_zone_device_data(head));
+}
+
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
 	.page_free		= nouveau_dmem_page_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
+	.folio_split		= nouveau_dmem_folio_split,
 };
 
 static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage,
+			 bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct resource *res;
@@ -279,16 +334,21 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 	pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
 	page = pfn_to_page(pfn_first);
 	spin_lock(&drm->dmem->lock);
-	for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
-		page->zone_device_data = drm->dmem->free_pages;
-		drm->dmem->free_pages = page;
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || !is_large) {
+		for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
+			page->zone_device_data = drm->dmem->free_pages;
+			drm->dmem->free_pages = page;
+		}
 	}
+
 	*ppage = page;
 	chunk->callocated++;
 	spin_unlock(&drm->dmem->lock);
 
-	NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
-		DMEM_CHUNK_SIZE >> 20);
+	NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+		DMEM_CHUNK_SIZE >> 20, is_large ? "THP " : "", pfn_first,
+		nouveau_dmem_page_addr(page));
 
 	return 0;
 
@@ -305,27 +365,37 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 }
 
 static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct page *page = NULL;
+	struct folio *folio = NULL;
 	int ret;
+	unsigned int order = 0;
 
 	spin_lock(&drm->dmem->lock);
-	if (drm->dmem->free_pages) {
+	if (is_large && drm->dmem->free_folios) {
+		folio = drm->dmem->free_folios;
+		drm->dmem->free_folios = folio_zone_device_data(folio);
+		chunk = nouveau_page_to_chunk(page);
+		chunk->callocated++;
+		spin_unlock(&drm->dmem->lock);
+		order = DMEM_CHUNK_NPAGES;
+	} else if (!is_large && drm->dmem->free_pages) {
 		page = drm->dmem->free_pages;
 		drm->dmem->free_pages = page->zone_device_data;
 		chunk = nouveau_page_to_chunk(page);
 		chunk->callocated++;
 		spin_unlock(&drm->dmem->lock);
+		folio = page_folio(page);
 	} else {
 		spin_unlock(&drm->dmem->lock);
-		ret = nouveau_dmem_chunk_alloc(drm, &page);
+		ret = nouveau_dmem_chunk_alloc(drm, &page, is_large);
 		if (ret)
 			return NULL;
 	}
 
-	zone_device_page_init(page);
+	init_zone_device_folio(folio, order);
 	return page;
 }
 
@@ -376,12 +446,12 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 {
 	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
 	unsigned long *src_pfns, *dst_pfns;
-	dma_addr_t *dma_addrs;
+	struct nouveau_dmem_dma_info *dma_info;
 	struct nouveau_fence *fence;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
-	dma_addrs = kvcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL | __GFP_NOFAIL);
+	dma_info = kvcalloc(npages, sizeof(*dma_info), GFP_KERNEL | __GFP_NOFAIL);
 
 	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
 			npages);
@@ -389,17 +459,28 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	for (i = 0; i < npages; i++) {
 		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
 			struct page *dpage;
+			struct folio *folio = page_folio(
+				migrate_pfn_to_page(src_pfns[i]));
+			unsigned int order = folio_order(folio);
+
+			if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+				dpage = folio_page(
+						folio_alloc(
+						GFP_HIGHUSER_MOVABLE, order), 0);
+			} else {
+				/*
+				 * _GFP_NOFAIL because the GPU is going away and there
+				 * is nothing sensible we can do if we can't copy the
+				 * data back.
+				 */
+				dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+			}
 
-			/*
-			 * _GFP_NOFAIL because the GPU is going away and there
-			 * is nothing sensible we can do if we can't copy the
-			 * data back.
-			 */
-			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
 			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
-			nouveau_dmem_copy_one(chunk->drm,
-					migrate_pfn_to_page(src_pfns[i]), dpage,
-					&dma_addrs[i]);
+			nouveau_dmem_copy_folio(chunk->drm,
+				page_folio(migrate_pfn_to_page(src_pfns[i])),
+				page_folio(dpage),
+				&dma_info[i]);
 		}
 	}
 
@@ -410,8 +491,9 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	kvfree(src_pfns);
 	kvfree(dst_pfns);
 	for (i = 0; i < npages; i++)
-		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
-	kvfree(dma_addrs);
+		dma_unmap_page(chunk->drm->dev->dev, dma_info[i].dma_addr,
+				dma_info[i].size, DMA_BIDIRECTIONAL);
+	kvfree(dma_info);
 }
 
 void
@@ -615,31 +697,35 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, unsigned long src,
-		dma_addr_t *dma_addr, u64 *pfn)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfn)
 {
 	struct device *dev = drm->dev->dev;
 	struct page *dpage, *spage;
 	unsigned long paddr;
+	bool is_large = false;
 
 	spage = migrate_pfn_to_page(src);
 	if (!(src & MIGRATE_PFN_MIGRATE))
 		goto out;
 
-	dpage = nouveau_dmem_page_alloc_locked(drm);
+	is_large = src & MIGRATE_PFN_COMPOUND;
+	dpage = nouveau_dmem_page_alloc_locked(drm, is_large);
 	if (!dpage)
 		goto out;
 
 	paddr = nouveau_dmem_page_addr(dpage);
 	if (spage) {
-		*dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
+		dma_info->dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
 					 DMA_BIDIRECTIONAL);
-		if (dma_mapping_error(dev, *dma_addr))
+		dma_info->size = page_size(spage);
+		if (dma_mapping_error(dev, dma_info->dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, 1,
-			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
+		if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(page_folio(spage)),
+			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST,
+			dma_info->dma_addr))
 			goto out_dma_unmap;
 	} else {
-		*dma_addr = DMA_MAPPING_ERROR;
+		dma_info->dma_addr = DMA_MAPPING_ERROR;
 		if (drm->dmem->migrate.clear_func(drm, page_size(dpage),
 			NOUVEAU_APER_VRAM, paddr))
 			goto out_free_page;
@@ -653,7 +739,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 	return migrate_pfn(page_to_pfn(dpage));
 
 out_dma_unmap:
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(dev, dma_info->dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 out_free_page:
 	nouveau_dmem_page_free_locked(drm, dpage);
 out:
@@ -663,27 +749,33 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 
 static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, struct migrate_vma *args,
-		dma_addr_t *dma_addrs, u64 *pfns)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfns)
 {
 	struct nouveau_fence *fence;
 	unsigned long addr = args->start, nr_dma = 0, i;
+	unsigned long order = 0;
 
-	for (i = 0; addr < args->end; i++) {
+	for (i = 0; addr < args->end; ) {
+		struct folio *folio;
+
+		folio = page_folio(migrate_pfn_to_page(args->dst[i]));
+		order = folio_order(folio);
 		args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
-				args->src[i], dma_addrs + nr_dma, pfns + i);
-		if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
+				args->src[i], dma_info + nr_dma, pfns + i);
+		if (!dma_mapping_error(drm->dev->dev, dma_info[nr_dma].dma_addr))
 			nr_dma++;
-		addr += PAGE_SIZE;
+		i += 1 << order;
+		addr += (1 << order) * PAGE_SIZE;
 	}
 
 	nouveau_fence_new(&fence, drm->dmem->migrate.chan);
 	migrate_vma_pages(args);
 	nouveau_dmem_fence_done(&fence);
-	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i, order);
 
 	while (nr_dma--) {
-		dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
-				DMA_BIDIRECTIONAL);
+		dma_unmap_page(drm->dev->dev, dma_info[nr_dma].dma_addr,
+				dma_info[nr_dma].size, DMA_BIDIRECTIONAL);
 	}
 	migrate_vma_finalize(args);
 }
@@ -697,20 +789,24 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 {
 	unsigned long npages = (end - start) >> PAGE_SHIFT;
 	unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
-	dma_addr_t *dma_addrs;
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
 		.pgmap_owner	= drm->dev,
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM
+				  | MIGRATE_VMA_SELECT_COMPOUND,
 	};
 	unsigned long i;
 	u64 *pfns;
 	int ret = -ENOMEM;
+	struct nouveau_dmem_dma_info *dma_info;
 
 	if (drm->dmem == NULL)
 		return -ENODEV;
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		max = max(HPAGE_PMD_NR, max);
+
 	args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL);
 	if (!args.src)
 		goto out;
@@ -718,8 +814,8 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	if (!args.dst)
 		goto out_free_src;
 
-	dma_addrs = kmalloc_array(max, sizeof(*dma_addrs), GFP_KERNEL);
-	if (!dma_addrs)
+	dma_info = kmalloc_array(max, sizeof(*dma_info), GFP_KERNEL);
+	if (!dma_info)
 		goto out_free_dst;
 
 	pfns = nouveau_pfns_alloc(max);
@@ -737,7 +833,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 			goto out_free_pfns;
 
 		if (args.cpages)
-			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
+			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_info,
 						   pfns);
 		args.start = args.end;
 	}
@@ -746,7 +842,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 out_free_pfns:
 	nouveau_pfns_free(pfns);
 out_free_dma:
-	kfree(dma_addrs);
+	kfree(dma_info);
 out_free_dst:
 	kfree(args.dst);
 out_free_src:
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 1fed638b9eba..0693179d0a7d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -920,12 +920,14 @@ nouveau_pfns_free(u64 *pfns)
 
 void
 nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		 unsigned long addr, u64 *pfns, unsigned long npages)
+		 unsigned long addr, u64 *pfns, unsigned long npages,
+		 unsigned int page_shift)
 {
 	struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
 
 	args->p.addr = addr;
-	args->p.size = npages << PAGE_SHIFT;
+	args->p.size = npages << page_shift;
+	args->p.page = page_shift;
 
 	mutex_lock(&svmm->mutex);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index e7d63d7f0c2d..3fd78662f17e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -33,7 +33,8 @@ void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit);
 u64 *nouveau_pfns_alloc(unsigned long npages);
 void nouveau_pfns_free(u64 *pfns);
 void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		      unsigned long addr, u64 *pfns, unsigned long npages);
+		      unsigned long addr, u64 *pfns, unsigned long npages,
+		      unsigned int page_shift);
 #else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
 static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
 static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC 07/11] mm/memremap: Add folio_split support
  2025-03-06  4:42 ` [RFC 07/11] mm/memremap: Add folio_split support Balbir Singh
@ 2025-03-06  8:16   ` Mika Penttilä
  2025-03-06 21:42     ` Balbir Singh
  2025-03-06 22:36   ` Alistair Popple
  2025-07-08 14:31   ` David Hildenbrand
  2 siblings, 1 reply; 38+ messages in thread
From: Mika Penttilä @ 2025-03-06  8:16 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

Hi,

On 3/6/25 06:42, Balbir Singh wrote:
> When a zone device page is split (via huge pmd folio split). The
> driver callback for folio_split is invoked to let the device driver
> know that the folio size has been split into a smaller order.
>
> The HMM test driver has been updated to handle the split, since the
> test driver uses backing pages, it requires a mechanism of reorganizing
> the backing pages (backing pages are used to create a mirror device)
> again into the right sized order pages. This is supported by exporting
> prep_compound_page().
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/memremap.h |  7 +++++++
>  include/linux/mm.h       |  1 +
>  lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c         |  5 +++++
>  mm/page_alloc.c          |  1 +
>  5 files changed, 49 insertions(+)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 11d586dd8ef1..2091b754f1da 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -100,6 +100,13 @@ struct dev_pagemap_ops {
>  	 */
>  	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>  			      unsigned long nr_pages, int mf_flags);
> +
> +	/*
> +	 * Used for private (un-addressable) device memory only.
> +	 * This callback is used when a folio is split into
> +	 * a smaller folio
> +	 */
> +	void (*folio_split)(struct folio *head, struct folio *tail);
>  };
>  
>  #define PGMAP_ALTMAP_VALID	(1 << 0)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 98a67488b5fe..3d0e91e0a923 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1415,6 +1415,7 @@ static inline struct folio *virt_to_folio(const void *x)
>  void __folio_put(struct folio *folio);
>  
>  void split_page(struct page *page, unsigned int order);
> +void prep_compound_page(struct page *page, unsigned int order);
>  void folio_copy(struct folio *dst, struct folio *src);
>  int folio_mc_copy(struct folio *dst, struct folio *src);
>  
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index a81d2f8a0426..18b6a7b061d7 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1640,10 +1640,45 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	return ret;
>  }
>  
> +
> +static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
> +{
> +	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
> +	struct folio *new_rfolio;
> +	struct folio *rfolio;
> +	unsigned long offset = 0;
> +
> +	if (!rpage) {
> +		folio_page(tail, 0)->zone_device_data = NULL;
> +		return;
> +	}
> +
> +	offset = folio_pfn(tail) - folio_pfn(head);
> +	rfolio = page_folio(rpage);
> +	new_rfolio = page_folio(folio_page(rfolio, offset));
> +
> +	folio_page(tail, 0)->zone_device_data = folio_page(new_rfolio, 0);
> +

> +	if (folio_pfn(tail) - folio_pfn(head) == 1) {
> +		if (folio_order(head))
> +			prep_compound_page(folio_page(rfolio, 0),
> +						folio_order(head));
> +		folio_set_count(rfolio, 1);
> +	}

I think this might need at least a comment. Also, isn't the
folio_order(head) always 0, tail and head are splitted folios and if pfn
difference == 1?
If the intention is to adjust the backing folio's head page to the new
order, shouldn't there be clear_compound_head() also for backing head
page if split to zero order?


> +	clear_compound_head(folio_page(new_rfolio, 0));
> +	if (folio_order(tail))
> +		prep_compound_page(folio_page(new_rfolio, 0),
> +						folio_order(tail));
> +	folio_set_count(new_rfolio, 1);
> +	folio_page(new_rfolio, 0)->mapping = folio_page(rfolio, 0)->mapping;
> +	tail->pgmap = head->pgmap;
> +}
> +
>  static const struct dev_pagemap_ops dmirror_devmem_ops = {
>  	.page_free	= dmirror_devmem_free,
>  	.migrate_to_ram	= dmirror_devmem_fault,
>  	.page_free	= dmirror_devmem_free,
> +	.folio_split	= dmirror_devmem_folio_split,
>  };
>  
>  static int dmirror_device_init(struct dmirror_device *mdevice, int id)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 995ac8be5709..518a70d1b58a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3655,6 +3655,11 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  						MTHP_STAT_NR_ANON, 1);
>  			}
>  
> +			if (folio_is_device_private(origin_folio) &&
> +					origin_folio->pgmap->ops->folio_split)
> +				origin_folio->pgmap->ops->folio_split(
> +					origin_folio, release);
> +
>  			/*
>  			 * Unfreeze refcount first. Additional reference from
>  			 * page cache.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 17ea8fb27cbf..563f7e39aa79 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -573,6 +573,7 @@ void prep_compound_page(struct page *page, unsigned int order)
>  
>  	prep_compound_head(page, order);
>  }
> +EXPORT_SYMBOL_GPL(prep_compound_page);
>  
>  static inline void set_buddy_order(struct page *page, unsigned int order)
>  {

--Mika




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 04/11] mm/migrate_device: THP migration of zone device pages
  2025-03-06  4:42 ` [RFC 04/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-03-06  9:24   ` Mika Penttilä
  2025-03-06 21:35     ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: Mika Penttilä @ 2025-03-06  9:24 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

Hi,

On 3/6/25 06:42, Balbir Singh wrote:
...

>  
>  			/*
>  			 * The only time there is no vma is when called from
> @@ -728,15 +1000,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  					migrate->pgmap_owner);
>  				mmu_notifier_invalidate_range_start(&range);
>  			}
> -			migrate_vma_insert_page(migrate, addr, newpage,
> +
> +			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
> +				nr = HPAGE_PMD_NR;
> +				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +				goto next;
> +			}
> +
> +			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
>  						&src_pfns[i]);
> -			continue;
> +			goto next;
>  		}
>  
>  		newfolio = page_folio(newpage);
>  		folio = page_folio(page);
>  		mapping = folio_mapping(folio);
>  
> +		/*
> +		 * If THP migration is enabled, check if both src and dst
> +		 * can migrate large pages
> +		 */
> +		if (thp_migration_supported()) {
> +			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
> +				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> +
> +				if (!migrate) {
> +					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
> +							 MIGRATE_PFN_COMPOUND);
> +					goto next;
> +				}
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;

This looks strange as is but patch 08 changes this to split and then
migrate.


> +			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
> +				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;

Should there be goto next; or similar here also, we are not migrating
this src?


> +			}
> +		}
> +
> +
>  		if (folio_is_device_private(newfolio) ||
>  		    folio_is_device_coherent(newfolio)) {
>  			if (mapping) {
> @@ -749,7 +1053,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				if (!folio_test_anon(folio) ||
>  				    !folio_free_swap(folio)) {
>  					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -					continue;
> +					goto next;
>  				}
>  			}
>  		} else if (folio_is_zone_device(newfolio)) {
> @@ -757,7 +1061,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  			 * Other types of ZONE_DEVICE page are not supported.
>  			 */
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -			continue;
> +			goto next;
>  		}
>  
>  		BUG_ON(folio_test_writeback(folio));
> @@ -769,6 +1073,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>  		else
>  			folio_migrate_flags(newfolio, folio);
> +next:
> +		i += nr;
>  	}
>  
>  	if (notified)
> @@ -899,24 +1205,40 @@ EXPORT_SYMBOL(migrate_vma_finalize);
>  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
>  			unsigned long npages)
>  {
> -	unsigned long i, pfn;
> +	unsigned long i, j, pfn;
>  
> -	for (pfn = start, i = 0; i < npages; pfn++, i++) {
> -		struct folio *folio;
> +	i = 0;
> +	pfn = start;
> +	while (i < npages) {
> +		struct page *page = pfn_to_page(pfn);
> +		struct folio *folio = page_folio(page);
> +		unsigned int nr = 1;
>  
>  		folio = folio_get_nontail_page(pfn_to_page(pfn));
>  		if (!folio) {
>  			src_pfns[i] = 0;
> -			continue;
> +			goto next;
>  		}
>  
>  		if (!folio_trylock(folio)) {
>  			src_pfns[i] = 0;
>  			folio_put(folio);
> -			continue;
> +			goto next;
>  		}
>  
>  		src_pfns[i] = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> +		nr = folio_nr_pages(folio);
> +		if (nr > 1) {
> +			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
> +			for (j = 1; j < nr; j++)
> +				src_pfns[i+j] = 0;
> +			i += j;
> +			pfn += j;
> +			continue;
> +		}
> +next:
> +		i++;
> +		pfn++;
>  	}
>  
>  	migrate_device_unmap(src_pfns, npages, NULL);

--Mika




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 04/11] mm/migrate_device: THP migration of zone device pages
  2025-03-06  9:24   ` Mika Penttilä
@ 2025-03-06 21:35     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06 21:35 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 3/6/25 20:24, Mika Penttilä wrote:
> Hi,
> 
> On 3/6/25 06:42, Balbir Singh wrote:
> ...
> 
>>  
>>  			/*
>>  			 * The only time there is no vma is when called from
>> @@ -728,15 +1000,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  					migrate->pgmap_owner);
>>  				mmu_notifier_invalidate_range_start(&range);
>>  			}
>> -			migrate_vma_insert_page(migrate, addr, newpage,
>> +
>> +			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>> +				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
>> +				nr = HPAGE_PMD_NR;
>> +				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
>> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>> +				goto next;
>> +			}
>> +
>> +			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
>>  						&src_pfns[i]);
>> -			continue;
>> +			goto next;
>>  		}
>>  
>>  		newfolio = page_folio(newpage);
>>  		folio = page_folio(page);
>>  		mapping = folio_mapping(folio);
>>  
>> +		/*
>> +		 * If THP migration is enabled, check if both src and dst
>> +		 * can migrate large pages
>> +		 */
>> +		if (thp_migration_supported()) {
>> +			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>> +				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>> +				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
>> +
>> +				if (!migrate) {
>> +					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
>> +							 MIGRATE_PFN_COMPOUND);
>> +					goto next;
>> +				}
>> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> 
> This looks strange as is but patch 08 changes this to split and then
> migrate.

Yes, in the series at patch 4/11 split migration is not supported. That support
needs additional changes to be fully supported and comes in later.

> 
> 
>> +			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>> +				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>> +				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
>> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> 
> Should there be goto next; or similar here also, we are not migrating
> this src?
> 

Yes, will do, but generally it just falls through, but the additional checks below
are not needed.

Thanks for the review!
Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 07/11] mm/memremap: Add folio_split support
  2025-03-06  8:16   ` Mika Penttilä
@ 2025-03-06 21:42     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-06 21:42 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 3/6/25 19:16, Mika Penttilä wrote:
> Hi,
> 
> On 3/6/25 06:42, Balbir Singh wrote:
>> When a zone device page is split (via huge pmd folio split). The
>> driver callback for folio_split is invoked to let the device driver
>> know that the folio size has been split into a smaller order.
>>
>> The HMM test driver has been updated to handle the split, since the
>> test driver uses backing pages, it requires a mechanism of reorganizing
>> the backing pages (backing pages are used to create a mirror device)
>> again into the right sized order pages. This is supported by exporting
>> prep_compound_page().
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/memremap.h |  7 +++++++
>>  include/linux/mm.h       |  1 +
>>  lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
>>  mm/huge_memory.c         |  5 +++++
>>  mm/page_alloc.c          |  1 +
>>  5 files changed, 49 insertions(+)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 11d586dd8ef1..2091b754f1da 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -100,6 +100,13 @@ struct dev_pagemap_ops {
>>  	 */
>>  	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>>  			      unsigned long nr_pages, int mf_flags);
>> +
>> +	/*
>> +	 * Used for private (un-addressable) device memory only.
>> +	 * This callback is used when a folio is split into
>> +	 * a smaller folio
>> +	 */
>> +	void (*folio_split)(struct folio *head, struct folio *tail);
>>  };
>>  
>>  #define PGMAP_ALTMAP_VALID	(1 << 0)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 98a67488b5fe..3d0e91e0a923 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1415,6 +1415,7 @@ static inline struct folio *virt_to_folio(const void *x)
>>  void __folio_put(struct folio *folio);
>>  
>>  void split_page(struct page *page, unsigned int order);
>> +void prep_compound_page(struct page *page, unsigned int order);
>>  void folio_copy(struct folio *dst, struct folio *src);
>>  int folio_mc_copy(struct folio *dst, struct folio *src);
>>  
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index a81d2f8a0426..18b6a7b061d7 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1640,10 +1640,45 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>  	return ret;
>>  }
>>  
>> +
>> +static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
>> +{
>> +	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
>> +	struct folio *new_rfolio;
>> +	struct folio *rfolio;
>> +	unsigned long offset = 0;
>> +
>> +	if (!rpage) {
>> +		folio_page(tail, 0)->zone_device_data = NULL;
>> +		return;
>> +	}
>> +
>> +	offset = folio_pfn(tail) - folio_pfn(head);
>> +	rfolio = page_folio(rpage);
>> +	new_rfolio = page_folio(folio_page(rfolio, offset));
>> +
>> +	folio_page(tail, 0)->zone_device_data = folio_page(new_rfolio, 0);
>> +
> 
>> +	if (folio_pfn(tail) - folio_pfn(head) == 1) {
>> +		if (folio_order(head))
>> +			prep_compound_page(folio_page(rfolio, 0),
>> +						folio_order(head));
>> +		folio_set_count(rfolio, 1);
>> +	}
> 
> I think this might need at least a comment. Also, isn't the
> folio_order(head) always 0, tail and head are splitted folios and if pfn
> difference == 1?
> If the intention is to adjust the backing folio's head page to the new
> order, shouldn't there be clear_compound_head() also for backing head
> page if split to zero order?
> 

I'll add some comments to clarify, I'll add clear_compound_head()

Thanks,
Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 07/11] mm/memremap: Add folio_split support
  2025-03-06  4:42 ` [RFC 07/11] mm/memremap: Add folio_split support Balbir Singh
  2025-03-06  8:16   ` Mika Penttilä
@ 2025-03-06 22:36   ` Alistair Popple
  2025-07-08 14:31   ` David Hildenbrand
  2 siblings, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2025-03-06 22:36 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, dri-devel, nouveau, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Thu, Mar 06, 2025 at 03:42:35PM +1100, Balbir Singh wrote:
> When a zone device page is split (via huge pmd folio split). The
> driver callback for folio_split is invoked to let the device driver
> know that the folio size has been split into a smaller order.
> 
> The HMM test driver has been updated to handle the split, since the
> test driver uses backing pages, it requires a mechanism of reorganizing
> the backing pages (backing pages are used to create a mirror device)
> again into the right sized order pages. This is supported by exporting
> prep_compound_page().
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/memremap.h |  7 +++++++
>  include/linux/mm.h       |  1 +
>  lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c         |  5 +++++
>  mm/page_alloc.c          |  1 +
>  5 files changed, 49 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 11d586dd8ef1..2091b754f1da 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -100,6 +100,13 @@ struct dev_pagemap_ops {
>  	 */
>  	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>  			      unsigned long nr_pages, int mf_flags);
> +
> +	/*
> +	 * Used for private (un-addressable) device memory only.
> +	 * This callback is used when a folio is split into
> +	 * a smaller folio
> +	 */
> +	void (*folio_split)(struct folio *head, struct folio *tail);
>  };
>  
>  #define PGMAP_ALTMAP_VALID	(1 << 0)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 98a67488b5fe..3d0e91e0a923 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1415,6 +1415,7 @@ static inline struct folio *virt_to_folio(const void *x)
>  void __folio_put(struct folio *folio);
>  
>  void split_page(struct page *page, unsigned int order);
> +void prep_compound_page(struct page *page, unsigned int order);
>  void folio_copy(struct folio *dst, struct folio *src);
>  int folio_mc_copy(struct folio *dst, struct folio *src);
>  
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index a81d2f8a0426..18b6a7b061d7 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1640,10 +1640,45 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	return ret;
>  }
>  
> +
> +static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
> +{
> +	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
> +	struct folio *new_rfolio;
> +	struct folio *rfolio;
> +	unsigned long offset = 0;
> +
> +	if (!rpage) {
> +		folio_page(tail, 0)->zone_device_data = NULL;
> +		return;
> +	}
> +
> +	offset = folio_pfn(tail) - folio_pfn(head);
> +	rfolio = page_folio(rpage);
> +	new_rfolio = page_folio(folio_page(rfolio, offset));
> +
> +	folio_page(tail, 0)->zone_device_data = folio_page(new_rfolio, 0);
> +
> +	if (folio_pfn(tail) - folio_pfn(head) == 1) {
> +		if (folio_order(head))
> +			prep_compound_page(folio_page(rfolio, 0),
> +						folio_order(head));
> +		folio_set_count(rfolio, 1);
> +	}
> +	clear_compound_head(folio_page(new_rfolio, 0));
> +	if (folio_order(tail))
> +		prep_compound_page(folio_page(new_rfolio, 0),
> +						folio_order(tail));
> +	folio_set_count(new_rfolio, 1);
> +	folio_page(new_rfolio, 0)->mapping = folio_page(rfolio, 0)->mapping;
> +	tail->pgmap = head->pgmap;

It seem what you're trying to do here is split a higher order driver allocted
folio (rfolio) into smaller ones right? Rather than open coding that I think it
would be benefical to create an exported helper function to do this. Pretty sure
at least DAX could also use this, as it has a need to split device folios into
smaller ones as well. That would also address my comment below.

> +}
> +
>  static const struct dev_pagemap_ops dmirror_devmem_ops = {
>  	.page_free	= dmirror_devmem_free,
>  	.migrate_to_ram	= dmirror_devmem_fault,
>  	.page_free	= dmirror_devmem_free,
> +	.folio_split	= dmirror_devmem_folio_split,
>  };
>  
>  static int dmirror_device_init(struct dmirror_device *mdevice, int id)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 995ac8be5709..518a70d1b58a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3655,6 +3655,11 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  						MTHP_STAT_NR_ANON, 1);
>  			}
>  
> +			if (folio_is_device_private(origin_folio) &&
> +					origin_folio->pgmap->ops->folio_split)
> +				origin_folio->pgmap->ops->folio_split(
> +					origin_folio, release);
> +
>  			/*
>  			 * Unfreeze refcount first. Additional reference from
>  			 * page cache.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 17ea8fb27cbf..563f7e39aa79 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -573,6 +573,7 @@ void prep_compound_page(struct page *page, unsigned int order)
>  
>  	prep_compound_head(page, order);
>  }
> +EXPORT_SYMBOL_GPL(prep_compound_page);

I think this is probably too low level of a function to export, especially just
for a test.

>  static inline void set_buddy_order(struct page *page, unsigned int order)
>  {
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 01/11] mm/zone_device: support large zone device private folios
  2025-03-06  4:42 ` [RFC 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-03-06 23:02   ` Alistair Popple
  2025-07-08 13:37   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2025-03-06 23:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, dri-devel, nouveau, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Thu, Mar 06, 2025 at 03:42:29PM +1100, Balbir Singh wrote:
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
> 
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/memremap.h | 22 +++++++++++++++++-
>  mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
>  2 files changed, 58 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 4aa151914eab..11d586dd8ef1 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
>  	return is_device_private_page(&folio->page);
>  }
>  
> +static inline void *folio_zone_device_data(const struct folio *folio)
> +{
> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
> +	return folio->page.zone_device_data;
> +}
> +
> +static inline void folio_set_zone_device_data(struct folio *folio, void *data)
> +{
> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
> +	folio->page.zone_device_data = data;
> +}
> +
>  static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
> @@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void init_zone_device_folio(struct folio *folio, unsigned int order);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>  
>  unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_page_init(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	init_zone_device_folio(folio, 0);
> +}
> +
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
>  		struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 2aebc1b192da..7d98d0a4c0cd 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -459,20 +459,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  void free_zone_device_folio(struct folio *folio)
>  {
>  	struct dev_pagemap *pgmap = folio->pgmap;
> +	unsigned int nr = folio_nr_pages(folio);
> +	int i;
> +	bool anon = folio_test_anon(folio);
> +	struct page *page = folio_page(folio, 0);
>  
>  	if (WARN_ON_ONCE(!pgmap))
>  		return;
>  
>  	mem_cgroup_uncharge(folio);
>  
> -	/*
> -	 * Note: we don't expect anonymous compound pages yet. Once supported
> -	 * and we could PTE-map them similar to THP, we'd have to clear
> -	 * PG_anon_exclusive on all tail pages.
> -	 */
> -	if (folio_test_anon(folio)) {
> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -		__ClearPageAnonExclusive(folio_page(folio, 0));
> +	WARN_ON_ONCE(folio_test_large(folio) && !anon);
> +
> +	for (i = 0; i < nr; i++) {
> +		if (anon)
> +			__ClearPageAnonExclusive(folio_page(folio, i));
>  	}
>  
>  	/*
> @@ -496,10 +497,19 @@ void free_zone_device_folio(struct folio *folio)
>  
>  	switch (pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
> +		if (folio_test_large(folio)) {
> +			folio_unqueue_deferred_split(folio);
> +
> +			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
> +		}
> +		pgmap->ops->page_free(page);

It seems to me this would be a good time to finally convert page_free() to
folio_free() and have it take a folio instead.

> +		put_dev_pagemap(pgmap);
> +		page->mapping = NULL;
> +		break;
>  	case MEMORY_DEVICE_COHERENT:
>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>  			break;
> -		pgmap->ops->page_free(folio_page(folio, 0));
> +		pgmap->ops->page_free(page);

Ditto.

>  		put_dev_pagemap(pgmap);
>  		break;
>  
> @@ -523,14 +533,28 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>  
> -void zone_device_page_init(struct page *page)
> +void init_zone_device_folio(struct folio *folio, unsigned int order)

Is this supposed to deal with taking any arbitrary zone_device folio and
initialising it to the given order? If so I'd expect to see things like
clear_compound_head() to cope with initialising a previously higher order folio
to a 0-order folio.

>  {
> +	struct page *page = folio_page(folio, 0);
> +
> +	VM_BUG_ON(order > MAX_ORDER_NR_PAGES);
> +
> +	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
> +
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().
>  	 */
> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> -	set_page_count(page, 1);
> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> +	folio_set_count(folio, 1);
>  	lock_page(page);
> +
> +	/*
> +	 * Only PMD level migration is supported for THP migration
> +	 */
> +	if (order > 1) {
> +		prep_compound_page(page, order);
> +		folio_set_large_rmappable(folio);
> +	}
>  }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(init_zone_device_folio);
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/11] THP support for zone device pages
  2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
                   ` (10 preceding siblings ...)
  2025-03-06  4:42 ` [RFC 11/11] gpu/drm/nouveau: Add THP migration support Balbir Singh
@ 2025-03-06 23:08 ` Matthew Brost
  2025-03-06 23:20   ` Balbir Singh
  11 siblings, 1 reply; 38+ messages in thread
From: Matthew Brost @ 2025-03-06 23:08 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, dri-devel, nouveau, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Thu, Mar 06, 2025 at 03:42:28PM +1100, Balbir Singh wrote:

This is an exciting series to see. As of today, we have just merged this
series into the DRM subsystem / Xe [2], which adds very basic SVM
support. One of the performance bottlenecks we quickly identified was
the lack of THP for device pages—I believe our profiling showed that 96%
of the time spent on 2M page GPU faults was within the migrate_vma_*
functions. Presumably, this will help significantly.

We will likely attempt to pull this code into GPU SVM / Xe fairly soon.
I believe we will encounter a conflict since [2] includes these patches
[3] [4], but we should be able to resolve that. These patches might make
it into the 6.15 PR — TBD but I can get back to you on that.

I have one question—does this series contain all the required core MM
changes for us to give it a try? That is, do I need to include any other
code from the list to test this out?

Matt

[2] https://patchwork.freedesktop.org/series/137870/
[3] https://patchwork.freedesktop.org/patch/641207/?series=137870&rev=8
[4] https://patchwork.freedesktop.org/patch/641214/?series=137870&rev=8

> This patch series adds support for THP migration of zone device pages.
> To do so, the patches implement support for folio zone device pages
> by adding support for setting up larger order pages.
> 
> These patches build on the earlier posts by Ralph Campbell [1]
> 
> Two new flags are added in vma_migration to select and mark compound pages.
> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
> is passed in as arguments.
> 
> The series also adds zone device awareness to (m)THP pages along
> with fault handling of large zone device private pages. page vma walk
> and the rmap code is also zone device aware. Support has also been
> added for folios that might need to be split in the middle
> of migration (when the src and dst do not agree on
> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
> migrate large pages, but the destination has not been able to allocate
> large pages. The code supported and used folio_split() when migrating
> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
> as an argument to migrate_vma_setup().
> 
> The test infrastructure lib/test_hmm.c has been enhanced to support THP
> migration. A new ioctl to emulate failure of large page allocations has
> been added to test the folio split code path. hmm-tests.c has new test
> cases for huge page migration and to test the folio split path.
> 
> The nouveau dmem code has been enhanced to use the new THP migration
> capability.
> 
> mTHP support:
> 
> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
> been kept generic to support various order sizes. With additional
> refactoring of the code support of different order sizes should be
> possible.
> 
> References:
> [1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
> 
> These patches are built on top of mm-everything-2025-03-04-05-51
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Balbir Singh (11):
>   mm/zone_device: support large zone device private folios
>   mm/migrate_device: flags for selecting device private THP pages
>   mm/thp: zone_device awareness in THP handling code
>   mm/migrate_device: THP migration of zone device pages
>   mm/memory/fault: Add support for zone device THP fault handling
>   lib/test_hmm: test cases and support for zone device private THP
>   mm/memremap: Add folio_split support
>   mm/thp: add split during migration support
>   lib/test_hmm: add test case for split pages
>   selftests/mm/hmm-tests: new tests for zone device THP migration
>   gpu/drm/nouveau: Add THP migration support
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 244 +++++++++----
>  drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
>  include/linux/huge_mm.h                |  18 +-
>  include/linux/memremap.h               |  29 +-
>  include/linux/migrate.h                |   2 +
>  include/linux/mm.h                     |   1 +
>  lib/test_hmm.c                         | 387 ++++++++++++++++----
>  lib/test_hmm_uapi.h                    |   3 +
>  mm/huge_memory.c                       | 242 +++++++++---
>  mm/memory.c                            |   6 +-
>  mm/memremap.c                          |  50 ++-
>  mm/migrate.c                           |   2 +
>  mm/migrate_device.c                    | 488 +++++++++++++++++++++----
>  mm/page_alloc.c                        |   1 +
>  mm/page_vma_mapped.c                   |  10 +
>  mm/rmap.c                              |  19 +-
>  tools/testing/selftests/mm/hmm-tests.c | 407 +++++++++++++++++++++
>  18 files changed, 1630 insertions(+), 288 deletions(-)
> 
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/11] THP support for zone device pages
  2025-03-06 23:08 ` [RFC 00/11] THP support for zone device pages Matthew Brost
@ 2025-03-06 23:20   ` Balbir Singh
  2025-07-04 13:52     ` Francois Dugast
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-06 23:20 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, akpm, dri-devel, nouveau, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 3/7/25 10:08, Matthew Brost wrote:
> On Thu, Mar 06, 2025 at 03:42:28PM +1100, Balbir Singh wrote:
> 
> This is an exciting series to see. As of today, we have just merged this
> series into the DRM subsystem / Xe [2], which adds very basic SVM
> support. One of the performance bottlenecks we quickly identified was
> the lack of THP for device pages—I believe our profiling showed that 96%
> of the time spent on 2M page GPU faults was within the migrate_vma_*
> functions. Presumably, this will help significantly.
> 
> We will likely attempt to pull this code into GPU SVM / Xe fairly soon.
> I believe we will encounter a conflict since [2] includes these patches
> [3] [4], but we should be able to resolve that. These patches might make
> it into the 6.15 PR — TBD but I can get back to you on that.
> 
> I have one question—does this series contain all the required core MM
> changes for us to give it a try? That is, do I need to include any other
> code from the list to test this out?
> 

Thank you, the patches are built on top of mm-everything-2025-03-04-05-51, which
includes changes by Alistair to fix fs/dax reference counting and changes
By Zi Yan (folio split changes), the series builds on top of those, but the
patches are not dependent on the folio split changes, IIRC

Please do report bugs/issues that you come across.

Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/11] THP support for zone device pages
  2025-03-06 23:20   ` Balbir Singh
@ 2025-07-04 13:52     ` Francois Dugast
  2025-07-04 16:17       ` Zi Yan
  0 siblings, 1 reply; 38+ messages in thread
From: Francois Dugast @ 2025-07-04 13:52 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Brost, linux-mm, akpm, dri-devel, nouveau, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Hi,

On Fri, Mar 07, 2025 at 10:20:30AM +1100, Balbir Singh wrote:
> On 3/7/25 10:08, Matthew Brost wrote:
> > On Thu, Mar 06, 2025 at 03:42:28PM +1100, Balbir Singh wrote:
> > 
> > This is an exciting series to see. As of today, we have just merged this
> > series into the DRM subsystem / Xe [2], which adds very basic SVM
> > support. One of the performance bottlenecks we quickly identified was
> > the lack of THP for device pages—I believe our profiling showed that 96%
> > of the time spent on 2M page GPU faults was within the migrate_vma_*
> > functions. Presumably, this will help significantly.
> > 
> > We will likely attempt to pull this code into GPU SVM / Xe fairly soon.
> > I believe we will encounter a conflict since [2] includes these patches
> > [3] [4], but we should be able to resolve that. These patches might make
> > it into the 6.15 PR — TBD but I can get back to you on that.
> > 
> > I have one question—does this series contain all the required core MM
> > changes for us to give it a try? That is, do I need to include any other
> > code from the list to test this out?
> > 
> 
> Thank you, the patches are built on top of mm-everything-2025-03-04-05-51, which
> includes changes by Alistair to fix fs/dax reference counting and changes
> By Zi Yan (folio split changes), the series builds on top of those, but the
> patches are not dependent on the folio split changes, IIRC
> 
> Please do report bugs/issues that you come across.
> 
> Balbir
> 

Thanks for sharing. We used your series to experimentally enable THP migration
of zone device pages in DRM GPU SVM and Xe. Here is an early draft [1] rebased
on 6.16-rc1. It is still hacky but I wanted to share some findings/questions:
- Is there an updated version of your series?
- In hmm_vma_walk_pmd(), when the device private pages are owned by the caller,
  is it needed to fault them in or could execution just continue in order to
  handle the PMD?
- When __drm_gpusvm_migrate_to_ram() is called from the CPU fault handler, the
  faulting folio is already locked when reaching migrate_vma_collect_huge_pmd()
  so folio_trylock() fails, which leads to skipping collection. As this case
  seems valid, collection should probably be skipped only when the folio is not
  the faulting folio.
- Something seems odd with the folio ref count in folio_migrate_mapping(), it
  does not match the expected count in our runs. This is not root caused yet.
- The expectation for HMM internals is speedups as it should find one single
  THP versus 512 devices pages previously. However we noticed slowdowns, for
  example in hmm_range_fault(), which increase drm_gpusvm_range_get_pages()
  execution time. We are investigating why this happens as this can be caused
  by leftover hacks in my patches but is the above expectation correct? Have
  you also observed such side effects?

Thanks,
Francois

[1] https://gitlab.freedesktop.org/ifdu/kernel/-/tree/svm-thp-device


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/11] THP support for zone device pages
  2025-07-04 13:52     ` Francois Dugast
@ 2025-07-04 16:17       ` Zi Yan
  2025-07-06  1:25         ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: Zi Yan @ 2025-07-04 16:17 UTC (permalink / raw)
  To: Francois Dugast
  Cc: Balbir Singh, Matthew Brost, linux-mm, akpm, dri-devel, nouveau,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu, Alistair Popple,
	Donet Tom

On 4 Jul 2025, at 9:52, Francois Dugast wrote:

> Hi,
>
> On Fri, Mar 07, 2025 at 10:20:30AM +1100, Balbir Singh wrote:
>> On 3/7/25 10:08, Matthew Brost wrote:
>>> On Thu, Mar 06, 2025 at 03:42:28PM +1100, Balbir Singh wrote:
>>>
>>> This is an exciting series to see. As of today, we have just merged this
>>> series into the DRM subsystem / Xe [2], which adds very basic SVM
>>> support. One of the performance bottlenecks we quickly identified was
>>> the lack of THP for device pages—I believe our profiling showed that 96%
>>> of the time spent on 2M page GPU faults was within the migrate_vma_*
>>> functions. Presumably, this will help significantly.
>>>
>>> We will likely attempt to pull this code into GPU SVM / Xe fairly soon.
>>> I believe we will encounter a conflict since [2] includes these patches
>>> [3] [4], but we should be able to resolve that. These patches might make
>>> it into the 6.15 PR — TBD but I can get back to you on that.
>>>
>>> I have one question—does this series contain all the required core MM
>>> changes for us to give it a try? That is, do I need to include any other
>>> code from the list to test this out?
>>>
>>
>> Thank you, the patches are built on top of mm-everything-2025-03-04-05-51, which
>> includes changes by Alistair to fix fs/dax reference counting and changes
>> By Zi Yan (folio split changes), the series builds on top of those, but the
>> patches are not dependent on the folio split changes, IIRC
>>
>> Please do report bugs/issues that you come across.
>>
>> Balbir
>>
>
> Thanks for sharing. We used your series to experimentally enable THP migration
> of zone device pages in DRM GPU SVM and Xe. Here is an early draft [1] rebased
> on 6.16-rc1. It is still hacky but I wanted to share some findings/questions:
> - Is there an updated version of your series?

Here is a new one: https://lore.kernel.org/linux-mm/20250703233511.2028395-1-balbirs@nvidia.com/.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/11] THP support for zone device pages
  2025-07-04 16:17       ` Zi Yan
@ 2025-07-06  1:25         ` Balbir Singh
  2025-07-06 16:34           ` Francois Dugast
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-07-06  1:25 UTC (permalink / raw)
  To: Zi Yan, Francois Dugast
  Cc: Matthew Brost, linux-mm, akpm, dri-devel, nouveau, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/5/25 02:17, Zi Yan wrote:
> On 4 Jul 2025, at 9:52, Francois Dugast wrote:
> 
>> Hi,
>>
>> On Fri, Mar 07, 2025 at 10:20:30AM +1100, Balbir Singh wrote:
>>> On 3/7/25 10:08, Matthew Brost wrote:
>>>> On Thu, Mar 06, 2025 at 03:42:28PM +1100, Balbir Singh wrote:
>>>>
>>>> This is an exciting series to see. As of today, we have just merged this
>>>> series into the DRM subsystem / Xe [2], which adds very basic SVM
>>>> support. One of the performance bottlenecks we quickly identified was
>>>> the lack of THP for device pages—I believe our profiling showed that 96%
>>>> of the time spent on 2M page GPU faults was within the migrate_vma_*
>>>> functions. Presumably, this will help significantly.
>>>>
>>>> We will likely attempt to pull this code into GPU SVM / Xe fairly soon.
>>>> I believe we will encounter a conflict since [2] includes these patches
>>>> [3] [4], but we should be able to resolve that. These patches might make
>>>> it into the 6.15 PR — TBD but I can get back to you on that.
>>>>
>>>> I have one question—does this series contain all the required core MM
>>>> changes for us to give it a try? That is, do I need to include any other
>>>> code from the list to test this out?
>>>>
>>>
>>> Thank you, the patches are built on top of mm-everything-2025-03-04-05-51, which
>>> includes changes by Alistair to fix fs/dax reference counting and changes
>>> By Zi Yan (folio split changes), the series builds on top of those, but the
>>> patches are not dependent on the folio split changes, IIRC
>>>
>>> Please do report bugs/issues that you come across.
>>>
>>> Balbir
>>>
>>
>> Thanks for sharing. We used your series to experimentally enable THP migration
>> of zone device pages in DRM GPU SVM and Xe. Here is an early draft [1] rebased
>> on 6.16-rc1. It is still hacky but I wanted to share some findings/questions:
>> - Is there an updated version of your series?
> 
> Here is a new one: https://lore.kernel.org/linux-mm/20250703233511.2028395-1-balbirs@nvidia.com/.

Thanks Zi!

Could you please try out the latest patches Francois?

Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/11] THP support for zone device pages
  2025-07-06  1:25         ` Balbir Singh
@ 2025-07-06 16:34           ` Francois Dugast
  0 siblings, 0 replies; 38+ messages in thread
From: Francois Dugast @ 2025-07-06 16:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Zi Yan, Matthew Brost, linux-mm, akpm, dri-devel, nouveau,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu, Alistair Popple,
	Donet Tom

On Sun, Jul 06, 2025 at 11:25:32AM +1000, Balbir Singh wrote:
> On 7/5/25 02:17, Zi Yan wrote:
> > On 4 Jul 2025, at 9:52, Francois Dugast wrote:
> > 
> >> Hi,
> >>
> >> On Fri, Mar 07, 2025 at 10:20:30AM +1100, Balbir Singh wrote:
> >>> On 3/7/25 10:08, Matthew Brost wrote:
> >>>> On Thu, Mar 06, 2025 at 03:42:28PM +1100, Balbir Singh wrote:
> >>>>
> >>>> This is an exciting series to see. As of today, we have just merged this
> >>>> series into the DRM subsystem / Xe [2], which adds very basic SVM
> >>>> support. One of the performance bottlenecks we quickly identified was
> >>>> the lack of THP for device pages—I believe our profiling showed that 96%
> >>>> of the time spent on 2M page GPU faults was within the migrate_vma_*
> >>>> functions. Presumably, this will help significantly.
> >>>>
> >>>> We will likely attempt to pull this code into GPU SVM / Xe fairly soon.
> >>>> I believe we will encounter a conflict since [2] includes these patches
> >>>> [3] [4], but we should be able to resolve that. These patches might make
> >>>> it into the 6.15 PR — TBD but I can get back to you on that.
> >>>>
> >>>> I have one question—does this series contain all the required core MM
> >>>> changes for us to give it a try? That is, do I need to include any other
> >>>> code from the list to test this out?
> >>>>
> >>>
> >>> Thank you, the patches are built on top of mm-everything-2025-03-04-05-51, which
> >>> includes changes by Alistair to fix fs/dax reference counting and changes
> >>> By Zi Yan (folio split changes), the series builds on top of those, but the
> >>> patches are not dependent on the folio split changes, IIRC
> >>>
> >>> Please do report bugs/issues that you come across.
> >>>
> >>> Balbir
> >>>
> >>
> >> Thanks for sharing. We used your series to experimentally enable THP migration
> >> of zone device pages in DRM GPU SVM and Xe. Here is an early draft [1] rebased
> >> on 6.16-rc1. It is still hacky but I wanted to share some findings/questions:
> >> - Is there an updated version of your series?
> > 
> > Here is a new one: https://lore.kernel.org/linux-mm/20250703233511.2028395-1-balbirs@nvidia.com/.
> 
> Thanks Zi!
> 
> Could you please try out the latest patches Francois?

Sure! Let me rebase and share results.

Francois


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 01/11] mm/zone_device: support large zone device private folios
  2025-03-06  4:42 ` [RFC 01/11] mm/zone_device: support large zone device private folios Balbir Singh
  2025-03-06 23:02   ` Alistair Popple
@ 2025-07-08 13:37   ` David Hildenbrand
  2025-07-09  5:25     ` Balbir Singh
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-08 13:37 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 06.03.25 05:42, Balbir Singh wrote:
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
> 
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>   include/linux/memremap.h | 22 +++++++++++++++++-
>   mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
>   2 files changed, 58 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 4aa151914eab..11d586dd8ef1 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
>   	return is_device_private_page(&folio->page);
>   }
>   
> +static inline void *folio_zone_device_data(const struct folio *folio)
> +{
> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
> +	return folio->page.zone_device_data;
> +}

Not used.

> +
> +static inline void folio_set_zone_device_data(struct folio *folio, void *data)
> +{
> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
> +	folio->page.zone_device_data = data;
> +}
> +

Not used.

Move both into the patch where they are actually used.

>   static inline bool is_pci_p2pdma_page(const struct page *page)
>   {
>   	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
> @@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>   }
>   
>   #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void init_zone_device_folio(struct folio *folio, unsigned int order);
>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>   void memunmap_pages(struct dev_pagemap *pgmap);
>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>   
>   unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_page_init(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	init_zone_device_folio(folio, 0);
> +}
> +
>   #else
>   static inline void *devm_memremap_pages(struct device *dev,
>   		struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 2aebc1b192da..7d98d0a4c0cd 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -459,20 +459,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>   void free_zone_device_folio(struct folio *folio)
>   {
>   	struct dev_pagemap *pgmap = folio->pgmap;
> +	unsigned int nr = folio_nr_pages(folio);
> +	int i;
> +	bool anon = folio_test_anon(folio);

You can easily get rid of this (see below).

> +	struct page *page = folio_page(folio, 0);

Please inline folio_page(folio, 0) below instead.

>   
>   	if (WARN_ON_ONCE(!pgmap))
>   		return;
>   
>   	mem_cgroup_uncharge(folio);
>   
> -	/*
> -	 * Note: we don't expect anonymous compound pages yet. Once supported
> -	 * and we could PTE-map them similar to THP, we'd have to clear
> -	 * PG_anon_exclusive on all tail pages.
> -	 */
> -	if (folio_test_anon(folio)) {
> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -		__ClearPageAnonExclusive(folio_page(folio, 0));
> +	WARN_ON_ONCE(folio_test_large(folio) && !anon);
> +
> +	for (i = 0; i < nr; i++) {
> +		if (anon)
> +			__ClearPageAnonExclusive(folio_page(folio, i));
>   	}

if (folio_test_anon(folio)) {
	for (i = 0; i < nr; i++)
		__ClearPageAnonExclusive(folio_page(folio, i));
} else {
	VM_WARN_ON_ONCE(folio_test_large(folio));
}

>   
>   	/*
> @@ -496,10 +497,19 @@ void free_zone_device_folio(struct folio *folio)
>   
>   	switch (pgmap->type) {
>   	case MEMORY_DEVICE_PRIVATE:
> +		if (folio_test_large(folio)) {
> +			folio_unqueue_deferred_split(folio);

Is deferred splitting even a thing for device-private?

Should we ever queue them for deferred splitting?

> +
> +			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);

Looks like we instead want a helper put_dev_pagemap_refs(pgmap, nr) 
below instead

> +		}
> +		pgmap->ops->page_free(page);
> +		put_dev_pagemap(pgmap);
> +		page->mapping = NULL;
> +		break;
>   	case MEMORY_DEVICE_COHERENT:
>   		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>   			break;
> -		pgmap->ops->page_free(folio_page(folio, 0));
> +		pgmap->ops->page_free(page);
>   		put_dev_pagemap(pgmap);
>   		break;
>   
> @@ -523,14 +533,28 @@ void free_zone_device_folio(struct folio *folio)
>   	}
>   }
>   
> -void zone_device_page_init(struct page *page)
> +void init_zone_device_folio(struct folio *folio, unsigned int order)
>   {
> +	struct page *page = folio_page(folio, 0);
> +
> +	VM_BUG_ON(order > MAX_ORDER_NR_PAGES);

VM_WARN_ON_ONCE() or anything else that is not *BUG, please.

> +
> +	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);

Why do we need that limitation?

> +
>   	/*
>   	 * Drivers shouldn't be allocating pages after calling
>   	 * memunmap_pages().
>   	 */
> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> -	set_page_count(page, 1);
> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> +	folio_set_count(folio, 1);
>   	lock_page(page);
> +
> +	/*
> +	 * Only PMD level migration is supported for THP migration
> +	 */

I don't understand how that comment interacts with the code below. This 
is basic large folio initialization.

Drop the comment, or move it above the HPAGE_PMD_ORDER check with a 
clear reason why that limitation excists.

> +	if (order > 1) {
> +		prep_compound_page(page, order);
> +		folio_set_large_rmappable(folio);
> +	}
>   }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(init_zone_device_folio);


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages
  2025-03-06  4:42 ` [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
@ 2025-07-08 13:41   ` David Hildenbrand
  2025-07-09  5:25     ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-08 13:41 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 06.03.25 05:42, Balbir Singh wrote:
> Add flags to mark zone device migration pages.
> 
> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> device pages as compound pages during device pfn migration.
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>   include/linux/migrate.h | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 61899ec7a9a3..b5e4f51e64c7 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>   #define MIGRATE_PFN_VALID	(1UL << 0)
>   #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>   #define MIGRATE_PFN_WRITE	(1UL << 3)
> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>   #define MIGRATE_PFN_SHIFT	6
>   
>   static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> @@ -185,6 +186,7 @@ enum migrate_vma_direction {
>   	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>   	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>   	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>   };
>   
>   struct migrate_vma {

Squash that into the patch where it is actually used.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 03/11] mm/thp: zone_device awareness in THP handling code
  2025-03-06  4:42 ` [RFC 03/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-07-08 14:10   ` David Hildenbrand
  2025-07-09  6:06     ` Alistair Popple
  2025-07-09 12:30     ` Balbir Singh
  0 siblings, 2 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-08 14:10 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 06.03.25 05:42, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages
> aware of zone device pages. Although the code is
> designed to be generic when it comes to handling splitting
> of pages, the code is designed to work for THP page sizes
> corresponding to HPAGE_PMD_NR.
> 
> Modify page_vma_mapped_walk() to return true when a zone
> device huge entry is present, enabling try_to_migrate()
> and other code migration paths to appropriately process the
> entry
> 
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for
> zone device entries.
> 
> try_to_map_to_unused_zeropage() does not apply to zone device
> entries, zone device entries are ignored in the call.
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>   mm/huge_memory.c     | 151 +++++++++++++++++++++++++++++++------------
>   mm/migrate.c         |   2 +
>   mm/page_vma_mapped.c |  10 +++
>   mm/rmap.c            |  19 +++++-
>   4 files changed, 138 insertions(+), 44 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 826bfe907017..d8e018d1bdbd 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2247,10 +2247,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   		} else if (thp_migration_supported()) {
>   			swp_entry_t entry;
>   
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>   			entry = pmd_to_swp_entry(orig_pmd);
>   			folio = pfn_swap_entry_folio(entry);
>   			flush_needed = 0;
> +
> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +					!folio_is_device_private(folio));

Convert that to a VM_WARN_ON_ONCE() while you are at it.

But really, check that the *pmd* is as expected (device_pritavte entry), 
and not the folio after the effects.

Also, hiding all that under the thp_migration_supported() looks wrong.

Likely you must clean that up first, to have something that expresses 
that we support PMD swap entries or sth like that. Not just "migration 
entries".


> +
> +			if (folio_is_device_private(folio)) {
> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}


zap_nonpresent_ptes() does

if (is_device_private_entry(entry)) {
	...
} else if (is_migration_entry(entry)) {
	....
}

Can we adjust the same way of foing things? (yes, we might want a 
thp_migration_supported() check somewhere)

>   		} else
>   			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>   
> @@ -2264,6 +2271,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   				       -HPAGE_PMD_NR);
>   		}
>   
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>   		spin_unlock(ptl);
>   		if (flush_needed)
>   			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2392,7 +2408,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   		struct folio *folio = pfn_swap_entry_folio(entry);
>   		pmd_t newpmd;
>   
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +			  !folio_is_device_private(folio));
>   		if (is_writable_migration_entry(entry)) {
>   			/*
>   			 * A protection check is difficult so
> @@ -2405,9 +2422,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   			newpmd = swp_entry_to_pmd(entry);
>   			if (pmd_swp_soft_dirty(*pmd))
>   				newpmd = pmd_swp_mksoft_dirty(newpmd);
> -		} else {
> +		} else if (is_writable_device_private_entry(entry)) {
> +			newpmd = swp_entry_to_pmd(entry);
> +			entry = make_device_exclusive_entry(swp_offset(entry));
> +		} else
>   			newpmd = *pmd;
> -		}
>   
>   		if (uffd_wp)
>   			newpmd = pmd_swp_mkuffd_wp(newpmd);
> @@ -2860,11 +2879,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	struct page *page;
>   	pgtable_t pgtable;
>   	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>   	unsigned long addr;
>   	pte_t *pte;
>   	int i;
> +	swp_entry_t swp_entry;
>   
>   	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>   	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> @@ -2918,20 +2938,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>   	}
>   
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>   
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>   		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +		write = is_writable_migration_entry(swp_entry);
> +
>   		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +			anon_exclusive =
> +				is_readable_exclusive_migration_entry(swp_entry);
>   		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>   		uffd_wp = pmd_swp_uffd_wp(old_pmd);
> +		young = is_migration_entry_young(swp_entry);
> +		dirty = is_migration_entry_dirty(swp_entry);
>   	} else {
>   		/*
>   		 * Up to this point the pmd is present and huge and userland has
> @@ -3015,30 +3040,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	 * Note that NUMA hinting access restrictions are not transferred to
>   	 * avoid any possibility of altering permissions across VMAs.
>   	 */
> -	if (freeze || pmd_migration) {
> +	if (freeze || !present) {
>   		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>   			pte_t entry;
> -			swp_entry_t swp_entry;
> -
> -			if (write)
> -				swp_entry = make_writable_migration_entry(
> -							page_to_pfn(page + i));
> -			else if (anon_exclusive)
> -				swp_entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(page + i));
> -			else
> -				swp_entry = make_readable_migration_entry(
> -							page_to_pfn(page + i));
> -			if (young)
> -				swp_entry = make_migration_entry_young(swp_entry);
> -			if (dirty)
> -				swp_entry = make_migration_entry_dirty(swp_entry);
> -			entry = swp_entry_to_pte(swp_entry);
> -			if (soft_dirty)
> -				entry = pte_swp_mksoft_dirty(entry);
> -			if (uffd_wp)
> -				entry = pte_swp_mkuffd_wp(entry);
> -
> +			if (freeze || is_migration_entry(swp_entry)) {
> +				if (write)
> +					swp_entry = make_writable_migration_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_migration_entry(
> +								page_to_pfn(page + i));
> +				if (young)
> +					swp_entry = make_migration_entry_young(swp_entry);
> +				if (dirty)
> +					swp_entry = make_migration_entry_dirty(swp_entry);
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			} else {
> +				VM_BUG_ON(!is_device_private_entry(swp_entry));
> +				if (write)
> +					swp_entry = make_writable_device_private_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_device_exclusive_entry(
> +								page_to_pfn(page + i));

I am pretty sure this is wrong. You cannot suddenly mix in 
device-exclusive entries.

And now I am confused again how device-private, anon and GUP interact.

> +				else
> +					swp_entry = make_readable_device_private_entry(
> +								page_to_pfn(page + i));
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			}
>   			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>   			set_pte_at(mm, addr, pte + i, entry);
>   		}
> @@ -3065,7 +3105,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	}
>   	pte_unmap(pte);
>   
> -	if (!pmd_migration)
> +	if (present)
>   		folio_remove_rmap_pmd(folio, page, vma);
>   	if (freeze)
>   		put_page(page);
> @@ -3077,6 +3117,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>   			   pmd_t *pmd, bool freeze, struct folio *folio)
>   {
> +	struct folio *pmd_folio;
>   	VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
>   	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>   	VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
> @@ -3089,7 +3130,14 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>   	 */
>   	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>   	    is_pmd_migration_entry(*pmd)) {
> -		if (folio && folio != pmd_folio(*pmd))
> +		if (folio && !pmd_present(*pmd)) {
> +			swp_entry_t swp_entry = pmd_to_swp_entry(*pmd);
> +
> +			pmd_folio = page_folio(pfn_swap_entry_to_page(swp_entry));
> +		} else {
> +			pmd_folio = pmd_folio(*pmd);
> +		}
> +		if (folio && folio != pmd_folio)
>   			return;
>   		__split_huge_pmd_locked(vma, pmd, address, freeze);
>   	}
> @@ -3581,11 +3629,16 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>   				     folio_test_swapcache(origin_folio)) ?
>   					     folio_nr_pages(release) : 0));
>   
> +			if (folio_is_device_private(release))
> +				percpu_ref_get_many(&release->pgmap->ref,
> +							(1 << new_order) - 1);
> +
>   			if (release == origin_folio)
>   				continue;
>   
> -			lru_add_page_tail(origin_folio, &release->page,
> -						lruvec, list);
> +			if (!folio_is_device_private(origin_folio))
> +				lru_add_page_tail(origin_folio, &release->page,
> +							lruvec, list);
>   
>   			/* Some pages can be beyond EOF: drop them from page cache */
>   			if (release->index >= end) {
> @@ -4625,7 +4678,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>   		return 0;
>   
>   	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	if (!folio_is_device_private(folio))
> +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	else
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);

Please handle this like we handle the PTE case -- checking for 
pmd_present() instead.

Avoid placing these nasty folio_is_device_private() all over the place 
where avoidable.

>   
>   	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>   	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> @@ -4675,6 +4731,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>   	entry = pmd_to_swp_entry(*pvmw->pmd);
>   	folio_get(folio);
>   	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
> +
> +	if (unlikely(folio_is_device_private(folio))) {
> +		if (pmd_write(pmde))
> +			entry = make_writable_device_private_entry(
> +							page_to_pfn(new));
> +		else
> +			entry = make_readable_device_private_entry(
> +							page_to_pfn(new));
> +		pmde = swp_entry_to_pmd(entry);
> +	}
> +
>   	if (pmd_swp_soft_dirty(*pvmw->pmd))
>   		pmde = pmd_mksoft_dirty(pmde);
>   	if (is_writable_migration_entry(entry))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 59e39aaa74e7..0aa1bdb711c3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>   
>   	if (PageCompound(page))
>   		return false;
> +	if (folio_is_device_private(folio))
> +		return false;

Why is that check required when you are adding THP handling and there is 
a PageCompound check right there?

>   	VM_BUG_ON_PAGE(!PageAnon(page), page);
>   	VM_BUG_ON_PAGE(!PageLocked(page), page);
>   	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e463c3be934a..5dd2e51477d3 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -278,6 +278,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   			 * cannot return prematurely, while zap_huge_pmd() has
>   			 * cleared *pmd but not decremented compound_mapcount().
>   			 */
> +			swp_entry_t entry;
> +
> +			if (!thp_migration_supported())
> +				return not_found(pvmw);

This check looks misplaced. We should follow the same model as check_pte().

Checking for THP migration support when you are actually caring about 
device-private entries is weird.

That is, I would expect something like

} else if (is_swap_pmd(pmde)) {
	swp_entry_t entry;

	entry = pmd_to_swp_entry(pmde);
	if (!is_device_private_entry(entry))
		return false;

	...
}

> +			entry = pmd_to_swp_entry(pmde);
> +			if (is_device_private_entry(entry)) {
> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +				return true;
> +			}
> +
>   			if ((pvmw->flags & PVMW_SYNC) &&
>   			    thp_vma_suitable_order(vma, pvmw->address,
>   						   PMD_ORDER) &&
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 67bb273dfb80..67e99dc5f2ef 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2326,8 +2326,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>   		/* PMD-mapped THP migration entry */
>   		if (!pvmw.pte) {
> -			subpage = folio_page(folio,
> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +			/*
> +			 * Zone device private folios do not work well with
> +			 * pmd_pfn() on some architectures due to pte
> +			 * inversion.
> +			 */
> +			if (folio_is_device_private(folio)) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> +				unsigned long pfn = swp_offset_pfn(entry);
> +
> +				subpage = folio_page(folio, pfn
> +							- folio_pfn(folio));
> +			} else {
> +				subpage = folio_page(folio,
> +							pmd_pfn(*pvmw.pmd)
> +							- folio_pfn(folio));
> +			}
> +


Please follow the same model we use for PTEs.

/*
  * Handle PFN swap PMDs, such as device-exclusive ones, that
  * actually map pages.
  */
if (likely(pmd_present(...))) {

}


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 07/11] mm/memremap: Add folio_split support
  2025-03-06  4:42 ` [RFC 07/11] mm/memremap: Add folio_split support Balbir Singh
  2025-03-06  8:16   ` Mika Penttilä
  2025-03-06 22:36   ` Alistair Popple
@ 2025-07-08 14:31   ` David Hildenbrand
  2025-07-09 23:34     ` Balbir Singh
  2 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-08 14:31 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 06.03.25 05:42, Balbir Singh wrote:
> When a zone device page is split (via huge pmd folio split). The
> driver callback for folio_split is invoked to let the device driver
> know that the folio size has been split into a smaller order.
> 
> The HMM test driver has been updated to handle the split, since the
> test driver uses backing pages, it requires a mechanism of reorganizing
> the backing pages (backing pages are used to create a mirror device)
> again into the right sized order pages. This is supported by exporting
> prep_compound_page().
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>   include/linux/memremap.h |  7 +++++++
>   include/linux/mm.h       |  1 +
>   lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
>   mm/huge_memory.c         |  5 +++++
>   mm/page_alloc.c          |  1 +
>   5 files changed, 49 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 11d586dd8ef1..2091b754f1da 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -100,6 +100,13 @@ struct dev_pagemap_ops {
>   	 */
>   	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>   			      unsigned long nr_pages, int mf_flags);
> +
> +	/*
> +	 * Used for private (un-addressable) device memory only.
> +	 * This callback is used when a folio is split into
> +	 * a smaller folio

Confusing. When a folio is split, it is split into multiple folios.

So when will this be invoked?

> +	 */
> +	void (*folio_split)(struct folio *head, struct folio *tail);

head and tail are really suboptimal termonology. They refer to head and 
tail pages, which is not really the case with folios (in the long run).

>   };
>   
>   #define PGMAP_ALTMAP_VALID	(1 << 0)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 98a67488b5fe..3d0e91e0a923 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1415,6 +1415,7 @@ static inline struct folio *virt_to_folio(const void *x)
>   void __folio_put(struct folio *folio);
>   
>   void split_page(struct page *page, unsigned int order);
> +void prep_compound_page(struct page *page, unsigned int order);
>   void folio_copy(struct folio *dst, struct folio *src);
>   int folio_mc_copy(struct folio *dst, struct folio *src);
>   
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index a81d2f8a0426..18b6a7b061d7 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1640,10 +1640,45 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>   	return ret;
>   }
>   
> +
> +static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
> +{
> +	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
> +	struct folio *new_rfolio;
> +	struct folio *rfolio;
> +	unsigned long offset = 0;
> +
> +	if (!rpage) {
> +		folio_page(tail, 0)->zone_device_data = NULL;
> +		return;
> +	}
> +
> +	offset = folio_pfn(tail) - folio_pfn(head);
> +	rfolio = page_folio(rpage);
> +	new_rfolio = page_folio(folio_page(rfolio, offset));
> +
> +	folio_page(tail, 0)->zone_device_data = folio_page(new_rfolio, 0);
> +
> +	if (folio_pfn(tail) - folio_pfn(head) == 1) {
> +		if (folio_order(head))
> +			prep_compound_page(folio_page(rfolio, 0),
> +						folio_order(head));
> +		folio_set_count(rfolio, 1);
> +	}
> +	clear_compound_head(folio_page(new_rfolio, 0));
> +	if (folio_order(tail))
> +		prep_compound_page(folio_page(new_rfolio, 0),
> +						folio_order(tail));
> +	folio_set_count(new_rfolio, 1);
> +	folio_page(new_rfolio, 0)->mapping = folio_page(rfolio, 0)->mapping;
> +	tail->pgmap = head->pgmap;

Most of this doesn't look like it should be the responsibility of this 
callback.

Setting up a new folio after the split (messing with compound pages etc) 
really should not be the responsibility of this callback.

So no, this looks misplaced.

> +}
> +
>   static const struct dev_pagemap_ops dmirror_devmem_ops = {
>   	.page_free	= dmirror_devmem_free,
>   	.migrate_to_ram	= dmirror_devmem_fault,
>   	.page_free	= dmirror_devmem_free,
> +	.folio_split	= dmirror_devmem_folio_split,
>   };
>   
>   static int dmirror_device_init(struct dmirror_device *mdevice, int id)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 995ac8be5709..518a70d1b58a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3655,6 +3655,11 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>   						MTHP_STAT_NR_ANON, 1);
>   			}
>   
> +			if (folio_is_device_private(origin_folio) &&
> +					origin_folio->pgmap->ops->folio_split)
> +				origin_folio->pgmap->ops->folio_split(
> +					origin_folio, release);

Absolutely ugly. I think we need a wrapper for the

> +
>   			/*
>   			 * Unfreeze refcount first. Additional reference from
>   			 * page cache.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 17ea8fb27cbf..563f7e39aa79 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -573,6 +573,7 @@ void prep_compound_page(struct page *page, unsigned int order)
>   
>   	prep_compound_head(page, order);
>   }
> +EXPORT_SYMBOL_GPL(prep_compound_page);

Hmmm, that is questionable. We don't want arbitrary modules to make use 
of that.

Another sign that you are exposing the wrong functionality/interface 
(folio_split) to modules.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 08/11] mm/thp: add split during migration support
  2025-03-06  4:42 ` [RFC 08/11] mm/thp: add split during migration support Balbir Singh
@ 2025-07-08 14:38   ` David Hildenbrand
  2025-07-08 14:46     ` Zi Yan
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-08 14:38 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 06.03.25 05:42, Balbir Singh wrote:
> Support splitting pages during THP zone device migration as needed.
> The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages.
> 
> Add a new routine migrate_vma_split_pages() to support the splitting
> of already isolated pages. The pages being migrated are already unmapped
> and marked for migration during setup (via unmap). folio_split() and
> __split_unmapped_folio() take additional isolated arguments, to avoid
> unmapping and remaping these pages and unlocking/putting the folio.
> 
> Since unmap/remap is avoided in these code paths, an extra reference
> count is added to the split folio pages, which will be dropped in
> the finalize phase.
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---

[...]

>   	remap_page(origin_folio, 1 << order,
>   			folio_test_anon(origin_folio) ?
>   				RMP_USE_SHARED_ZEROPAGE : 0);
> @@ -3808,6 +3823,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>    * @lock_at: a page within @folio to be left locked to caller
>    * @list: after-split folios will be put on it if non NULL
>    * @uniform_split: perform uniform split or not (non-uniform split)
> + * @isolated: The pages are already unmapped

Isolated -> unmapped? Huh?

Can we just detect that state from the folio so we don't have to pass 
random boolean variables around?

For example, folio_mapped() can tell you if the folio is currently mapped.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling
  2025-03-06  4:42 ` [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling Balbir Singh
@ 2025-07-08 14:40   ` David Hildenbrand
  2025-07-09 23:26     ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-08 14:40 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 06.03.25 05:42, Balbir Singh wrote:
> When the CPU touches a zone device THP entry, the data needs to
> be migrated back to the CPU, call migrate_to_ram() on these pages
> via do_huge_pmd_device_private() fault handling helper.
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>   include/linux/huge_mm.h |  7 +++++++
>   mm/huge_memory.c        | 35 +++++++++++++++++++++++++++++++++++
>   mm/memory.c             |  6 ++++--
>   3 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e893d546a49f..ad0c0ccfcbc2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -479,6 +479,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>   
>   vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
>   
> +vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
> +
>   extern struct folio *huge_zero_folio;
>   extern unsigned long huge_zero_pfn;
>   
> @@ -634,6 +636,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	return 0;
>   }
>   
> +static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
>   static inline bool is_huge_zero_folio(const struct folio *folio)
>   {
>   	return false;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8e018d1bdbd..995ac8be5709 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1375,6 +1375,41 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   	return __do_huge_pmd_anonymous_page(vmf);
>   }
>   
> +vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	vm_fault_t ret;
> +	spinlock_t *ptl;
> +	swp_entry_t swp_entry;
> +	struct page *page;
> +
> +	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> +		return VM_FAULT_FALLBACK;

I'm confused. Why is that required when we already have a PMD entry?

Apart from that, nothing jumped at me.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 08/11] mm/thp: add split during migration support
  2025-07-08 14:38   ` David Hildenbrand
@ 2025-07-08 14:46     ` Zi Yan
  2025-07-08 14:53       ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Zi Yan @ 2025-07-08 14:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Balbir Singh, linux-mm, akpm, dri-devel, nouveau, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 8 Jul 2025, at 10:38, David Hildenbrand wrote:

> On 06.03.25 05:42, Balbir Singh wrote:
>> Support splitting pages during THP zone device migration as needed.
>> The common case that arises is that after setup, during migrate
>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>> pages.
>>
>> Add a new routine migrate_vma_split_pages() to support the splitting
>> of already isolated pages. The pages being migrated are already unmapped
>> and marked for migration during setup (via unmap). folio_split() and
>> __split_unmapped_folio() take additional isolated arguments, to avoid
>> unmapping and remaping these pages and unlocking/putting the folio.
>>
>> Since unmap/remap is avoided in these code paths, an extra reference
>> count is added to the split folio pages, which will be dropped in
>> the finalize phase.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>
> [...]
>
>>   	remap_page(origin_folio, 1 << order,
>>   			folio_test_anon(origin_folio) ?
>>   				RMP_USE_SHARED_ZEROPAGE : 0);
>> @@ -3808,6 +3823,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>    * @lock_at: a page within @folio to be left locked to caller
>>    * @list: after-split folios will be put on it if non NULL
>>    * @uniform_split: perform uniform split or not (non-uniform split)
>> + * @isolated: The pages are already unmapped
>
> Isolated -> unmapped? Huh?
>
> Can we just detect that state from the folio so we don't have to pass random boolean variables around?
>
> For example, folio_mapped() can tell you if the folio is currently mapped.

My proposal is to clean up __split_unmapped_folio() to not include
remap(), folio_ref_unfreeze(), lru_add_split_folio(), so that Balbir
can use __split_unmapped_folio() directly. Since the folio is
unmapped and all page table entries are migration entries, __folio_split()
code could be avoided.

My clean up patch is at: https://lore.kernel.org/linux-mm/660F3BCC-0360-458F-BFF5-92C797E165CC@nvidia.com/. I will make some polish and send it out properly.



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 08/11] mm/thp: add split during migration support
  2025-07-08 14:46     ` Zi Yan
@ 2025-07-08 14:53       ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-08 14:53 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, akpm, dri-devel, nouveau, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 08.07.25 16:46, Zi Yan wrote:
> On 8 Jul 2025, at 10:38, David Hildenbrand wrote:
> 
>> On 06.03.25 05:42, Balbir Singh wrote:
>>> Support splitting pages during THP zone device migration as needed.
>>> The common case that arises is that after setup, during migrate
>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>> pages.
>>>
>>> Add a new routine migrate_vma_split_pages() to support the splitting
>>> of already isolated pages. The pages being migrated are already unmapped
>>> and marked for migration during setup (via unmap). folio_split() and
>>> __split_unmapped_folio() take additional isolated arguments, to avoid
>>> unmapping and remaping these pages and unlocking/putting the folio.
>>>
>>> Since unmap/remap is avoided in these code paths, an extra reference
>>> count is added to the split folio pages, which will be dropped in
>>> the finalize phase.
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>
>> [...]
>>
>>>    	remap_page(origin_folio, 1 << order,
>>>    			folio_test_anon(origin_folio) ?
>>>    				RMP_USE_SHARED_ZEROPAGE : 0);
>>> @@ -3808,6 +3823,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>     * @lock_at: a page within @folio to be left locked to caller
>>>     * @list: after-split folios will be put on it if non NULL
>>>     * @uniform_split: perform uniform split or not (non-uniform split)
>>> + * @isolated: The pages are already unmapped
>>
>> Isolated -> unmapped? Huh?
>>
>> Can we just detect that state from the folio so we don't have to pass random boolean variables around?
>>
>> For example, folio_mapped() can tell you if the folio is currently mapped.
> 
> My proposal is to clean up __split_unmapped_folio() to not include
> remap(), folio_ref_unfreeze(), lru_add_split_folio(), so that Balbir
> can use __split_unmapped_folio() directly. Since the folio is
> unmapped and all page table entries are migration entries, __folio_split()
> code could be avoided.
> 
> My clean up patch is at: https://lore.kernel.org/linux-mm/660F3BCC-0360-458F-BFF5-92C797E165CC@nvidia.com/. I will make some polish and send it out properly.

Works for me as well.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 01/11] mm/zone_device: support large zone device private folios
  2025-07-08 13:37   ` David Hildenbrand
@ 2025-07-09  5:25     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-07-09  5:25 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/8/25 23:37, David Hildenbrand wrote:
> On 06.03.25 05:42, Balbir Singh wrote:
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   include/linux/memremap.h | 22 +++++++++++++++++-
>>   mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
>>   2 files changed, 58 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 4aa151914eab..11d586dd8ef1 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>       return is_device_private_page(&folio->page);
>>   }
>>   +static inline void *folio_zone_device_data(const struct folio *folio)
>> +{
>> +    VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
>> +    return folio->page.zone_device_data;
>> +}
> 
> Not used.
> 
>> +
>> +static inline void folio_set_zone_device_data(struct folio *folio, void *data)
>> +{
>> +    VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
>> +    folio->page.zone_device_data = data;
>> +}
>> +
> 
> Not used.
> 
> Move both into the patch where they are actually used.
> 

Ack

>>   static inline bool is_pci_p2pdma_page(const struct page *page)
>>   {
>>       return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
>> @@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>>   }
>>     #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void init_zone_device_folio(struct folio *folio, unsigned int order);
>>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>   void memunmap_pages(struct dev_pagemap *pgmap);
>>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>     unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_page_init(struct page *page)
>> +{
>> +    struct folio *folio = page_folio(page);
>> +
>> +    init_zone_device_folio(folio, 0);
>> +}
>> +
>>   #else
>>   static inline void *devm_memremap_pages(struct device *dev,
>>           struct dev_pagemap *pgmap)
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 2aebc1b192da..7d98d0a4c0cd 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -459,20 +459,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>   void free_zone_device_folio(struct folio *folio)
>>   {
>>       struct dev_pagemap *pgmap = folio->pgmap;
>> +    unsigned int nr = folio_nr_pages(folio);
>> +    int i;
>> +    bool anon = folio_test_anon(folio);
> 
> You can easily get rid of this (see below).
> 
>> +    struct page *page = folio_page(folio, 0);
> 
> Please inline folio_page(folio, 0) below instead.

Sure, is that preferred to taking a struct page ref?

> 
>>         if (WARN_ON_ONCE(!pgmap))
>>           return;
>>         mem_cgroup_uncharge(folio);
>>   -    /*
>> -     * Note: we don't expect anonymous compound pages yet. Once supported
>> -     * and we could PTE-map them similar to THP, we'd have to clear
>> -     * PG_anon_exclusive on all tail pages.
>> -     */
>> -    if (folio_test_anon(folio)) {
>> -        VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -        __ClearPageAnonExclusive(folio_page(folio, 0));
>> +    WARN_ON_ONCE(folio_test_large(folio) && !anon);
>> +
>> +    for (i = 0; i < nr; i++) {
>> +        if (anon)
>> +            __ClearPageAnonExclusive(folio_page(folio, i));
>>       }
> 
> if (folio_test_anon(folio)) {
>     for (i = 0; i < nr; i++)
>         __ClearPageAnonExclusive(folio_page(folio, i));
> } else {
>     VM_WARN_ON_ONCE(folio_test_large(folio));
> }
> 

Ack

>>         /*
>> @@ -496,10 +497,19 @@ void free_zone_device_folio(struct folio *folio)
>>         switch (pgmap->type) {
>>       case MEMORY_DEVICE_PRIVATE:
>> +        if (folio_test_large(folio)) {
>> +            folio_unqueue_deferred_split(folio);
> 
> Is deferred splitting even a thing for device-private?
> 
> Should we ever queue them for deferred splitting?
> 

Not really, but wanted to do the right thing in the tear down path, I can remove these bits

>> +
>> +            percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
> 
> Looks like we instead want a helper put_dev_pagemap_refs(pgmap, nr) below instead
> 
>> +        }
>> +        pgmap->ops->page_free(page);
>> +        put_dev_pagemap(pgmap);
>> +        page->mapping = NULL;
>> +        break;
>>       case MEMORY_DEVICE_COHERENT:
>>           if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>               break;
>> -        pgmap->ops->page_free(folio_page(folio, 0));
>> +        pgmap->ops->page_free(page);
>>           put_dev_pagemap(pgmap);
>>           break;
>>   @@ -523,14 +533,28 @@ void free_zone_device_folio(struct folio *folio)
>>       }
>>   }
>>   -void zone_device_page_init(struct page *page)
>> +void init_zone_device_folio(struct folio *folio, unsigned int order)
>>   {
>> +    struct page *page = folio_page(folio, 0);
>> +
>> +    VM_BUG_ON(order > MAX_ORDER_NR_PAGES);
> 
> VM_WARN_ON_ONCE() or anything else that is not *BUG, please.
> 

Ack

>> +
>> +    WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
> 
> Why do we need that limitation?
> 

mTHP is not yet supported in the series. We could keep this routine more generic
and not need the checks, but I added them to prevent unsupported order usage

>> +
>>       /*
>>        * Drivers shouldn't be allocating pages after calling
>>        * memunmap_pages().
>>        */
>> -    WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>> -    set_page_count(page, 1);
>> +    WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> +    folio_set_count(folio, 1);
>>       lock_page(page);
>> +
>> +    /*
>> +     * Only PMD level migration is supported for THP migration
>> +     */
> 
> I don't understand how that comment interacts with the code below. This is basic large folio initialization.
> 
> Drop the comment, or move it above the HPAGE_PMD_ORDER check with a clear reason why that limitation excists.
>

Ack

 
>> +    if (order > 1) {
>> +        prep_compound_page(page, order);
>> +        folio_set_large_rmappable(folio);
>> +    }
>>   }
>> -EXPORT_SYMBOL_GPL(zone_device_page_init);
>> +EXPORT_SYMBOL_GPL(init_zone_device_folio);
> 
> 

Thanks for the review
Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages
  2025-07-08 13:41   ` David Hildenbrand
@ 2025-07-09  5:25     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-07-09  5:25 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/8/25 23:41, David Hildenbrand wrote:
> On 06.03.25 05:42, Balbir Singh wrote:
>> Add flags to mark zone device migration pages.
>>
>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>> device pages as compound pages during device pfn migration.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   include/linux/migrate.h | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 61899ec7a9a3..b5e4f51e64c7 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>   #define MIGRATE_PFN_VALID    (1UL << 0)
>>   #define MIGRATE_PFN_MIGRATE    (1UL << 1)
>>   #define MIGRATE_PFN_WRITE    (1UL << 3)
>> +#define MIGRATE_PFN_COMPOUND    (1UL << 4)
>>   #define MIGRATE_PFN_SHIFT    6
>>     static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>> @@ -185,6 +186,7 @@ enum migrate_vma_direction {
>>       MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>       MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>       MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>> +    MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>   };
>>     struct migrate_vma {
> 
> Squash that into the patch where it is actually used.
> 

Will do,

Thanks,
Balbir Singh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 03/11] mm/thp: zone_device awareness in THP handling code
  2025-07-08 14:10   ` David Hildenbrand
@ 2025-07-09  6:06     ` Alistair Popple
  2025-07-09 12:30     ` Balbir Singh
  1 sibling, 0 replies; 38+ messages in thread
From: Alistair Popple @ 2025-07-09  6:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Balbir Singh, linux-mm, akpm, dri-devel, nouveau, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang,
	Jane Chu, Donet Tom

On Tue, Jul 08, 2025 at 04:10:55PM +0200, David Hildenbrand wrote:
> On 06.03.25 05:42, Balbir Singh wrote:
> > Make THP handling code in the mm subsystem for THP pages
> > aware of zone device pages. Although the code is
> > designed to be generic when it comes to handling splitting
> > of pages, the code is designed to work for THP page sizes
> > corresponding to HPAGE_PMD_NR.
> > 
> > Modify page_vma_mapped_walk() to return true when a zone
> > device huge entry is present, enabling try_to_migrate()
> > and other code migration paths to appropriately process the
> > entry
> > 
> > pmd_pfn() does not work well with zone device entries, use
> > pfn_pmd_entry_to_swap() for checking and comparison as for
> > zone device entries.
> > 
> > try_to_map_to_unused_zeropage() does not apply to zone device
> > entries, zone device entries are ignored in the call.
> > 
> > Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> > ---
> >   mm/huge_memory.c     | 151 +++++++++++++++++++++++++++++++------------
> >   mm/migrate.c         |   2 +
> >   mm/page_vma_mapped.c |  10 +++
> >   mm/rmap.c            |  19 +++++-
> >   4 files changed, 138 insertions(+), 44 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 826bfe907017..d8e018d1bdbd 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2247,10 +2247,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   		} else if (thp_migration_supported()) {
> >   			swp_entry_t entry;
> > -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
> >   			entry = pmd_to_swp_entry(orig_pmd);
> >   			folio = pfn_swap_entry_folio(entry);
> >   			flush_needed = 0;
> > +
> > +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> > +					!folio_is_device_private(folio));
> 
> Convert that to a VM_WARN_ON_ONCE() while you are at it.
> 
> But really, check that the *pmd* is as expected (device_pritavte entry), and
> not the folio after the effects.
> 
> Also, hiding all that under the thp_migration_supported() looks wrong.
> 
> Likely you must clean that up first, to have something that expresses that
> we support PMD swap entries or sth like that. Not just "migration entries".
> 
> 
> > +
> > +			if (folio_is_device_private(folio)) {
> > +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
> > +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> > +			}
> 
> 
> zap_nonpresent_ptes() does
> 
> if (is_device_private_entry(entry)) {
> 	...
> } else if (is_migration_entry(entry)) {
> 	....
> }
> 
> Can we adjust the same way of foing things? (yes, we might want a
> thp_migration_supported() check somewhere)
> 
> >   		} else
> >   			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> > @@ -2264,6 +2271,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   				       -HPAGE_PMD_NR);
> >   		}
> > +		/*
> > +		 * Do a folio put on zone device private pages after
> > +		 * changes to mm_counter, because the folio_put() will
> > +		 * clean folio->mapping and the folio_test_anon() check
> > +		 * will not be usable.
> > +		 */
> > +		if (folio_is_device_private(folio))
> > +			folio_put(folio);
> > +
> >   		spin_unlock(ptl);
> >   		if (flush_needed)
> >   			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> > @@ -2392,7 +2408,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   		struct folio *folio = pfn_swap_entry_folio(entry);
> >   		pmd_t newpmd;
> > -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> > +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> > +			  !folio_is_device_private(folio));
> >   		if (is_writable_migration_entry(entry)) {
> >   			/*
> >   			 * A protection check is difficult so
> > @@ -2405,9 +2422,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   			newpmd = swp_entry_to_pmd(entry);
> >   			if (pmd_swp_soft_dirty(*pmd))
> >   				newpmd = pmd_swp_mksoft_dirty(newpmd);
> > -		} else {
> > +		} else if (is_writable_device_private_entry(entry)) {
> > +			newpmd = swp_entry_to_pmd(entry);
> > +			entry = make_device_exclusive_entry(swp_offset(entry));
> > +		} else
> >   			newpmd = *pmd;
> > -		}
> >   		if (uffd_wp)
> >   			newpmd = pmd_swp_mkuffd_wp(newpmd);
> > @@ -2860,11 +2879,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   	struct page *page;
> >   	pgtable_t pgtable;
> >   	pmd_t old_pmd, _pmd;
> > -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> > -	bool anon_exclusive = false, dirty = false;
> > +	bool young, write, soft_dirty, uffd_wp = false;
> > +	bool anon_exclusive = false, dirty = false, present = false;
> >   	unsigned long addr;
> >   	pte_t *pte;
> >   	int i;
> > +	swp_entry_t swp_entry;
> >   	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
> >   	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> > @@ -2918,20 +2938,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> >   	}
> > -	pmd_migration = is_pmd_migration_entry(*pmd);
> > -	if (unlikely(pmd_migration)) {
> > -		swp_entry_t entry;
> > +	present = pmd_present(*pmd);
> > +	if (unlikely(!present)) {
> > +		swp_entry = pmd_to_swp_entry(*pmd);
> >   		old_pmd = *pmd;
> > -		entry = pmd_to_swp_entry(old_pmd);
> > -		page = pfn_swap_entry_to_page(entry);
> > -		write = is_writable_migration_entry(entry);
> > +
> > +		folio = pfn_swap_entry_folio(swp_entry);
> > +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
> > +				!is_device_private_entry(swp_entry));
> > +		page = pfn_swap_entry_to_page(swp_entry);
> > +		write = is_writable_migration_entry(swp_entry);
> > +
> >   		if (PageAnon(page))
> > -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> > -		young = is_migration_entry_young(entry);
> > -		dirty = is_migration_entry_dirty(entry);
> > +			anon_exclusive =
> > +				is_readable_exclusive_migration_entry(swp_entry);
> >   		soft_dirty = pmd_swp_soft_dirty(old_pmd);
> >   		uffd_wp = pmd_swp_uffd_wp(old_pmd);
> > +		young = is_migration_entry_young(swp_entry);
> > +		dirty = is_migration_entry_dirty(swp_entry);
> >   	} else {
> >   		/*
> >   		 * Up to this point the pmd is present and huge and userland has
> > @@ -3015,30 +3040,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   	 * Note that NUMA hinting access restrictions are not transferred to
> >   	 * avoid any possibility of altering permissions across VMAs.
> >   	 */
> > -	if (freeze || pmd_migration) {
> > +	if (freeze || !present) {
> >   		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> >   			pte_t entry;
> > -			swp_entry_t swp_entry;
> > -
> > -			if (write)
> > -				swp_entry = make_writable_migration_entry(
> > -							page_to_pfn(page + i));
> > -			else if (anon_exclusive)
> > -				swp_entry = make_readable_exclusive_migration_entry(
> > -							page_to_pfn(page + i));
> > -			else
> > -				swp_entry = make_readable_migration_entry(
> > -							page_to_pfn(page + i));
> > -			if (young)
> > -				swp_entry = make_migration_entry_young(swp_entry);
> > -			if (dirty)
> > -				swp_entry = make_migration_entry_dirty(swp_entry);
> > -			entry = swp_entry_to_pte(swp_entry);
> > -			if (soft_dirty)
> > -				entry = pte_swp_mksoft_dirty(entry);
> > -			if (uffd_wp)
> > -				entry = pte_swp_mkuffd_wp(entry);
> > -
> > +			if (freeze || is_migration_entry(swp_entry)) {
> > +				if (write)
> > +					swp_entry = make_writable_migration_entry(
> > +								page_to_pfn(page + i));
> > +				else if (anon_exclusive)
> > +					swp_entry = make_readable_exclusive_migration_entry(
> > +								page_to_pfn(page + i));
> > +				else
> > +					swp_entry = make_readable_migration_entry(
> > +								page_to_pfn(page + i));
> > +				if (young)
> > +					swp_entry = make_migration_entry_young(swp_entry);
> > +				if (dirty)
> > +					swp_entry = make_migration_entry_dirty(swp_entry);
> > +				entry = swp_entry_to_pte(swp_entry);
> > +				if (soft_dirty)
> > +					entry = pte_swp_mksoft_dirty(entry);
> > +				if (uffd_wp)
> > +					entry = pte_swp_mkuffd_wp(entry);
> > +			} else {
> > +				VM_BUG_ON(!is_device_private_entry(swp_entry));
> > +				if (write)
> > +					swp_entry = make_writable_device_private_entry(
> > +								page_to_pfn(page + i));
> > +				else if (anon_exclusive)
> > +					swp_entry = make_device_exclusive_entry(
> > +								page_to_pfn(page + i));
> 
> I am pretty sure this is wrong. You cannot suddenly mix in device-exclusive
> entries.
> 
> And now I am confused again how device-private, anon and GUP interact.

See my comments on Balbir's v1 resend. I'm pretty sure he's just gotten mixed
up with the wonderfully confusing naming I helped create and incorrectly copied
the code for making migration entries above. GUP doesn't work for device-private
pages - it will fault which will cause the device driver to migrate the pages
back.

> > +				else
> > +					swp_entry = make_readable_device_private_entry(
> > +								page_to_pfn(page + i));
> > +				entry = swp_entry_to_pte(swp_entry);
> > +				if (soft_dirty)
> > +					entry = pte_swp_mksoft_dirty(entry);
> > +				if (uffd_wp)
> > +					entry = pte_swp_mkuffd_wp(entry);
> > +			}
> >   			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
> >   			set_pte_at(mm, addr, pte + i, entry);
> >   		}
> > @@ -3065,7 +3105,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   	}
> >   	pte_unmap(pte);
> > -	if (!pmd_migration)
> > +	if (present)
> >   		folio_remove_rmap_pmd(folio, page, vma);
> >   	if (freeze)
> >   		put_page(page);
> > @@ -3077,6 +3117,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> >   			   pmd_t *pmd, bool freeze, struct folio *folio)
> >   {
> > +	struct folio *pmd_folio;
> >   	VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
> >   	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> >   	VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
> > @@ -3089,7 +3130,14 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> >   	 */
> >   	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
> >   	    is_pmd_migration_entry(*pmd)) {
> > -		if (folio && folio != pmd_folio(*pmd))
> > +		if (folio && !pmd_present(*pmd)) {
> > +			swp_entry_t swp_entry = pmd_to_swp_entry(*pmd);
> > +
> > +			pmd_folio = page_folio(pfn_swap_entry_to_page(swp_entry));
> > +		} else {
> > +			pmd_folio = pmd_folio(*pmd);
> > +		}
> > +		if (folio && folio != pmd_folio)
> >   			return;
> >   		__split_huge_pmd_locked(vma, pmd, address, freeze);
> >   	}
> > @@ -3581,11 +3629,16 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
> >   				     folio_test_swapcache(origin_folio)) ?
> >   					     folio_nr_pages(release) : 0));
> > +			if (folio_is_device_private(release))
> > +				percpu_ref_get_many(&release->pgmap->ref,
> > +							(1 << new_order) - 1);
> > +
> >   			if (release == origin_folio)
> >   				continue;
> > -			lru_add_page_tail(origin_folio, &release->page,
> > -						lruvec, list);
> > +			if (!folio_is_device_private(origin_folio))
> > +				lru_add_page_tail(origin_folio, &release->page,
> > +							lruvec, list);
> >   			/* Some pages can be beyond EOF: drop them from page cache */
> >   			if (release->index >= end) {
> > @@ -4625,7 +4678,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> >   		return 0;
> >   	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> > -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> > +	if (!folio_is_device_private(folio))
> > +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> > +	else
> > +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> 
> Please handle this like we handle the PTE case -- checking for pmd_present()
> instead.
> 
> Avoid placing these nasty folio_is_device_private() all over the place where
> avoidable.
> 
> >   	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
> >   	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> > @@ -4675,6 +4731,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> >   	entry = pmd_to_swp_entry(*pvmw->pmd);
> >   	folio_get(folio);
> >   	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
> > +
> > +	if (unlikely(folio_is_device_private(folio))) {
> > +		if (pmd_write(pmde))
> > +			entry = make_writable_device_private_entry(
> > +							page_to_pfn(new));
> > +		else
> > +			entry = make_readable_device_private_entry(
> > +							page_to_pfn(new));
> > +		pmde = swp_entry_to_pmd(entry);
> > +	}
> > +
> >   	if (pmd_swp_soft_dirty(*pvmw->pmd))
> >   		pmde = pmd_mksoft_dirty(pmde);
> >   	if (is_writable_migration_entry(entry))
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 59e39aaa74e7..0aa1bdb711c3 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> >   	if (PageCompound(page))
> >   		return false;
> > +	if (folio_is_device_private(folio))
> > +		return false;
> 
> Why is that check required when you are adding THP handling and there is a
> PageCompound check right there?
> 
> >   	VM_BUG_ON_PAGE(!PageAnon(page), page);
> >   	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >   	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index e463c3be934a..5dd2e51477d3 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -278,6 +278,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >   			 * cannot return prematurely, while zap_huge_pmd() has
> >   			 * cleared *pmd but not decremented compound_mapcount().
> >   			 */
> > +			swp_entry_t entry;
> > +
> > +			if (!thp_migration_supported())
> > +				return not_found(pvmw);
> 
> This check looks misplaced. We should follow the same model as check_pte().
> 
> Checking for THP migration support when you are actually caring about
> device-private entries is weird.
> 
> That is, I would expect something like
> 
> } else if (is_swap_pmd(pmde)) {
> 	swp_entry_t entry;
> 
> 	entry = pmd_to_swp_entry(pmde);
> 	if (!is_device_private_entry(entry))
> 		return false;
> 
> 	...
> }
> 
> > +			entry = pmd_to_swp_entry(pmde);
> > +			if (is_device_private_entry(entry)) {
> > +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> > +				return true;
> > +			}
> > +
> >   			if ((pvmw->flags & PVMW_SYNC) &&
> >   			    thp_vma_suitable_order(vma, pvmw->address,
> >   						   PMD_ORDER) &&
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 67bb273dfb80..67e99dc5f2ef 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2326,8 +2326,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> >   #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> >   		/* PMD-mapped THP migration entry */
> >   		if (!pvmw.pte) {
> > -			subpage = folio_page(folio,
> > -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> > +			/*
> > +			 * Zone device private folios do not work well with
> > +			 * pmd_pfn() on some architectures due to pte
> > +			 * inversion.
> > +			 */
> > +			if (folio_is_device_private(folio)) {
> > +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> > +				unsigned long pfn = swp_offset_pfn(entry);
> > +
> > +				subpage = folio_page(folio, pfn
> > +							- folio_pfn(folio));
> > +			} else {
> > +				subpage = folio_page(folio,
> > +							pmd_pfn(*pvmw.pmd)
> > +							- folio_pfn(folio));
> > +			}
> > +
> 
> 
> Please follow the same model we use for PTEs.
> 
> /*
>  * Handle PFN swap PMDs, such as device-exclusive ones, that
>  * actually map pages.
>  */
> if (likely(pmd_present(...))) {
> 
> }
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 03/11] mm/thp: zone_device awareness in THP handling code
  2025-07-08 14:10   ` David Hildenbrand
  2025-07-09  6:06     ` Alistair Popple
@ 2025-07-09 12:30     ` Balbir Singh
  1 sibling, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-07-09 12:30 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/9/25 00:10, David Hildenbrand wrote:
> On 06.03.25 05:42, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages
>> aware of zone device pages. Although the code is
>> designed to be generic when it comes to handling splitting
>> of pages, the code is designed to work for THP page sizes
>> corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone
>> device huge entry is present, enabling try_to_migrate()
>> and other code migration paths to appropriately process the
>> entry
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for
>> zone device entries.
>>
>> try_to_map_to_unused_zeropage() does not apply to zone device
>> entries, zone device entries are ignored in the call.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   mm/huge_memory.c     | 151 +++++++++++++++++++++++++++++++------------
>>   mm/migrate.c         |   2 +
>>   mm/page_vma_mapped.c |  10 +++
>>   mm/rmap.c            |  19 +++++-
>>   4 files changed, 138 insertions(+), 44 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 826bfe907017..d8e018d1bdbd 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2247,10 +2247,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>           } else if (thp_migration_supported()) {
>>               swp_entry_t entry;
>>   -            VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>               entry = pmd_to_swp_entry(orig_pmd);
>>               folio = pfn_swap_entry_folio(entry);
>>               flush_needed = 0;
>> +
>> +            VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> +                    !folio_is_device_private(folio));
> 
> Convert that to a VM_WARN_ON_ONCE() while you are at it.
> 

Ack

> But really, check that the *pmd* is as expected (device_pritavte entry), and not the folio after the effects.
> 
> Also, hiding all that under the thp_migration_supported() looks wrong.
> 
> Likely you must clean that up first, to have something that expresses that we support PMD swap entries or sth like that. Not just "migration entries".
> 

The logic for the check is

if (pmd_present()) {
   ...
} else if (thp_migration_supported()) {
  ...

}

PMD swap is supported for migration entries (and zone device private after these changes)



> 
>> +
>> +            if (folio_is_device_private(folio)) {
>> +                folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
>> +                WARN_ON_ONCE(folio_mapcount(folio) < 0);
>> +            }
> 
> 
> zap_nonpresent_ptes() does
> 
> if (is_device_private_entry(entry)) {
>     ...
> } else if (is_migration_entry(entry)) {
>     ....
> }
> 
> Can we adjust the same way of foing things? (yes, we might want a thp_migration_supported() check somewhere)

Are you suggesting refactoring of the code to add zap_nonpresent_pmd()? There really isn't much to be
done for specifically for migration entries in this context

> 
>>           } else
>>               WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>   @@ -2264,6 +2271,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>                          -HPAGE_PMD_NR);
>>           }
>>   +        /*
>> +         * Do a folio put on zone device private pages after
>> +         * changes to mm_counter, because the folio_put() will
>> +         * clean folio->mapping and the folio_test_anon() check
>> +         * will not be usable.
>> +         */
>> +        if (folio_is_device_private(folio))
>> +            folio_put(folio);
>> +
>>           spin_unlock(ptl);
>>           if (flush_needed)
>>               tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
>> @@ -2392,7 +2408,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>           struct folio *folio = pfn_swap_entry_folio(entry);
>>           pmd_t newpmd;
>>   -        VM_BUG_ON(!is_pmd_migration_entry(*pmd));
>> +        VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> +              !folio_is_device_private(folio));
>>           if (is_writable_migration_entry(entry)) {
>>               /*
>>                * A protection check is difficult so
>> @@ -2405,9 +2422,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>               newpmd = swp_entry_to_pmd(entry);
>>               if (pmd_swp_soft_dirty(*pmd))
>>                   newpmd = pmd_swp_mksoft_dirty(newpmd);
>> -        } else {
>> +        } else if (is_writable_device_private_entry(entry)) {
>> +            newpmd = swp_entry_to_pmd(entry);
>> +            entry = make_device_exclusive_entry(swp_offset(entry));
>> +        } else
>>               newpmd = *pmd;
>> -        }
>>             if (uffd_wp)
>>               newpmd = pmd_swp_mkuffd_wp(newpmd);
>> @@ -2860,11 +2879,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>       struct page *page;
>>       pgtable_t pgtable;
>>       pmd_t old_pmd, _pmd;
>> -    bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>> -    bool anon_exclusive = false, dirty = false;
>> +    bool young, write, soft_dirty, uffd_wp = false;
>> +    bool anon_exclusive = false, dirty = false, present = false;
>>       unsigned long addr;
>>       pte_t *pte;
>>       int i;
>> +    swp_entry_t swp_entry;
>>         VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>>       VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>> @@ -2918,20 +2938,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>           return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>       }
>>   -    pmd_migration = is_pmd_migration_entry(*pmd);
>> -    if (unlikely(pmd_migration)) {
>> -        swp_entry_t entry;
>>   +    present = pmd_present(*pmd);
>> +    if (unlikely(!present)) {
>> +        swp_entry = pmd_to_swp_entry(*pmd);
>>           old_pmd = *pmd;
>> -        entry = pmd_to_swp_entry(old_pmd);
>> -        page = pfn_swap_entry_to_page(entry);
>> -        write = is_writable_migration_entry(entry);
>> +
>> +        folio = pfn_swap_entry_folio(swp_entry);
>> +        VM_BUG_ON(!is_migration_entry(swp_entry) &&
>> +                !is_device_private_entry(swp_entry));
>> +        page = pfn_swap_entry_to_page(swp_entry);
>> +        write = is_writable_migration_entry(swp_entry);
>> +
>>           if (PageAnon(page))
>> -            anon_exclusive = is_readable_exclusive_migration_entry(entry);
>> -        young = is_migration_entry_young(entry);
>> -        dirty = is_migration_entry_dirty(entry);
>> +            anon_exclusive =
>> +                is_readable_exclusive_migration_entry(swp_entry);
>>           soft_dirty = pmd_swp_soft_dirty(old_pmd);
>>           uffd_wp = pmd_swp_uffd_wp(old_pmd);
>> +        young = is_migration_entry_young(swp_entry);
>> +        dirty = is_migration_entry_dirty(swp_entry);
>>       } else {
>>           /*
>>            * Up to this point the pmd is present and huge and userland has
>> @@ -3015,30 +3040,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>        * Note that NUMA hinting access restrictions are not transferred to
>>        * avoid any possibility of altering permissions across VMAs.
>>        */
>> -    if (freeze || pmd_migration) {
>> +    if (freeze || !present) {
>>           for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>>               pte_t entry;
>> -            swp_entry_t swp_entry;
>> -
>> -            if (write)
>> -                swp_entry = make_writable_migration_entry(
>> -                            page_to_pfn(page + i));
>> -            else if (anon_exclusive)
>> -                swp_entry = make_readable_exclusive_migration_entry(
>> -                            page_to_pfn(page + i));
>> -            else
>> -                swp_entry = make_readable_migration_entry(
>> -                            page_to_pfn(page + i));
>> -            if (young)
>> -                swp_entry = make_migration_entry_young(swp_entry);
>> -            if (dirty)
>> -                swp_entry = make_migration_entry_dirty(swp_entry);
>> -            entry = swp_entry_to_pte(swp_entry);
>> -            if (soft_dirty)
>> -                entry = pte_swp_mksoft_dirty(entry);
>> -            if (uffd_wp)
>> -                entry = pte_swp_mkuffd_wp(entry);
>> -
>> +            if (freeze || is_migration_entry(swp_entry)) {
>> +                if (write)
>> +                    swp_entry = make_writable_migration_entry(
>> +                                page_to_pfn(page + i));
>> +                else if (anon_exclusive)
>> +                    swp_entry = make_readable_exclusive_migration_entry(
>> +                                page_to_pfn(page + i));
>> +                else
>> +                    swp_entry = make_readable_migration_entry(
>> +                                page_to_pfn(page + i));
>> +                if (young)
>> +                    swp_entry = make_migration_entry_young(swp_entry);
>> +                if (dirty)
>> +                    swp_entry = make_migration_entry_dirty(swp_entry);
>> +                entry = swp_entry_to_pte(swp_entry);
>> +                if (soft_dirty)
>> +                    entry = pte_swp_mksoft_dirty(entry);
>> +                if (uffd_wp)
>> +                    entry = pte_swp_mkuffd_wp(entry);
>> +            } else {
>> +                VM_BUG_ON(!is_device_private_entry(swp_entry));
>> +                if (write)
>> +                    swp_entry = make_writable_device_private_entry(
>> +                                page_to_pfn(page + i));
>> +                else if (anon_exclusive)
>> +                    swp_entry = make_device_exclusive_entry(
>> +                                page_to_pfn(page + i));
> 
> I am pretty sure this is wrong. You cannot suddenly mix in device-exclusive entries.
> 
> And now I am confused again how device-private, anon and GUP interact.
> 

:)

Yep, this is wrong, we don't need to anything specific for anon_exclusive since
device private pages cannot be pinned. Other reviews have pointed towards fixes
needed as well.

>> +                else
>> +                    swp_entry = make_readable_device_private_entry(
>> +                                page_to_pfn(page + i));
>> +                entry = swp_entry_to_pte(swp_entry);
>> +                if (soft_dirty)
>> +                    entry = pte_swp_mksoft_dirty(entry);
>> +                if (uffd_wp)
>> +                    entry = pte_swp_mkuffd_wp(entry);
>> +            }
>>               VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>>               set_pte_at(mm, addr, pte + i, entry);
>>           }
>> @@ -3065,7 +3105,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>       }
>>       pte_unmap(pte);
>>   -    if (!pmd_migration)
>> +    if (present)
>>           folio_remove_rmap_pmd(folio, page, vma);
>>       if (freeze)
>>           put_page(page);
>> @@ -3077,6 +3117,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>                  pmd_t *pmd, bool freeze, struct folio *folio)
>>   {
>> +    struct folio *pmd_folio;
>>       VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
>>       VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>       VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
>> @@ -3089,7 +3130,14 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>        */
>>       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
>>           is_pmd_migration_entry(*pmd)) {
>> -        if (folio && folio != pmd_folio(*pmd))
>> +        if (folio && !pmd_present(*pmd)) {
>> +            swp_entry_t swp_entry = pmd_to_swp_entry(*pmd);
>> +
>> +            pmd_folio = page_folio(pfn_swap_entry_to_page(swp_entry));
>> +        } else {
>> +            pmd_folio = pmd_folio(*pmd);
>> +        }
>> +        if (folio && folio != pmd_folio)
>>               return;
>>           __split_huge_pmd_locked(vma, pmd, address, freeze);
>>       }
>> @@ -3581,11 +3629,16 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>                        folio_test_swapcache(origin_folio)) ?
>>                            folio_nr_pages(release) : 0));
>>   +            if (folio_is_device_private(release))
>> +                percpu_ref_get_many(&release->pgmap->ref,
>> +                            (1 << new_order) - 1);
>> +
>>               if (release == origin_folio)
>>                   continue;
>>   -            lru_add_page_tail(origin_folio, &release->page,
>> -                        lruvec, list);
>> +            if (!folio_is_device_private(origin_folio))
>> +                lru_add_page_tail(origin_folio, &release->page,
>> +                            lruvec, list);
>>                 /* Some pages can be beyond EOF: drop them from page cache */
>>               if (release->index >= end) {
>> @@ -4625,7 +4678,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>           return 0;
>>         flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>> -    pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> +    if (!folio_is_device_private(folio))
>> +        pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> +    else
>> +        pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> 
> Please handle this like we handle the PTE case -- checking for pmd_present() instead.
> 
> Avoid placing these nasty folio_is_device_private() all over the place where avoidable.
> 

Ack, I'll try and use the presence of pmd to create the migration entries as appropriate

>>         /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>>       anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
>> @@ -4675,6 +4731,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>>       entry = pmd_to_swp_entry(*pvmw->pmd);
>>       folio_get(folio);
>>       pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
>> +
>> +    if (unlikely(folio_is_device_private(folio))) {
>> +        if (pmd_write(pmde))
>> +            entry = make_writable_device_private_entry(
>> +                            page_to_pfn(new));
>> +        else
>> +            entry = make_readable_device_private_entry(
>> +                            page_to_pfn(new));
>> +        pmde = swp_entry_to_pmd(entry);
>> +    }
>> +
>>       if (pmd_swp_soft_dirty(*pvmw->pmd))
>>           pmde = pmd_mksoft_dirty(pmde);
>>       if (is_writable_migration_entry(entry))
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 59e39aaa74e7..0aa1bdb711c3 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>         if (PageCompound(page))
>>           return false;
>> +    if (folio_is_device_private(folio))
>> +        return false;
> 
> Why is that check required when you are adding THP handling and there is a PageCompound check right there?
> 

Fair point, we might not need the check here for THP handling.

>>       VM_BUG_ON_PAGE(!PageAnon(page), page);
>>       VM_BUG_ON_PAGE(!PageLocked(page), page);
>>       VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index e463c3be934a..5dd2e51477d3 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -278,6 +278,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>                * cannot return prematurely, while zap_huge_pmd() has
>>                * cleared *pmd but not decremented compound_mapcount().
>>                */
>> +            swp_entry_t entry;
>> +
>> +            if (!thp_migration_supported())
>> +                return not_found(pvmw);
> 
> This check looks misplaced. We should follow the same model as check_pte().
> 
> Checking for THP migration support when you are actually caring about device-private entries is weird.
> 

The thp migration check is common to the pmd code checks even above the patched code, the
code checks for thp_migration() and PVMW_MIGRATION. If thp migration is not supported
is there any point in returning true?

> That is, I would expect something like
> 
> } else if (is_swap_pmd(pmde)) {
>     swp_entry_t entry;
> 
>     entry = pmd_to_swp_entry(pmde);
>     if (!is_device_private_entry(entry))
>         return false;
> 

I don't think the code above is correct, you'll notice that there is a specific race
that the code handles for the !pmd_present() case and zap_huge_pmd() when PVMW_SYNC is set

I get the idea you're driving towards, I'll see how I can refactor it better.


>     ...
> }
> 
>> +            entry = pmd_to_swp_entry(pmde);
>> +            if (is_device_private_entry(entry)) {
>> +                pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>> +                return true;
>> +            }
>> +
>>               if ((pvmw->flags & PVMW_SYNC) &&
>>                   thp_vma_suitable_order(vma, pvmw->address,
>>                              PMD_ORDER) &&
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 67bb273dfb80..67e99dc5f2ef 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2326,8 +2326,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>   #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>           /* PMD-mapped THP migration entry */
>>           if (!pvmw.pte) {
>> -            subpage = folio_page(folio,
>> -                pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
>> +            /*
>> +             * Zone device private folios do not work well with
>> +             * pmd_pfn() on some architectures due to pte
>> +             * inversion.
>> +             */
>> +            if (folio_is_device_private(folio)) {
>> +                swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
>> +                unsigned long pfn = swp_offset_pfn(entry);
>> +
>> +                subpage = folio_page(folio, pfn
>> +                            - folio_pfn(folio));
>> +            } else {
>> +                subpage = folio_page(folio,
>> +                            pmd_pfn(*pvmw.pmd)
>> +                            - folio_pfn(folio));
>> +            }
>> +
> 
> 
> Please follow the same model we use for PTEs.
> 
> /*
>  * Handle PFN swap PMDs, such as device-exclusive ones, that
>  * actually map pages.
>  */
> if (likely(pmd_present(...))) {
> 
> }
> 
> 

Will refactor to check for pmd_present first

Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling
  2025-07-08 14:40   ` David Hildenbrand
@ 2025-07-09 23:26     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-07-09 23:26 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/9/25 00:40, David Hildenbrand wrote:
> On 06.03.25 05:42, Balbir Singh wrote:
>> When the CPU touches a zone device THP entry, the data needs to
>> be migrated back to the CPU, call migrate_to_ram() on these pages
>> via do_huge_pmd_device_private() fault handling helper.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   include/linux/huge_mm.h |  7 +++++++
>>   mm/huge_memory.c        | 35 +++++++++++++++++++++++++++++++++++
>>   mm/memory.c             |  6 ++++--
>>   3 files changed, 46 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index e893d546a49f..ad0c0ccfcbc2 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -479,6 +479,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>>     vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
>>   +vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
>> +
>>   extern struct folio *huge_zero_folio;
>>   extern unsigned long huge_zero_pfn;
>>   @@ -634,6 +636,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>       return 0;
>>   }
>>   +static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
>> +{
>> +    return 0;
>> +}
>> +
>>   static inline bool is_huge_zero_folio(const struct folio *folio)
>>   {
>>       return false;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d8e018d1bdbd..995ac8be5709 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1375,6 +1375,41 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>       return __do_huge_pmd_anonymous_page(vmf);
>>   }
>>   +vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
>> +{
>> +    struct vm_area_struct *vma = vmf->vma;
>> +    unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>> +    vm_fault_t ret;
>> +    spinlock_t *ptl;
>> +    swp_entry_t swp_entry;
>> +    struct page *page;
>> +
>> +    if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>> +        return VM_FAULT_FALLBACK;
> 
> I'm confused. Why is that required when we already have a PMD entry?
> 
> Apart from that, nothing jumped at me.
> 
> 

You're right, it is not required

Balbir Singh



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 07/11] mm/memremap: Add folio_split support
  2025-07-08 14:31   ` David Hildenbrand
@ 2025-07-09 23:34     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-07-09 23:34 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm
  Cc: dri-devel, nouveau, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/9/25 00:31, David Hildenbrand wrote:
> On 06.03.25 05:42, Balbir Singh wrote:
>> When a zone device page is split (via huge pmd folio split). The
>> driver callback for folio_split is invoked to let the device driver
>> know that the folio size has been split into a smaller order.
>>
>> The HMM test driver has been updated to handle the split, since the
>> test driver uses backing pages, it requires a mechanism of reorganizing
>> the backing pages (backing pages are used to create a mirror device)
>> again into the right sized order pages. This is supported by exporting
>> prep_compound_page().
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   include/linux/memremap.h |  7 +++++++
>>   include/linux/mm.h       |  1 +
>>   lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
>>   mm/huge_memory.c         |  5 +++++
>>   mm/page_alloc.c          |  1 +
>>   5 files changed, 49 insertions(+)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 11d586dd8ef1..2091b754f1da 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -100,6 +100,13 @@ struct dev_pagemap_ops {
>>        */
>>       int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>>                     unsigned long nr_pages, int mf_flags);
>> +
>> +    /*
>> +     * Used for private (un-addressable) device memory only.
>> +     * This callback is used when a folio is split into
>> +     * a smaller folio
> 
> Confusing. When a folio is split, it is split into multiple folios.
> 
> So when will this be invoked?
> 

It is invoked when a folio splits in mm/huge_memory.c. This allows the device
driver to update any metadata it's tracking w.r.t original folio in zone_device_data

>> +     */
>> +    void (*folio_split)(struct folio *head, struct folio *tail);
> 
> head and tail are really suboptimal termonology. They refer to head and tail pages, which is not really the case with folios (in the long run).
> 

Will rename them to original_folio and new_folio if that helps with readability

>>   };
>>     #define PGMAP_ALTMAP_VALID    (1 << 0)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 98a67488b5fe..3d0e91e0a923 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1415,6 +1415,7 @@ static inline struct folio *virt_to_folio(const void *x)
>>   void __folio_put(struct folio *folio);
>>     void split_page(struct page *page, unsigned int order);
>> +void prep_compound_page(struct page *page, unsigned int order);
>>   void folio_copy(struct folio *dst, struct folio *src);
>>   int folio_mc_copy(struct folio *dst, struct folio *src);
>>   diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index a81d2f8a0426..18b6a7b061d7 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1640,10 +1640,45 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>       return ret;
>>   }
>>   +
>> +static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
>> +{
>> +    struct page *rpage = BACKING_PAGE(folio_page(head, 0));
>> +    struct folio *new_rfolio;
>> +    struct folio *rfolio;
>> +    unsigned long offset = 0;
>> +
>> +    if (!rpage) {
>> +        folio_page(tail, 0)->zone_device_data = NULL;
>> +        return;
>> +    }
>> +
>> +    offset = folio_pfn(tail) - folio_pfn(head);
>> +    rfolio = page_folio(rpage);
>> +    new_rfolio = page_folio(folio_page(rfolio, offset));
>> +
>> +    folio_page(tail, 0)->zone_device_data = folio_page(new_rfolio, 0);
>> +
>> +    if (folio_pfn(tail) - folio_pfn(head) == 1) {
>> +        if (folio_order(head))
>> +            prep_compound_page(folio_page(rfolio, 0),
>> +                        folio_order(head));
>> +        folio_set_count(rfolio, 1);
>> +    }
>> +    clear_compound_head(folio_page(new_rfolio, 0));
>> +    if (folio_order(tail))
>> +        prep_compound_page(folio_page(new_rfolio, 0),
>> +                        folio_order(tail));
>> +    folio_set_count(new_rfolio, 1);
>> +    folio_page(new_rfolio, 0)->mapping = folio_page(rfolio, 0)->mapping;
>> +    tail->pgmap = head->pgmap;
> 
> Most of this doesn't look like it should be the responsibility of this callback.
> 
> Setting up a new folio after the split (messing with compound pages etc) really should not be the responsibility of this callback.
> 
> So no, this looks misplaced.
> 

We do need a callback for drivers to do the right thing. In this case if you look at lib/test_hmm.c,
device pages are emulated via backing pages (real folios allocated from system memory). Hence, you
see all the changes here. I can try and simplify this going forward.

>> +}
>> +
>>   static const struct dev_pagemap_ops dmirror_devmem_ops = {
>>       .page_free    = dmirror_devmem_free,
>>       .migrate_to_ram    = dmirror_devmem_fault,
>>       .page_free    = dmirror_devmem_free,
>> +    .folio_split    = dmirror_devmem_folio_split,
>>   };
>>     static int dmirror_device_init(struct dmirror_device *mdevice, int id)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 995ac8be5709..518a70d1b58a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3655,6 +3655,11 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>                           MTHP_STAT_NR_ANON, 1);
>>               }
>>   +            if (folio_is_device_private(origin_folio) &&
>> +                    origin_folio->pgmap->ops->folio_split)
>> +                origin_folio->pgmap->ops->folio_split(
>> +                    origin_folio, release);
> 
> Absolutely ugly. I think we need a wrapper for the
> 
Will do

>> +
>>               /*
>>                * Unfreeze refcount first. Additional reference from
>>                * page cache.
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 17ea8fb27cbf..563f7e39aa79 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -573,6 +573,7 @@ void prep_compound_page(struct page *page, unsigned int order)
>>         prep_compound_head(page, order);
>>   }
>> +EXPORT_SYMBOL_GPL(prep_compound_page);
> 
> Hmmm, that is questionable. We don't want arbitrary modules to make use of that.
> 
> Another sign that you are exposing the wrong functionality/interface (folio_split) to modules.
> 

prep_compound_page is required for generic THP support. In our case the driver lib/test_hmm.c has no
real device pages, just actual folio pages backing it. When the split occurs, we need to ensure
the pgmap entries are correct, the mapping is right and the backing folio is set to the right order.

I tried copying the pages to new folios (but I can't allocate in the split context), I'll see
if I can get rid of this requirement.

Thanks,
Balbir Singh



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-07-09 23:34 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-06  4:42 [RFC 00/11] THP support for zone device pages Balbir Singh
2025-03-06  4:42 ` [RFC 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-03-06 23:02   ` Alistair Popple
2025-07-08 13:37   ` David Hildenbrand
2025-07-09  5:25     ` Balbir Singh
2025-03-06  4:42 ` [RFC 02/11] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
2025-07-08 13:41   ` David Hildenbrand
2025-07-09  5:25     ` Balbir Singh
2025-03-06  4:42 ` [RFC 03/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-08 14:10   ` David Hildenbrand
2025-07-09  6:06     ` Alistair Popple
2025-07-09 12:30     ` Balbir Singh
2025-03-06  4:42 ` [RFC 04/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-03-06  9:24   ` Mika Penttilä
2025-03-06 21:35     ` Balbir Singh
2025-03-06  4:42 ` [RFC 05/11] mm/memory/fault: Add support for zone device THP fault handling Balbir Singh
2025-07-08 14:40   ` David Hildenbrand
2025-07-09 23:26     ` Balbir Singh
2025-03-06  4:42 ` [RFC 06/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-03-06  4:42 ` [RFC 07/11] mm/memremap: Add folio_split support Balbir Singh
2025-03-06  8:16   ` Mika Penttilä
2025-03-06 21:42     ` Balbir Singh
2025-03-06 22:36   ` Alistair Popple
2025-07-08 14:31   ` David Hildenbrand
2025-07-09 23:34     ` Balbir Singh
2025-03-06  4:42 ` [RFC 08/11] mm/thp: add split during migration support Balbir Singh
2025-07-08 14:38   ` David Hildenbrand
2025-07-08 14:46     ` Zi Yan
2025-07-08 14:53       ` David Hildenbrand
2025-03-06  4:42 ` [RFC 09/11] lib/test_hmm: add test case for split pages Balbir Singh
2025-03-06  4:42 ` [RFC 10/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-03-06  4:42 ` [RFC 11/11] gpu/drm/nouveau: Add THP migration support Balbir Singh
2025-03-06 23:08 ` [RFC 00/11] THP support for zone device pages Matthew Brost
2025-03-06 23:20   ` Balbir Singh
2025-07-04 13:52     ` Francois Dugast
2025-07-04 16:17       ` Zi Yan
2025-07-06  1:25         ` Balbir Singh
2025-07-06 16:34           ` Francois Dugast

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).