[v3 00/11] mm: support device-private THP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [v3 00/11] mm: support device-private THP
@ 2025-08-12  2:40 Balbir Singh
  2025-08-12  2:40 ` [v3 01/11] mm/zone_device: support large zone device private folios Balbir Singh
                   ` (10 more replies)
  0 siblings, 11 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

This patch series adds support for THP folios of zone device pages
and support their migration.

To do so, the patches implement support for folio zone device pages
by adding support for setting up larger order pages. Larger order
pages provide a speedup in throughput and latency.

In my local testing (using lib/test_hmm) and a throughput test, the
series shows a 350% improvement in data transfer throughput and a
500% improvement in latency

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.

The nouveau dmem code has been enhanced to use the new THP migration
capability.

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.

The future plan is to post enhancements to support mTHP with a rough
design as follows:

1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
   a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize

The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.

HMM support for large folios:

Francois Dugast posted patches support for HMM handling [4], the proposed
changes can build on top of this series to provide support for HMM fault
handling.

References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
[3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
[4] https://lore.kernel.org/lkml/20250722193445.1588348-1-francois.dugast@intel.com/
[5] https://lore.kernel.org/lkml/20250730092139.3890844-1-balbirs@nvidia.com/

These patches are built on top of mm/mm-stable

Cc: Andrew Morton <akpm@linux-foundation.org> 
Cc: David Hildenbrand <david@redhat.com> 
Cc: Zi Yan <ziy@nvidia.com>  
Cc: Joshua Hahn <joshua.hahnjy@gmail.com> 
Cc: Rakie Kim <rakie.kim@sk.com> 
Cc: Byungchul Park <byungchul@sk.com> 
Cc: Gregory Price <gourry@gourry.net> 
Cc: Ying Huang <ying.huang@linux.alibaba.com> 
Cc: Alistair Popple <apopple@nvidia.com> 
Cc: Oscar Salvador <osalvador@suse.de> 
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> 
Cc: Baolin Wang <baolin.wang@linux.alibaba.com> 
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> 
Cc: Nico Pache <npache@redhat.com> 
Cc: Ryan Roberts <ryan.roberts@arm.com> 
Cc: Dev Jain <dev.jain@arm.com> 
Cc: Barry Song <baohua@kernel.org> 
Cc: Lyude Paul <lyude@redhat.com> 
Cc: Danilo Krummrich <dakr@kernel.org> 
Cc: David Airlie <airlied@gmail.com> 
Cc: Simona Vetter <simona@ffwll.ch> 
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Changelog v3 [5] :
- Addressed review comments
  - No more split_device_private_folio() helper
  - Device private large folios do not end up on deferred scan lists
  - Removed THP size order checks when initializing zone device folio
  - Fixed bugs reported by kernel test robot (Thanks!)

Changelog v2 [3] :
- Several review comments from David Hildenbrand were addressed, Mika,
  Zi, Matthew also provided helpful review comments
  - In paths where it makes sense a new helper
    is_pmd_device_private_entry() is used
  - anon_exclusive handling of zone device private pages in
    split_huge_pmd_locked() has been fixed
  - Patches that introduced helpers have been folded into where they
    are used
- Zone device handling in mm/huge_memory.c has benefited from the code
  and testing of Matthew Brost, he helped find bugs related to
  copy_huge_pmd() and partial unmapping of folios.
- Zone device THP PMD support via page_vma_mapped_walk() is restricted
  to try_to_migrate_one()
- There is a new dedicated helper to split large zone device folios

Changelog v1 [2]:
- Support for handling fault_folio and using trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
  callback is used to know when the head order needs to be reset.

Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM

Balbir Singh (11):
  mm/zone_device: support large zone device private folios
  mm/thp: zone_device awareness in THP handling code
  mm/migrate_device: THP migration of zone device pages
  mm/memory/fault: add support for zone device THP fault handling
  lib/test_hmm: test cases and support for zone device private THP
  mm/memremap: add folio_split support
  mm/thp: add split during migration support
  lib/test_hmm: add test case for split pages
  selftests/mm/hmm-tests: new tests for zone device THP migration
  gpu/drm/nouveau: add THP migration support
  selftests/mm/hmm-tests: new throughput tests including THP

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 253 ++++++++---
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/huge_mm.h                |  18 +-
 include/linux/memremap.h               |  51 ++-
 include/linux/migrate.h                |   2 +
 include/linux/mm.h                     |   1 +
 include/linux/rmap.h                   |   2 +
 include/linux/swapops.h                |  17 +
 lib/test_hmm.c                         | 443 ++++++++++++++----
 lib/test_hmm_uapi.h                    |   3 +
 mm/huge_memory.c                       | 297 +++++++++---
 mm/memory.c                            |   6 +-
 mm/memremap.c                          |  38 +-
 mm/migrate_device.c                    | 567 ++++++++++++++++++++---
 mm/page_vma_mapped.c                   |  13 +-
 mm/pgtable-generic.c                   |   6 +
 mm/rmap.c                              |  28 +-
 tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
 19 files changed, 2039 insertions(+), 322 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [v3 01/11] mm/zone_device: support large zone device private folios
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-26 14:22   ` David Hildenbrand
  2025-08-12  2:40 ` [v3 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.

Zone device private large folios do not support deferred split and
scan like normal THP folios.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h | 10 +++++++++-
 mm/memremap.c            | 38 +++++++++++++++++++++++++-------------
 mm/rmap.c                |  6 +++++-
 3 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..a0723b35eeaa 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void zone_device_folio_init(struct folio *folio, unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
+
+static inline void zone_device_page_init(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	zone_device_folio_init(folio, 0);
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index b0ce0d8254bd..13e87dd743ad 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
+	unsigned long nr = folio_nr_pages(folio);
+	int i;
 
 	if (WARN_ON_ONCE(!pgmap))
 		return;
 
 	mem_cgroup_uncharge(folio);
 
-	/*
-	 * Note: we don't expect anonymous compound pages yet. Once supported
-	 * and we could PTE-map them similar to THP, we'd have to clear
-	 * PG_anon_exclusive on all tail pages.
-	 */
 	if (folio_test_anon(folio)) {
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-		__ClearPageAnonExclusive(folio_page(folio, 0));
+		for (i = 0; i < nr; i++)
+			__ClearPageAnonExclusive(folio_page(folio, i));
+	} else {
+		VM_WARN_ON_ONCE(folio_test_large(folio));
 	}
 
 	/*
@@ -464,11 +463,15 @@ void free_zone_device_folio(struct folio *folio)
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+		percpu_ref_put_many(&folio->pgmap->ref, nr);
+		pgmap->ops->page_free(&folio->page);
+		folio->page.mapping = NULL;
+		break;
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
 			break;
-		pgmap->ops->page_free(folio_page(folio, 0));
-		put_dev_pagemap(pgmap);
+		pgmap->ops->page_free(&folio->page);
+		percpu_ref_put(&folio->pgmap->ref);
 		break;
 
 	case MEMORY_DEVICE_GENERIC:
@@ -491,14 +494,23 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page)
+void zone_device_folio_init(struct folio *folio, unsigned int order)
 {
+	struct page *page = folio_page(folio, 0);
+
+	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
-	set_page_count(page, 1);
+	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
+	folio_set_count(folio, 1);
 	lock_page(page);
+
+	if (order > 1) {
+		prep_compound_page(page, order);
+		folio_set_large_rmappable(folio);
+	}
 }
-EXPORT_SYMBOL_GPL(zone_device_page_init);
+EXPORT_SYMBOL_GPL(zone_device_folio_init);
diff --git a/mm/rmap.c b/mm/rmap.c
index 568198e9efc2..b5837075b6e0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1769,9 +1769,13 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 	 * the folio is unmapped and at least one page is still mapped.
 	 *
 	 * Check partially_mapped first to ensure it is a large folio.
+	 *
+	 * Device private folios do not support deferred splitting and
+	 * shrinker based scanning of the folios to free.
 	 */
 	if (partially_mapped && folio_test_anon(folio) &&
-	    !folio_test_partially_mapped(folio))
+	    !folio_test_partially_mapped(folio) &&
+		!folio_is_device_private(folio))
 		deferred_split_folio(folio, true);
 
 	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
  2025-08-12  2:40 ` [v3 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-12 14:47   ` kernel test robot
                     ` (2 more replies)
  2025-08-12  2:40 ` [v3 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
                   ` (8 subsequent siblings)
  10 siblings, 3 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Make THP handling code in the mm subsystem for THP pages aware of zone
device pages. Although the code is designed to be generic when it comes
to handling splitting of pages, the code is designed to work for THP
page sizes corresponding to HPAGE_PMD_NR.

Modify page_vma_mapped_walk() to return true when a zone device huge
entry is present, enabling try_to_migrate() and other code migration
paths to appropriately process the entry. page_vma_mapped_walk() will
return true for zone device private large folios only when
PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
not zone device private pages from having to add awareness. The key
callback that needs this flag is try_to_migrate_one(). The other
callbacks page idle, damon use it for setting young/dirty bits, which is
not significant when it comes to pmd level bit harvesting.

pmd_pfn() does not work well with zone device entries, use
pfn_pmd_entry_to_swap() for checking and comparison as for zone device
entries.

Support partial unmapping of zone device private entries, which happens
via munmap(). munmap() causes the device private entry pmd to be split,
but the corresponding folio is not split. Deferred split does not work for
zone device private folios due to the need to split during fault
handling. Get migrate_vma_collect_pmd() to handle this case by splitting
partially unmapped device private folios.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/rmap.h    |   2 +
 include/linux/swapops.h |  17 ++++
 lib/test_hmm.c          |   2 +-
 mm/huge_memory.c        | 214 +++++++++++++++++++++++++++++++---------
 mm/migrate_device.c     |  47 +++++++++
 mm/page_vma_mapped.c    |  13 ++-
 mm/pgtable-generic.c    |   6 ++
 mm/rmap.c               |  24 ++++-
 8 files changed, 272 insertions(+), 53 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6cd020eea37a..dfb7aae3d77b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -927,6 +927,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 #define PVMW_SYNC		(1 << 0)
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION		(1 << 1)
+/* Look for device private THP entries */
+#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
 
 struct page_vma_mapped_walk {
 	unsigned long pfn;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 64ea151a7ae3..2641c01bd5d2 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 {
 	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
 }
+
 #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page)
@@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 }
 #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
+}
+
+#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return 0;
+}
+
+#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 761725bc713c..297f1e034045 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1408,7 +1408,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * the mirror but here we use it to hold the page for the simulated
 	 * device memory and that page holds the pointer to the mirror.
 	 */
-	rpage = vmf->page->zone_device_data;
+	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..2495e3fdbfae 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1711,8 +1711,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (!is_readable_migration_entry(entry)) {
+		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
+				!is_pmd_device_private_entry(pmd));
+
+		if (is_migration_entry(entry) &&
+			is_writable_migration_entry(entry)) {
 			entry = make_readable_migration_entry(
 							swp_offset(entry));
 			pmd = swp_entry_to_pmd(entry);
@@ -1722,6 +1725,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
+
+		if (is_device_private_entry(entry)) {
+			if (is_writable_device_private_entry(entry)) {
+				entry = make_readable_device_private_entry(
+					swp_offset(entry));
+				pmd = swp_entry_to_pmd(entry);
+
+				if (pmd_swp_soft_dirty(*src_pmd))
+					pmd = pmd_swp_mksoft_dirty(pmd);
+				if (pmd_swp_uffd_wp(*src_pmd))
+					pmd = pmd_swp_mkuffd_wp(pmd);
+				set_pmd_at(src_mm, addr, src_pmd, pmd);
+			}
+
+			src_folio = pfn_swap_entry_folio(entry);
+			VM_WARN_ON(!folio_test_large(src_folio));
+
+			folio_get(src_folio);
+			/*
+			 * folio_try_dup_anon_rmap_pmd does not fail for
+			 * device private entries.
+			 */
+			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
+					  &src_folio->page, dst_vma, src_vma));
+		}
+
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -2219,15 +2248,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			folio_remove_rmap_pmd(folio, page, vma);
 			WARN_ON_ONCE(folio_mapcount(folio) < 0);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-		} else if (thp_migration_supported()) {
+		} else if (is_pmd_migration_entry(orig_pmd) ||
+				is_pmd_device_private_entry(orig_pmd)) {
 			swp_entry_t entry;
 
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
 			folio = pfn_swap_entry_folio(entry);
 			flush_needed = 0;
-		} else
-			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+			if (!thp_migration_supported())
+				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+			if (is_pmd_device_private_entry(orig_pmd)) {
+				folio_remove_rmap_pmd(folio, &folio->page, vma);
+				WARN_ON_ONCE(folio_mapcount(folio) < 0);
+			}
+		}
 
 		if (folio_test_anon(folio)) {
 			zap_deposited_table(tlb->mm, pmd);
@@ -2247,6 +2283,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				folio_mark_accessed(folio);
 		}
 
+		/*
+		 * Do a folio put on zone device private pages after
+		 * changes to mm_counter, because the folio_put() will
+		 * clean folio->mapping and the folio_test_anon() check
+		 * will not be usable.
+		 */
+		if (folio_is_device_private(folio))
+			folio_put(folio);
+
 		spin_unlock(ptl);
 		if (flush_needed)
 			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
@@ -2375,7 +2420,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct folio *folio = pfn_swap_entry_folio(entry);
 		pmd_t newpmd;
 
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
+			   !folio_is_device_private(folio));
 		if (is_writable_migration_entry(entry)) {
 			/*
 			 * A protection check is difficult so
@@ -2388,6 +2434,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
+		} else if (is_writable_device_private_entry(entry)) {
+			entry = make_readable_device_private_entry(
+							swp_offset(entry));
+			newpmd = swp_entry_to_pmd(entry);
 		} else {
 			newpmd = *pmd;
 		}
@@ -2842,16 +2892,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	bool anon_exclusive = false, dirty = false;
+	bool young, write, soft_dirty, uffd_wp = false;
+	bool anon_exclusive = false, dirty = false, present = false;
 	unsigned long addr;
 	pte_t *pte;
 	int i;
+	swp_entry_t swp_entry;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
+
+	VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+			&& !(is_pmd_device_private_entry(*pmd)));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -2899,18 +2952,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	pmd_migration = is_pmd_migration_entry(*pmd);
-	if (unlikely(pmd_migration)) {
-		swp_entry_t entry;
 
+	present = pmd_present(*pmd);
+	if (unlikely(!present)) {
+		swp_entry = pmd_to_swp_entry(*pmd);
 		old_pmd = *pmd;
-		entry = pmd_to_swp_entry(old_pmd);
-		page = pfn_swap_entry_to_page(entry);
-		write = is_writable_migration_entry(entry);
-		if (PageAnon(page))
-			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = is_migration_entry_young(entry);
-		dirty = is_migration_entry_dirty(entry);
+
+		folio = pfn_swap_entry_folio(swp_entry);
+		VM_WARN_ON(!is_migration_entry(swp_entry) &&
+				!is_device_private_entry(swp_entry));
+		page = pfn_swap_entry_to_page(swp_entry);
+
+		if (is_pmd_migration_entry(old_pmd)) {
+			write = is_writable_migration_entry(swp_entry);
+			if (PageAnon(page))
+				anon_exclusive =
+					is_readable_exclusive_migration_entry(
+								swp_entry);
+			young = is_migration_entry_young(swp_entry);
+			dirty = is_migration_entry_dirty(swp_entry);
+		} else if (is_pmd_device_private_entry(old_pmd)) {
+			write = is_writable_device_private_entry(swp_entry);
+			anon_exclusive = PageAnonExclusive(page);
+			if (freeze && anon_exclusive &&
+			    folio_try_share_anon_rmap_pmd(folio, page))
+				freeze = false;
+			if (!freeze) {
+				rmap_t rmap_flags = RMAP_NONE;
+
+				if (anon_exclusive)
+					rmap_flags |= RMAP_EXCLUSIVE;
+
+				folio_ref_add(folio, HPAGE_PMD_NR - 1);
+				if (anon_exclusive)
+					rmap_flags |= RMAP_EXCLUSIVE;
+				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+						 vma, haddr, rmap_flags);
+			}
+		}
+
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2996,30 +3076,49 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Note that NUMA hinting access restrictions are not transferred to
 	 * avoid any possibility of altering permissions across VMAs.
 	 */
-	if (freeze || pmd_migration) {
+	if (freeze || !present) {
 		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
 			pte_t entry;
-			swp_entry_t swp_entry;
-
-			if (write)
-				swp_entry = make_writable_migration_entry(
-							page_to_pfn(page + i));
-			else if (anon_exclusive)
-				swp_entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page + i));
-			else
-				swp_entry = make_readable_migration_entry(
-							page_to_pfn(page + i));
-			if (young)
-				swp_entry = make_migration_entry_young(swp_entry);
-			if (dirty)
-				swp_entry = make_migration_entry_dirty(swp_entry);
-			entry = swp_entry_to_pte(swp_entry);
-			if (soft_dirty)
-				entry = pte_swp_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_swp_mkuffd_wp(entry);
-
+			if (freeze || is_migration_entry(swp_entry)) {
+				if (write)
+					swp_entry = make_writable_migration_entry(
+								page_to_pfn(page + i));
+				else if (anon_exclusive)
+					swp_entry = make_readable_exclusive_migration_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_migration_entry(
+								page_to_pfn(page + i));
+				if (young)
+					swp_entry = make_migration_entry_young(swp_entry);
+				if (dirty)
+					swp_entry = make_migration_entry_dirty(swp_entry);
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			} else {
+				/*
+				 * anon_exclusive was already propagated to the relevant
+				 * pages corresponding to the pte entries when freeze
+				 * is false.
+				 */
+				if (write)
+					swp_entry = make_writable_device_private_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_device_private_entry(
+								page_to_pfn(page + i));
+				/*
+				 * Young and dirty bits are not progated via swp_entry
+				 */
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			}
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
@@ -3046,7 +3145,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_migration)
+	if (present)
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
@@ -3058,8 +3157,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze)
 {
+
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
-	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
+	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
+			(is_pmd_device_private_entry(*pmd)))
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
@@ -3238,6 +3339,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
+	if (folio_is_device_private(folio))
+		return;
+
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		VM_WARN_ON(folio_test_lru(folio));
@@ -3252,6 +3356,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 			list_add_tail(&new_folio->lru, &folio->lru);
 		folio_set_lru(new_folio);
 	}
+
 }
 
 /* Racy check whether the huge page can be split */
@@ -3727,7 +3832,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	if (folio_ref_freeze(folio, 1 + extra_pins)) {
+	if (folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio))) {
 		struct address_space *swap_cache = NULL;
 		struct lruvec *lruvec;
 		int expected_refs;
@@ -3858,8 +3963,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
-	if (!ret && is_anon)
+	if (!ret && is_anon && !folio_is_device_private(folio))
 		remap_flags = RMP_USE_SHARED_ZEROPAGE;
+
 	remap_page(folio, 1 << order, remap_flags);
 
 	/*
@@ -4603,7 +4709,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		return 0;
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
-	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	if (unlikely(is_pmd_device_private_entry(*pvmw->pmd)))
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+	else
+		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
 
 	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
 	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
@@ -4653,6 +4762,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	folio_get(folio);
 	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
+
+	if (folio_is_device_private(folio)) {
+		if (pmd_write(pmde))
+			entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+		else
+			entry = make_readable_device_private_entry(
+							page_to_pfn(new));
+		pmde = swp_entry_to_pmd(entry);
+	}
+
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e05e14d6eacd..0ed337f94fcd 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -136,6 +136,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * page table entry. Other special swap entries are not
 			 * migratable, and we ignore regular swapped page.
 			 */
+			struct folio *folio;
+
 			entry = pte_to_swp_entry(pte);
 			if (!is_device_private_entry(entry))
 				goto next;
@@ -147,6 +149,51 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			    pgmap->owner != migrate->pgmap_owner)
 				goto next;
 
+			folio = page_folio(page);
+			if (folio_test_large(folio)) {
+				struct folio *new_folio;
+				struct folio *new_fault_folio = NULL;
+
+				/*
+				 * The reason for finding pmd present with a
+				 * device private pte and a large folio for the
+				 * pte is partial unmaps. Split the folio now
+				 * for the migration to be handled correctly
+				 */
+				pte_unmap_unlock(ptep, ptl);
+
+				folio_get(folio);
+				if (folio != fault_folio)
+					folio_lock(folio);
+				if (split_folio(folio)) {
+					if (folio != fault_folio)
+						folio_unlock(folio);
+					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+					goto next;
+				}
+
+				new_folio = page_folio(page);
+				if (fault_folio)
+					new_fault_folio = page_folio(migrate->fault_page);
+
+				/*
+				 * Ensure the lock is held on the correct
+				 * folio after the split
+				 */
+				if (!new_fault_folio) {
+					folio_unlock(folio);
+					folio_put(folio);
+				} else if (folio != new_fault_folio) {
+					folio_get(new_fault_folio);
+					folio_lock(new_fault_folio);
+					folio_unlock(folio);
+					folio_put(folio);
+				}
+
+				addr = start;
+				goto again;
+			}
+
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
 			if (is_writable_device_private_entry(entry))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e981a1a292d2..246e6c211f34 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -250,12 +250,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 			pmde = *pvmw->pmd;
 			if (!pmd_present(pmde)) {
-				swp_entry_t entry;
+				swp_entry_t entry = pmd_to_swp_entry(pmde);
 
 				if (!thp_migration_supported() ||
 				    !(pvmw->flags & PVMW_MIGRATION))
 					return not_found(pvmw);
-				entry = pmd_to_swp_entry(pmde);
 				if (!is_migration_entry(entry) ||
 				    !check_pmd(swp_offset_pfn(entry), pvmw))
 					return not_found(pvmw);
@@ -277,6 +276,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cannot return prematurely, while zap_huge_pmd() has
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(pmde);
+
+			if (is_device_private_entry(entry) &&
+				(pvmw->flags & PVMW_THP_DEVICE_PRIVATE)) {
+				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+				return true;
+			}
+
 			if ((pvmw->flags & PVMW_SYNC) &&
 			    thp_vma_suitable_order(vma, pvmw->address,
 						   PMD_ORDER) &&
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..604e8206a2ec 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
 		goto nomap;
+	if (is_swap_pmd(pmdval)) {
+		swp_entry_t entry = pmd_to_swp_entry(pmdval);
+
+		if (is_device_private_entry(entry))
+			goto nomap;
+	}
 	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_bad(pmdval))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index b5837075b6e0..f40e45564295 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2285,7 +2285,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
+				PVMW_THP_DEVICE_PRIVATE);
 	bool anon_exclusive, writable, ret = true;
 	pte_t pteval;
 	struct page *subpage;
@@ -2330,6 +2331,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+			unsigned long pfn;
+#endif
+
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				split_huge_pmd_locked(vma, pvmw.address,
 						      pvmw.pmd, true);
@@ -2338,8 +2343,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-			subpage = folio_page(folio,
-				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
+			/*
+			 * Zone device private folios do not work well with
+			 * pmd_pfn() on some architectures due to pte
+			 * inversion.
+			 */
+			if (is_pmd_device_private_entry(*pvmw.pmd)) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
+
+				pfn = swp_offset_pfn(entry);
+			} else {
+				pfn = pmd_pfn(*pvmw.pmd);
+			}
+
+			subpage = folio_page(folio, pfn - folio_pfn(folio));
+
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
  2025-08-12  2:40 ` [v3 01/11] mm/zone_device: support large zone device private folios Balbir Singh
  2025-08-12  2:40 ` [v3 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-12  5:35   ` Mika Penttilä
  2025-08-12  2:40 ` [v3 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
device pages as compound pages during device pfn migration.

migrate_device code paths go through the collect, setup
and finalize phases of migration.

The entries in src and dst arrays passed to these functions still
remain at a PAGE_SIZE granularity. When a compound page is passed,
the first entry has the PFN along with MIGRATE_PFN_COMPOUND
and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
representation allows for the compound page to be split into smaller
page sizes.

migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
page aware. Two new helper functions migrate_vma_collect_huge_pmd()
and migrate_vma_insert_huge_pmd_page() have been added.

migrate_vma_collect_huge_pmd() can collect THP pages, but if for
some reason this fails, there is fallback support to split the folio
and migrate it.

migrate_vma_insert_huge_pmd_page() closely follows the logic of
migrate_vma_insert_page()

Support for splitting pages as needed for migration will follow in
later patches in this series.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/migrate.h |   2 +
 mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
 2 files changed, 396 insertions(+), 63 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index acadd41e0b5c..d9cef0819f91 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_COMPOUND	(1UL << 4)
 #define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -147,6 +148,7 @@ enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
 	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
+	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 0ed337f94fcd..6621bba62710 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -14,6 +14,7 @@
 #include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swapops.h>
+#include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
 	if (!vma_is_anonymous(walk->vma))
 		return migrate_vma_collect_skip(start, end, walk);
 
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
+						MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages] = 0;
+		migrate->npages++;
+		migrate->cpages++;
+
+		/*
+		 * Collect the remaining entries as holes, in case we
+		 * need to split later
+		 */
+		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+	}
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
 		migrate->dst[migrate->npages] = 0;
@@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
 	return 0;
 }
 
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-				   unsigned long start,
-				   unsigned long end,
-				   struct mm_walk *walk)
+/**
+ * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
+ * folio for device private pages.
+ * @pmdp: pointer to pmd entry
+ * @start: start address of the range for migration
+ * @end: end address of the range for migration
+ * @walk: mm_walk callback structure
+ *
+ * Collect the huge pmd entry at @pmdp for migration and set the
+ * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
+ * migration will occur at HPAGE_PMD granularity
+ */
+static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
+					unsigned long end, struct mm_walk *walk,
+					struct folio *fault_folio)
 {
+	struct mm_struct *mm = walk->mm;
+	struct folio *folio;
 	struct migrate_vma *migrate = walk->private;
-	struct folio *fault_folio = migrate->fault_page ?
-		page_folio(migrate->fault_page) : NULL;
-	struct vm_area_struct *vma = walk->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
-	pte_t *ptep;
+	swp_entry_t entry;
+	int ret;
+	unsigned long write = 0;
 
-again:
-	if (pmd_none(*pmdp))
+	ptl = pmd_lock(mm, pmdp);
+	if (pmd_none(*pmdp)) {
+		spin_unlock(ptl);
 		return migrate_vma_collect_hole(start, end, -1, walk);
+	}
 
 	if (pmd_trans_huge(*pmdp)) {
-		struct folio *folio;
-
-		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
 			spin_unlock(ptl);
-			goto again;
+			return migrate_vma_collect_skip(start, end, walk);
 		}
 
 		folio = pmd_folio(*pmdp);
 		if (is_huge_zero_folio(folio)) {
 			spin_unlock(ptl);
-			split_huge_pmd(vma, pmdp, addr);
-		} else {
-			int ret;
+			return migrate_vma_collect_hole(start, end, -1, walk);
+		}
+		if (pmd_write(*pmdp))
+			write = MIGRATE_PFN_WRITE;
+	} else if (!pmd_present(*pmdp)) {
+		entry = pmd_to_swp_entry(*pmdp);
+		folio = pfn_swap_entry_folio(entry);
+
+		if (!is_device_private_entry(entry) ||
+			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
+			(folio->pgmap->owner != migrate->pgmap_owner)) {
+			spin_unlock(ptl);
+			return migrate_vma_collect_skip(start, end, walk);
+		}
 
-			folio_get(folio);
+		if (is_migration_entry(entry)) {
+			migration_entry_wait_on_locked(entry, ptl);
 			spin_unlock(ptl);
-			/* FIXME: we don't expect THP for fault_folio */
-			if (WARN_ON_ONCE(fault_folio == folio))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			if (unlikely(!folio_trylock(folio)))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			ret = split_folio(folio);
-			if (fault_folio != folio)
-				folio_unlock(folio);
-			folio_put(folio);
-			if (ret)
-				return migrate_vma_collect_skip(start, end,
-								walk);
+			return -EAGAIN;
 		}
+
+		if (is_writable_device_private_entry(entry))
+			write = MIGRATE_PFN_WRITE;
+	} else {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	folio_get(folio);
+	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
+		spin_unlock(ptl);
+		folio_put(folio);
+		return migrate_vma_collect_skip(start, end, walk);
+	}
+
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+
+		struct page_vma_mapped_walk pvmw = {
+			.ptl = ptl,
+			.address = start,
+			.pmd = pmdp,
+			.vma = walk->vma,
+		};
+
+		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
+
+		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
+						| MIGRATE_PFN_MIGRATE
+						| MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages++] = 0;
+		migrate->cpages++;
+		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
+		if (ret) {
+			migrate->npages--;
+			migrate->cpages--;
+			migrate->src[migrate->npages] = 0;
+			migrate->dst[migrate->npages] = 0;
+			goto fallback;
+		}
+		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+		spin_unlock(ptl);
+		return 0;
+	}
+
+fallback:
+	spin_unlock(ptl);
+	if (!folio_test_large(folio))
+		goto done;
+	ret = split_folio(folio);
+	if (fault_folio != folio)
+		folio_unlock(folio);
+	folio_put(folio);
+	if (ret)
+		return migrate_vma_collect_skip(start, end, walk);
+	if (pmd_none(pmdp_get_lockless(pmdp)))
+		return migrate_vma_collect_hole(start, end, -1, walk);
+
+done:
+	return -ENOENT;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start, unmapped = 0;
+	spinlock_t *ptl;
+	struct folio *fault_folio = migrate->fault_page ?
+		page_folio(migrate->fault_page) : NULL;
+	pte_t *ptep;
+
+again:
+	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
+		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
+
+		if (ret == -EAGAIN)
+			goto again;
+		if (ret == 0)
+			return 0;
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
 
-		/* FIXME support THP */
-		if (!page || !page->mapping || PageTransCompound(page)) {
+		if (!page || !page->mapping) {
 			mpfn = 0;
 			goto next;
 		}
@@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 	 */
 	int extra = 1 + (page == fault_page);
 
-	/*
-	 * FIXME support THP (transparent huge page), it is bit more complex to
-	 * check them than regular pages, because they can be mapped with a pmd
-	 * or with a pte (split pte mapping).
-	 */
-	if (folio_test_large(folio))
-		return false;
-
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (folio_is_zone_device(folio))
 		extra++;
@@ -432,17 +535,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 	lru_add_drain();
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct folio *folio;
+		unsigned int nr = 1;
 
 		if (!page) {
 			if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
 				unmapped++;
-			continue;
+			goto next;
 		}
 
 		folio =	page_folio(page);
+		nr = folio_nr_pages(folio);
+
+		if (nr > 1)
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+
+
 		/* ZONE_DEVICE folios are not on LRU */
 		if (!folio_is_zone_device(folio)) {
 			if (!folio_test_lru(folio) && allow_drain) {
@@ -454,7 +564,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			if (!folio_isolate_lru(folio)) {
 				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 				restore++;
-				continue;
+				goto next;
 			}
 
 			/* Drop the reference we took in collect */
@@ -473,10 +583,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			restore++;
-			continue;
+			goto next;
 		}
 
 		unmapped++;
+next:
+		i += nr;
 	}
 
 	for (i = 0; i < npages && restore; i++) {
@@ -622,6 +734,147 @@ int migrate_vma_setup(struct migrate_vma *args)
 }
 EXPORT_SYMBOL(migrate_vma_setup);
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/**
+ * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
+ * at @addr. folio is already allocated as a part of the migration process with
+ * large page.
+ *
+ * @folio needs to be initialized and setup after it's allocated. The code bits
+ * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
+ * not support THP zero pages.
+ *
+ * @migrate: migrate_vma arguments
+ * @addr: address where the folio will be inserted
+ * @folio: folio to be inserted at @addr
+ * @src: src pfn which is being migrated
+ * @pmdp: pointer to the pmd
+ */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	gfp_t gfp = vma_thp_gfp_mask(vma);
+	struct folio *folio = page_folio(page);
+	int ret;
+	vm_fault_t csa_ret;
+	spinlock_t *ptl;
+	pgtable_t pgtable;
+	pmd_t entry;
+	bool flush = false;
+	unsigned long i;
+
+	VM_WARN_ON_FOLIO(!folio, folio);
+	VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
+
+	if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
+		return -EINVAL;
+
+	ret = anon_vma_prepare(vma);
+	if (ret)
+		return ret;
+
+	folio_set_order(folio, HPAGE_PMD_ORDER);
+	folio_set_large_rmappable(folio);
+
+	if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+		ret = -ENOMEM;
+		goto abort;
+	}
+
+	__folio_mark_uptodate(folio);
+
+	pgtable = pte_alloc_one(vma->vm_mm);
+	if (unlikely(!pgtable))
+		goto abort;
+
+	if (folio_is_device_private(folio)) {
+		swp_entry_t swp_entry;
+
+		if (vma->vm_flags & VM_WRITE)
+			swp_entry = make_writable_device_private_entry(
+						page_to_pfn(page));
+		else
+			swp_entry = make_readable_device_private_entry(
+						page_to_pfn(page));
+		entry = swp_entry_to_pmd(swp_entry);
+	} else {
+		if (folio_is_zone_device(folio) &&
+		    !folio_is_device_coherent(folio)) {
+			goto abort;
+		}
+		entry = folio_mk_pmd(folio, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
+	}
+
+	ptl = pmd_lock(vma->vm_mm, pmdp);
+	csa_ret = check_stable_address_space(vma->vm_mm);
+	if (csa_ret)
+		goto abort;
+
+	/*
+	 * Check for userfaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma))
+		goto unlock_abort;
+
+	if (!pmd_none(*pmdp)) {
+		if (!is_huge_zero_pmd(*pmdp))
+			goto unlock_abort;
+		flush = true;
+	} else if (!pmd_none(*pmdp))
+		goto unlock_abort;
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
+	if (!folio_is_zone_device(folio))
+		folio_add_lru_vma(folio, vma);
+	folio_get(folio);
+
+	if (flush) {
+		pte_free(vma->vm_mm, pgtable);
+		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
+		pmdp_invalidate(vma, addr, pmdp);
+	} else {
+		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		mm_inc_nr_ptes(vma->vm_mm);
+	}
+	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
+	update_mmu_cache_pmd(vma, addr, pmdp);
+
+	spin_unlock(ptl);
+
+	count_vm_event(THP_FAULT_ALLOC);
+	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
+	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+
+	return 0;
+
+unlock_abort:
+	spin_unlock(ptl);
+abort:
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		src[i] &= ~MIGRATE_PFN_MIGRATE;
+	return 0;
+}
+#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	return 0;
+}
+#endif
+
 /*
  * This code closely matches the code in:
  *   __handle_mm_fault()
@@ -632,9 +885,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
-				    struct page *page,
+				    unsigned long *dst,
 				    unsigned long *src)
 {
+	struct page *page = migrate_pfn_to_page(*dst);
 	struct folio *folio = page_folio(page);
 	struct vm_area_struct *vma = migrate->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -662,8 +916,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp))
-		goto abort;
+
+	if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
+		int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
+								src, pmdp);
+		if (ret)
+			goto abort;
+		return;
+	}
+
+	if (!pmd_none(*pmdp)) {
+		if (pmd_trans_huge(*pmdp)) {
+			if (!is_huge_zero_pmd(*pmdp))
+				goto abort;
+			folio_get(pmd_folio(*pmdp));
+			split_huge_pmd(vma, pmdp, addr);
+		} else if (pmd_leaf(*pmdp))
+			goto abort;
+	}
+
 	if (pte_alloc(mm, pmdp))
 		goto abort;
 	if (unlikely(anon_vma_prepare(vma)))
@@ -754,23 +1025,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 	unsigned long i;
 	bool notified = false;
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct address_space *mapping;
 		struct folio *newfolio, *folio;
 		int r, extra_cnt = 0;
+		unsigned long nr = 1;
 
 		if (!newpage) {
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		if (!page) {
 			unsigned long addr;
 
 			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-				continue;
+				goto next;
 
 			/*
 			 * The only time there is no vma is when called from
@@ -788,15 +1060,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
-			migrate_vma_insert_page(migrate, addr, newpage,
+
+			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
+				nr = HPAGE_PMD_NR;
+				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				goto next;
+			}
+
+			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
 						&src_pfns[i]);
-			continue;
+			goto next;
 		}
 
 		newfolio = page_folio(newpage);
 		folio = page_folio(page);
 		mapping = folio_mapping(folio);
 
+		/*
+		 * If THP migration is enabled, check if both src and dst
+		 * can migrate large pages
+		 */
+		if (thp_migration_supported()) {
+			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+
+				if (!migrate) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			}
+		}
+
+
 		if (folio_is_device_private(newfolio) ||
 		    folio_is_device_coherent(newfolio)) {
 			if (mapping) {
@@ -809,7 +1113,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				if (!folio_test_anon(folio) ||
 				    !folio_free_swap(folio)) {
 					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-					continue;
+					goto next;
 				}
 			}
 		} else if (folio_is_zone_device(newfolio)) {
@@ -817,7 +1121,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			 * Other types of ZONE_DEVICE page are not supported.
 			 */
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		BUG_ON(folio_test_writeback(folio));
@@ -829,6 +1133,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 		else
 			folio_migrate_flags(newfolio, folio);
+next:
+		i += nr;
 	}
 
 	if (notified)
@@ -990,10 +1296,23 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
 int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 			unsigned long npages)
 {
-	unsigned long i, pfn;
+	unsigned long i, j, pfn;
+
+	for (pfn = start, i = 0; i < npages; pfn++, i++) {
+		struct page *page = pfn_to_page(pfn);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
-	for (pfn = start, i = 0; i < npages; pfn++, i++)
 		src_pfns[i] = migrate_device_pfn_lock(pfn);
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j - 1;
+			pfn += j - 1;
+		}
+	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
 
@@ -1011,10 +1330,22 @@ EXPORT_SYMBOL(migrate_device_range);
  */
 int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)
 {
-	unsigned long i;
+	unsigned long i, j;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = pfn_to_page(src_pfns[i]);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
-	for (i = 0; i < npages; i++)
 		src_pfns[i] = migrate_device_pfn_lock(src_pfns[i]);
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j - 1;
+		}
+	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 04/11] mm/memory/fault: add support for zone device THP fault handling
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (2 preceding siblings ...)
  2025-08-12  2:40 ` [v3 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-12  2:40 ` [v3 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

When the CPU touches a zone device THP entry, the data needs to
be migrated back to the CPU, call migrate_to_ram() on these pages
via do_huge_pmd_device_private() fault handling helper.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |  7 +++++++
 mm/huge_memory.c        | 36 ++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  6 ++++--
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..a4880fe98e46 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -474,6 +474,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
@@ -632,6 +634,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	return 0;
+}
+
 static inline bool is_huge_zero_folio(const struct folio *folio)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2495e3fdbfae..8888140e57a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1267,6 +1267,42 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 
 }
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	vm_fault_t ret = 0;
+	spinlock_t *ptl;
+	swp_entry_t swp_entry;
+	struct page *page;
+
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
+	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) {
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	swp_entry = pmd_to_swp_entry(vmf->orig_pmd);
+	page = pfn_swap_entry_to_page(swp_entry);
+	vmf->page = page;
+	vmf->pte = NULL;
+	if (trylock_page(vmf->page)) {
+		get_page(page);
+		spin_unlock(ptl);
+		ret = page_pgmap(page)->ops->migrate_to_ram(vmf);
+		unlock_page(vmf->page);
+		put_page(page);
+	} else {
+		spin_unlock(ptl);
+	}
+
+	return ret;
+}
+
 /*
  * always: directly stall for all thp allocations
  * defer: wake kswapd and fail if not immediately available
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1..6c87f043eea1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6152,8 +6152,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
 		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_device_private_entry(
+					pmd_to_swp_entry(vmf.orig_pmd)))
+				return do_huge_pmd_device_private(&vmf);
+
 			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 05/11] lib/test_hmm: test cases and support for zone device private THP
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (3 preceding siblings ...)
  2025-08-12  2:40 ` [v3 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-12  2:40 ` [v3 06/11] mm/memremap: add folio_split support Balbir Singh
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Enhance the hmm test driver (lib/test_hmm) with support for
THP pages.

A new pool of free_folios() has now been added to the dmirror
device, which can be allocated when a request for a THP zone
device private page is made.

Add compound page awareness to the allocation function during
normal migration and fault based migration. These routines also
copy folio_nr_pages() when moving data between system memory
and device memory.

args.src and args.dst used to hold migration entries are now
dynamically allocated (as they need to hold HPAGE_PMD_NR entries
or more).

Split and migrate support will be added in future patches in this
series.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h |  12 ++
 lib/test_hmm.c           | 366 +++++++++++++++++++++++++++++++--------
 2 files changed, 303 insertions(+), 75 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index a0723b35eeaa..0c5141a7d58c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
 	return is_device_private_page(&folio->page);
 }
 
+static inline void *folio_zone_device_data(const struct folio *folio)
+{
+	VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+	return folio->page.zone_device_data;
+}
+
+static inline void folio_set_zone_device_data(struct folio *folio, void *data)
+{
+	VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+	folio->page.zone_device_data = data;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 297f1e034045..d814056151d0 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -119,6 +119,7 @@ struct dmirror_device {
 	unsigned long		calloc;
 	unsigned long		cfree;
 	struct page		*free_pages;
+	struct folio		*free_folios;
 	spinlock_t		lock;		/* protects the above */
 };
 
@@ -492,7 +493,7 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 }
 
 static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
-				   struct page **ppage)
+				  struct page **ppage, bool is_large)
 {
 	struct dmirror_chunk *devmem;
 	struct resource *res = NULL;
@@ -572,20 +573,45 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		pfn_first, pfn_last);
 
 	spin_lock(&mdevice->lock);
-	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+	for (pfn = pfn_first; pfn < pfn_last; ) {
 		struct page *page = pfn_to_page(pfn);
 
+		if (is_large && IS_ALIGNED(pfn, HPAGE_PMD_NR)
+			&& (pfn + HPAGE_PMD_NR <= pfn_last)) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+			pfn += HPAGE_PMD_NR;
+			continue;
+		}
+
 		page->zone_device_data = mdevice->free_pages;
 		mdevice->free_pages = page;
+		pfn++;
 	}
+
+	ret = 0;
 	if (ppage) {
-		*ppage = mdevice->free_pages;
-		mdevice->free_pages = (*ppage)->zone_device_data;
-		mdevice->calloc++;
+		if (is_large) {
+			if (!mdevice->free_folios) {
+				ret = -ENOMEM;
+				goto err_unlock;
+			}
+			*ppage = folio_page(mdevice->free_folios, 0);
+			mdevice->free_folios = (*ppage)->zone_device_data;
+			mdevice->calloc += HPAGE_PMD_NR;
+		} else if (mdevice->free_pages) {
+			*ppage = mdevice->free_pages;
+			mdevice->free_pages = (*ppage)->zone_device_data;
+			mdevice->calloc++;
+		} else {
+			ret = -ENOMEM;
+			goto err_unlock;
+		}
 	}
+err_unlock:
 	spin_unlock(&mdevice->lock);
 
-	return 0;
+	return ret;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -598,10 +624,13 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	return ret;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
+					      bool is_large)
 {
 	struct page *dpage = NULL;
 	struct page *rpage = NULL;
+	unsigned int order = is_large ? HPAGE_PMD_ORDER : 0;
+	struct dmirror_device *mdevice = dmirror->mdevice;
 
 	/*
 	 * For ZONE_DEVICE private type, this is a fake device so we allocate
@@ -610,49 +639,55 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	 * data and ignore rpage.
 	 */
 	if (dmirror_is_private_zone(mdevice)) {
-		rpage = alloc_page(GFP_HIGHUSER);
+		rpage = folio_page(folio_alloc(GFP_HIGHUSER, order), 0);
 		if (!rpage)
 			return NULL;
 	}
 	spin_lock(&mdevice->lock);
 
-	if (mdevice->free_pages) {
+	if (is_large && mdevice->free_folios) {
+		dpage = folio_page(mdevice->free_folios, 0);
+		mdevice->free_folios = dpage->zone_device_data;
+		mdevice->calloc += 1 << order;
+		spin_unlock(&mdevice->lock);
+	} else if (!is_large && mdevice->free_pages) {
 		dpage = mdevice->free_pages;
 		mdevice->free_pages = dpage->zone_device_data;
 		mdevice->calloc++;
 		spin_unlock(&mdevice->lock);
 	} else {
 		spin_unlock(&mdevice->lock);
-		if (dmirror_allocate_chunk(mdevice, &dpage))
+		if (dmirror_allocate_chunk(mdevice, &dpage, is_large))
 			goto error;
 	}
 
-	zone_device_page_init(dpage);
+	zone_device_folio_init(page_folio(dpage), order);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
 error:
 	if (rpage)
-		__free_page(rpage);
+		__free_pages(rpage, order);
 	return NULL;
 }
 
 static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 					   struct dmirror *dmirror)
 {
-	struct dmirror_device *mdevice = dmirror->mdevice;
 	const unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
 	unsigned long addr;
 
-	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
-						   src++, dst++) {
+	for (addr = args->start; addr < args->end; ) {
 		struct page *spage;
 		struct page *dpage;
 		struct page *rpage;
+		bool is_large = *src & MIGRATE_PFN_COMPOUND;
+		int write = (*src & MIGRATE_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+		unsigned long nr = 1;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		/*
 		 * Note that spage might be NULL which is OK since it is an
@@ -662,17 +697,45 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		if (WARN(spage && is_zone_device_page(spage),
 		     "page already in device spage pfn: 0x%lx\n",
 		     page_to_pfn(spage)))
+			goto next;
+
+		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (!dpage) {
+			struct folio *folio;
+			unsigned long i;
+			unsigned long spfn = *src >> MIGRATE_PFN_SHIFT;
+			struct page *src_page;
+
+			if (!is_large)
+				goto next;
+
+			if (!spage && is_large) {
+				nr = HPAGE_PMD_NR;
+			} else {
+				folio = page_folio(spage);
+				nr = folio_nr_pages(folio);
+			}
+
+			for (i = 0; i < nr && addr < args->end; i++) {
+				dpage = dmirror_devmem_alloc_page(dmirror, false);
+				rpage = BACKING_PAGE(dpage);
+				rpage->zone_device_data = dmirror;
+
+				*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+				src_page = pfn_to_page(spfn + i);
+
+				if (spage)
+					copy_highpage(rpage, src_page);
+				else
+					clear_highpage(rpage);
+				src++;
+				dst++;
+				addr += PAGE_SIZE;
+			}
 			continue;
-
-		dpage = dmirror_devmem_alloc_page(mdevice);
-		if (!dpage)
-			continue;
+		}
 
 		rpage = BACKING_PAGE(dpage);
-		if (spage)
-			copy_highpage(rpage, spage);
-		else
-			clear_highpage(rpage);
 
 		/*
 		 * Normally, a device would use the page->zone_device_data to
@@ -684,10 +747,42 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 			 page_to_pfn(spage), page_to_pfn(dpage));
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if ((*src & MIGRATE_PFN_WRITE) ||
-		    (!spage && args->vma->vm_flags & VM_WRITE))
-			*dst |= MIGRATE_PFN_WRITE;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+
+		if (is_large) {
+			int i;
+			struct folio *folio = page_folio(dpage);
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+			if (folio_test_large(folio)) {
+				for (i = 0; i < folio_nr_pages(folio); i++) {
+					struct page *dst_page =
+						pfn_to_page(page_to_pfn(rpage) + i);
+					struct page *src_page =
+						pfn_to_page(page_to_pfn(spage) + i);
+
+					if (spage)
+						copy_highpage(dst_page, src_page);
+					else
+						clear_highpage(dst_page);
+					src++;
+					dst++;
+					addr += PAGE_SIZE;
+				}
+				continue;
+			}
+		}
+
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+next:
+		src++;
+		dst++;
+		addr += PAGE_SIZE;
 	}
 }
 
@@ -734,14 +829,17 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	const unsigned long *src = args->src;
 	const unsigned long *dst = args->dst;
 	unsigned long pfn;
+	const unsigned long start_pfn = start >> PAGE_SHIFT;
+	const unsigned long end_pfn = end >> PAGE_SHIFT;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
-								src++, dst++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn++, src++, dst++) {
 		struct page *dpage;
 		void *entry;
+		int nr, i;
+		struct page *rpage;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
 			continue;
@@ -750,13 +848,25 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 		if (!dpage)
 			continue;
 
-		entry = BACKING_PAGE(dpage);
-		if (*dst & MIGRATE_PFN_WRITE)
-			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
-		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
-		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
+		if (*dst & MIGRATE_PFN_COMPOUND)
+			nr = folio_nr_pages(page_folio(dpage));
+		else
+			nr = 1;
+
+		WARN_ON_ONCE(end_pfn < start_pfn + nr);
+
+		rpage = BACKING_PAGE(dpage);
+		VM_WARN_ON(folio_nr_pages(page_folio(rpage)) != nr);
+
+		for (i = 0; i < nr; i++) {
+			entry = folio_page(page_folio(rpage), i);
+			if (*dst & MIGRATE_PFN_WRITE)
+				entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+			entry = xa_store(&dmirror->pt, pfn + i, entry, GFP_ATOMIC);
+			if (xa_is_err(entry)) {
+				mutex_unlock(&dmirror->mutex);
+				return xa_err(entry);
+			}
 		}
 	}
 
@@ -829,31 +939,66 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 	unsigned long start = args->start;
 	unsigned long end = args->end;
 	unsigned long addr;
+	unsigned int order = 0;
+	int i;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
+	for (addr = start; addr < end; ) {
 		struct page *dpage, *spage;
 
 		spage = migrate_pfn_to_page(*src);
-		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE)) {
+			addr += PAGE_SIZE;
+			goto next;
+		}
 
 		if (WARN_ON(!is_device_private_page(spage) &&
-			    !is_device_coherent_page(spage)))
-			continue;
+			    !is_device_coherent_page(spage))) {
+			addr += PAGE_SIZE;
+			goto next;
+		}
+
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
-			 page_to_pfn(spage), page_to_pfn(dpage));
+		order = folio_order(page_folio(spage));
 
+		if (order)
+			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
+						order, args->vma, addr), 0);
+		else
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+
+		/* Try with smaller pages if large allocation fails */
+		if (!dpage && order) {
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+			if (!dpage)
+				return VM_FAULT_OOM;
+			order = 0;
+		}
+
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+				page_to_pfn(spage), page_to_pfn(dpage));
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
 		copy_highpage(dpage, spage);
 		*dst = migrate_pfn(page_to_pfn(dpage));
 		if (*src & MIGRATE_PFN_WRITE)
 			*dst |= MIGRATE_PFN_WRITE;
+		if (order)
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+		for (i = 0; i < (1 << order); i++) {
+			struct page *src_page;
+			struct page *dst_page;
+
+			src_page = pfn_to_page(page_to_pfn(spage) + i);
+			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+
+			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			copy_highpage(dst_page, src_page);
+		}
+next:
+		addr += PAGE_SIZE << order;
+		src += 1 << order;
+		dst += 1 << order;
 	}
 	return 0;
 }
@@ -879,11 +1024,14 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
+
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
 
 	start = cmd->addr;
 	end = start + size;
@@ -902,7 +1050,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -912,7 +1060,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = dmirror_select_device(dmirror);
+		args.flags = dmirror_select_device(dmirror) | MIGRATE_VMA_SELECT_COMPOUND;
 
 		ret = migrate_vma_setup(&args);
 		if (ret)
@@ -928,6 +1076,8 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+	kvfree(src_pfns);
+	kvfree(dst_pfns);
 
 	return ret;
 }
@@ -939,12 +1089,12 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct dmirror_bounce bounce;
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns = NULL;
+	unsigned long *dst_pfns = NULL;
 
 	start = cmd->addr;
 	end = start + size;
@@ -955,6 +1105,18 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	if (!mmget_not_zero(mm))
 		return -EINVAL;
 
+	ret = -ENOMEM;
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!src_pfns)
+		goto free_mem;
+
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!dst_pfns)
+		goto free_mem;
+
+	ret = 0;
 	mmap_read_lock(mm);
 	for (addr = start; addr < end; addr = next) {
 		vma = vma_lookup(mm, addr);
@@ -962,7 +1124,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -972,7 +1134,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+		args.flags = MIGRATE_VMA_SELECT_SYSTEM |
+				MIGRATE_VMA_SELECT_COMPOUND;
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out;
@@ -992,7 +1155,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	 */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
-		return ret;
+		goto free_mem;
 	mutex_lock(&dmirror->mutex);
 	ret = dmirror_do_read(dmirror, start, end, &bounce);
 	mutex_unlock(&dmirror->mutex);
@@ -1003,11 +1166,14 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	}
 	cmd->cpages = bounce.cpages;
 	dmirror_bounce_fini(&bounce);
-	return ret;
+	goto free_mem;
 
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+free_mem:
+	kfree(src_pfns);
+	kfree(dst_pfns);
 	return ret;
 }
 
@@ -1200,6 +1366,7 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 	unsigned long i;
 	unsigned long *src_pfns;
 	unsigned long *dst_pfns;
+	unsigned int order = 0;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
@@ -1215,13 +1382,25 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 		if (WARN_ON(!is_device_private_page(spage) &&
 			    !is_device_coherent_page(spage)))
 			continue;
+
+		order = folio_order(page_folio(spage));
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+		if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+			dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE,
+					      order), 0);
+		} else {
+			dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+			order = 0;
+		}
+
+		/* TODO Support splitting here */
 		lock_page(dpage);
-		copy_highpage(dpage, spage);
 		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
 		if (src_pfns[i] & MIGRATE_PFN_WRITE)
 			dst_pfns[i] |= MIGRATE_PFN_WRITE;
+		if (order)
+			dst_pfns[i] |= MIGRATE_PFN_COMPOUND;
+		folio_copy(page_folio(dpage), page_folio(spage));
 	}
 	migrate_device_pages(src_pfns, dst_pfns, npages);
 	migrate_device_finalize(src_pfns, dst_pfns, npages);
@@ -1234,7 +1413,12 @@ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
 {
 	struct dmirror_device *mdevice = devmem->mdevice;
 	struct page *page;
+	struct folio *folio;
+
 
+	for (folio = mdevice->free_folios; folio; folio = folio_zone_device_data(folio))
+		if (dmirror_page_to_chunk(folio_page(folio, 0)) == devmem)
+			mdevice->free_folios = folio_zone_device_data(folio);
 	for (page = mdevice->free_pages; page; page = page->zone_device_data)
 		if (dmirror_page_to_chunk(page) == devmem)
 			mdevice->free_pages = page->zone_device_data;
@@ -1265,6 +1449,7 @@ static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
 		mdevice->devmem_count = 0;
 		mdevice->devmem_capacity = 0;
 		mdevice->free_pages = NULL;
+		mdevice->free_folios = NULL;
 		kfree(mdevice->devmem_chunks);
 		mdevice->devmem_chunks = NULL;
 	}
@@ -1378,18 +1563,30 @@ static void dmirror_devmem_free(struct page *page)
 {
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
+	struct folio *folio = page_folio(rpage);
+	unsigned int order = folio_order(folio);
 
-	if (rpage != page)
-		__free_page(rpage);
+	if (rpage != page) {
+		if (order)
+			__free_pages(rpage, order);
+		else
+			__free_page(rpage);
+		rpage = NULL;
+	}
 
 	mdevice = dmirror_page_to_device(page);
 	spin_lock(&mdevice->lock);
 
 	/* Return page to our allocator if not freeing the chunk */
 	if (!dmirror_page_to_chunk(page)->remove) {
-		mdevice->cfree++;
-		page->zone_device_data = mdevice->free_pages;
-		mdevice->free_pages = page;
+		mdevice->cfree += 1 << order;
+		if (order) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+		} else {
+			page->zone_device_data = mdevice->free_pages;
+			mdevice->free_pages = page;
+		}
 	}
 	spin_unlock(&mdevice->lock);
 }
@@ -1397,11 +1594,10 @@ static void dmirror_devmem_free(struct page *page)
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args = { 0 };
-	unsigned long src_pfns = 0;
-	unsigned long dst_pfns = 0;
 	struct page *rpage;
 	struct dmirror *dmirror;
-	vm_fault_t ret;
+	vm_fault_t ret = 0;
+	unsigned int order, nr;
 
 	/*
 	 * Normally, a device would use the page->zone_device_data to point to
@@ -1412,21 +1608,38 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
+	order = folio_order(page_folio(vmf->page));
+	nr = 1 << order;
+
+	/*
+	 * Consider a per-cpu cache of src and dst pfns, but with
+	 * large number of cpus that might not scale well.
+	 */
+	args.start = ALIGN_DOWN(vmf->address, (PAGE_SIZE << order));
 	args.vma = vmf->vma;
-	args.start = vmf->address;
-	args.end = args.start + PAGE_SIZE;
-	args.src = &src_pfns;
-	args.dst = &dst_pfns;
+	args.end = args.start + (PAGE_SIZE << order);
+
+	nr = (args.end - args.start) >> PAGE_SHIFT;
+	args.src = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
 	args.pgmap_owner = dmirror->mdevice;
 	args.flags = dmirror_select_device(dmirror);
 	args.fault_page = vmf->page;
 
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
 	if (migrate_vma_setup(&args))
 		return VM_FAULT_SIGBUS;
 
 	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
 	if (ret)
-		return ret;
+		goto err;
 	migrate_vma_pages(&args);
 	/*
 	 * No device finalize step is needed since
@@ -1434,7 +1647,10 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * invalidated the device page table.
 	 */
 	migrate_vma_finalize(&args);
-	return 0;
+err:
+	kfree(args.src);
+	kfree(args.dst);
+	return ret;
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
@@ -1465,7 +1681,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 		return ret;
 
 	/* Build a list of free ZONE_DEVICE struct pages */
-	return dmirror_allocate_chunk(mdevice, NULL);
+	return dmirror_allocate_chunk(mdevice, NULL, false);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 06/11] mm/memremap: add folio_split support
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (4 preceding siblings ...)
  2025-08-12  2:40 ` [v3 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-12  2:40 ` [v3 07/11] mm/thp: add split during migration support Balbir Singh
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

When a zone device page is split (via huge pmd folio split). The
driver callback for folio_split is invoked to let the device driver
know that the folio size has been split into a smaller order.

Provide a default implementation for drivers that do not provide
this callback that copies the pgmap and mapping fields for the
split folios.

Update the HMM test driver to handle the split.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h | 29 +++++++++++++++++++++++++++++
 include/linux/mm.h       |  1 +
 lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         |  2 +-
 4 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0c5141a7d58c..20f4b5ebbc93 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,13 @@ struct dev_pagemap_ops {
 	 */
 	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
 			      unsigned long nr_pages, int mf_flags);
+
+	/*
+	 * Used for private (un-addressable) device memory only.
+	 * This callback is used when a folio is split into
+	 * a smaller folio
+	 */
+	void (*folio_split)(struct folio *head, struct folio *tail);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
@@ -229,6 +236,23 @@ static inline void zone_device_page_init(struct page *page)
 	zone_device_folio_init(folio, 0);
 }
 
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+						struct folio *new_folio)
+{
+	if (folio_is_device_private(original_folio)) {
+		if (!original_folio->pgmap->ops->folio_split) {
+			if (new_folio) {
+				new_folio->pgmap = original_folio->pgmap;
+				new_folio->page.mapping =
+					original_folio->page.mapping;
+			}
+		} else {
+			original_folio->pgmap->ops->folio_split(original_folio,
+								 new_folio);
+		}
+	}
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -263,6 +287,11 @@ static inline unsigned long memremap_compat_align(void)
 {
 	return PAGE_SIZE;
 }
+
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+						struct folio *new_folio)
+{
+}
 #endif /* CONFIG_ZONE_DEVICE */
 
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ceaa780a703a..f755afe533e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1185,6 +1185,7 @@ static inline struct folio *virt_to_folio(const void *x)
 void __folio_put(struct folio *folio);
 
 void split_page(struct page *page, unsigned int order);
+void prep_compound_page(struct page *page, unsigned int order);
 void folio_copy(struct folio *dst, struct folio *src);
 int folio_mc_copy(struct folio *dst, struct folio *src);
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index d814056151d0..14dbce719896 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1653,9 +1653,44 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
+{
+	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
+	struct page *rpage_tail;
+	struct folio *rfolio;
+	unsigned long offset = 0;
+
+	if (!rpage) {
+		tail->page.zone_device_data = NULL;
+		return;
+	}
+
+	rfolio = page_folio(rpage);
+
+	if (tail == NULL) {
+		folio_reset_order(rfolio);
+		rfolio->mapping = NULL;
+		folio_set_count(rfolio, 1);
+		return;
+	}
+
+	offset = folio_pfn(tail) - folio_pfn(head);
+
+	rpage_tail = folio_page(rfolio, offset);
+	tail->page.zone_device_data = rpage_tail;
+	rpage_tail->zone_device_data = rpage->zone_device_data;
+	clear_compound_head(rpage_tail);
+	rpage_tail->mapping = NULL;
+
+	folio_page(tail, 0)->mapping = folio_page(head, 0)->mapping;
+	tail->pgmap = head->pgmap;
+	folio_set_count(page_folio(rpage_tail), 1);
+}
+
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
 	.page_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
+	.folio_split	= dmirror_devmem_folio_split,
 };
 
 static int dmirror_device_init(struct dmirror_device *mdevice, int id)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8888140e57a3..dc58081b661c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3922,7 +3922,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 		ret = __split_unmapped_folio(folio, new_order, split_at, &xas,
 					     mapping, uniform_split);
-
 		/*
 		 * Unfreeze after-split folios and put them back to the right
 		 * list. @folio should be kept frozon until page cache
@@ -3973,6 +3972,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			__filemap_remove_folio(new_folio, NULL);
 			folio_put_refs(new_folio, nr_pages);
 		}
+
 		/*
 		 * Unfreeze @folio only after all page cache entries, which
 		 * used to point to it, have been updated with new folios.
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 07/11] mm/thp: add split during migration support
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (5 preceding siblings ...)
  2025-08-12  2:40 ` [v3 06/11] mm/memremap: add folio_split support Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-27 20:29   ` David Hildenbrand
  2025-08-12  2:40 ` [v3 08/11] lib/test_hmm: add test case for split pages Balbir Singh
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Support splitting pages during THP zone device migration as needed.
The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages.

Add a new routine migrate_vma_split_pages() to support the splitting
of already isolated pages. The pages being migrated are already unmapped
and marked for migration during setup (via unmap). folio_split() and
__split_unmapped_folio() take additional isolated arguments, to avoid
unmapping and remaping these pages and unlocking/putting the folio.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h | 11 +++++--
 lib/test_hmm.c          |  9 ++++++
 mm/huge_memory.c        | 45 ++++++++++++++------------
 mm/migrate_device.c     | 71 ++++++++++++++++++++++++++++++++++-------
 4 files changed, 101 insertions(+), 35 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4880fe98e46..52d8b435950b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order, bool unmapped);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns);
 int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 		struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
 /*
  * try_folio_split - try to split a @folio at @page using non uniform split.
  * @folio: folio to be split
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 14dbce719896..dda87c34b440 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1611,6 +1611,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	order = folio_order(page_folio(vmf->page));
 	nr = 1 << order;
 
+	/*
+	 * When folios are partially mapped, we can't rely on the folio
+	 * order of vmf->page as the folio might not be fully split yet
+	 */
+	if (vmf->pte) {
+		order = 0;
+		nr = 1;
+	}
+
 	/*
 	 * Consider a per-cpu cache of src and dst pfns, but with
 	 * large number of cpus that might not scale well.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dc58081b661c..863393dec1f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3474,15 +3474,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		new_folio->mapping = folio->mapping;
 		new_folio->index = folio->index + i;
 
-		/*
-		 * page->private should not be set in tail pages. Fix up and warn once
-		 * if private is unexpectedly set.
-		 */
-		if (unlikely(new_folio->private)) {
-			VM_WARN_ON_ONCE_PAGE(true, new_head);
-			new_folio->private = NULL;
-		}
-
 		if (folio_test_swapcache(folio))
 			new_folio->swap.val = folio->swap.val + i;
 
@@ -3711,6 +3702,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
+ * @unmapped: The pages are already unmapped, they are migration entries.
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3726,7 +3718,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split)
+		struct list_head *list, bool uniform_split, bool unmapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3776,13 +3768,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = folio_get_anon_vma(folio);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (!unmapped) {
+			anon_vma = folio_get_anon_vma(folio);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		unsigned int min_order;
 		gfp_t gfp;
@@ -3849,7 +3843,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	unmap_folio(folio);
+	if (!unmapped)
+		unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3936,10 +3931,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 			next = folio_next(new_folio);
 
+			zone_device_private_split_cb(folio, new_folio);
+
 			expected_refs = folio_expected_ref_count(new_folio) + 1;
 			folio_ref_unfreeze(new_folio, expected_refs);
 
-			lru_add_split_folio(folio, new_folio, lruvec, list);
+			if (!unmapped)
+				lru_add_split_folio(folio, new_folio, lruvec, list);
 
 			/*
 			 * Anonymous folio with swap cache.
@@ -3973,6 +3971,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			folio_put_refs(new_folio, nr_pages);
 		}
 
+		zone_device_private_split_cb(folio, NULL);
 		/*
 		 * Unfreeze @folio only after all page cache entries, which
 		 * used to point to it, have been updated with new folios.
@@ -3996,6 +3995,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	local_irq_enable();
 
+	if (unmapped)
+		return ret;
+
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
@@ -4086,12 +4088,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order, bool unmapped)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true);
+	return __folio_split(folio, new_order, &folio->page, page, list, true,
+				unmapped);
 }
 
 /*
@@ -4120,7 +4123,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
 	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false);
+			false, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6621bba62710..9206a3d5c0d1 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -864,6 +864,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		src[i] &= ~MIGRATE_PFN_MIGRATE;
 	return 0;
 }
+
+static int migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{
+	unsigned long i;
+	unsigned long pfn;
+	unsigned long flags;
+	int ret = 0;
+
+	folio_get(folio);
+	split_huge_pmd_address(migrate->vma, addr, true);
+	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
+							0, true);
+	if (ret)
+		return ret;
+	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+	for (i = 1; i < HPAGE_PMD_NR; i++)
+		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+	return ret;
+}
 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 					 unsigned long addr,
@@ -873,6 +896,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 {
 	return 0;
 }
+
+static int migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{
+	return 0;
+}
 #endif
 
 /*
@@ -1022,8 +1052,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				struct migrate_vma *migrate)
 {
 	struct mmu_notifier_range range;
-	unsigned long i;
+	unsigned long i, j;
 	bool notified = false;
+	unsigned long addr;
 
 	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1065,12 +1096,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
 				nr = HPAGE_PMD_NR;
 				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-				goto next;
+			} else {
+				nr = 1;
 			}
 
-			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
-						&src_pfns[i]);
+			for (j = 0; j < nr && i + j < npages; j++) {
+				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+				migrate_vma_insert_page(migrate,
+					addr + j * PAGE_SIZE,
+					&dst_pfns[i+j], &src_pfns[i+j]);
+			}
 			goto next;
 		}
 
@@ -1092,7 +1127,14 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 							 MIGRATE_PFN_COMPOUND);
 					goto next;
 				}
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				nr = 1 << folio_order(folio);
+				addr = migrate->start + i * PAGE_SIZE;
+				if (migrate_vma_split_pages(migrate, i, addr,
+								folio)) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
 			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
 				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
 				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1127,12 +1169,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 		BUG_ON(folio_test_writeback(folio));
 
 		if (migrate && migrate->fault_page == page)
-			extra_cnt = 1;
-		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
-		if (r != MIGRATEPAGE_SUCCESS)
-			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-		else
-			folio_migrate_flags(newfolio, folio);
+			extra_cnt++;
+		for (j = 0; j < nr && i + j < npages; j++) {
+			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+			if (r != MIGRATEPAGE_SUCCESS)
+				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+			else
+				folio_migrate_flags(newfolio, folio);
+		}
 next:
 		i += nr;
 	}
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 08/11] lib/test_hmm: add test case for split pages
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (6 preceding siblings ...)
  2025-08-12  2:40 ` [v3 07/11] mm/thp: add split during migration support Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-12  2:40 ` [v3 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add a new flag HMM_DMIRROR_FLAG_FAIL_ALLOC to emulate
failure of allocating a large page. This tests the code paths
involving split migration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 lib/test_hmm.c      | 61 ++++++++++++++++++++++++++++++---------------
 lib/test_hmm_uapi.h |  3 +++
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index dda87c34b440..5c5bfb48ec8a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -92,6 +92,7 @@ struct dmirror {
 	struct xarray			pt;
 	struct mmu_interval_notifier	notifier;
 	struct mutex			mutex;
+	__u64			flags;
 };
 
 /*
@@ -699,7 +700,12 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		     page_to_pfn(spage)))
 			goto next;
 
-		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			dpage = NULL;
+		} else
+			dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+
 		if (!dpage) {
 			struct folio *folio;
 			unsigned long i;
@@ -959,44 +965,55 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 
 		spage = BACKING_PAGE(spage);
 		order = folio_order(page_folio(spage));
-
 		if (order)
+			*dst = MIGRATE_PFN_COMPOUND;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			*dst &= ~MIGRATE_PFN_COMPOUND;
+			dpage = NULL;
+		} else if (order) {
 			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
 						order, args->vma, addr), 0);
-		else
-			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-
-		/* Try with smaller pages if large allocation fails */
-		if (!dpage && order) {
+		} else {
 			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-			if (!dpage)
-				return VM_FAULT_OOM;
-			order = 0;
 		}
 
+		if (!dpage && !order)
+			return VM_FAULT_OOM;
+
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 				page_to_pfn(spage), page_to_pfn(dpage));
-		lock_page(dpage);
-		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if (*src & MIGRATE_PFN_WRITE)
-			*dst |= MIGRATE_PFN_WRITE;
-		if (order)
-			*dst |= MIGRATE_PFN_COMPOUND;
+
+		if (dpage) {
+			lock_page(dpage);
+			*dst |= migrate_pfn(page_to_pfn(dpage));
+		}
 
 		for (i = 0; i < (1 << order); i++) {
 			struct page *src_page;
 			struct page *dst_page;
 
+			/* Try with smaller pages if large allocation fails */
+			if (!dpage && order) {
+				dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+				lock_page(dpage);
+				dst[i] = migrate_pfn(page_to_pfn(dpage));
+				dst_page = pfn_to_page(page_to_pfn(dpage));
+				dpage = NULL; /* For the next iteration */
+			} else {
+				dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+			}
+
 			src_page = pfn_to_page(page_to_pfn(spage) + i);
-			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
 
 			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			addr += PAGE_SIZE;
 			copy_highpage(dst_page, src_page);
 		}
 next:
-		addr += PAGE_SIZE << order;
 		src += 1 << order;
 		dst += 1 << order;
 	}
@@ -1514,6 +1531,10 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		dmirror_device_remove_chunks(dmirror->mdevice);
 		ret = 0;
 		break;
+	case HMM_DMIRROR_FLAGS:
+		dmirror->flags = cmd.npages;
+		ret = 0;
+		break;
 
 	default:
 		return -EINVAL;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 8c818a2cf4f6..f94c6d457338 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -37,6 +37,9 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_FLAGS		_IOWR('H', 0x08, struct hmm_dmirror_cmd)
+
+#define HMM_DMIRROR_FLAG_FAIL_ALLOC	(1ULL << 0)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (7 preceding siblings ...)
  2025-08-12  2:40 ` [v3 08/11] lib/test_hmm: add test case for split pages Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-12  2:40 ` [v3 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
  2025-08-12  2:40 ` [v3 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
  10 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add new tests for migrating anon THP pages, including anon_huge,
anon_huge_zero and error cases involving forced splitting of pages
during migration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 410 +++++++++++++++++++++++++
 1 file changed, 410 insertions(+)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 141bf63cbe05..da3322a1282c 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -2056,4 +2056,414 @@ TEST_F(hmm, hmm_cow_in_device)
 
 	hmm_buffer_free(buffer);
 }
+
+/*
+ * Migrate private anonymous huge empty page.
+ */
+TEST_F(hmm, migrate_anon_huge_empty)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page.
+ */
+TEST_F(hmm, migrate_anon_huge_zero)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+	int val;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize a read-only zero huge page. */
+	val = *(int *)buffer->ptr;
+	ASSERT_EQ(val, 0);
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], 0);
+		/* If it asserts once, it probably will 500,000 times */
+		if (ptr[i] != 0)
+			break;
+	}
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and free.
+ */
+TEST_F(hmm, migrate_anon_huge_free)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Try freeing it. */
+	ret = madvise(map, size, MADV_FREE);
+	ASSERT_EQ(ret, 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and fault back to sysmem.
+ */
+TEST_F(hmm, migrate_anon_huge_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], i);
+		if (ptr[i] != i)
+			break;
+	}
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate THP to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/*
+	 * Force an allocation error when faulting back a THP resident in the
+	 * device.
+	 */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+
+	ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_zero_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 0);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory (zero THP page). */
+	ret = ptr[0];
+	ASSERT_EQ(ret, 0);
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Fault the device memory back and check it. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 10/11] gpu/drm/nouveau: add THP migration support
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (8 preceding siblings ...)
  2025-08-12  2:40 ` [v3 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  2025-08-13  2:23   ` kernel test robot
  2025-08-12  2:40 ` [v3 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
  10 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Change the code to add support for MIGRATE_VMA_SELECT_COMPOUND
and appropriately handling page sizes in the migrate/evict
code paths.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 253 ++++++++++++++++++-------
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 186 insertions(+), 76 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index ca4932a150e3..408c1adf6f20 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -83,9 +83,15 @@ struct nouveau_dmem {
 	struct list_head chunks;
 	struct mutex mutex;
 	struct page *free_pages;
+	struct folio *free_folios;
 	spinlock_t lock;
 };
 
+struct nouveau_dmem_dma_info {
+	dma_addr_t dma_addr;
+	size_t size;
+};
+
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
 	return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
@@ -112,10 +118,16 @@ static void nouveau_dmem_page_free(struct page *page)
 {
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
+	struct folio *folio = page_folio(page);
 
 	spin_lock(&dmem->lock);
-	page->zone_device_data = dmem->free_pages;
-	dmem->free_pages = page;
+	if (folio_order(folio)) {
+		folio_set_zone_device_data(folio, dmem->free_folios);
+		dmem->free_folios = folio;
+	} else {
+		page->zone_device_data = dmem->free_pages;
+		dmem->free_pages = page;
+	}
 
 	WARN_ON(!chunk->callocated);
 	chunk->callocated--;
@@ -139,20 +151,28 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
 	}
 }
 
-static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
-				struct page *dpage, dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_folio(struct nouveau_drm *drm,
+				   struct folio *sfolio, struct folio *dfolio,
+				   struct nouveau_dmem_dma_info *dma_info)
 {
 	struct device *dev = drm->dev->dev;
+	struct page *dpage = folio_page(dfolio, 0);
+	struct page *spage = folio_page(sfolio, 0);
 
-	lock_page(dpage);
+	folio_lock(dfolio);
 
-	*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
-	if (dma_mapping_error(dev, *dma_addr))
+	dma_info->dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+					DMA_BIDIRECTIONAL);
+	dma_info->size = page_size(dpage);
+	if (dma_mapping_error(dev, dma_info->dma_addr))
 		return -EIO;
 
-	if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-					 NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage))) {
-		dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(sfolio),
+					 NOUVEAU_APER_HOST, dma_info->dma_addr,
+					 NOUVEAU_APER_VRAM,
+					 nouveau_dmem_page_addr(spage))) {
+		dma_unmap_page(dev, dma_info->dma_addr, page_size(dpage),
+					DMA_BIDIRECTIONAL);
 		return -EIO;
 	}
 
@@ -165,21 +185,47 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	struct nouveau_dmem *dmem = drm->dmem;
 	struct nouveau_fence *fence;
 	struct nouveau_svmm *svmm;
-	struct page *spage, *dpage;
-	unsigned long src = 0, dst = 0;
-	dma_addr_t dma_addr = 0;
+	struct page *dpage;
 	vm_fault_t ret = 0;
 	struct migrate_vma args = {
 		.vma		= vmf->vma,
-		.start		= vmf->address,
-		.end		= vmf->address + PAGE_SIZE,
-		.src		= &src,
-		.dst		= &dst,
 		.pgmap_owner	= drm->dev,
 		.fault_page	= vmf->page,
-		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+				  MIGRATE_VMA_SELECT_COMPOUND,
+		.src = NULL,
+		.dst = NULL,
 	};
+	unsigned int order, nr;
+	struct folio *sfolio, *dfolio;
+	struct nouveau_dmem_dma_info dma_info;
+
+	sfolio = page_folio(vmf->page);
+	order = folio_order(sfolio);
+	nr = 1 << order;
 
+	/*
+	 * Handle partial unmap faults, where the folio is large, but
+	 * the pmd is split.
+	 */
+	if (vmf->pte) {
+		order = 0;
+		nr = 1;
+	}
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
+	args.start = ALIGN_DOWN(vmf->address, (1 << (PAGE_SHIFT + order)));
+	args.vma = vmf->vma;
+	args.end = args.start + (PAGE_SIZE << order);
+	args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
+
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
 	/*
 	 * FIXME what we really want is to find some heuristic to migrate more
 	 * than just one page on CPU fault. When such fault happens it is very
@@ -190,20 +236,26 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	if (!args.cpages)
 		return 0;
 
-	spage = migrate_pfn_to_page(src);
-	if (!spage || !(src & MIGRATE_PFN_MIGRATE))
-		goto done;
-
-	dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma, vmf->address);
-	if (!dpage)
+	if (order)
+		dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER | __GFP_ZERO,
+					order, vmf->vma, vmf->address), 0);
+	else
+		dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma,
+					vmf->address);
+	if (!dpage) {
+		ret = VM_FAULT_OOM;
 		goto done;
+	}
 
-	dst = migrate_pfn(page_to_pfn(dpage));
+	args.dst[0] = migrate_pfn(page_to_pfn(dpage));
+	if (order)
+		args.dst[0] |= MIGRATE_PFN_COMPOUND;
+	dfolio = page_folio(dpage);
 
-	svmm = spage->zone_device_data;
+	svmm = folio_zone_device_data(sfolio);
 	mutex_lock(&svmm->mutex);
 	nouveau_svmm_invalidate(svmm, args.start, args.end);
-	ret = nouveau_dmem_copy_one(drm, spage, dpage, &dma_addr);
+	ret = nouveau_dmem_copy_folio(drm, sfolio, dfolio, &dma_info);
 	mutex_unlock(&svmm->mutex);
 	if (ret) {
 		ret = VM_FAULT_SIGBUS;
@@ -213,19 +265,33 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	nouveau_fence_new(&fence, dmem->migrate.chan);
 	migrate_vma_pages(&args);
 	nouveau_dmem_fence_done(&fence);
-	dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(drm->dev->dev, dma_info.dma_addr, PAGE_SIZE,
+				DMA_BIDIRECTIONAL);
 done:
 	migrate_vma_finalize(&args);
+err:
+	kfree(args.src);
+	kfree(args.dst);
 	return ret;
 }
 
+static void nouveau_dmem_folio_split(struct folio *head, struct folio *tail)
+{
+	if (tail == NULL)
+		return;
+	tail->pgmap = head->pgmap;
+	folio_set_zone_device_data(tail, folio_zone_device_data(head));
+}
+
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
 	.page_free		= nouveau_dmem_page_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
+	.folio_split		= nouveau_dmem_folio_split,
 };
 
 static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage,
+			 bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct resource *res;
@@ -274,16 +340,21 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 	pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
 	page = pfn_to_page(pfn_first);
 	spin_lock(&drm->dmem->lock);
-	for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
-		page->zone_device_data = drm->dmem->free_pages;
-		drm->dmem->free_pages = page;
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || !is_large) {
+		for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
+			page->zone_device_data = drm->dmem->free_pages;
+			drm->dmem->free_pages = page;
+		}
 	}
+
 	*ppage = page;
 	chunk->callocated++;
 	spin_unlock(&drm->dmem->lock);
 
-	NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
-		DMEM_CHUNK_SIZE >> 20);
+	NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+		DMEM_CHUNK_SIZE >> 20, is_large ? "THP " : "", pfn_first,
+		nouveau_dmem_page_addr(page));
 
 	return 0;
 
@@ -298,27 +369,37 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 }
 
 static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct page *page = NULL;
+	struct folio *folio = NULL;
 	int ret;
+	unsigned int order = 0;
 
 	spin_lock(&drm->dmem->lock);
-	if (drm->dmem->free_pages) {
+	if (is_large && drm->dmem->free_folios) {
+		folio = drm->dmem->free_folios;
+		drm->dmem->free_folios = folio_zone_device_data(folio);
+		chunk = nouveau_page_to_chunk(page);
+		chunk->callocated++;
+		spin_unlock(&drm->dmem->lock);
+		order = DMEM_CHUNK_NPAGES;
+	} else if (!is_large && drm->dmem->free_pages) {
 		page = drm->dmem->free_pages;
 		drm->dmem->free_pages = page->zone_device_data;
 		chunk = nouveau_page_to_chunk(page);
 		chunk->callocated++;
 		spin_unlock(&drm->dmem->lock);
+		folio = page_folio(page);
 	} else {
 		spin_unlock(&drm->dmem->lock);
-		ret = nouveau_dmem_chunk_alloc(drm, &page);
+		ret = nouveau_dmem_chunk_alloc(drm, &page, is_large);
 		if (ret)
 			return NULL;
 	}
 
-	zone_device_page_init(page);
+	zone_device_folio_init(folio, order);
 	return page;
 }
 
@@ -369,12 +450,12 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 {
 	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
 	unsigned long *src_pfns, *dst_pfns;
-	dma_addr_t *dma_addrs;
+	struct nouveau_dmem_dma_info *dma_info;
 	struct nouveau_fence *fence;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
-	dma_addrs = kvcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL | __GFP_NOFAIL);
+	dma_info = kvcalloc(npages, sizeof(*dma_info), GFP_KERNEL | __GFP_NOFAIL);
 
 	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
 			npages);
@@ -382,17 +463,28 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	for (i = 0; i < npages; i++) {
 		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
 			struct page *dpage;
+			struct folio *folio = page_folio(
+				migrate_pfn_to_page(src_pfns[i]));
+			unsigned int order = folio_order(folio);
+
+			if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+				dpage = folio_page(
+						folio_alloc(
+						GFP_HIGHUSER_MOVABLE, order), 0);
+			} else {
+				/*
+				 * _GFP_NOFAIL because the GPU is going away and there
+				 * is nothing sensible we can do if we can't copy the
+				 * data back.
+				 */
+				dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+			}
 
-			/*
-			 * _GFP_NOFAIL because the GPU is going away and there
-			 * is nothing sensible we can do if we can't copy the
-			 * data back.
-			 */
-			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
 			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
-			nouveau_dmem_copy_one(chunk->drm,
-					migrate_pfn_to_page(src_pfns[i]), dpage,
-					&dma_addrs[i]);
+			nouveau_dmem_copy_folio(chunk->drm,
+				page_folio(migrate_pfn_to_page(src_pfns[i])),
+				page_folio(dpage),
+				&dma_info[i]);
 		}
 	}
 
@@ -403,8 +495,9 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	kvfree(src_pfns);
 	kvfree(dst_pfns);
 	for (i = 0; i < npages; i++)
-		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
-	kvfree(dma_addrs);
+		dma_unmap_page(chunk->drm->dev->dev, dma_info[i].dma_addr,
+				dma_info[i].size, DMA_BIDIRECTIONAL);
+	kvfree(dma_info);
 }
 
 void
@@ -607,31 +700,35 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, unsigned long src,
-		dma_addr_t *dma_addr, u64 *pfn)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfn)
 {
 	struct device *dev = drm->dev->dev;
 	struct page *dpage, *spage;
 	unsigned long paddr;
+	bool is_large = false;
 
 	spage = migrate_pfn_to_page(src);
 	if (!(src & MIGRATE_PFN_MIGRATE))
 		goto out;
 
-	dpage = nouveau_dmem_page_alloc_locked(drm);
+	is_large = src & MIGRATE_PFN_COMPOUND;
+	dpage = nouveau_dmem_page_alloc_locked(drm, is_large);
 	if (!dpage)
 		goto out;
 
 	paddr = nouveau_dmem_page_addr(dpage);
 	if (spage) {
-		*dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
+		dma_info->dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
 					 DMA_BIDIRECTIONAL);
-		if (dma_mapping_error(dev, *dma_addr))
+		dma_info->size = page_size(spage);
+		if (dma_mapping_error(dev, dma_info->dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, 1,
-			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
+		if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(page_folio(spage)),
+			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST,
+			dma_info->dma_addr))
 			goto out_dma_unmap;
 	} else {
-		*dma_addr = DMA_MAPPING_ERROR;
+		dma_info->dma_addr = DMA_MAPPING_ERROR;
 		if (drm->dmem->migrate.clear_func(drm, page_size(dpage),
 			NOUVEAU_APER_VRAM, paddr))
 			goto out_free_page;
@@ -645,7 +742,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 	return migrate_pfn(page_to_pfn(dpage));
 
 out_dma_unmap:
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(dev, dma_info->dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 out_free_page:
 	nouveau_dmem_page_free_locked(drm, dpage);
 out:
@@ -655,27 +752,33 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 
 static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, struct migrate_vma *args,
-		dma_addr_t *dma_addrs, u64 *pfns)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfns)
 {
 	struct nouveau_fence *fence;
 	unsigned long addr = args->start, nr_dma = 0, i;
+	unsigned long order = 0;
+
+	for (i = 0; addr < args->end; ) {
+		struct folio *folio;
 
-	for (i = 0; addr < args->end; i++) {
+		folio = page_folio(migrate_pfn_to_page(args->dst[i]));
+		order = folio_order(folio);
 		args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
-				args->src[i], dma_addrs + nr_dma, pfns + i);
-		if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
+				args->src[i], dma_info + nr_dma, pfns + i);
+		if (!dma_mapping_error(drm->dev->dev, dma_info[nr_dma].dma_addr))
 			nr_dma++;
-		addr += PAGE_SIZE;
+		i += 1 << order;
+		addr += (1 << order) * PAGE_SIZE;
 	}
 
 	nouveau_fence_new(&fence, drm->dmem->migrate.chan);
 	migrate_vma_pages(args);
 	nouveau_dmem_fence_done(&fence);
-	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i, order);
 
 	while (nr_dma--) {
-		dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
-				DMA_BIDIRECTIONAL);
+		dma_unmap_page(drm->dev->dev, dma_info[nr_dma].dma_addr,
+				dma_info[nr_dma].size, DMA_BIDIRECTIONAL);
 	}
 	migrate_vma_finalize(args);
 }
@@ -689,20 +792,24 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 {
 	unsigned long npages = (end - start) >> PAGE_SHIFT;
 	unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
-	dma_addr_t *dma_addrs;
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
 		.pgmap_owner	= drm->dev,
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM
+				  | MIGRATE_VMA_SELECT_COMPOUND,
 	};
 	unsigned long i;
 	u64 *pfns;
 	int ret = -ENOMEM;
+	struct nouveau_dmem_dma_info *dma_info;
 
 	if (drm->dmem == NULL)
 		return -ENODEV;
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		max = max(HPAGE_PMD_NR, max);
+
 	args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL);
 	if (!args.src)
 		goto out;
@@ -710,8 +817,8 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	if (!args.dst)
 		goto out_free_src;
 
-	dma_addrs = kmalloc_array(max, sizeof(*dma_addrs), GFP_KERNEL);
-	if (!dma_addrs)
+	dma_info = kmalloc_array(max, sizeof(*dma_info), GFP_KERNEL);
+	if (!dma_info)
 		goto out_free_dst;
 
 	pfns = nouveau_pfns_alloc(max);
@@ -729,7 +836,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 			goto out_free_pfns;
 
 		if (args.cpages)
-			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
+			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_info,
 						   pfns);
 		args.start = args.end;
 	}
@@ -738,7 +845,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 out_free_pfns:
 	nouveau_pfns_free(pfns);
 out_free_dma:
-	kfree(dma_addrs);
+	kfree(dma_info);
 out_free_dst:
 	kfree(args.dst);
 out_free_src:
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 6fa387da0637..b8a3378154d5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -921,12 +921,14 @@ nouveau_pfns_free(u64 *pfns)
 
 void
 nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		 unsigned long addr, u64 *pfns, unsigned long npages)
+		 unsigned long addr, u64 *pfns, unsigned long npages,
+		 unsigned int page_shift)
 {
 	struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
 
 	args->p.addr = addr;
-	args->p.size = npages << PAGE_SHIFT;
+	args->p.size = npages << page_shift;
+	args->p.page = page_shift;
 
 	mutex_lock(&svmm->mutex);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index e7d63d7f0c2d..3fd78662f17e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -33,7 +33,8 @@ void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit);
 u64 *nouveau_pfns_alloc(unsigned long npages);
 void nouveau_pfns_free(u64 *pfns);
 void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		      unsigned long addr, u64 *pfns, unsigned long npages);
+		      unsigned long addr, u64 *pfns, unsigned long npages,
+		      unsigned int page_shift);
 #else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
 static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
 static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [v3 11/11] selftests/mm/hmm-tests: new throughput tests including THP
  2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
                   ` (9 preceding siblings ...)
  2025-08-12  2:40 ` [v3 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
@ 2025-08-12  2:40 ` Balbir Singh
  10 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-12  2:40 UTC (permalink / raw)
  To: dri-devel, linux-mm, linux-kernel
  Cc: Balbir Singh, Andrew Morton, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add new benchmark style support to test transfer bandwidth for
zone device memory operations.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 197 ++++++++++++++++++++++++-
 1 file changed, 196 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index da3322a1282c..1325de70f44f 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -25,6 +25,7 @@
 #include <sys/stat.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
+#include <sys/time.h>
 
 
 /*
@@ -207,8 +208,10 @@ static void hmm_buffer_free(struct hmm_buffer *buffer)
 	if (buffer == NULL)
 		return;
 
-	if (buffer->ptr)
+	if (buffer->ptr) {
 		munmap(buffer->ptr, buffer->size);
+		buffer->ptr = NULL;
+	}
 	free(buffer->mirror);
 	free(buffer);
 }
@@ -2466,4 +2469,196 @@ TEST_F(hmm, migrate_anon_huge_zero_err)
 	buffer->ptr = old_ptr;
 	hmm_buffer_free(buffer);
 }
+
+struct benchmark_results {
+	double sys_to_dev_time;
+	double dev_to_sys_time;
+	double throughput_s2d;
+	double throughput_d2s;
+};
+
+static double get_time_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000.0) + (tv.tv_usec / 1000.0);
+}
+
+static inline struct hmm_buffer *hmm_buffer_alloc(unsigned long size)
+{
+	struct hmm_buffer *buffer;
+
+	buffer = malloc(sizeof(*buffer));
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	memset(buffer->mirror, 0xFF, size);
+	return buffer;
+}
+
+static void print_benchmark_results(const char *test_name, size_t buffer_size,
+				     struct benchmark_results *thp,
+				     struct benchmark_results *regular)
+{
+	double s2d_improvement = ((regular->sys_to_dev_time - thp->sys_to_dev_time) /
+				 regular->sys_to_dev_time) * 100.0;
+	double d2s_improvement = ((regular->dev_to_sys_time - thp->dev_to_sys_time) /
+				 regular->dev_to_sys_time) * 100.0;
+	double throughput_s2d_improvement = ((thp->throughput_s2d - regular->throughput_s2d) /
+					    regular->throughput_s2d) * 100.0;
+	double throughput_d2s_improvement = ((thp->throughput_d2s - regular->throughput_d2s) /
+					    regular->throughput_d2s) * 100.0;
+
+	printf("\n=== %s (%.1f MB) ===\n", test_name, buffer_size / (1024.0 * 1024.0));
+	printf("                     | With THP        | Without THP     | Improvement\n");
+	printf("---------------------------------------------------------------------\n");
+	printf("Sys->Dev Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->sys_to_dev_time, regular->sys_to_dev_time, s2d_improvement);
+	printf("Dev->Sys Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->dev_to_sys_time, regular->dev_to_sys_time, d2s_improvement);
+	printf("S->D Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_s2d, regular->throughput_s2d, throughput_s2d_improvement);
+	printf("D->S Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_d2s, regular->throughput_d2s, throughput_d2s_improvement);
+}
+
+/*
+ * Run a single migration benchmark
+ * fd: file descriptor for hmm device
+ * use_thp: whether to use THP
+ * buffer_size: size of buffer to allocate
+ * iterations: number of iterations
+ * results: where to store results
+ */
+static inline int run_migration_benchmark(int fd, int use_thp, size_t buffer_size,
+					   int iterations, struct benchmark_results *results)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages = buffer_size / sysconf(_SC_PAGESIZE);
+	double start, end;
+	double s2d_total = 0, d2s_total = 0;
+	int ret, i;
+	int *ptr;
+
+	buffer = hmm_buffer_alloc(buffer_size);
+
+	/* Map memory */
+	buffer->ptr = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
+			  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (!buffer->ptr)
+		return -1;
+
+	/* Apply THP hint if requested */
+	if (use_thp)
+		ret = madvise(buffer->ptr, buffer_size, MADV_HUGEPAGE);
+	else
+		ret = madvise(buffer->ptr, buffer_size, MADV_NOHUGEPAGE);
+
+	if (ret)
+		return ret;
+
+	/* Initialize memory to make sure pages are allocated */
+	ptr = (int *)buffer->ptr;
+	for (i = 0; i < buffer_size / sizeof(int); i++)
+		ptr[i] = i & 0xFF;
+
+	/* Warmup iteration */
+	ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	/* Benchmark iterations */
+	for (i = 0; i < iterations; i++) {
+		/* System to device migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		s2d_total += (end - start);
+
+		/* Device to system migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		d2s_total += (end - start);
+	}
+
+	/* Calculate average times and throughput */
+	results->sys_to_dev_time = s2d_total / iterations;
+	results->dev_to_sys_time = d2s_total / iterations;
+	results->throughput_s2d = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->sys_to_dev_time / 1000.0);
+	results->throughput_d2s = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->dev_to_sys_time / 1000.0);
+
+	/* Cleanup */
+	hmm_buffer_free(buffer);
+	return 0;
+}
+
+/*
+ * Benchmark THP migration with different buffer sizes
+ */
+TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120)
+{
+	struct benchmark_results thp_results, regular_results;
+	size_t thp_size = 2 * 1024 * 1024; /* 2MB - typical THP size */
+	int iterations = 5;
+
+	printf("\nHMM THP Migration Benchmark\n");
+	printf("---------------------------\n");
+	printf("System page size: %ld bytes\n", sysconf(_SC_PAGESIZE));
+
+	/* Test different buffer sizes */
+	size_t test_sizes[] = {
+		thp_size / 4,      /* 512KB - smaller than THP */
+		thp_size / 2,      /* 1MB - half THP */
+		thp_size,          /* 2MB - single THP */
+		thp_size * 2,      /* 4MB - two THPs */
+		thp_size * 4,      /* 8MB - four THPs */
+		thp_size * 8,       /* 16MB - eight THPs */
+		thp_size * 128,       /* 256MB - one twenty eight THPs */
+	};
+
+	static const char *const test_names[] = {
+		"Small Buffer (512KB)",
+		"Half THP Size (1MB)",
+		"Single THP Size (2MB)",
+		"Two THP Size (4MB)",
+		"Four THP Size (8MB)",
+		"Eight THP Size (16MB)",
+		"One twenty eight THP Size (256MB)"
+	};
+
+	int num_tests = ARRAY_SIZE(test_sizes);
+
+	/* Run all tests */
+	for (int i = 0; i < num_tests; i++) {
+		/* Test with THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 1, test_sizes[i],
+					iterations, &thp_results), 0);
+
+		/* Test without THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 0, test_sizes[i],
+					iterations, &regular_results), 0);
+
+		/* Print results */
+		print_benchmark_results(test_names[i], test_sizes[i],
+					&thp_results, &regular_results);
+	}
+}
 TEST_HARNESS_MAIN
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  2:40 ` [v3 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-08-12  5:35   ` Mika Penttilä
  2025-08-12  5:54     ` Matthew Brost
  2025-08-12 23:36     ` Balbir Singh
  0 siblings, 2 replies; 36+ messages in thread
From: Mika Penttilä @ 2025-08-12  5:35 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Matthew Brost, Francois Dugast

Hi,

On 8/12/25 05:40, Balbir Singh wrote:

> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> device pages as compound pages during device pfn migration.
>
> migrate_device code paths go through the collect, setup
> and finalize phases of migration.
>
> The entries in src and dst arrays passed to these functions still
> remain at a PAGE_SIZE granularity. When a compound page is passed,
> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
> representation allows for the compound page to be split into smaller
> page sizes.
>
> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
> and migrate_vma_insert_huge_pmd_page() have been added.
>
> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
> some reason this fails, there is fallback support to split the folio
> and migrate it.
>
> migrate_vma_insert_huge_pmd_page() closely follows the logic of
> migrate_vma_insert_page()
>
> Support for splitting pages as needed for migration will follow in
> later patches in this series.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/migrate.h |   2 +
>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>  2 files changed, 396 insertions(+), 63 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index acadd41e0b5c..d9cef0819f91 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>  #define MIGRATE_PFN_VALID	(1UL << 0)
>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>  #define MIGRATE_PFN_SHIFT	6
>  
>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>  };
>  
>  struct migrate_vma {
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 0ed337f94fcd..6621bba62710 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -14,6 +14,7 @@
>  #include <linux/pagewalk.h>
>  #include <linux/rmap.h>
>  #include <linux/swapops.h>
> +#include <asm/pgalloc.h>
>  #include <asm/tlbflush.h>
>  #include "internal.h"
>  
> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>  	if (!vma_is_anonymous(walk->vma))
>  		return migrate_vma_collect_skip(start, end, walk);
>  
> +	if (thp_migration_supported() &&
> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> +						MIGRATE_PFN_COMPOUND;
> +		migrate->dst[migrate->npages] = 0;
> +		migrate->npages++;
> +		migrate->cpages++;
> +
> +		/*
> +		 * Collect the remaining entries as holes, in case we
> +		 * need to split later
> +		 */
> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> +	}
> +
>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>  		migrate->dst[migrate->npages] = 0;
> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>  	return 0;
>  }
>  
> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> -				   unsigned long start,
> -				   unsigned long end,
> -				   struct mm_walk *walk)
> +/**
> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
> + * folio for device private pages.
> + * @pmdp: pointer to pmd entry
> + * @start: start address of the range for migration
> + * @end: end address of the range for migration
> + * @walk: mm_walk callback structure
> + *
> + * Collect the huge pmd entry at @pmdp for migration and set the
> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
> + * migration will occur at HPAGE_PMD granularity
> + */
> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
> +					unsigned long end, struct mm_walk *walk,
> +					struct folio *fault_folio)
>  {
> +	struct mm_struct *mm = walk->mm;
> +	struct folio *folio;
>  	struct migrate_vma *migrate = walk->private;
> -	struct folio *fault_folio = migrate->fault_page ?
> -		page_folio(migrate->fault_page) : NULL;
> -	struct vm_area_struct *vma = walk->vma;
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long addr = start, unmapped = 0;
>  	spinlock_t *ptl;
> -	pte_t *ptep;
> +	swp_entry_t entry;
> +	int ret;
> +	unsigned long write = 0;
>  
> -again:
> -	if (pmd_none(*pmdp))
> +	ptl = pmd_lock(mm, pmdp);
> +	if (pmd_none(*pmdp)) {
> +		spin_unlock(ptl);
>  		return migrate_vma_collect_hole(start, end, -1, walk);
> +	}
>  
>  	if (pmd_trans_huge(*pmdp)) {
> -		struct folio *folio;
> -
> -		ptl = pmd_lock(mm, pmdp);
> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>  			spin_unlock(ptl);
> -			goto again;
> +			return migrate_vma_collect_skip(start, end, walk);
>  		}
>  
>  		folio = pmd_folio(*pmdp);
>  		if (is_huge_zero_folio(folio)) {
>  			spin_unlock(ptl);
> -			split_huge_pmd(vma, pmdp, addr);
> -		} else {
> -			int ret;
> +			return migrate_vma_collect_hole(start, end, -1, walk);
> +		}
> +		if (pmd_write(*pmdp))
> +			write = MIGRATE_PFN_WRITE;
> +	} else if (!pmd_present(*pmdp)) {
> +		entry = pmd_to_swp_entry(*pmdp);
> +		folio = pfn_swap_entry_folio(entry);
> +
> +		if (!is_device_private_entry(entry) ||
> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
> +			spin_unlock(ptl);
> +			return migrate_vma_collect_skip(start, end, walk);
> +		}
>  
> -			folio_get(folio);
> +		if (is_migration_entry(entry)) {
> +			migration_entry_wait_on_locked(entry, ptl);
>  			spin_unlock(ptl);
> -			/* FIXME: we don't expect THP for fault_folio */
> -			if (WARN_ON_ONCE(fault_folio == folio))
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> -			if (unlikely(!folio_trylock(folio)))
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> -			ret = split_folio(folio);
> -			if (fault_folio != folio)
> -				folio_unlock(folio);
> -			folio_put(folio);
> -			if (ret)
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> +			return -EAGAIN;
>  		}
> +
> +		if (is_writable_device_private_entry(entry))
> +			write = MIGRATE_PFN_WRITE;
> +	} else {
> +		spin_unlock(ptl);
> +		return -EAGAIN;
> +	}
> +
> +	folio_get(folio);
> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
> +		spin_unlock(ptl);
> +		folio_put(folio);
> +		return migrate_vma_collect_skip(start, end, walk);
> +	}
> +
> +	if (thp_migration_supported() &&
> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> +
> +		struct page_vma_mapped_walk pvmw = {
> +			.ptl = ptl,
> +			.address = start,
> +			.pmd = pmdp,
> +			.vma = walk->vma,
> +		};
> +
> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
> +
> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
> +						| MIGRATE_PFN_MIGRATE
> +						| MIGRATE_PFN_COMPOUND;
> +		migrate->dst[migrate->npages++] = 0;
> +		migrate->cpages++;
> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
> +		if (ret) {
> +			migrate->npages--;
> +			migrate->cpages--;
> +			migrate->src[migrate->npages] = 0;
> +			migrate->dst[migrate->npages] = 0;
> +			goto fallback;
> +		}
> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +fallback:
> +	spin_unlock(ptl);
> +	if (!folio_test_large(folio))
> +		goto done;
> +	ret = split_folio(folio);
> +	if (fault_folio != folio)
> +		folio_unlock(folio);
> +	folio_put(folio);
> +	if (ret)
> +		return migrate_vma_collect_skip(start, end, walk);
> +	if (pmd_none(pmdp_get_lockless(pmdp)))
> +		return migrate_vma_collect_hole(start, end, -1, walk);
> +
> +done:
> +	return -ENOENT;
> +}
> +
> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> +				   unsigned long start,
> +				   unsigned long end,
> +				   struct mm_walk *walk)
> +{
> +	struct migrate_vma *migrate = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long addr = start, unmapped = 0;
> +	spinlock_t *ptl;
> +	struct folio *fault_folio = migrate->fault_page ?
> +		page_folio(migrate->fault_page) : NULL;
> +	pte_t *ptep;
> +
> +again:
> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
> +
> +		if (ret == -EAGAIN)
> +			goto again;
> +		if (ret == 0)
> +			return 0;
>  	}
>  
>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>  		}
>  
> -		/* FIXME support THP */
> -		if (!page || !page->mapping || PageTransCompound(page)) {
> +		if (!page || !page->mapping) {
>  			mpfn = 0;
>  			goto next;
>  		}
> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>  	 */
>  	int extra = 1 + (page == fault_page);
>  
> -	/*
> -	 * FIXME support THP (transparent huge page), it is bit more complex to
> -	 * check them than regular pages, because they can be mapped with a pmd
> -	 * or with a pte (split pte mapping).
> -	 */
> -	if (folio_test_large(folio))
> -		return false;
> -

You cannot remove this check unless support normal mTHP folios migrate to device, 
which I think this series doesn't do, but maybe should?

--Mika



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  5:35   ` Mika Penttilä
@ 2025-08-12  5:54     ` Matthew Brost
  2025-08-12  6:18       ` Matthew Brost
  2025-08-12  6:25       ` Mika Penttilä
  2025-08-12 23:36     ` Balbir Singh
  1 sibling, 2 replies; 36+ messages in thread
From: Matthew Brost @ 2025-08-12  5:54 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, dri-devel, linux-mm, linux-kernel, Andrew Morton,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Francois Dugast

On Tue, Aug 12, 2025 at 08:35:49AM +0300, Mika Penttilä wrote:
> Hi,
> 
> On 8/12/25 05:40, Balbir Singh wrote:
> 
> > MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> > migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> > device pages as compound pages during device pfn migration.
> >
> > migrate_device code paths go through the collect, setup
> > and finalize phases of migration.
> >
> > The entries in src and dst arrays passed to these functions still
> > remain at a PAGE_SIZE granularity. When a compound page is passed,
> > the first entry has the PFN along with MIGRATE_PFN_COMPOUND
> > and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
> > remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
> > representation allows for the compound page to be split into smaller
> > page sizes.
> >
> > migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
> > page aware. Two new helper functions migrate_vma_collect_huge_pmd()
> > and migrate_vma_insert_huge_pmd_page() have been added.
> >
> > migrate_vma_collect_huge_pmd() can collect THP pages, but if for
> > some reason this fails, there is fallback support to split the folio
> > and migrate it.
> >
> > migrate_vma_insert_huge_pmd_page() closely follows the logic of
> > migrate_vma_insert_page()
> >
> > Support for splitting pages as needed for migration will follow in
> > later patches in this series.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> > Cc: Rakie Kim <rakie.kim@sk.com>
> > Cc: Byungchul Park <byungchul@sk.com>
> > Cc: Gregory Price <gourry@gourry.net>
> > Cc: Ying Huang <ying.huang@linux.alibaba.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Nico Pache <npache@redhat.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Dev Jain <dev.jain@arm.com>
> > Cc: Barry Song <baohua@kernel.org>
> > Cc: Lyude Paul <lyude@redhat.com>
> > Cc: Danilo Krummrich <dakr@kernel.org>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona@ffwll.ch>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mika Penttilä <mpenttil@redhat.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Francois Dugast <francois.dugast@intel.com>
> >
> > Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> > ---
> >  include/linux/migrate.h |   2 +
> >  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
> >  2 files changed, 396 insertions(+), 63 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index acadd41e0b5c..d9cef0819f91 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> >  #define MIGRATE_PFN_VALID	(1UL << 0)
> >  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
> >  #define MIGRATE_PFN_WRITE	(1UL << 3)
> > +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
> >  #define MIGRATE_PFN_SHIFT	6
> >  
> >  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> > @@ -147,6 +148,7 @@ enum migrate_vma_direction {
> >  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
> >  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> >  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> > +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
> >  };
> >  
> >  struct migrate_vma {
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 0ed337f94fcd..6621bba62710 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/pagewalk.h>
> >  #include <linux/rmap.h>
> >  #include <linux/swapops.h>
> > +#include <asm/pgalloc.h>
> >  #include <asm/tlbflush.h>
> >  #include "internal.h"
> >  
> > @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
> >  	if (!vma_is_anonymous(walk->vma))
> >  		return migrate_vma_collect_skip(start, end, walk);
> >  
> > +	if (thp_migration_supported() &&
> > +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> > +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> > +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> > +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> > +						MIGRATE_PFN_COMPOUND;
> > +		migrate->dst[migrate->npages] = 0;
> > +		migrate->npages++;
> > +		migrate->cpages++;
> > +
> > +		/*
> > +		 * Collect the remaining entries as holes, in case we
> > +		 * need to split later
> > +		 */
> > +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> > +	}
> > +
> >  	for (addr = start; addr < end; addr += PAGE_SIZE) {
> >  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
> >  		migrate->dst[migrate->npages] = 0;
> > @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
> >  	return 0;
> >  }
> >  
> > -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > -				   unsigned long start,
> > -				   unsigned long end,
> > -				   struct mm_walk *walk)
> > +/**
> > + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
> > + * folio for device private pages.
> > + * @pmdp: pointer to pmd entry
> > + * @start: start address of the range for migration
> > + * @end: end address of the range for migration
> > + * @walk: mm_walk callback structure
> > + *
> > + * Collect the huge pmd entry at @pmdp for migration and set the
> > + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
> > + * migration will occur at HPAGE_PMD granularity
> > + */
> > +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
> > +					unsigned long end, struct mm_walk *walk,
> > +					struct folio *fault_folio)
> >  {
> > +	struct mm_struct *mm = walk->mm;
> > +	struct folio *folio;
> >  	struct migrate_vma *migrate = walk->private;
> > -	struct folio *fault_folio = migrate->fault_page ?
> > -		page_folio(migrate->fault_page) : NULL;
> > -	struct vm_area_struct *vma = walk->vma;
> > -	struct mm_struct *mm = vma->vm_mm;
> > -	unsigned long addr = start, unmapped = 0;
> >  	spinlock_t *ptl;
> > -	pte_t *ptep;
> > +	swp_entry_t entry;
> > +	int ret;
> > +	unsigned long write = 0;
> >  
> > -again:
> > -	if (pmd_none(*pmdp))
> > +	ptl = pmd_lock(mm, pmdp);
> > +	if (pmd_none(*pmdp)) {
> > +		spin_unlock(ptl);
> >  		return migrate_vma_collect_hole(start, end, -1, walk);
> > +	}
> >  
> >  	if (pmd_trans_huge(*pmdp)) {
> > -		struct folio *folio;
> > -
> > -		ptl = pmd_lock(mm, pmdp);
> > -		if (unlikely(!pmd_trans_huge(*pmdp))) {
> > +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
> >  			spin_unlock(ptl);
> > -			goto again;
> > +			return migrate_vma_collect_skip(start, end, walk);
> >  		}
> >  
> >  		folio = pmd_folio(*pmdp);
> >  		if (is_huge_zero_folio(folio)) {
> >  			spin_unlock(ptl);
> > -			split_huge_pmd(vma, pmdp, addr);
> > -		} else {
> > -			int ret;
> > +			return migrate_vma_collect_hole(start, end, -1, walk);
> > +		}
> > +		if (pmd_write(*pmdp))
> > +			write = MIGRATE_PFN_WRITE;
> > +	} else if (!pmd_present(*pmdp)) {
> > +		entry = pmd_to_swp_entry(*pmdp);
> > +		folio = pfn_swap_entry_folio(entry);
> > +
> > +		if (!is_device_private_entry(entry) ||
> > +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
> > +			(folio->pgmap->owner != migrate->pgmap_owner)) {
> > +			spin_unlock(ptl);
> > +			return migrate_vma_collect_skip(start, end, walk);
> > +		}
> >  
> > -			folio_get(folio);
> > +		if (is_migration_entry(entry)) {
> > +			migration_entry_wait_on_locked(entry, ptl);
> >  			spin_unlock(ptl);
> > -			/* FIXME: we don't expect THP for fault_folio */
> > -			if (WARN_ON_ONCE(fault_folio == folio))
> > -				return migrate_vma_collect_skip(start, end,
> > -								walk);
> > -			if (unlikely(!folio_trylock(folio)))
> > -				return migrate_vma_collect_skip(start, end,
> > -								walk);
> > -			ret = split_folio(folio);
> > -			if (fault_folio != folio)
> > -				folio_unlock(folio);
> > -			folio_put(folio);
> > -			if (ret)
> > -				return migrate_vma_collect_skip(start, end,
> > -								walk);
> > +			return -EAGAIN;
> >  		}
> > +
> > +		if (is_writable_device_private_entry(entry))
> > +			write = MIGRATE_PFN_WRITE;
> > +	} else {
> > +		spin_unlock(ptl);
> > +		return -EAGAIN;
> > +	}
> > +
> > +	folio_get(folio);
> > +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
> > +		spin_unlock(ptl);
> > +		folio_put(folio);
> > +		return migrate_vma_collect_skip(start, end, walk);
> > +	}
> > +
> > +	if (thp_migration_supported() &&
> > +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> > +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> > +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> > +
> > +		struct page_vma_mapped_walk pvmw = {
> > +			.ptl = ptl,
> > +			.address = start,
> > +			.pmd = pmdp,
> > +			.vma = walk->vma,
> > +		};
> > +
> > +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
> > +
> > +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
> > +						| MIGRATE_PFN_MIGRATE
> > +						| MIGRATE_PFN_COMPOUND;
> > +		migrate->dst[migrate->npages++] = 0;
> > +		migrate->cpages++;
> > +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
> > +		if (ret) {
> > +			migrate->npages--;
> > +			migrate->cpages--;
> > +			migrate->src[migrate->npages] = 0;
> > +			migrate->dst[migrate->npages] = 0;
> > +			goto fallback;
> > +		}
> > +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +fallback:
> > +	spin_unlock(ptl);
> > +	if (!folio_test_large(folio))
> > +		goto done;
> > +	ret = split_folio(folio);
> > +	if (fault_folio != folio)
> > +		folio_unlock(folio);
> > +	folio_put(folio);
> > +	if (ret)
> > +		return migrate_vma_collect_skip(start, end, walk);
> > +	if (pmd_none(pmdp_get_lockless(pmdp)))
> > +		return migrate_vma_collect_hole(start, end, -1, walk);
> > +
> > +done:
> > +	return -ENOENT;
> > +}
> > +
> > +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > +				   unsigned long start,
> > +				   unsigned long end,
> > +				   struct mm_walk *walk)
> > +{
> > +	struct migrate_vma *migrate = walk->private;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	unsigned long addr = start, unmapped = 0;
> > +	spinlock_t *ptl;
> > +	struct folio *fault_folio = migrate->fault_page ?
> > +		page_folio(migrate->fault_page) : NULL;
> > +	pte_t *ptep;
> > +
> > +again:
> > +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
> > +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
> > +
> > +		if (ret == -EAGAIN)
> > +			goto again;
> > +		if (ret == 0)
> > +			return 0;
> >  	}
> >  
> >  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> >  		}
> >  
> > -		/* FIXME support THP */
> > -		if (!page || !page->mapping || PageTransCompound(page)) {
> > +		if (!page || !page->mapping) {
> >  			mpfn = 0;
> >  			goto next;
> >  		}
> > @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
> >  	 */
> >  	int extra = 1 + (page == fault_page);
> >  
> > -	/*
> > -	 * FIXME support THP (transparent huge page), it is bit more complex to
> > -	 * check them than regular pages, because they can be mapped with a pmd
> > -	 * or with a pte (split pte mapping).
> > -	 */
> > -	if (folio_test_large(folio))
> > -		return false;
> > -
> 
> You cannot remove this check unless support normal mTHP folios migrate to device, 
> which I think this series doesn't do, but maybe should?
> 

Currently, mTHP should be split upon collection, right? The only way a
THP should be collected is if it directly maps to a PMD. If a THP or
mTHP is found in PTEs (i.e., in migrate_vma_collect_pmd outside of
migrate_vma_collect_huge_pmd), it should be split there. I sent this
logic to Balbir privately, but it appears to have been omitted.

I’m quite sure this missing split is actually an upstream bug, but it
has been suppressed by PMDs being split upon device fault. I have a test
that performs a ton of complete mremap—nonsense no one would normally
do, but which should work—that exposed this. I can rebase on this series
and see if the bug appears, or try the same nonsense without the device
faulting first and splitting the pages, to trigger the bug.

Matt

> --Mika
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  5:54     ` Matthew Brost
@ 2025-08-12  6:18       ` Matthew Brost
  2025-08-12  6:25       ` Mika Penttilä
  1 sibling, 0 replies; 36+ messages in thread
From: Matthew Brost @ 2025-08-12  6:18 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, dri-devel, linux-mm, linux-kernel, Andrew Morton,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Francois Dugast

On Mon, Aug 11, 2025 at 10:54:04PM -0700, Matthew Brost wrote:
> On Tue, Aug 12, 2025 at 08:35:49AM +0300, Mika Penttilä wrote:
> > Hi,
> > 
> > On 8/12/25 05:40, Balbir Singh wrote:
> > 
> > > MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> > > migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> > > device pages as compound pages during device pfn migration.
> > >
> > > migrate_device code paths go through the collect, setup
> > > and finalize phases of migration.
> > >
> > > The entries in src and dst arrays passed to these functions still
> > > remain at a PAGE_SIZE granularity. When a compound page is passed,
> > > the first entry has the PFN along with MIGRATE_PFN_COMPOUND
> > > and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
> > > remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
> > > representation allows for the compound page to be split into smaller
> > > page sizes.
> > >
> > > migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
> > > page aware. Two new helper functions migrate_vma_collect_huge_pmd()
> > > and migrate_vma_insert_huge_pmd_page() have been added.
> > >
> > > migrate_vma_collect_huge_pmd() can collect THP pages, but if for
> > > some reason this fails, there is fallback support to split the folio
> > > and migrate it.
> > >
> > > migrate_vma_insert_huge_pmd_page() closely follows the logic of
> > > migrate_vma_insert_page()
> > >
> > > Support for splitting pages as needed for migration will follow in
> > > later patches in this series.
> > >
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> > > Cc: Rakie Kim <rakie.kim@sk.com>
> > > Cc: Byungchul Park <byungchul@sk.com>
> > > Cc: Gregory Price <gourry@gourry.net>
> > > Cc: Ying Huang <ying.huang@linux.alibaba.com>
> > > Cc: Alistair Popple <apopple@nvidia.com>
> > > Cc: Oscar Salvador <osalvador@suse.de>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Nico Pache <npache@redhat.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Dev Jain <dev.jain@arm.com>
> > > Cc: Barry Song <baohua@kernel.org>
> > > Cc: Lyude Paul <lyude@redhat.com>
> > > Cc: Danilo Krummrich <dakr@kernel.org>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona@ffwll.ch>
> > > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > > Cc: Mika Penttilä <mpenttil@redhat.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Francois Dugast <francois.dugast@intel.com>
> > >
> > > Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> > > ---
> > >  include/linux/migrate.h |   2 +
> > >  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
> > >  2 files changed, 396 insertions(+), 63 deletions(-)
> > >
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index acadd41e0b5c..d9cef0819f91 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> > >  #define MIGRATE_PFN_VALID	(1UL << 0)
> > >  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
> > >  #define MIGRATE_PFN_WRITE	(1UL << 3)
> > > +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
> > >  #define MIGRATE_PFN_SHIFT	6
> > >  
> > >  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> > > @@ -147,6 +148,7 @@ enum migrate_vma_direction {
> > >  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
> > >  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> > >  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> > > +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
> > >  };
> > >  
> > >  struct migrate_vma {
> > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > index 0ed337f94fcd..6621bba62710 100644
> > > --- a/mm/migrate_device.c
> > > +++ b/mm/migrate_device.c
> > > @@ -14,6 +14,7 @@
> > >  #include <linux/pagewalk.h>
> > >  #include <linux/rmap.h>
> > >  #include <linux/swapops.h>
> > > +#include <asm/pgalloc.h>
> > >  #include <asm/tlbflush.h>
> > >  #include "internal.h"
> > >  
> > > @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
> > >  	if (!vma_is_anonymous(walk->vma))
> > >  		return migrate_vma_collect_skip(start, end, walk);
> > >  
> > > +	if (thp_migration_supported() &&
> > > +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> > > +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> > > +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> > > +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> > > +						MIGRATE_PFN_COMPOUND;
> > > +		migrate->dst[migrate->npages] = 0;
> > > +		migrate->npages++;
> > > +		migrate->cpages++;
> > > +
> > > +		/*
> > > +		 * Collect the remaining entries as holes, in case we
> > > +		 * need to split later
> > > +		 */
> > > +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> > > +	}
> > > +
> > >  	for (addr = start; addr < end; addr += PAGE_SIZE) {
> > >  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
> > >  		migrate->dst[migrate->npages] = 0;
> > > @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
> > >  	return 0;
> > >  }
> > >  
> > > -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > -				   unsigned long start,
> > > -				   unsigned long end,
> > > -				   struct mm_walk *walk)
> > > +/**
> > > + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
> > > + * folio for device private pages.
> > > + * @pmdp: pointer to pmd entry
> > > + * @start: start address of the range for migration
> > > + * @end: end address of the range for migration
> > > + * @walk: mm_walk callback structure
> > > + *
> > > + * Collect the huge pmd entry at @pmdp for migration and set the
> > > + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
> > > + * migration will occur at HPAGE_PMD granularity
> > > + */
> > > +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
> > > +					unsigned long end, struct mm_walk *walk,
> > > +					struct folio *fault_folio)
> > >  {
> > > +	struct mm_struct *mm = walk->mm;
> > > +	struct folio *folio;
> > >  	struct migrate_vma *migrate = walk->private;
> > > -	struct folio *fault_folio = migrate->fault_page ?
> > > -		page_folio(migrate->fault_page) : NULL;
> > > -	struct vm_area_struct *vma = walk->vma;
> > > -	struct mm_struct *mm = vma->vm_mm;
> > > -	unsigned long addr = start, unmapped = 0;
> > >  	spinlock_t *ptl;
> > > -	pte_t *ptep;
> > > +	swp_entry_t entry;
> > > +	int ret;
> > > +	unsigned long write = 0;
> > >  
> > > -again:
> > > -	if (pmd_none(*pmdp))
> > > +	ptl = pmd_lock(mm, pmdp);
> > > +	if (pmd_none(*pmdp)) {
> > > +		spin_unlock(ptl);
> > >  		return migrate_vma_collect_hole(start, end, -1, walk);
> > > +	}
> > >  
> > >  	if (pmd_trans_huge(*pmdp)) {
> > > -		struct folio *folio;
> > > -
> > > -		ptl = pmd_lock(mm, pmdp);
> > > -		if (unlikely(!pmd_trans_huge(*pmdp))) {
> > > +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
> > >  			spin_unlock(ptl);
> > > -			goto again;
> > > +			return migrate_vma_collect_skip(start, end, walk);
> > >  		}
> > >  
> > >  		folio = pmd_folio(*pmdp);
> > >  		if (is_huge_zero_folio(folio)) {
> > >  			spin_unlock(ptl);
> > > -			split_huge_pmd(vma, pmdp, addr);
> > > -		} else {
> > > -			int ret;
> > > +			return migrate_vma_collect_hole(start, end, -1, walk);
> > > +		}
> > > +		if (pmd_write(*pmdp))
> > > +			write = MIGRATE_PFN_WRITE;
> > > +	} else if (!pmd_present(*pmdp)) {
> > > +		entry = pmd_to_swp_entry(*pmdp);
> > > +		folio = pfn_swap_entry_folio(entry);
> > > +
> > > +		if (!is_device_private_entry(entry) ||
> > > +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
> > > +			(folio->pgmap->owner != migrate->pgmap_owner)) {
> > > +			spin_unlock(ptl);
> > > +			return migrate_vma_collect_skip(start, end, walk);
> > > +		}
> > >  
> > > -			folio_get(folio);
> > > +		if (is_migration_entry(entry)) {
> > > +			migration_entry_wait_on_locked(entry, ptl);
> > >  			spin_unlock(ptl);
> > > -			/* FIXME: we don't expect THP for fault_folio */
> > > -			if (WARN_ON_ONCE(fault_folio == folio))
> > > -				return migrate_vma_collect_skip(start, end,
> > > -								walk);
> > > -			if (unlikely(!folio_trylock(folio)))
> > > -				return migrate_vma_collect_skip(start, end,
> > > -								walk);
> > > -			ret = split_folio(folio);
> > > -			if (fault_folio != folio)
> > > -				folio_unlock(folio);
> > > -			folio_put(folio);
> > > -			if (ret)
> > > -				return migrate_vma_collect_skip(start, end,
> > > -								walk);
> > > +			return -EAGAIN;
> > >  		}
> > > +
> > > +		if (is_writable_device_private_entry(entry))
> > > +			write = MIGRATE_PFN_WRITE;
> > > +	} else {
> > > +		spin_unlock(ptl);
> > > +		return -EAGAIN;
> > > +	}
> > > +
> > > +	folio_get(folio);
> > > +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
> > > +		spin_unlock(ptl);
> > > +		folio_put(folio);
> > > +		return migrate_vma_collect_skip(start, end, walk);
> > > +	}
> > > +
> > > +	if (thp_migration_supported() &&
> > > +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> > > +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> > > +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> > > +
> > > +		struct page_vma_mapped_walk pvmw = {
> > > +			.ptl = ptl,
> > > +			.address = start,
> > > +			.pmd = pmdp,
> > > +			.vma = walk->vma,
> > > +		};
> > > +
> > > +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
> > > +
> > > +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
> > > +						| MIGRATE_PFN_MIGRATE
> > > +						| MIGRATE_PFN_COMPOUND;
> > > +		migrate->dst[migrate->npages++] = 0;
> > > +		migrate->cpages++;
> > > +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
> > > +		if (ret) {
> > > +			migrate->npages--;
> > > +			migrate->cpages--;
> > > +			migrate->src[migrate->npages] = 0;
> > > +			migrate->dst[migrate->npages] = 0;
> > > +			goto fallback;
> > > +		}
> > > +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> > > +		spin_unlock(ptl);
> > > +		return 0;
> > > +	}
> > > +
> > > +fallback:
> > > +	spin_unlock(ptl);
> > > +	if (!folio_test_large(folio))
> > > +		goto done;
> > > +	ret = split_folio(folio);
> > > +	if (fault_folio != folio)
> > > +		folio_unlock(folio);
> > > +	folio_put(folio);
> > > +	if (ret)
> > > +		return migrate_vma_collect_skip(start, end, walk);
> > > +	if (pmd_none(pmdp_get_lockless(pmdp)))
> > > +		return migrate_vma_collect_hole(start, end, -1, walk);
> > > +
> > > +done:
> > > +	return -ENOENT;
> > > +}
> > > +
> > > +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > +				   unsigned long start,
> > > +				   unsigned long end,
> > > +				   struct mm_walk *walk)
> > > +{
> > > +	struct migrate_vma *migrate = walk->private;
> > > +	struct vm_area_struct *vma = walk->vma;
> > > +	struct mm_struct *mm = vma->vm_mm;
> > > +	unsigned long addr = start, unmapped = 0;
> > > +	spinlock_t *ptl;
> > > +	struct folio *fault_folio = migrate->fault_page ?
> > > +		page_folio(migrate->fault_page) : NULL;
> > > +	pte_t *ptep;
> > > +
> > > +again:
> > > +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
> > > +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
> > > +
> > > +		if (ret == -EAGAIN)
> > > +			goto again;
> > > +		if (ret == 0)
> > > +			return 0;
> > >  	}
> > >  
> > >  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > > @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> > >  		}
> > >  
> > > -		/* FIXME support THP */
> > > -		if (!page || !page->mapping || PageTransCompound(page)) {
> > > +		if (!page || !page->mapping) {
> > >  			mpfn = 0;
> > >  			goto next;
> > >  		}
> > > @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
> > >  	 */
> > >  	int extra = 1 + (page == fault_page);
> > >  
> > > -	/*
> > > -	 * FIXME support THP (transparent huge page), it is bit more complex to
> > > -	 * check them than regular pages, because they can be mapped with a pmd
> > > -	 * or with a pte (split pte mapping).
> > > -	 */
> > > -	if (folio_test_large(folio))
> > > -		return false;
> > > -
> > 
> > You cannot remove this check unless support normal mTHP folios migrate to device, 
> > which I think this series doesn't do, but maybe should?

I also agree that, eventually, what I detail below (collecting mTHP or
THPs in PTEs) should be supported without a split, albeit enabled by a
different migrate_vma_direction flag than introduced in this series.

Reasoning for a differnet flag: Handling of mTHP/THP at the upper layers
(drivers) will have to adjusted if the THP doesn't align to a PMD.
Fairly mirror but upper layer should maintain control if they want
non-aligned mTHP/THP to PMDs collected.

I also just noticed an interface problem — migrate_vma does not define
flags as migrate_vma_direction, but I digress.

Matt

> > 
> 
> Currently, mTHP should be split upon collection, right? The only way a
> THP should be collected is if it directly maps to a PMD. If a THP or
> mTHP is found in PTEs (i.e., in migrate_vma_collect_pmd outside of
> migrate_vma_collect_huge_pmd), it should be split there. I sent this
> logic to Balbir privately, but it appears to have been omitted.
> 
> I’m quite sure this missing split is actually an upstream bug, but it
> has been suppressed by PMDs being split upon device fault. I have a test
> that performs a ton of complete mremap—nonsense no one would normally
> do, but which should work—that exposed this. I can rebase on this series
> and see if the bug appears, or try the same nonsense without the device
> faulting first and splitting the pages, to trigger the bug.
> 
> Matt
> 
> > --Mika
> > 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  5:54     ` Matthew Brost
  2025-08-12  6:18       ` Matthew Brost
@ 2025-08-12  6:25       ` Mika Penttilä
  2025-08-12  6:33         ` Matthew Brost
  1 sibling, 1 reply; 36+ messages in thread
From: Mika Penttilä @ 2025-08-12  6:25 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, dri-devel, linux-mm, linux-kernel, Andrew Morton,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Francois Dugast


On 8/12/25 08:54, Matthew Brost wrote:

> On Tue, Aug 12, 2025 at 08:35:49AM +0300, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/12/25 05:40, Balbir Singh wrote:
>>
>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>>> device pages as compound pages during device pfn migration.
>>>
>>> migrate_device code paths go through the collect, setup
>>> and finalize phases of migration.
>>>
>>> The entries in src and dst arrays passed to these functions still
>>> remain at a PAGE_SIZE granularity. When a compound page is passed,
>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
>>> representation allows for the compound page to be split into smaller
>>> page sizes.
>>>
>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
>>> and migrate_vma_insert_huge_pmd_page() have been added.
>>>
>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
>>> some reason this fails, there is fallback support to split the folio
>>> and migrate it.
>>>
>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
>>> migrate_vma_insert_page()
>>>
>>> Support for splitting pages as needed for migration will follow in
>>> later patches in this series.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>> Cc: Byungchul Park <byungchul@sk.com>
>>> Cc: Gregory Price <gourry@gourry.net>
>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/migrate.h |   2 +
>>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>>>  2 files changed, 396 insertions(+), 63 deletions(-)
>>>
>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>> index acadd41e0b5c..d9cef0819f91 100644
>>> --- a/include/linux/migrate.h
>>> +++ b/include/linux/migrate.h
>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>>>  #define MIGRATE_PFN_SHIFT	6
>>>  
>>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>>  };
>>>  
>>>  struct migrate_vma {
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 0ed337f94fcd..6621bba62710 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -14,6 +14,7 @@
>>>  #include <linux/pagewalk.h>
>>>  #include <linux/rmap.h>
>>>  #include <linux/swapops.h>
>>> +#include <asm/pgalloc.h>
>>>  #include <asm/tlbflush.h>
>>>  #include "internal.h"
>>>  
>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>  	if (!vma_is_anonymous(walk->vma))
>>>  		return migrate_vma_collect_skip(start, end, walk);
>>>  
>>> +	if (thp_migration_supported() &&
>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>>> +						MIGRATE_PFN_COMPOUND;
>>> +		migrate->dst[migrate->npages] = 0;
>>> +		migrate->npages++;
>>> +		migrate->cpages++;
>>> +
>>> +		/*
>>> +		 * Collect the remaining entries as holes, in case we
>>> +		 * need to split later
>>> +		 */
>>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>> +	}
>>> +
>>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>>>  		migrate->dst[migrate->npages] = 0;
>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>  	return 0;
>>>  }
>>>  
>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> -				   unsigned long start,
>>> -				   unsigned long end,
>>> -				   struct mm_walk *walk)
>>> +/**
>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
>>> + * folio for device private pages.
>>> + * @pmdp: pointer to pmd entry
>>> + * @start: start address of the range for migration
>>> + * @end: end address of the range for migration
>>> + * @walk: mm_walk callback structure
>>> + *
>>> + * Collect the huge pmd entry at @pmdp for migration and set the
>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
>>> + * migration will occur at HPAGE_PMD granularity
>>> + */
>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>>> +					unsigned long end, struct mm_walk *walk,
>>> +					struct folio *fault_folio)
>>>  {
>>> +	struct mm_struct *mm = walk->mm;
>>> +	struct folio *folio;
>>>  	struct migrate_vma *migrate = walk->private;
>>> -	struct folio *fault_folio = migrate->fault_page ?
>>> -		page_folio(migrate->fault_page) : NULL;
>>> -	struct vm_area_struct *vma = walk->vma;
>>> -	struct mm_struct *mm = vma->vm_mm;
>>> -	unsigned long addr = start, unmapped = 0;
>>>  	spinlock_t *ptl;
>>> -	pte_t *ptep;
>>> +	swp_entry_t entry;
>>> +	int ret;
>>> +	unsigned long write = 0;
>>>  
>>> -again:
>>> -	if (pmd_none(*pmdp))
>>> +	ptl = pmd_lock(mm, pmdp);
>>> +	if (pmd_none(*pmdp)) {
>>> +		spin_unlock(ptl);
>>>  		return migrate_vma_collect_hole(start, end, -1, walk);
>>> +	}
>>>  
>>>  	if (pmd_trans_huge(*pmdp)) {
>>> -		struct folio *folio;
>>> -
>>> -		ptl = pmd_lock(mm, pmdp);
>>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
>>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>>  			spin_unlock(ptl);
>>> -			goto again;
>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>  		}
>>>  
>>>  		folio = pmd_folio(*pmdp);
>>>  		if (is_huge_zero_folio(folio)) {
>>>  			spin_unlock(ptl);
>>> -			split_huge_pmd(vma, pmdp, addr);
>>> -		} else {
>>> -			int ret;
>>> +			return migrate_vma_collect_hole(start, end, -1, walk);
>>> +		}
>>> +		if (pmd_write(*pmdp))
>>> +			write = MIGRATE_PFN_WRITE;
>>> +	} else if (!pmd_present(*pmdp)) {
>>> +		entry = pmd_to_swp_entry(*pmdp);
>>> +		folio = pfn_swap_entry_folio(entry);
>>> +
>>> +		if (!is_device_private_entry(entry) ||
>>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
>>> +			spin_unlock(ptl);
>>> +			return migrate_vma_collect_skip(start, end, walk);
>>> +		}
>>>  
>>> -			folio_get(folio);
>>> +		if (is_migration_entry(entry)) {
>>> +			migration_entry_wait_on_locked(entry, ptl);
>>>  			spin_unlock(ptl);
>>> -			/* FIXME: we don't expect THP for fault_folio */
>>> -			if (WARN_ON_ONCE(fault_folio == folio))
>>> -				return migrate_vma_collect_skip(start, end,
>>> -								walk);
>>> -			if (unlikely(!folio_trylock(folio)))
>>> -				return migrate_vma_collect_skip(start, end,
>>> -								walk);
>>> -			ret = split_folio(folio);
>>> -			if (fault_folio != folio)
>>> -				folio_unlock(folio);
>>> -			folio_put(folio);
>>> -			if (ret)
>>> -				return migrate_vma_collect_skip(start, end,
>>> -								walk);
>>> +			return -EAGAIN;
>>>  		}
>>> +
>>> +		if (is_writable_device_private_entry(entry))
>>> +			write = MIGRATE_PFN_WRITE;
>>> +	} else {
>>> +		spin_unlock(ptl);
>>> +		return -EAGAIN;
>>> +	}
>>> +
>>> +	folio_get(folio);
>>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
>>> +		spin_unlock(ptl);
>>> +		folio_put(folio);
>>> +		return migrate_vma_collect_skip(start, end, walk);
>>> +	}
>>> +
>>> +	if (thp_migration_supported() &&
>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>> +
>>> +		struct page_vma_mapped_walk pvmw = {
>>> +			.ptl = ptl,
>>> +			.address = start,
>>> +			.pmd = pmdp,
>>> +			.vma = walk->vma,
>>> +		};
>>> +
>>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
>>> +
>>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
>>> +						| MIGRATE_PFN_MIGRATE
>>> +						| MIGRATE_PFN_COMPOUND;
>>> +		migrate->dst[migrate->npages++] = 0;
>>> +		migrate->cpages++;
>>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
>>> +		if (ret) {
>>> +			migrate->npages--;
>>> +			migrate->cpages--;
>>> +			migrate->src[migrate->npages] = 0;
>>> +			migrate->dst[migrate->npages] = 0;
>>> +			goto fallback;
>>> +		}
>>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>> +		spin_unlock(ptl);
>>> +		return 0;
>>> +	}
>>> +
>>> +fallback:
>>> +	spin_unlock(ptl);
>>> +	if (!folio_test_large(folio))
>>> +		goto done;
>>> +	ret = split_folio(folio);
>>> +	if (fault_folio != folio)
>>> +		folio_unlock(folio);
>>> +	folio_put(folio);
>>> +	if (ret)
>>> +		return migrate_vma_collect_skip(start, end, walk);
>>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
>>> +		return migrate_vma_collect_hole(start, end, -1, walk);
>>> +
>>> +done:
>>> +	return -ENOENT;
>>> +}
>>> +
>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> +				   unsigned long start,
>>> +				   unsigned long end,
>>> +				   struct mm_walk *walk)
>>> +{
>>> +	struct migrate_vma *migrate = walk->private;
>>> +	struct vm_area_struct *vma = walk->vma;
>>> +	struct mm_struct *mm = vma->vm_mm;
>>> +	unsigned long addr = start, unmapped = 0;
>>> +	spinlock_t *ptl;
>>> +	struct folio *fault_folio = migrate->fault_page ?
>>> +		page_folio(migrate->fault_page) : NULL;
>>> +	pte_t *ptep;
>>> +
>>> +again:
>>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
>>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
>>> +
>>> +		if (ret == -EAGAIN)
>>> +			goto again;
>>> +		if (ret == 0)
>>> +			return 0;
>>>  	}
>>>  
>>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>>>  		}
>>>  
>>> -		/* FIXME support THP */
>>> -		if (!page || !page->mapping || PageTransCompound(page)) {
>>> +		if (!page || !page->mapping) {
>>>  			mpfn = 0;
>>>  			goto next;
>>>  		}
>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>>  	 */
>>>  	int extra = 1 + (page == fault_page);
>>>  
>>> -	/*
>>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
>>> -	 * check them than regular pages, because they can be mapped with a pmd
>>> -	 * or with a pte (split pte mapping).
>>> -	 */
>>> -	if (folio_test_large(folio))
>>> -		return false;
>>> -
>> You cannot remove this check unless support normal mTHP folios migrate to device, 
>> which I think this series doesn't do, but maybe should?
>>
> Currently, mTHP should be split upon collection, right? The only way a
> THP should be collected is if it directly maps to a PMD. If a THP or
> mTHP is found in PTEs (i.e., in migrate_vma_collect_pmd outside of
> migrate_vma_collect_huge_pmd), it should be split there. I sent this
> logic to Balbir privately, but it appears to have been omitted.

I think currently if mTHP is found byte PTEs folio just isn't migrated.
Yes maybe they should be just split while collected now. Best would of course
to migrate (like as order-0 pages for device) for not to split all mTHPs.
And yes maybe this all controlled by different flag.

> I’m quite sure this missing split is actually an upstream bug, but it
> has been suppressed by PMDs being split upon device fault. I have a test
> that performs a ton of complete mremap—nonsense no one would normally
> do, but which should work—that exposed this. I can rebase on this series
> and see if the bug appears, or try the same nonsense without the device
> faulting first and splitting the pages, to trigger the bug.
>
> Matt
>
>> --Mika
>>
--Mika



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  6:25       ` Mika Penttilä
@ 2025-08-12  6:33         ` Matthew Brost
  2025-08-12  6:37           ` Mika Penttilä
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2025-08-12  6:33 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, dri-devel, linux-mm, linux-kernel, Andrew Morton,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Francois Dugast

On Tue, Aug 12, 2025 at 09:25:29AM +0300, Mika Penttilä wrote:
> 
> On 8/12/25 08:54, Matthew Brost wrote:
> 
> > On Tue, Aug 12, 2025 at 08:35:49AM +0300, Mika Penttilä wrote:
> >> Hi,
> >>
> >> On 8/12/25 05:40, Balbir Singh wrote:
> >>
> >>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> >>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> >>> device pages as compound pages during device pfn migration.
> >>>
> >>> migrate_device code paths go through the collect, setup
> >>> and finalize phases of migration.
> >>>
> >>> The entries in src and dst arrays passed to these functions still
> >>> remain at a PAGE_SIZE granularity. When a compound page is passed,
> >>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
> >>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
> >>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
> >>> representation allows for the compound page to be split into smaller
> >>> page sizes.
> >>>
> >>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
> >>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
> >>> and migrate_vma_insert_huge_pmd_page() have been added.
> >>>
> >>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
> >>> some reason this fails, there is fallback support to split the folio
> >>> and migrate it.
> >>>
> >>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
> >>> migrate_vma_insert_page()
> >>>
> >>> Support for splitting pages as needed for migration will follow in
> >>> later patches in this series.
> >>>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: David Hildenbrand <david@redhat.com>
> >>> Cc: Zi Yan <ziy@nvidia.com>
> >>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> >>> Cc: Rakie Kim <rakie.kim@sk.com>
> >>> Cc: Byungchul Park <byungchul@sk.com>
> >>> Cc: Gregory Price <gourry@gourry.net>
> >>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> >>> Cc: Alistair Popple <apopple@nvidia.com>
> >>> Cc: Oscar Salvador <osalvador@suse.de>
> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >>> Cc: Nico Pache <npache@redhat.com>
> >>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >>> Cc: Dev Jain <dev.jain@arm.com>
> >>> Cc: Barry Song <baohua@kernel.org>
> >>> Cc: Lyude Paul <lyude@redhat.com>
> >>> Cc: Danilo Krummrich <dakr@kernel.org>
> >>> Cc: David Airlie <airlied@gmail.com>
> >>> Cc: Simona Vetter <simona@ffwll.ch>
> >>> Cc: Ralph Campbell <rcampbell@nvidia.com>
> >>> Cc: Mika Penttilä <mpenttil@redhat.com>
> >>> Cc: Matthew Brost <matthew.brost@intel.com>
> >>> Cc: Francois Dugast <francois.dugast@intel.com>
> >>>
> >>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> >>> ---
> >>>  include/linux/migrate.h |   2 +
> >>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
> >>>  2 files changed, 396 insertions(+), 63 deletions(-)
> >>>
> >>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >>> index acadd41e0b5c..d9cef0819f91 100644
> >>> --- a/include/linux/migrate.h
> >>> +++ b/include/linux/migrate.h
> >>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> >>>  #define MIGRATE_PFN_VALID	(1UL << 0)
> >>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
> >>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> >>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
> >>>  #define MIGRATE_PFN_SHIFT	6
> >>>  
> >>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> >>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
> >>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
> >>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> >>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> >>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
> >>>  };
> >>>  
> >>>  struct migrate_vma {
> >>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >>> index 0ed337f94fcd..6621bba62710 100644
> >>> --- a/mm/migrate_device.c
> >>> +++ b/mm/migrate_device.c
> >>> @@ -14,6 +14,7 @@
> >>>  #include <linux/pagewalk.h>
> >>>  #include <linux/rmap.h>
> >>>  #include <linux/swapops.h>
> >>> +#include <asm/pgalloc.h>
> >>>  #include <asm/tlbflush.h>
> >>>  #include "internal.h"
> >>>  
> >>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
> >>>  	if (!vma_is_anonymous(walk->vma))
> >>>  		return migrate_vma_collect_skip(start, end, walk);
> >>>  
> >>> +	if (thp_migration_supported() &&
> >>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> >>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> >>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> >>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> >>> +						MIGRATE_PFN_COMPOUND;
> >>> +		migrate->dst[migrate->npages] = 0;
> >>> +		migrate->npages++;
> >>> +		migrate->cpages++;
> >>> +
> >>> +		/*
> >>> +		 * Collect the remaining entries as holes, in case we
> >>> +		 * need to split later
> >>> +		 */
> >>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> >>> +	}
> >>> +
> >>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
> >>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
> >>>  		migrate->dst[migrate->npages] = 0;
> >>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
> >>>  	return 0;
> >>>  }
> >>>  
> >>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>> -				   unsigned long start,
> >>> -				   unsigned long end,
> >>> -				   struct mm_walk *walk)
> >>> +/**
> >>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
> >>> + * folio for device private pages.
> >>> + * @pmdp: pointer to pmd entry
> >>> + * @start: start address of the range for migration
> >>> + * @end: end address of the range for migration
> >>> + * @walk: mm_walk callback structure
> >>> + *
> >>> + * Collect the huge pmd entry at @pmdp for migration and set the
> >>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
> >>> + * migration will occur at HPAGE_PMD granularity
> >>> + */
> >>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
> >>> +					unsigned long end, struct mm_walk *walk,
> >>> +					struct folio *fault_folio)
> >>>  {
> >>> +	struct mm_struct *mm = walk->mm;
> >>> +	struct folio *folio;
> >>>  	struct migrate_vma *migrate = walk->private;
> >>> -	struct folio *fault_folio = migrate->fault_page ?
> >>> -		page_folio(migrate->fault_page) : NULL;
> >>> -	struct vm_area_struct *vma = walk->vma;
> >>> -	struct mm_struct *mm = vma->vm_mm;
> >>> -	unsigned long addr = start, unmapped = 0;
> >>>  	spinlock_t *ptl;
> >>> -	pte_t *ptep;
> >>> +	swp_entry_t entry;
> >>> +	int ret;
> >>> +	unsigned long write = 0;
> >>>  
> >>> -again:
> >>> -	if (pmd_none(*pmdp))
> >>> +	ptl = pmd_lock(mm, pmdp);
> >>> +	if (pmd_none(*pmdp)) {
> >>> +		spin_unlock(ptl);
> >>>  		return migrate_vma_collect_hole(start, end, -1, walk);
> >>> +	}
> >>>  
> >>>  	if (pmd_trans_huge(*pmdp)) {
> >>> -		struct folio *folio;
> >>> -
> >>> -		ptl = pmd_lock(mm, pmdp);
> >>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
> >>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
> >>>  			spin_unlock(ptl);
> >>> -			goto again;
> >>> +			return migrate_vma_collect_skip(start, end, walk);
> >>>  		}
> >>>  
> >>>  		folio = pmd_folio(*pmdp);
> >>>  		if (is_huge_zero_folio(folio)) {
> >>>  			spin_unlock(ptl);
> >>> -			split_huge_pmd(vma, pmdp, addr);
> >>> -		} else {
> >>> -			int ret;
> >>> +			return migrate_vma_collect_hole(start, end, -1, walk);
> >>> +		}
> >>> +		if (pmd_write(*pmdp))
> >>> +			write = MIGRATE_PFN_WRITE;
> >>> +	} else if (!pmd_present(*pmdp)) {
> >>> +		entry = pmd_to_swp_entry(*pmdp);
> >>> +		folio = pfn_swap_entry_folio(entry);
> >>> +
> >>> +		if (!is_device_private_entry(entry) ||
> >>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
> >>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
> >>> +			spin_unlock(ptl);
> >>> +			return migrate_vma_collect_skip(start, end, walk);
> >>> +		}
> >>>  
> >>> -			folio_get(folio);
> >>> +		if (is_migration_entry(entry)) {
> >>> +			migration_entry_wait_on_locked(entry, ptl);
> >>>  			spin_unlock(ptl);
> >>> -			/* FIXME: we don't expect THP for fault_folio */
> >>> -			if (WARN_ON_ONCE(fault_folio == folio))
> >>> -				return migrate_vma_collect_skip(start, end,
> >>> -								walk);
> >>> -			if (unlikely(!folio_trylock(folio)))
> >>> -				return migrate_vma_collect_skip(start, end,
> >>> -								walk);
> >>> -			ret = split_folio(folio);
> >>> -			if (fault_folio != folio)
> >>> -				folio_unlock(folio);
> >>> -			folio_put(folio);
> >>> -			if (ret)
> >>> -				return migrate_vma_collect_skip(start, end,
> >>> -								walk);
> >>> +			return -EAGAIN;
> >>>  		}
> >>> +
> >>> +		if (is_writable_device_private_entry(entry))
> >>> +			write = MIGRATE_PFN_WRITE;
> >>> +	} else {
> >>> +		spin_unlock(ptl);
> >>> +		return -EAGAIN;
> >>> +	}
> >>> +
> >>> +	folio_get(folio);
> >>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
> >>> +		spin_unlock(ptl);
> >>> +		folio_put(folio);
> >>> +		return migrate_vma_collect_skip(start, end, walk);
> >>> +	}
> >>> +
> >>> +	if (thp_migration_supported() &&
> >>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> >>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> >>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> >>> +
> >>> +		struct page_vma_mapped_walk pvmw = {
> >>> +			.ptl = ptl,
> >>> +			.address = start,
> >>> +			.pmd = pmdp,
> >>> +			.vma = walk->vma,
> >>> +		};
> >>> +
> >>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
> >>> +
> >>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
> >>> +						| MIGRATE_PFN_MIGRATE
> >>> +						| MIGRATE_PFN_COMPOUND;
> >>> +		migrate->dst[migrate->npages++] = 0;
> >>> +		migrate->cpages++;
> >>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
> >>> +		if (ret) {
> >>> +			migrate->npages--;
> >>> +			migrate->cpages--;
> >>> +			migrate->src[migrate->npages] = 0;
> >>> +			migrate->dst[migrate->npages] = 0;
> >>> +			goto fallback;
> >>> +		}
> >>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> >>> +		spin_unlock(ptl);
> >>> +		return 0;
> >>> +	}
> >>> +
> >>> +fallback:
> >>> +	spin_unlock(ptl);
> >>> +	if (!folio_test_large(folio))
> >>> +		goto done;
> >>> +	ret = split_folio(folio);
> >>> +	if (fault_folio != folio)
> >>> +		folio_unlock(folio);
> >>> +	folio_put(folio);
> >>> +	if (ret)
> >>> +		return migrate_vma_collect_skip(start, end, walk);
> >>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
> >>> +		return migrate_vma_collect_hole(start, end, -1, walk);
> >>> +
> >>> +done:
> >>> +	return -ENOENT;
> >>> +}
> >>> +
> >>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>> +				   unsigned long start,
> >>> +				   unsigned long end,
> >>> +				   struct mm_walk *walk)
> >>> +{
> >>> +	struct migrate_vma *migrate = walk->private;
> >>> +	struct vm_area_struct *vma = walk->vma;
> >>> +	struct mm_struct *mm = vma->vm_mm;
> >>> +	unsigned long addr = start, unmapped = 0;
> >>> +	spinlock_t *ptl;
> >>> +	struct folio *fault_folio = migrate->fault_page ?
> >>> +		page_folio(migrate->fault_page) : NULL;
> >>> +	pte_t *ptep;
> >>> +
> >>> +again:
> >>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
> >>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
> >>> +
> >>> +		if (ret == -EAGAIN)
> >>> +			goto again;
> >>> +		if (ret == 0)
> >>> +			return 0;
> >>>  	}
> >>>  
> >>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> >>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> >>>  		}
> >>>  
> >>> -		/* FIXME support THP */
> >>> -		if (!page || !page->mapping || PageTransCompound(page)) {
> >>> +		if (!page || !page->mapping) {
> >>>  			mpfn = 0;
> >>>  			goto next;
> >>>  		}
> >>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
> >>>  	 */
> >>>  	int extra = 1 + (page == fault_page);
> >>>  
> >>> -	/*
> >>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
> >>> -	 * check them than regular pages, because they can be mapped with a pmd
> >>> -	 * or with a pte (split pte mapping).
> >>> -	 */
> >>> -	if (folio_test_large(folio))
> >>> -		return false;
> >>> -
> >> You cannot remove this check unless support normal mTHP folios migrate to device, 
> >> which I think this series doesn't do, but maybe should?
> >>
> > Currently, mTHP should be split upon collection, right? The only way a
> > THP should be collected is if it directly maps to a PMD. If a THP or
> > mTHP is found in PTEs (i.e., in migrate_vma_collect_pmd outside of
> > migrate_vma_collect_huge_pmd), it should be split there. I sent this
> > logic to Balbir privately, but it appears to have been omitted.
> 
> I think currently if mTHP is found byte PTEs folio just isn't migrated.

If this is the fault page, you'd just spin forever. IIRC this how it
popped in my testing. I'll try to follow up with a fixes patch as I have
bandwidth.

> Yes maybe they should be just split while collected now. Best would of course

+1 for now.

> to migrate (like as order-0 pages for device) for not to split all mTHPs.
> And yes maybe this all controlled by different flag.
> 

+1 for different flag eventually.

Matt

> > I’m quite sure this missing split is actually an upstream bug, but it
> > has been suppressed by PMDs being split upon device fault. I have a test
> > that performs a ton of complete mremap—nonsense no one would normally
> > do, but which should work—that exposed this. I can rebase on this series
> > and see if the bug appears, or try the same nonsense without the device
> > faulting first and splitting the pages, to trigger the bug.
> >
> > Matt
> >
> >> --Mika
> >>
> --Mika
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  6:33         ` Matthew Brost
@ 2025-08-12  6:37           ` Mika Penttilä
  0 siblings, 0 replies; 36+ messages in thread
From: Mika Penttilä @ 2025-08-12  6:37 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, dri-devel, linux-mm, linux-kernel, Andrew Morton,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Francois Dugast

On 8/12/25 09:33, Matthew Brost wrote:

> On Tue, Aug 12, 2025 at 09:25:29AM +0300, Mika Penttilä wrote:
>> On 8/12/25 08:54, Matthew Brost wrote:
>>
>>> On Tue, Aug 12, 2025 at 08:35:49AM +0300, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/12/25 05:40, Balbir Singh wrote:
>>>>
>>>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>>>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>>>>> device pages as compound pages during device pfn migration.
>>>>>
>>>>> migrate_device code paths go through the collect, setup
>>>>> and finalize phases of migration.
>>>>>
>>>>> The entries in src and dst arrays passed to these functions still
>>>>> remain at a PAGE_SIZE granularity. When a compound page is passed,
>>>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
>>>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
>>>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
>>>>> representation allows for the compound page to be split into smaller
>>>>> page sizes.
>>>>>
>>>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
>>>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
>>>>> and migrate_vma_insert_huge_pmd_page() have been added.
>>>>>
>>>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
>>>>> some reason this fails, there is fallback support to split the folio
>>>>> and migrate it.
>>>>>
>>>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
>>>>> migrate_vma_insert_page()
>>>>>
>>>>> Support for splitting pages as needed for migration will follow in
>>>>> later patches in this series.
>>>>>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>>> Cc: Gregory Price <gourry@gourry.net>
>>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>>> Cc: Nico Pache <npache@redhat.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> ---
>>>>>  include/linux/migrate.h |   2 +
>>>>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>>>>>  2 files changed, 396 insertions(+), 63 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>>>> index acadd41e0b5c..d9cef0819f91 100644
>>>>> --- a/include/linux/migrate.h
>>>>> +++ b/include/linux/migrate.h
>>>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>>>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>>>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>>>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>>>>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>>>>>  #define MIGRATE_PFN_SHIFT	6
>>>>>  
>>>>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>>>>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>>>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>>>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>>>>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>>>>  };
>>>>>  
>>>>>  struct migrate_vma {
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 0ed337f94fcd..6621bba62710 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -14,6 +14,7 @@
>>>>>  #include <linux/pagewalk.h>
>>>>>  #include <linux/rmap.h>
>>>>>  #include <linux/swapops.h>
>>>>> +#include <asm/pgalloc.h>
>>>>>  #include <asm/tlbflush.h>
>>>>>  #include "internal.h"
>>>>>  
>>>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>>  	if (!vma_is_anonymous(walk->vma))
>>>>>  		return migrate_vma_collect_skip(start, end, walk);
>>>>>  
>>>>> +	if (thp_migration_supported() &&
>>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>>>>> +						MIGRATE_PFN_COMPOUND;
>>>>> +		migrate->dst[migrate->npages] = 0;
>>>>> +		migrate->npages++;
>>>>> +		migrate->cpages++;
>>>>> +
>>>>> +		/*
>>>>> +		 * Collect the remaining entries as holes, in case we
>>>>> +		 * need to split later
>>>>> +		 */
>>>>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>>> +	}
>>>>> +
>>>>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>>>>>  		migrate->dst[migrate->npages] = 0;
>>>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>>  	return 0;
>>>>>  }
>>>>>  
>>>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> -				   unsigned long start,
>>>>> -				   unsigned long end,
>>>>> -				   struct mm_walk *walk)
>>>>> +/**
>>>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
>>>>> + * folio for device private pages.
>>>>> + * @pmdp: pointer to pmd entry
>>>>> + * @start: start address of the range for migration
>>>>> + * @end: end address of the range for migration
>>>>> + * @walk: mm_walk callback structure
>>>>> + *
>>>>> + * Collect the huge pmd entry at @pmdp for migration and set the
>>>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
>>>>> + * migration will occur at HPAGE_PMD granularity
>>>>> + */
>>>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>>>>> +					unsigned long end, struct mm_walk *walk,
>>>>> +					struct folio *fault_folio)
>>>>>  {
>>>>> +	struct mm_struct *mm = walk->mm;
>>>>> +	struct folio *folio;
>>>>>  	struct migrate_vma *migrate = walk->private;
>>>>> -	struct folio *fault_folio = migrate->fault_page ?
>>>>> -		page_folio(migrate->fault_page) : NULL;
>>>>> -	struct vm_area_struct *vma = walk->vma;
>>>>> -	struct mm_struct *mm = vma->vm_mm;
>>>>> -	unsigned long addr = start, unmapped = 0;
>>>>>  	spinlock_t *ptl;
>>>>> -	pte_t *ptep;
>>>>> +	swp_entry_t entry;
>>>>> +	int ret;
>>>>> +	unsigned long write = 0;
>>>>>  
>>>>> -again:
>>>>> -	if (pmd_none(*pmdp))
>>>>> +	ptl = pmd_lock(mm, pmdp);
>>>>> +	if (pmd_none(*pmdp)) {
>>>>> +		spin_unlock(ptl);
>>>>>  		return migrate_vma_collect_hole(start, end, -1, walk);
>>>>> +	}
>>>>>  
>>>>>  	if (pmd_trans_huge(*pmdp)) {
>>>>> -		struct folio *folio;
>>>>> -
>>>>> -		ptl = pmd_lock(mm, pmdp);
>>>>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
>>>>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>>>>  			spin_unlock(ptl);
>>>>> -			goto again;
>>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>>>  		}
>>>>>  
>>>>>  		folio = pmd_folio(*pmdp);
>>>>>  		if (is_huge_zero_folio(folio)) {
>>>>>  			spin_unlock(ptl);
>>>>> -			split_huge_pmd(vma, pmdp, addr);
>>>>> -		} else {
>>>>> -			int ret;
>>>>> +			return migrate_vma_collect_hole(start, end, -1, walk);
>>>>> +		}
>>>>> +		if (pmd_write(*pmdp))
>>>>> +			write = MIGRATE_PFN_WRITE;
>>>>> +	} else if (!pmd_present(*pmdp)) {
>>>>> +		entry = pmd_to_swp_entry(*pmdp);
>>>>> +		folio = pfn_swap_entry_folio(entry);
>>>>> +
>>>>> +		if (!is_device_private_entry(entry) ||
>>>>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>>>>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
>>>>> +			spin_unlock(ptl);
>>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>>> +		}
>>>>>  
>>>>> -			folio_get(folio);
>>>>> +		if (is_migration_entry(entry)) {
>>>>> +			migration_entry_wait_on_locked(entry, ptl);
>>>>>  			spin_unlock(ptl);
>>>>> -			/* FIXME: we don't expect THP for fault_folio */
>>>>> -			if (WARN_ON_ONCE(fault_folio == folio))
>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>> -								walk);
>>>>> -			if (unlikely(!folio_trylock(folio)))
>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>> -								walk);
>>>>> -			ret = split_folio(folio);
>>>>> -			if (fault_folio != folio)
>>>>> -				folio_unlock(folio);
>>>>> -			folio_put(folio);
>>>>> -			if (ret)
>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>> -								walk);
>>>>> +			return -EAGAIN;
>>>>>  		}
>>>>> +
>>>>> +		if (is_writable_device_private_entry(entry))
>>>>> +			write = MIGRATE_PFN_WRITE;
>>>>> +	} else {
>>>>> +		spin_unlock(ptl);
>>>>> +		return -EAGAIN;
>>>>> +	}
>>>>> +
>>>>> +	folio_get(folio);
>>>>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
>>>>> +		spin_unlock(ptl);
>>>>> +		folio_put(folio);
>>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>>> +	}
>>>>> +
>>>>> +	if (thp_migration_supported() &&
>>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>>> +
>>>>> +		struct page_vma_mapped_walk pvmw = {
>>>>> +			.ptl = ptl,
>>>>> +			.address = start,
>>>>> +			.pmd = pmdp,
>>>>> +			.vma = walk->vma,
>>>>> +		};
>>>>> +
>>>>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
>>>>> +
>>>>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
>>>>> +						| MIGRATE_PFN_MIGRATE
>>>>> +						| MIGRATE_PFN_COMPOUND;
>>>>> +		migrate->dst[migrate->npages++] = 0;
>>>>> +		migrate->cpages++;
>>>>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
>>>>> +		if (ret) {
>>>>> +			migrate->npages--;
>>>>> +			migrate->cpages--;
>>>>> +			migrate->src[migrate->npages] = 0;
>>>>> +			migrate->dst[migrate->npages] = 0;
>>>>> +			goto fallback;
>>>>> +		}
>>>>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>>> +		spin_unlock(ptl);
>>>>> +		return 0;
>>>>> +	}
>>>>> +
>>>>> +fallback:
>>>>> +	spin_unlock(ptl);
>>>>> +	if (!folio_test_large(folio))
>>>>> +		goto done;
>>>>> +	ret = split_folio(folio);
>>>>> +	if (fault_folio != folio)
>>>>> +		folio_unlock(folio);
>>>>> +	folio_put(folio);
>>>>> +	if (ret)
>>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
>>>>> +		return migrate_vma_collect_hole(start, end, -1, walk);
>>>>> +
>>>>> +done:
>>>>> +	return -ENOENT;
>>>>> +}
>>>>> +
>>>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> +				   unsigned long start,
>>>>> +				   unsigned long end,
>>>>> +				   struct mm_walk *walk)
>>>>> +{
>>>>> +	struct migrate_vma *migrate = walk->private;
>>>>> +	struct vm_area_struct *vma = walk->vma;
>>>>> +	struct mm_struct *mm = vma->vm_mm;
>>>>> +	unsigned long addr = start, unmapped = 0;
>>>>> +	spinlock_t *ptl;
>>>>> +	struct folio *fault_folio = migrate->fault_page ?
>>>>> +		page_folio(migrate->fault_page) : NULL;
>>>>> +	pte_t *ptep;
>>>>> +
>>>>> +again:
>>>>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
>>>>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
>>>>> +
>>>>> +		if (ret == -EAGAIN)
>>>>> +			goto again;
>>>>> +		if (ret == 0)
>>>>> +			return 0;
>>>>>  	}
>>>>>  
>>>>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>>>>>  		}
>>>>>  
>>>>> -		/* FIXME support THP */
>>>>> -		if (!page || !page->mapping || PageTransCompound(page)) {
>>>>> +		if (!page || !page->mapping) {
>>>>>  			mpfn = 0;
>>>>>  			goto next;
>>>>>  		}
>>>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>>>>  	 */
>>>>>  	int extra = 1 + (page == fault_page);
>>>>>  
>>>>> -	/*
>>>>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
>>>>> -	 * check them than regular pages, because they can be mapped with a pmd
>>>>> -	 * or with a pte (split pte mapping).
>>>>> -	 */
>>>>> -	if (folio_test_large(folio))
>>>>> -		return false;
>>>>> -
>>>> You cannot remove this check unless support normal mTHP folios migrate to device, 
>>>> which I think this series doesn't do, but maybe should?
>>>>
>>> Currently, mTHP should be split upon collection, right? The only way a
>>> THP should be collected is if it directly maps to a PMD. If a THP or
>>> mTHP is found in PTEs (i.e., in migrate_vma_collect_pmd outside of
>>> migrate_vma_collect_huge_pmd), it should be split there. I sent this
>>> logic to Balbir privately, but it appears to have been omitted.
>> I think currently if mTHP is found byte PTEs folio just isn't migrated.
> If this is the fault page, you'd just spin forever. IIRC this how it
> popped in my testing. I'll try to follow up with a fixes patch as I have
> bandwidth.

Uh yes indeed that's a bug!

>
>> Yes maybe they should be just split while collected now. Best would of course
> +1 for now.
>
>> to migrate (like as order-0 pages for device) for not to split all mTHPs.
>> And yes maybe this all controlled by different flag.
>>
> +1 for different flag eventually.
>
> Matt
>
>>> I’m quite sure this missing split is actually an upstream bug, but it
>>> has been suppressed by PMDs being split upon device fault. I have a test
>>> that performs a ton of complete mremap—nonsense no one would normally
>>> do, but which should work—that exposed this. I can rebase on this series
>>> and see if the bug appears, or try the same nonsense without the device
>>> faulting first and splitting the pages, to trigger the bug.
>>>
>>> Matt
>>>
>>>> --Mika
>>>>
>> --Mika
>>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-12  2:40 ` [v3 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-08-12 14:47   ` kernel test robot
  2025-08-26 15:19   ` David Hildenbrand
  2025-08-28 20:05   ` Matthew Brost
  2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2025-08-12 14:47 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: llvm, oe-kbuild-all, Balbir Singh, Andrew Morton,
	Linux Memory Management List, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Hi Balbir,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250812-105145
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250812024036.690064-3-balbirs%40nvidia.com
patch subject: [v3 02/11] mm/thp: zone_device awareness in THP handling code
config: x86_64-randconfig-001-20250812 (https://download.01.org/0day-ci/archive/20250812/202508122212.qjsfr5Wf-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250812/202508122212.qjsfr5Wf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508122212.qjsfr5Wf-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/migrate_device.c:154:19: warning: variable 'new_folio' set but not used [-Wunused-but-set-variable]
     154 |                                 struct folio *new_folio;
         |                                               ^
   1 warning generated.


vim +/new_folio +154 mm/migrate_device.c

    56	
    57	static int migrate_vma_collect_pmd(pmd_t *pmdp,
    58					   unsigned long start,
    59					   unsigned long end,
    60					   struct mm_walk *walk)
    61	{
    62		struct migrate_vma *migrate = walk->private;
    63		struct folio *fault_folio = migrate->fault_page ?
    64			page_folio(migrate->fault_page) : NULL;
    65		struct vm_area_struct *vma = walk->vma;
    66		struct mm_struct *mm = vma->vm_mm;
    67		unsigned long addr = start, unmapped = 0;
    68		spinlock_t *ptl;
    69		pte_t *ptep;
    70	
    71	again:
    72		if (pmd_none(*pmdp))
    73			return migrate_vma_collect_hole(start, end, -1, walk);
    74	
    75		if (pmd_trans_huge(*pmdp)) {
    76			struct folio *folio;
    77	
    78			ptl = pmd_lock(mm, pmdp);
    79			if (unlikely(!pmd_trans_huge(*pmdp))) {
    80				spin_unlock(ptl);
    81				goto again;
    82			}
    83	
    84			folio = pmd_folio(*pmdp);
    85			if (is_huge_zero_folio(folio)) {
    86				spin_unlock(ptl);
    87				split_huge_pmd(vma, pmdp, addr);
    88			} else {
    89				int ret;
    90	
    91				folio_get(folio);
    92				spin_unlock(ptl);
    93				/* FIXME: we don't expect THP for fault_folio */
    94				if (WARN_ON_ONCE(fault_folio == folio))
    95					return migrate_vma_collect_skip(start, end,
    96									walk);
    97				if (unlikely(!folio_trylock(folio)))
    98					return migrate_vma_collect_skip(start, end,
    99									walk);
   100				ret = split_folio(folio);
   101				if (fault_folio != folio)
   102					folio_unlock(folio);
   103				folio_put(folio);
   104				if (ret)
   105					return migrate_vma_collect_skip(start, end,
   106									walk);
   107			}
   108		}
   109	
   110		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
   111		if (!ptep)
   112			goto again;
   113		arch_enter_lazy_mmu_mode();
   114	
   115		for (; addr < end; addr += PAGE_SIZE, ptep++) {
   116			struct dev_pagemap *pgmap;
   117			unsigned long mpfn = 0, pfn;
   118			struct folio *folio;
   119			struct page *page;
   120			swp_entry_t entry;
   121			pte_t pte;
   122	
   123			pte = ptep_get(ptep);
   124	
   125			if (pte_none(pte)) {
   126				if (vma_is_anonymous(vma)) {
   127					mpfn = MIGRATE_PFN_MIGRATE;
   128					migrate->cpages++;
   129				}
   130				goto next;
   131			}
   132	
   133			if (!pte_present(pte)) {
   134				/*
   135				 * Only care about unaddressable device page special
   136				 * page table entry. Other special swap entries are not
   137				 * migratable, and we ignore regular swapped page.
   138				 */
   139				struct folio *folio;
   140	
   141				entry = pte_to_swp_entry(pte);
   142				if (!is_device_private_entry(entry))
   143					goto next;
   144	
   145				page = pfn_swap_entry_to_page(entry);
   146				pgmap = page_pgmap(page);
   147				if (!(migrate->flags &
   148					MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
   149				    pgmap->owner != migrate->pgmap_owner)
   150					goto next;
   151	
   152				folio = page_folio(page);
   153				if (folio_test_large(folio)) {
 > 154					struct folio *new_folio;
   155					struct folio *new_fault_folio = NULL;
   156	
   157					/*
   158					 * The reason for finding pmd present with a
   159					 * device private pte and a large folio for the
   160					 * pte is partial unmaps. Split the folio now
   161					 * for the migration to be handled correctly
   162					 */
   163					pte_unmap_unlock(ptep, ptl);
   164	
   165					folio_get(folio);
   166					if (folio != fault_folio)
   167						folio_lock(folio);
   168					if (split_folio(folio)) {
   169						if (folio != fault_folio)
   170							folio_unlock(folio);
   171						ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
   172						goto next;
   173					}
   174	
   175					new_folio = page_folio(page);
   176					if (fault_folio)
   177						new_fault_folio = page_folio(migrate->fault_page);
   178	
   179					/*
   180					 * Ensure the lock is held on the correct
   181					 * folio after the split
   182					 */
   183					if (!new_fault_folio) {
   184						folio_unlock(folio);
   185						folio_put(folio);
   186					} else if (folio != new_fault_folio) {
   187						folio_get(new_fault_folio);
   188						folio_lock(new_fault_folio);
   189						folio_unlock(folio);
   190						folio_put(folio);
   191					}
   192	
   193					addr = start;
   194					goto again;
   195				}
   196	
   197				mpfn = migrate_pfn(page_to_pfn(page)) |
   198						MIGRATE_PFN_MIGRATE;
   199				if (is_writable_device_private_entry(entry))
   200					mpfn |= MIGRATE_PFN_WRITE;
   201			} else {
   202				pfn = pte_pfn(pte);
   203				if (is_zero_pfn(pfn) &&
   204				    (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
   205					mpfn = MIGRATE_PFN_MIGRATE;
   206					migrate->cpages++;
   207					goto next;
   208				}
   209				page = vm_normal_page(migrate->vma, addr, pte);
   210				if (page && !is_zone_device_page(page) &&
   211				    !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
   212					goto next;
   213				} else if (page && is_device_coherent_page(page)) {
   214					pgmap = page_pgmap(page);
   215	
   216					if (!(migrate->flags &
   217						MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
   218						pgmap->owner != migrate->pgmap_owner)
   219						goto next;
   220				}
   221				mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
   222				mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
   223			}
   224	
   225			/* FIXME support THP */
   226			if (!page || !page->mapping || PageTransCompound(page)) {
   227				mpfn = 0;
   228				goto next;
   229			}
   230	
   231			/*
   232			 * By getting a reference on the folio we pin it and that blocks
   233			 * any kind of migration. Side effect is that it "freezes" the
   234			 * pte.
   235			 *
   236			 * We drop this reference after isolating the folio from the lru
   237			 * for non device folio (device folio are not on the lru and thus
   238			 * can't be dropped from it).
   239			 */
   240			folio = page_folio(page);
   241			folio_get(folio);
   242	
   243			/*
   244			 * We rely on folio_trylock() to avoid deadlock between
   245			 * concurrent migrations where each is waiting on the others
   246			 * folio lock. If we can't immediately lock the folio we fail this
   247			 * migration as it is only best effort anyway.
   248			 *
   249			 * If we can lock the folio it's safe to set up a migration entry
   250			 * now. In the common case where the folio is mapped once in a
   251			 * single process setting up the migration entry now is an
   252			 * optimisation to avoid walking the rmap later with
   253			 * try_to_migrate().
   254			 */
   255			if (fault_folio == folio || folio_trylock(folio)) {
   256				bool anon_exclusive;
   257				pte_t swp_pte;
   258	
   259				flush_cache_page(vma, addr, pte_pfn(pte));
   260				anon_exclusive = folio_test_anon(folio) &&
   261						  PageAnonExclusive(page);
   262				if (anon_exclusive) {
   263					pte = ptep_clear_flush(vma, addr, ptep);
   264	
   265					if (folio_try_share_anon_rmap_pte(folio, page)) {
   266						set_pte_at(mm, addr, ptep, pte);
   267						if (fault_folio != folio)
   268							folio_unlock(folio);
   269						folio_put(folio);
   270						mpfn = 0;
   271						goto next;
   272					}
   273				} else {
   274					pte = ptep_get_and_clear(mm, addr, ptep);
   275				}
   276	
   277				migrate->cpages++;
   278	
   279				/* Set the dirty flag on the folio now the pte is gone. */
   280				if (pte_dirty(pte))
   281					folio_mark_dirty(folio);
   282	
   283				/* Setup special migration page table entry */
   284				if (mpfn & MIGRATE_PFN_WRITE)
   285					entry = make_writable_migration_entry(
   286								page_to_pfn(page));
   287				else if (anon_exclusive)
   288					entry = make_readable_exclusive_migration_entry(
   289								page_to_pfn(page));
   290				else
   291					entry = make_readable_migration_entry(
   292								page_to_pfn(page));
   293				if (pte_present(pte)) {
   294					if (pte_young(pte))
   295						entry = make_migration_entry_young(entry);
   296					if (pte_dirty(pte))
   297						entry = make_migration_entry_dirty(entry);
   298				}
   299				swp_pte = swp_entry_to_pte(entry);
   300				if (pte_present(pte)) {
   301					if (pte_soft_dirty(pte))
   302						swp_pte = pte_swp_mksoft_dirty(swp_pte);
   303					if (pte_uffd_wp(pte))
   304						swp_pte = pte_swp_mkuffd_wp(swp_pte);
   305				} else {
   306					if (pte_swp_soft_dirty(pte))
   307						swp_pte = pte_swp_mksoft_dirty(swp_pte);
   308					if (pte_swp_uffd_wp(pte))
   309						swp_pte = pte_swp_mkuffd_wp(swp_pte);
   310				}
   311				set_pte_at(mm, addr, ptep, swp_pte);
   312	
   313				/*
   314				 * This is like regular unmap: we remove the rmap and
   315				 * drop the folio refcount. The folio won't be freed, as
   316				 * we took a reference just above.
   317				 */
   318				folio_remove_rmap_pte(folio, page, vma);
   319				folio_put(folio);
   320	
   321				if (pte_present(pte))
   322					unmapped++;
   323			} else {
   324				folio_put(folio);
   325				mpfn = 0;
   326			}
   327	
   328	next:
   329			migrate->dst[migrate->npages] = 0;
   330			migrate->src[migrate->npages++] = mpfn;
   331		}
   332	
   333		/* Only flush the TLB if we actually modified any entries */
   334		if (unmapped)
   335			flush_tlb_range(walk->vma, start, end);
   336	
   337		arch_leave_lazy_mmu_mode();
   338		pte_unmap_unlock(ptep - 1, ptl);
   339	
   340		return 0;
   341	}
   342	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12  5:35   ` Mika Penttilä
  2025-08-12  5:54     ` Matthew Brost
@ 2025-08-12 23:36     ` Balbir Singh
  2025-08-13  0:07       ` Mika Penttilä
  1 sibling, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-12 23:36 UTC (permalink / raw)
  To: Mika Penttilä, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Matthew Brost, Francois Dugast

On 8/12/25 15:35, Mika Penttilä wrote:
> Hi,
> 
> On 8/12/25 05:40, Balbir Singh wrote:
> 
>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>> device pages as compound pages during device pfn migration.
>>
>> migrate_device code paths go through the collect, setup
>> and finalize phases of migration.
>>
>> The entries in src and dst arrays passed to these functions still
>> remain at a PAGE_SIZE granularity. When a compound page is passed,
>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
>> representation allows for the compound page to be split into smaller
>> page sizes.
>>
>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
>> and migrate_vma_insert_huge_pmd_page() have been added.
>>
>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
>> some reason this fails, there is fallback support to split the folio
>> and migrate it.
>>
>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
>> migrate_vma_insert_page()
>>
>> Support for splitting pages as needed for migration will follow in
>> later patches in this series.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/migrate.h |   2 +
>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>>  2 files changed, 396 insertions(+), 63 deletions(-)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index acadd41e0b5c..d9cef0819f91 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>>  #define MIGRATE_PFN_SHIFT	6
>>  
>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>  };
>>  
>>  struct migrate_vma {
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 0ed337f94fcd..6621bba62710 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/pagewalk.h>
>>  #include <linux/rmap.h>
>>  #include <linux/swapops.h>
>> +#include <asm/pgalloc.h>
>>  #include <asm/tlbflush.h>
>>  #include "internal.h"
>>  
>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>>  	if (!vma_is_anonymous(walk->vma))
>>  		return migrate_vma_collect_skip(start, end, walk);
>>  
>> +	if (thp_migration_supported() &&
>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>> +						MIGRATE_PFN_COMPOUND;
>> +		migrate->dst[migrate->npages] = 0;
>> +		migrate->npages++;
>> +		migrate->cpages++;
>> +
>> +		/*
>> +		 * Collect the remaining entries as holes, in case we
>> +		 * need to split later
>> +		 */
>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>> +	}
>> +
>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>>  		migrate->dst[migrate->npages] = 0;
>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>>  	return 0;
>>  }
>>  
>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> -				   unsigned long start,
>> -				   unsigned long end,
>> -				   struct mm_walk *walk)
>> +/**
>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
>> + * folio for device private pages.
>> + * @pmdp: pointer to pmd entry
>> + * @start: start address of the range for migration
>> + * @end: end address of the range for migration
>> + * @walk: mm_walk callback structure
>> + *
>> + * Collect the huge pmd entry at @pmdp for migration and set the
>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
>> + * migration will occur at HPAGE_PMD granularity
>> + */
>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>> +					unsigned long end, struct mm_walk *walk,
>> +					struct folio *fault_folio)
>>  {
>> +	struct mm_struct *mm = walk->mm;
>> +	struct folio *folio;
>>  	struct migrate_vma *migrate = walk->private;
>> -	struct folio *fault_folio = migrate->fault_page ?
>> -		page_folio(migrate->fault_page) : NULL;
>> -	struct vm_area_struct *vma = walk->vma;
>> -	struct mm_struct *mm = vma->vm_mm;
>> -	unsigned long addr = start, unmapped = 0;
>>  	spinlock_t *ptl;
>> -	pte_t *ptep;
>> +	swp_entry_t entry;
>> +	int ret;
>> +	unsigned long write = 0;
>>  
>> -again:
>> -	if (pmd_none(*pmdp))
>> +	ptl = pmd_lock(mm, pmdp);
>> +	if (pmd_none(*pmdp)) {
>> +		spin_unlock(ptl);
>>  		return migrate_vma_collect_hole(start, end, -1, walk);
>> +	}
>>  
>>  	if (pmd_trans_huge(*pmdp)) {
>> -		struct folio *folio;
>> -
>> -		ptl = pmd_lock(mm, pmdp);
>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>  			spin_unlock(ptl);
>> -			goto again;
>> +			return migrate_vma_collect_skip(start, end, walk);
>>  		}
>>  
>>  		folio = pmd_folio(*pmdp);
>>  		if (is_huge_zero_folio(folio)) {
>>  			spin_unlock(ptl);
>> -			split_huge_pmd(vma, pmdp, addr);
>> -		} else {
>> -			int ret;
>> +			return migrate_vma_collect_hole(start, end, -1, walk);
>> +		}
>> +		if (pmd_write(*pmdp))
>> +			write = MIGRATE_PFN_WRITE;
>> +	} else if (!pmd_present(*pmdp)) {
>> +		entry = pmd_to_swp_entry(*pmdp);
>> +		folio = pfn_swap_entry_folio(entry);
>> +
>> +		if (!is_device_private_entry(entry) ||
>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
>> +			spin_unlock(ptl);
>> +			return migrate_vma_collect_skip(start, end, walk);
>> +		}
>>  
>> -			folio_get(folio);
>> +		if (is_migration_entry(entry)) {
>> +			migration_entry_wait_on_locked(entry, ptl);
>>  			spin_unlock(ptl);
>> -			/* FIXME: we don't expect THP for fault_folio */
>> -			if (WARN_ON_ONCE(fault_folio == folio))
>> -				return migrate_vma_collect_skip(start, end,
>> -								walk);
>> -			if (unlikely(!folio_trylock(folio)))
>> -				return migrate_vma_collect_skip(start, end,
>> -								walk);
>> -			ret = split_folio(folio);
>> -			if (fault_folio != folio)
>> -				folio_unlock(folio);
>> -			folio_put(folio);
>> -			if (ret)
>> -				return migrate_vma_collect_skip(start, end,
>> -								walk);
>> +			return -EAGAIN;
>>  		}
>> +
>> +		if (is_writable_device_private_entry(entry))
>> +			write = MIGRATE_PFN_WRITE;
>> +	} else {
>> +		spin_unlock(ptl);
>> +		return -EAGAIN;
>> +	}
>> +
>> +	folio_get(folio);
>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
>> +		spin_unlock(ptl);
>> +		folio_put(folio);
>> +		return migrate_vma_collect_skip(start, end, walk);
>> +	}
>> +
>> +	if (thp_migration_supported() &&
>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>> +
>> +		struct page_vma_mapped_walk pvmw = {
>> +			.ptl = ptl,
>> +			.address = start,
>> +			.pmd = pmdp,
>> +			.vma = walk->vma,
>> +		};
>> +
>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
>> +
>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
>> +						| MIGRATE_PFN_MIGRATE
>> +						| MIGRATE_PFN_COMPOUND;
>> +		migrate->dst[migrate->npages++] = 0;
>> +		migrate->cpages++;
>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
>> +		if (ret) {
>> +			migrate->npages--;
>> +			migrate->cpages--;
>> +			migrate->src[migrate->npages] = 0;
>> +			migrate->dst[migrate->npages] = 0;
>> +			goto fallback;
>> +		}
>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>> +		spin_unlock(ptl);
>> +		return 0;
>> +	}
>> +
>> +fallback:
>> +	spin_unlock(ptl);
>> +	if (!folio_test_large(folio))
>> +		goto done;
>> +	ret = split_folio(folio);
>> +	if (fault_folio != folio)
>> +		folio_unlock(folio);
>> +	folio_put(folio);
>> +	if (ret)
>> +		return migrate_vma_collect_skip(start, end, walk);
>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
>> +		return migrate_vma_collect_hole(start, end, -1, walk);
>> +
>> +done:
>> +	return -ENOENT;
>> +}
>> +
>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> +				   unsigned long start,
>> +				   unsigned long end,
>> +				   struct mm_walk *walk)
>> +{
>> +	struct migrate_vma *migrate = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	unsigned long addr = start, unmapped = 0;
>> +	spinlock_t *ptl;
>> +	struct folio *fault_folio = migrate->fault_page ?
>> +		page_folio(migrate->fault_page) : NULL;
>> +	pte_t *ptep;
>> +
>> +again:
>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
>> +
>> +		if (ret == -EAGAIN)
>> +			goto again;
>> +		if (ret == 0)
>> +			return 0;
>>  	}
>>  
>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>>  		}
>>  
>> -		/* FIXME support THP */
>> -		if (!page || !page->mapping || PageTransCompound(page)) {
>> +		if (!page || !page->mapping) {
>>  			mpfn = 0;
>>  			goto next;
>>  		}
>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>  	 */
>>  	int extra = 1 + (page == fault_page);
>>  
>> -	/*
>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
>> -	 * check them than regular pages, because they can be mapped with a pmd
>> -	 * or with a pte (split pte mapping).
>> -	 */
>> -	if (folio_test_large(folio))
>> -		return false;
>> -
> 
> You cannot remove this check unless support normal mTHP folios migrate to device, 
> which I think this series doesn't do, but maybe should?
> 

mTHP needs to be a follow up series, there are comments in the cover letter. I had an
assert earlier to prevent other folio sizes, but I've removed it and the interface
to drivers does not allow for mTHP device private folios.

Balbir


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-12 23:36     ` Balbir Singh
@ 2025-08-13  0:07       ` Mika Penttilä
  2025-08-14 22:51         ` Balbir Singh
  0 siblings, 1 reply; 36+ messages in thread
From: Mika Penttilä @ 2025-08-13  0:07 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Matthew Brost, Francois Dugast


On 8/13/25 02:36, Balbir Singh wrote:

> On 8/12/25 15:35, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/12/25 05:40, Balbir Singh wrote:
>>
>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>>> device pages as compound pages during device pfn migration.
>>>
>>> migrate_device code paths go through the collect, setup
>>> and finalize phases of migration.
>>>
>>> The entries in src and dst arrays passed to these functions still
>>> remain at a PAGE_SIZE granularity. When a compound page is passed,
>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
>>> representation allows for the compound page to be split into smaller
>>> page sizes.
>>>
>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
>>> and migrate_vma_insert_huge_pmd_page() have been added.
>>>
>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
>>> some reason this fails, there is fallback support to split the folio
>>> and migrate it.
>>>
>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
>>> migrate_vma_insert_page()
>>>
>>> Support for splitting pages as needed for migration will follow in
>>> later patches in this series.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>> Cc: Byungchul Park <byungchul@sk.com>
>>> Cc: Gregory Price <gourry@gourry.net>
>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/migrate.h |   2 +
>>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>>>  2 files changed, 396 insertions(+), 63 deletions(-)
>>>
>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>> index acadd41e0b5c..d9cef0819f91 100644
>>> --- a/include/linux/migrate.h
>>> +++ b/include/linux/migrate.h
>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>>>  #define MIGRATE_PFN_SHIFT	6
>>>  
>>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>>  };
>>>  
>>>  struct migrate_vma {
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 0ed337f94fcd..6621bba62710 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -14,6 +14,7 @@
>>>  #include <linux/pagewalk.h>
>>>  #include <linux/rmap.h>
>>>  #include <linux/swapops.h>
>>> +#include <asm/pgalloc.h>
>>>  #include <asm/tlbflush.h>
>>>  #include "internal.h"
>>>  
>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>  	if (!vma_is_anonymous(walk->vma))
>>>  		return migrate_vma_collect_skip(start, end, walk);
>>>  
>>> +	if (thp_migration_supported() &&
>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>>> +						MIGRATE_PFN_COMPOUND;
>>> +		migrate->dst[migrate->npages] = 0;
>>> +		migrate->npages++;
>>> +		migrate->cpages++;
>>> +
>>> +		/*
>>> +		 * Collect the remaining entries as holes, in case we
>>> +		 * need to split later
>>> +		 */
>>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>> +	}
>>> +
>>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>>>  		migrate->dst[migrate->npages] = 0;
>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>  	return 0;
>>>  }
>>>  
>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> -				   unsigned long start,
>>> -				   unsigned long end,
>>> -				   struct mm_walk *walk)
>>> +/**
>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
>>> + * folio for device private pages.
>>> + * @pmdp: pointer to pmd entry
>>> + * @start: start address of the range for migration
>>> + * @end: end address of the range for migration
>>> + * @walk: mm_walk callback structure
>>> + *
>>> + * Collect the huge pmd entry at @pmdp for migration and set the
>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
>>> + * migration will occur at HPAGE_PMD granularity
>>> + */
>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>>> +					unsigned long end, struct mm_walk *walk,
>>> +					struct folio *fault_folio)
>>>  {
>>> +	struct mm_struct *mm = walk->mm;
>>> +	struct folio *folio;
>>>  	struct migrate_vma *migrate = walk->private;
>>> -	struct folio *fault_folio = migrate->fault_page ?
>>> -		page_folio(migrate->fault_page) : NULL;
>>> -	struct vm_area_struct *vma = walk->vma;
>>> -	struct mm_struct *mm = vma->vm_mm;
>>> -	unsigned long addr = start, unmapped = 0;
>>>  	spinlock_t *ptl;
>>> -	pte_t *ptep;
>>> +	swp_entry_t entry;
>>> +	int ret;
>>> +	unsigned long write = 0;
>>>  
>>> -again:
>>> -	if (pmd_none(*pmdp))
>>> +	ptl = pmd_lock(mm, pmdp);
>>> +	if (pmd_none(*pmdp)) {
>>> +		spin_unlock(ptl);
>>>  		return migrate_vma_collect_hole(start, end, -1, walk);
>>> +	}
>>>  
>>>  	if (pmd_trans_huge(*pmdp)) {
>>> -		struct folio *folio;
>>> -
>>> -		ptl = pmd_lock(mm, pmdp);
>>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
>>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>>  			spin_unlock(ptl);
>>> -			goto again;
>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>  		}
>>>  
>>>  		folio = pmd_folio(*pmdp);
>>>  		if (is_huge_zero_folio(folio)) {
>>>  			spin_unlock(ptl);
>>> -			split_huge_pmd(vma, pmdp, addr);
>>> -		} else {
>>> -			int ret;
>>> +			return migrate_vma_collect_hole(start, end, -1, walk);
>>> +		}
>>> +		if (pmd_write(*pmdp))
>>> +			write = MIGRATE_PFN_WRITE;
>>> +	} else if (!pmd_present(*pmdp)) {
>>> +		entry = pmd_to_swp_entry(*pmdp);
>>> +		folio = pfn_swap_entry_folio(entry);
>>> +
>>> +		if (!is_device_private_entry(entry) ||
>>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
>>> +			spin_unlock(ptl);
>>> +			return migrate_vma_collect_skip(start, end, walk);
>>> +		}
>>>  
>>> -			folio_get(folio);
>>> +		if (is_migration_entry(entry)) {
>>> +			migration_entry_wait_on_locked(entry, ptl);
>>>  			spin_unlock(ptl);
>>> -			/* FIXME: we don't expect THP for fault_folio */
>>> -			if (WARN_ON_ONCE(fault_folio == folio))
>>> -				return migrate_vma_collect_skip(start, end,
>>> -								walk);
>>> -			if (unlikely(!folio_trylock(folio)))
>>> -				return migrate_vma_collect_skip(start, end,
>>> -								walk);
>>> -			ret = split_folio(folio);
>>> -			if (fault_folio != folio)
>>> -				folio_unlock(folio);
>>> -			folio_put(folio);
>>> -			if (ret)
>>> -				return migrate_vma_collect_skip(start, end,
>>> -								walk);
>>> +			return -EAGAIN;
>>>  		}
>>> +
>>> +		if (is_writable_device_private_entry(entry))
>>> +			write = MIGRATE_PFN_WRITE;
>>> +	} else {
>>> +		spin_unlock(ptl);
>>> +		return -EAGAIN;
>>> +	}
>>> +
>>> +	folio_get(folio);
>>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
>>> +		spin_unlock(ptl);
>>> +		folio_put(folio);
>>> +		return migrate_vma_collect_skip(start, end, walk);
>>> +	}
>>> +
>>> +	if (thp_migration_supported() &&
>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>> +
>>> +		struct page_vma_mapped_walk pvmw = {
>>> +			.ptl = ptl,
>>> +			.address = start,
>>> +			.pmd = pmdp,
>>> +			.vma = walk->vma,
>>> +		};
>>> +
>>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
>>> +
>>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
>>> +						| MIGRATE_PFN_MIGRATE
>>> +						| MIGRATE_PFN_COMPOUND;
>>> +		migrate->dst[migrate->npages++] = 0;
>>> +		migrate->cpages++;
>>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
>>> +		if (ret) {
>>> +			migrate->npages--;
>>> +			migrate->cpages--;
>>> +			migrate->src[migrate->npages] = 0;
>>> +			migrate->dst[migrate->npages] = 0;
>>> +			goto fallback;
>>> +		}
>>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>> +		spin_unlock(ptl);
>>> +		return 0;
>>> +	}
>>> +
>>> +fallback:
>>> +	spin_unlock(ptl);
>>> +	if (!folio_test_large(folio))
>>> +		goto done;
>>> +	ret = split_folio(folio);
>>> +	if (fault_folio != folio)
>>> +		folio_unlock(folio);
>>> +	folio_put(folio);
>>> +	if (ret)
>>> +		return migrate_vma_collect_skip(start, end, walk);
>>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
>>> +		return migrate_vma_collect_hole(start, end, -1, walk);
>>> +
>>> +done:
>>> +	return -ENOENT;
>>> +}
>>> +
>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> +				   unsigned long start,
>>> +				   unsigned long end,
>>> +				   struct mm_walk *walk)
>>> +{
>>> +	struct migrate_vma *migrate = walk->private;
>>> +	struct vm_area_struct *vma = walk->vma;
>>> +	struct mm_struct *mm = vma->vm_mm;
>>> +	unsigned long addr = start, unmapped = 0;
>>> +	spinlock_t *ptl;
>>> +	struct folio *fault_folio = migrate->fault_page ?
>>> +		page_folio(migrate->fault_page) : NULL;
>>> +	pte_t *ptep;
>>> +
>>> +again:
>>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
>>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
>>> +
>>> +		if (ret == -EAGAIN)
>>> +			goto again;
>>> +		if (ret == 0)
>>> +			return 0;
>>>  	}
>>>  
>>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>>>  		}
>>>  
>>> -		/* FIXME support THP */
>>> -		if (!page || !page->mapping || PageTransCompound(page)) {
>>> +		if (!page || !page->mapping) {
>>>  			mpfn = 0;
>>>  			goto next;
>>>  		}
>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>>  	 */
>>>  	int extra = 1 + (page == fault_page);
>>>  
>>> -	/*
>>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
>>> -	 * check them than regular pages, because they can be mapped with a pmd
>>> -	 * or with a pte (split pte mapping).
>>> -	 */
>>> -	if (folio_test_large(folio))
>>> -		return false;
>>> -
>> You cannot remove this check unless support normal mTHP folios migrate to device, 
>> which I think this series doesn't do, but maybe should?
>>
> mTHP needs to be a follow up series, there are comments in the cover letter. I had an
> assert earlier to prevent other folio sizes, but I've removed it and the interface
> to drivers does not allow for mTHP device private folios.
>
> Balbir
>
pte mapped device private THPs with other sizes also can be created as a result of pmd and folio splits. 
Your should handle them in one place consistently not by different drivers. 
As pointed by Matthew, there's the problem with the fault_page if not split and just ignored.


--Mika



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 10/11] gpu/drm/nouveau: add THP migration support
  2025-08-12  2:40 ` [v3 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
@ 2025-08-13  2:23   ` kernel test robot
  0 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2025-08-13  2:23 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: oe-kbuild-all, Balbir Singh, Andrew Morton,
	Linux Memory Management List, David Hildenbrand, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Hi Balbir,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250812-105145
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250812024036.690064-11-balbirs%40nvidia.com
patch subject: [v3 10/11] gpu/drm/nouveau: add THP migration support
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20250813/202508130923.0VGA41Zv-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250813/202508130923.0VGA41Zv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508130923.0VGA41Zv-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   drivers/gpu/drm/nouveau/nouveau_dmem.c: In function 'nouveau_dmem_migrate_vma':
>> include/linux/compiler_types.h:572:45: error: call to '__compiletime_assert_706' declared with attribute error: max((1<<((16 + __pte_index_size)-16)), max) signedness error
     572 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:553:25: note: in definition of macro '__compiletime_assert'
     553 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:572:9: note: in expansion of macro '_compiletime_assert'
     572 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:93:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      93 |         BUILD_BUG_ON_MSG(!__types_ok(ux, uy),           \
         |         ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:98:9: note: in expansion of macro '__careful_cmp_once'
      98 |         __careful_cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:112:25: note: in expansion of macro '__careful_cmp'
     112 | #define max(x, y)       __careful_cmp(max, x, y)
         |                         ^~~~~~~~~~~~~
   drivers/gpu/drm/nouveau/nouveau_dmem.c:811:23: note: in expansion of macro 'max'
     811 |                 max = max(HPAGE_PMD_NR, max);
         |                       ^~~
--
   In file included from <command-line>:
   nouveau/nouveau_dmem.c: In function 'nouveau_dmem_migrate_vma':
>> include/linux/compiler_types.h:572:45: error: call to '__compiletime_assert_706' declared with attribute error: max((1<<((16 + __pte_index_size)-16)), max) signedness error
     572 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:553:25: note: in definition of macro '__compiletime_assert'
     553 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:572:9: note: in expansion of macro '_compiletime_assert'
     572 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:93:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      93 |         BUILD_BUG_ON_MSG(!__types_ok(ux, uy),           \
         |         ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:98:9: note: in expansion of macro '__careful_cmp_once'
      98 |         __careful_cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:112:25: note: in expansion of macro '__careful_cmp'
     112 | #define max(x, y)       __careful_cmp(max, x, y)
         |                         ^~~~~~~~~~~~~
   nouveau/nouveau_dmem.c:811:23: note: in expansion of macro 'max'
     811 |                 max = max(HPAGE_PMD_NR, max);
         |                       ^~~


vim +/__compiletime_assert_706 +572 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  558  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  559  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  560  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  561  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  562  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  563   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  564   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  565   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  566   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  567   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  568   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  569   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  570   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  571  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @572  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  573  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-13  0:07       ` Mika Penttilä
@ 2025-08-14 22:51         ` Balbir Singh
  2025-08-15  0:04           ` Matthew Brost
  0 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-14 22:51 UTC (permalink / raw)
  To: Mika Penttilä, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Matthew Brost, Francois Dugast

On 8/13/25 10:07, Mika Penttilä wrote:
> 
> On 8/13/25 02:36, Balbir Singh wrote:
> 
>> On 8/12/25 15:35, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/12/25 05:40, Balbir Singh wrote:
>>>
>>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>>>> device pages as compound pages during device pfn migration.
>>>>
>>>> migrate_device code paths go through the collect, setup
>>>> and finalize phases of migration.
>>>>
>>>> The entries in src and dst arrays passed to these functions still
>>>> remain at a PAGE_SIZE granularity. When a compound page is passed,
>>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
>>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
>>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
>>>> representation allows for the compound page to be split into smaller
>>>> page sizes.
>>>>
>>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
>>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
>>>> and migrate_vma_insert_huge_pmd_page() have been added.
>>>>
>>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
>>>> some reason this fails, there is fallback support to split the folio
>>>> and migrate it.
>>>>
>>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
>>>> migrate_vma_insert_page()
>>>>
>>>> Support for splitting pages as needed for migration will follow in
>>>> later patches in this series.
>>>>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>> Cc: Gregory Price <gourry@gourry.net>
>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>> Cc: Nico Pache <npache@redhat.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>  include/linux/migrate.h |   2 +
>>>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>>>>  2 files changed, 396 insertions(+), 63 deletions(-)
>>>>
>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>>> index acadd41e0b5c..d9cef0819f91 100644
>>>> --- a/include/linux/migrate.h
>>>> +++ b/include/linux/migrate.h
>>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>>>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>>>>  #define MIGRATE_PFN_SHIFT	6
>>>>  
>>>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>>>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>>>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>>>  };
>>>>  
>>>>  struct migrate_vma {
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 0ed337f94fcd..6621bba62710 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -14,6 +14,7 @@
>>>>  #include <linux/pagewalk.h>
>>>>  #include <linux/rmap.h>
>>>>  #include <linux/swapops.h>
>>>> +#include <asm/pgalloc.h>
>>>>  #include <asm/tlbflush.h>
>>>>  #include "internal.h"
>>>>  
>>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>  	if (!vma_is_anonymous(walk->vma))
>>>>  		return migrate_vma_collect_skip(start, end, walk);
>>>>  
>>>> +	if (thp_migration_supported() &&
>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>>>> +						MIGRATE_PFN_COMPOUND;
>>>> +		migrate->dst[migrate->npages] = 0;
>>>> +		migrate->npages++;
>>>> +		migrate->cpages++;
>>>> +
>>>> +		/*
>>>> +		 * Collect the remaining entries as holes, in case we
>>>> +		 * need to split later
>>>> +		 */
>>>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>> +	}
>>>> +
>>>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>>>>  		migrate->dst[migrate->npages] = 0;
>>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>  	return 0;
>>>>  }
>>>>  
>>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>> -				   unsigned long start,
>>>> -				   unsigned long end,
>>>> -				   struct mm_walk *walk)
>>>> +/**
>>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
>>>> + * folio for device private pages.
>>>> + * @pmdp: pointer to pmd entry
>>>> + * @start: start address of the range for migration
>>>> + * @end: end address of the range for migration
>>>> + * @walk: mm_walk callback structure
>>>> + *
>>>> + * Collect the huge pmd entry at @pmdp for migration and set the
>>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
>>>> + * migration will occur at HPAGE_PMD granularity
>>>> + */
>>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>>>> +					unsigned long end, struct mm_walk *walk,
>>>> +					struct folio *fault_folio)
>>>>  {
>>>> +	struct mm_struct *mm = walk->mm;
>>>> +	struct folio *folio;
>>>>  	struct migrate_vma *migrate = walk->private;
>>>> -	struct folio *fault_folio = migrate->fault_page ?
>>>> -		page_folio(migrate->fault_page) : NULL;
>>>> -	struct vm_area_struct *vma = walk->vma;
>>>> -	struct mm_struct *mm = vma->vm_mm;
>>>> -	unsigned long addr = start, unmapped = 0;
>>>>  	spinlock_t *ptl;
>>>> -	pte_t *ptep;
>>>> +	swp_entry_t entry;
>>>> +	int ret;
>>>> +	unsigned long write = 0;
>>>>  
>>>> -again:
>>>> -	if (pmd_none(*pmdp))
>>>> +	ptl = pmd_lock(mm, pmdp);
>>>> +	if (pmd_none(*pmdp)) {
>>>> +		spin_unlock(ptl);
>>>>  		return migrate_vma_collect_hole(start, end, -1, walk);
>>>> +	}
>>>>  
>>>>  	if (pmd_trans_huge(*pmdp)) {
>>>> -		struct folio *folio;
>>>> -
>>>> -		ptl = pmd_lock(mm, pmdp);
>>>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
>>>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>>>  			spin_unlock(ptl);
>>>> -			goto again;
>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>>  		}
>>>>  
>>>>  		folio = pmd_folio(*pmdp);
>>>>  		if (is_huge_zero_folio(folio)) {
>>>>  			spin_unlock(ptl);
>>>> -			split_huge_pmd(vma, pmdp, addr);
>>>> -		} else {
>>>> -			int ret;
>>>> +			return migrate_vma_collect_hole(start, end, -1, walk);
>>>> +		}
>>>> +		if (pmd_write(*pmdp))
>>>> +			write = MIGRATE_PFN_WRITE;
>>>> +	} else if (!pmd_present(*pmdp)) {
>>>> +		entry = pmd_to_swp_entry(*pmdp);
>>>> +		folio = pfn_swap_entry_folio(entry);
>>>> +
>>>> +		if (!is_device_private_entry(entry) ||
>>>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>>>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
>>>> +			spin_unlock(ptl);
>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>> +		}
>>>>  
>>>> -			folio_get(folio);
>>>> +		if (is_migration_entry(entry)) {
>>>> +			migration_entry_wait_on_locked(entry, ptl);
>>>>  			spin_unlock(ptl);
>>>> -			/* FIXME: we don't expect THP for fault_folio */
>>>> -			if (WARN_ON_ONCE(fault_folio == folio))
>>>> -				return migrate_vma_collect_skip(start, end,
>>>> -								walk);
>>>> -			if (unlikely(!folio_trylock(folio)))
>>>> -				return migrate_vma_collect_skip(start, end,
>>>> -								walk);
>>>> -			ret = split_folio(folio);
>>>> -			if (fault_folio != folio)
>>>> -				folio_unlock(folio);
>>>> -			folio_put(folio);
>>>> -			if (ret)
>>>> -				return migrate_vma_collect_skip(start, end,
>>>> -								walk);
>>>> +			return -EAGAIN;
>>>>  		}
>>>> +
>>>> +		if (is_writable_device_private_entry(entry))
>>>> +			write = MIGRATE_PFN_WRITE;
>>>> +	} else {
>>>> +		spin_unlock(ptl);
>>>> +		return -EAGAIN;
>>>> +	}
>>>> +
>>>> +	folio_get(folio);
>>>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
>>>> +		spin_unlock(ptl);
>>>> +		folio_put(folio);
>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>> +	}
>>>> +
>>>> +	if (thp_migration_supported() &&
>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>> +
>>>> +		struct page_vma_mapped_walk pvmw = {
>>>> +			.ptl = ptl,
>>>> +			.address = start,
>>>> +			.pmd = pmdp,
>>>> +			.vma = walk->vma,
>>>> +		};
>>>> +
>>>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
>>>> +
>>>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
>>>> +						| MIGRATE_PFN_MIGRATE
>>>> +						| MIGRATE_PFN_COMPOUND;
>>>> +		migrate->dst[migrate->npages++] = 0;
>>>> +		migrate->cpages++;
>>>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
>>>> +		if (ret) {
>>>> +			migrate->npages--;
>>>> +			migrate->cpages--;
>>>> +			migrate->src[migrate->npages] = 0;
>>>> +			migrate->dst[migrate->npages] = 0;
>>>> +			goto fallback;
>>>> +		}
>>>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>> +		spin_unlock(ptl);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +fallback:
>>>> +	spin_unlock(ptl);
>>>> +	if (!folio_test_large(folio))
>>>> +		goto done;
>>>> +	ret = split_folio(folio);
>>>> +	if (fault_folio != folio)
>>>> +		folio_unlock(folio);
>>>> +	folio_put(folio);
>>>> +	if (ret)
>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
>>>> +		return migrate_vma_collect_hole(start, end, -1, walk);
>>>> +
>>>> +done:
>>>> +	return -ENOENT;
>>>> +}
>>>> +
>>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>> +				   unsigned long start,
>>>> +				   unsigned long end,
>>>> +				   struct mm_walk *walk)
>>>> +{
>>>> +	struct migrate_vma *migrate = walk->private;
>>>> +	struct vm_area_struct *vma = walk->vma;
>>>> +	struct mm_struct *mm = vma->vm_mm;
>>>> +	unsigned long addr = start, unmapped = 0;
>>>> +	spinlock_t *ptl;
>>>> +	struct folio *fault_folio = migrate->fault_page ?
>>>> +		page_folio(migrate->fault_page) : NULL;
>>>> +	pte_t *ptep;
>>>> +
>>>> +again:
>>>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
>>>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
>>>> +
>>>> +		if (ret == -EAGAIN)
>>>> +			goto again;
>>>> +		if (ret == 0)
>>>> +			return 0;
>>>>  	}
>>>>  
>>>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>>>>  		}
>>>>  
>>>> -		/* FIXME support THP */
>>>> -		if (!page || !page->mapping || PageTransCompound(page)) {
>>>> +		if (!page || !page->mapping) {
>>>>  			mpfn = 0;
>>>>  			goto next;
>>>>  		}
>>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>>>  	 */
>>>>  	int extra = 1 + (page == fault_page);
>>>>  
>>>> -	/*
>>>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
>>>> -	 * check them than regular pages, because they can be mapped with a pmd
>>>> -	 * or with a pte (split pte mapping).
>>>> -	 */
>>>> -	if (folio_test_large(folio))
>>>> -		return false;
>>>> -
>>> You cannot remove this check unless support normal mTHP folios migrate to device, 
>>> which I think this series doesn't do, but maybe should?
>>>
>> mTHP needs to be a follow up series, there are comments in the cover letter. I had an
>> assert earlier to prevent other folio sizes, but I've removed it and the interface
>> to drivers does not allow for mTHP device private folios.
>>
>>
> pte mapped device private THPs with other sizes also can be created as a result of pmd and folio splits. 
> Your should handle them in one place consistently not by different drivers. 
> As pointed by Matthew, there's the problem with the fault_page if not split and just ignored.
> 

I've not run into this with my testing, let me try with more mTHP sizes enabled. I'll wait on Matthew
to post his test case or any results, issues seen

Balbir Singh



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-14 22:51         ` Balbir Singh
@ 2025-08-15  0:04           ` Matthew Brost
  2025-08-15 12:09             ` Balbir Singh
  2025-08-21 10:24             ` Balbir Singh
  0 siblings, 2 replies; 36+ messages in thread
From: Matthew Brost @ 2025-08-15  0:04 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Mika Penttilä, dri-devel, linux-mm, linux-kernel,
	Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Francois Dugast

On Fri, Aug 15, 2025 at 08:51:21AM +1000, Balbir Singh wrote:
> On 8/13/25 10:07, Mika Penttilä wrote:
> > 
> > On 8/13/25 02:36, Balbir Singh wrote:
> > 
> >> On 8/12/25 15:35, Mika Penttilä wrote:
> >>> Hi,
> >>>
> >>> On 8/12/25 05:40, Balbir Singh wrote:
> >>>
> >>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> >>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> >>>> device pages as compound pages during device pfn migration.
> >>>>
> >>>> migrate_device code paths go through the collect, setup
> >>>> and finalize phases of migration.
> >>>>
> >>>> The entries in src and dst arrays passed to these functions still
> >>>> remain at a PAGE_SIZE granularity. When a compound page is passed,
> >>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
> >>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
> >>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
> >>>> representation allows for the compound page to be split into smaller
> >>>> page sizes.
> >>>>
> >>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
> >>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
> >>>> and migrate_vma_insert_huge_pmd_page() have been added.
> >>>>
> >>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
> >>>> some reason this fails, there is fallback support to split the folio
> >>>> and migrate it.
> >>>>
> >>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
> >>>> migrate_vma_insert_page()
> >>>>
> >>>> Support for splitting pages as needed for migration will follow in
> >>>> later patches in this series.
> >>>>
> >>>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>>> Cc: David Hildenbrand <david@redhat.com>
> >>>> Cc: Zi Yan <ziy@nvidia.com>
> >>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> >>>> Cc: Rakie Kim <rakie.kim@sk.com>
> >>>> Cc: Byungchul Park <byungchul@sk.com>
> >>>> Cc: Gregory Price <gourry@gourry.net>
> >>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> >>>> Cc: Alistair Popple <apopple@nvidia.com>
> >>>> Cc: Oscar Salvador <osalvador@suse.de>
> >>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >>>> Cc: Nico Pache <npache@redhat.com>
> >>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >>>> Cc: Dev Jain <dev.jain@arm.com>
> >>>> Cc: Barry Song <baohua@kernel.org>
> >>>> Cc: Lyude Paul <lyude@redhat.com>
> >>>> Cc: Danilo Krummrich <dakr@kernel.org>
> >>>> Cc: David Airlie <airlied@gmail.com>
> >>>> Cc: Simona Vetter <simona@ffwll.ch>
> >>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
> >>>> Cc: Mika Penttilä <mpenttil@redhat.com>
> >>>> Cc: Matthew Brost <matthew.brost@intel.com>
> >>>> Cc: Francois Dugast <francois.dugast@intel.com>
> >>>>
> >>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> >>>> ---
> >>>>  include/linux/migrate.h |   2 +
> >>>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
> >>>>  2 files changed, 396 insertions(+), 63 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >>>> index acadd41e0b5c..d9cef0819f91 100644
> >>>> --- a/include/linux/migrate.h
> >>>> +++ b/include/linux/migrate.h
> >>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> >>>>  #define MIGRATE_PFN_VALID	(1UL << 0)
> >>>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
> >>>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> >>>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
> >>>>  #define MIGRATE_PFN_SHIFT	6
> >>>>  
> >>>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> >>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
> >>>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
> >>>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> >>>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> >>>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
> >>>>  };
> >>>>  
> >>>>  struct migrate_vma {
> >>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >>>> index 0ed337f94fcd..6621bba62710 100644
> >>>> --- a/mm/migrate_device.c
> >>>> +++ b/mm/migrate_device.c
> >>>> @@ -14,6 +14,7 @@
> >>>>  #include <linux/pagewalk.h>
> >>>>  #include <linux/rmap.h>
> >>>>  #include <linux/swapops.h>
> >>>> +#include <asm/pgalloc.h>
> >>>>  #include <asm/tlbflush.h>
> >>>>  #include "internal.h"
> >>>>  
> >>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
> >>>>  	if (!vma_is_anonymous(walk->vma))
> >>>>  		return migrate_vma_collect_skip(start, end, walk);
> >>>>  
> >>>> +	if (thp_migration_supported() &&
> >>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> >>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> >>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> >>>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> >>>> +						MIGRATE_PFN_COMPOUND;
> >>>> +		migrate->dst[migrate->npages] = 0;
> >>>> +		migrate->npages++;
> >>>> +		migrate->cpages++;
> >>>> +
> >>>> +		/*
> >>>> +		 * Collect the remaining entries as holes, in case we
> >>>> +		 * need to split later
> >>>> +		 */
> >>>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> >>>> +	}
> >>>> +
> >>>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
> >>>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
> >>>>  		migrate->dst[migrate->npages] = 0;
> >>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
> >>>>  	return 0;
> >>>>  }
> >>>>  
> >>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>>> -				   unsigned long start,
> >>>> -				   unsigned long end,
> >>>> -				   struct mm_walk *walk)
> >>>> +/**
> >>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
> >>>> + * folio for device private pages.
> >>>> + * @pmdp: pointer to pmd entry
> >>>> + * @start: start address of the range for migration
> >>>> + * @end: end address of the range for migration
> >>>> + * @walk: mm_walk callback structure
> >>>> + *
> >>>> + * Collect the huge pmd entry at @pmdp for migration and set the
> >>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
> >>>> + * migration will occur at HPAGE_PMD granularity
> >>>> + */
> >>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
> >>>> +					unsigned long end, struct mm_walk *walk,
> >>>> +					struct folio *fault_folio)
> >>>>  {
> >>>> +	struct mm_struct *mm = walk->mm;
> >>>> +	struct folio *folio;
> >>>>  	struct migrate_vma *migrate = walk->private;
> >>>> -	struct folio *fault_folio = migrate->fault_page ?
> >>>> -		page_folio(migrate->fault_page) : NULL;
> >>>> -	struct vm_area_struct *vma = walk->vma;
> >>>> -	struct mm_struct *mm = vma->vm_mm;
> >>>> -	unsigned long addr = start, unmapped = 0;
> >>>>  	spinlock_t *ptl;
> >>>> -	pte_t *ptep;
> >>>> +	swp_entry_t entry;
> >>>> +	int ret;
> >>>> +	unsigned long write = 0;
> >>>>  
> >>>> -again:
> >>>> -	if (pmd_none(*pmdp))
> >>>> +	ptl = pmd_lock(mm, pmdp);
> >>>> +	if (pmd_none(*pmdp)) {
> >>>> +		spin_unlock(ptl);
> >>>>  		return migrate_vma_collect_hole(start, end, -1, walk);
> >>>> +	}
> >>>>  
> >>>>  	if (pmd_trans_huge(*pmdp)) {
> >>>> -		struct folio *folio;
> >>>> -
> >>>> -		ptl = pmd_lock(mm, pmdp);
> >>>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
> >>>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
> >>>>  			spin_unlock(ptl);
> >>>> -			goto again;
> >>>> +			return migrate_vma_collect_skip(start, end, walk);
> >>>>  		}
> >>>>  
> >>>>  		folio = pmd_folio(*pmdp);
> >>>>  		if (is_huge_zero_folio(folio)) {
> >>>>  			spin_unlock(ptl);
> >>>> -			split_huge_pmd(vma, pmdp, addr);
> >>>> -		} else {
> >>>> -			int ret;
> >>>> +			return migrate_vma_collect_hole(start, end, -1, walk);
> >>>> +		}
> >>>> +		if (pmd_write(*pmdp))
> >>>> +			write = MIGRATE_PFN_WRITE;
> >>>> +	} else if (!pmd_present(*pmdp)) {
> >>>> +		entry = pmd_to_swp_entry(*pmdp);
> >>>> +		folio = pfn_swap_entry_folio(entry);
> >>>> +
> >>>> +		if (!is_device_private_entry(entry) ||
> >>>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
> >>>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
> >>>> +			spin_unlock(ptl);
> >>>> +			return migrate_vma_collect_skip(start, end, walk);
> >>>> +		}
> >>>>  
> >>>> -			folio_get(folio);
> >>>> +		if (is_migration_entry(entry)) {
> >>>> +			migration_entry_wait_on_locked(entry, ptl);
> >>>>  			spin_unlock(ptl);
> >>>> -			/* FIXME: we don't expect THP for fault_folio */
> >>>> -			if (WARN_ON_ONCE(fault_folio == folio))
> >>>> -				return migrate_vma_collect_skip(start, end,
> >>>> -								walk);
> >>>> -			if (unlikely(!folio_trylock(folio)))
> >>>> -				return migrate_vma_collect_skip(start, end,
> >>>> -								walk);
> >>>> -			ret = split_folio(folio);
> >>>> -			if (fault_folio != folio)
> >>>> -				folio_unlock(folio);
> >>>> -			folio_put(folio);
> >>>> -			if (ret)
> >>>> -				return migrate_vma_collect_skip(start, end,
> >>>> -								walk);
> >>>> +			return -EAGAIN;
> >>>>  		}
> >>>> +
> >>>> +		if (is_writable_device_private_entry(entry))
> >>>> +			write = MIGRATE_PFN_WRITE;
> >>>> +	} else {
> >>>> +		spin_unlock(ptl);
> >>>> +		return -EAGAIN;
> >>>> +	}
> >>>> +
> >>>> +	folio_get(folio);
> >>>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
> >>>> +		spin_unlock(ptl);
> >>>> +		folio_put(folio);
> >>>> +		return migrate_vma_collect_skip(start, end, walk);
> >>>> +	}
> >>>> +
> >>>> +	if (thp_migration_supported() &&
> >>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> >>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> >>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> >>>> +
> >>>> +		struct page_vma_mapped_walk pvmw = {
> >>>> +			.ptl = ptl,
> >>>> +			.address = start,
> >>>> +			.pmd = pmdp,
> >>>> +			.vma = walk->vma,
> >>>> +		};
> >>>> +
> >>>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
> >>>> +
> >>>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
> >>>> +						| MIGRATE_PFN_MIGRATE
> >>>> +						| MIGRATE_PFN_COMPOUND;
> >>>> +		migrate->dst[migrate->npages++] = 0;
> >>>> +		migrate->cpages++;
> >>>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
> >>>> +		if (ret) {
> >>>> +			migrate->npages--;
> >>>> +			migrate->cpages--;
> >>>> +			migrate->src[migrate->npages] = 0;
> >>>> +			migrate->dst[migrate->npages] = 0;
> >>>> +			goto fallback;
> >>>> +		}
> >>>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> >>>> +		spin_unlock(ptl);
> >>>> +		return 0;
> >>>> +	}
> >>>> +
> >>>> +fallback:
> >>>> +	spin_unlock(ptl);
> >>>> +	if (!folio_test_large(folio))
> >>>> +		goto done;
> >>>> +	ret = split_folio(folio);
> >>>> +	if (fault_folio != folio)
> >>>> +		folio_unlock(folio);
> >>>> +	folio_put(folio);
> >>>> +	if (ret)
> >>>> +		return migrate_vma_collect_skip(start, end, walk);
> >>>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
> >>>> +		return migrate_vma_collect_hole(start, end, -1, walk);
> >>>> +
> >>>> +done:
> >>>> +	return -ENOENT;
> >>>> +}
> >>>> +
> >>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>>> +				   unsigned long start,
> >>>> +				   unsigned long end,
> >>>> +				   struct mm_walk *walk)
> >>>> +{
> >>>> +	struct migrate_vma *migrate = walk->private;
> >>>> +	struct vm_area_struct *vma = walk->vma;
> >>>> +	struct mm_struct *mm = vma->vm_mm;
> >>>> +	unsigned long addr = start, unmapped = 0;
> >>>> +	spinlock_t *ptl;
> >>>> +	struct folio *fault_folio = migrate->fault_page ?
> >>>> +		page_folio(migrate->fault_page) : NULL;
> >>>> +	pte_t *ptep;
> >>>> +
> >>>> +again:
> >>>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
> >>>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
> >>>> +
> >>>> +		if (ret == -EAGAIN)
> >>>> +			goto again;
> >>>> +		if (ret == 0)
> >>>> +			return 0;
> >>>>  	}
> >>>>  
> >>>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> >>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> >>>>  		}
> >>>>  
> >>>> -		/* FIXME support THP */
> >>>> -		if (!page || !page->mapping || PageTransCompound(page)) {
> >>>> +		if (!page || !page->mapping) {
> >>>>  			mpfn = 0;
> >>>>  			goto next;
> >>>>  		}
> >>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
> >>>>  	 */
> >>>>  	int extra = 1 + (page == fault_page);
> >>>>  
> >>>> -	/*
> >>>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
> >>>> -	 * check them than regular pages, because they can be mapped with a pmd
> >>>> -	 * or with a pte (split pte mapping).
> >>>> -	 */
> >>>> -	if (folio_test_large(folio))
> >>>> -		return false;
> >>>> -
> >>> You cannot remove this check unless support normal mTHP folios migrate to device, 
> >>> which I think this series doesn't do, but maybe should?
> >>>
> >> mTHP needs to be a follow up series, there are comments in the cover letter. I had an
> >> assert earlier to prevent other folio sizes, but I've removed it and the interface
> >> to drivers does not allow for mTHP device private folios.
> >>
> >>
> > pte mapped device private THPs with other sizes also can be created as a result of pmd and folio splits. 
> > Your should handle them in one place consistently not by different drivers. 
> > As pointed by Matthew, there's the problem with the fault_page if not split and just ignored.
> > 
> 
> I've not run into this with my testing, let me try with more mTHP sizes enabled. I'll wait on Matthew
> to post his test case or any results, issues seen
> 

I’ve hit this. In the code I shared privately, I split THPs in the
page-collection path. You omitted that in v2 and v3; I believe you’ll
need those changes. The code I'm referring to had the below comment.

 416         /*
 417          * XXX: No clean way to support higher-order folios that don't match PMD
 418          * boundaries for now — split them instead. Once mTHP support lands, add
 419          * proper support for this case.
 420          *
 421          * The test, which exposed this as problematic, remapped (memremap) a
 422          * large folio to an unaligned address, resulting in the folio being
 423          * found in the middle of the PTEs. The requested number of pages was
 424          * less than the folio size. Likely to be handled gracefully by upper
 425          * layers eventually, but not yet.
 426          */

I triggered it by doing some odd mremap operations, which caused the CPU
page-fault handler to spin indefinitely iirc. In that case, a large device
folio had been moved into the middle of a PMD.

Upstream could see the same problem if the device fault handler enforces
a must-migrate-to-device policy and mremap moves a large CPU folio into
the middle of a PMD.

I’m in the middle of other work; when I circle back, I’ll try to create
a selftest to reproduce this. My current test is a fairly convoluted IGT
with a bunch of threads doing remap nonsense, but I’ll try to distill it
into a concise selftest.

Matt

> Balbir Singh
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-15  0:04           ` Matthew Brost
@ 2025-08-15 12:09             ` Balbir Singh
  2025-08-21 10:24             ` Balbir Singh
  1 sibling, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2025-08-15 12:09 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Mika Penttilä, dri-devel, linux-mm, linux-kernel,
	Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Francois Dugast

On 8/15/25 10:04, Matthew Brost wrote:
> On Fri, Aug 15, 2025 at 08:51:21AM +1000, Balbir Singh wrote:
>> On 8/13/25 10:07, Mika Penttilä wrote:
>>>
>>> On 8/13/25 02:36, Balbir Singh wrote:
>>>
>>>> On 8/12/25 15:35, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/12/25 05:40, Balbir Singh wrote:
>>>>>
>>>>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>>>>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>>>>>> device pages as compound pages during device pfn migration.
>>>>>>
>>>>>> migrate_device code paths go through the collect, setup
>>>>>> and finalize phases of migration.
>>>>>>
>>>>>> The entries in src and dst arrays passed to these functions still
>>>>>> remain at a PAGE_SIZE granularity. When a compound page is passed,
>>>>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
>>>>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
>>>>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
>>>>>> representation allows for the compound page to be split into smaller
>>>>>> page sizes.
>>>>>>
>>>>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
>>>>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
>>>>>> and migrate_vma_insert_huge_pmd_page() have been added.
>>>>>>
>>>>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
>>>>>> some reason this fails, there is fallback support to split the folio
>>>>>> and migrate it.
>>>>>>
>>>>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
>>>>>> migrate_vma_insert_page()
>>>>>>
>>>>>> Support for splitting pages as needed for migration will follow in
>>>>>> later patches in this series.
>>>>>>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>>>> Cc: Gregory Price <gourry@gourry.net>
>>>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>>>> Cc: Nico Pache <npache@redhat.com>
>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/migrate.h |   2 +
>>>>>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>>>>>>  2 files changed, 396 insertions(+), 63 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>>>>> index acadd41e0b5c..d9cef0819f91 100644
>>>>>> --- a/include/linux/migrate.h
>>>>>> +++ b/include/linux/migrate.h
>>>>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>>>>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>>>>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>>>>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>>>>>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>>>>>>  #define MIGRATE_PFN_SHIFT	6
>>>>>>  
>>>>>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>>>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>>>>>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>>>>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>>>>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>>>>>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>>>>>  };
>>>>>>  
>>>>>>  struct migrate_vma {
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 0ed337f94fcd..6621bba62710 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -14,6 +14,7 @@
>>>>>>  #include <linux/pagewalk.h>
>>>>>>  #include <linux/rmap.h>
>>>>>>  #include <linux/swapops.h>
>>>>>> +#include <asm/pgalloc.h>
>>>>>>  #include <asm/tlbflush.h>
>>>>>>  #include "internal.h"
>>>>>>  
>>>>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>>>  	if (!vma_is_anonymous(walk->vma))
>>>>>>  		return migrate_vma_collect_skip(start, end, walk);
>>>>>>  
>>>>>> +	if (thp_migration_supported() &&
>>>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>>>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>>>>>> +						MIGRATE_PFN_COMPOUND;
>>>>>> +		migrate->dst[migrate->npages] = 0;
>>>>>> +		migrate->npages++;
>>>>>> +		migrate->cpages++;
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * Collect the remaining entries as holes, in case we
>>>>>> +		 * need to split later
>>>>>> +		 */
>>>>>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>>>> +	}
>>>>>> +
>>>>>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>>>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>>>>>>  		migrate->dst[migrate->npages] = 0;
>>>>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>>>  	return 0;
>>>>>>  }
>>>>>>  
>>>>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> -				   unsigned long start,
>>>>>> -				   unsigned long end,
>>>>>> -				   struct mm_walk *walk)
>>>>>> +/**
>>>>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
>>>>>> + * folio for device private pages.
>>>>>> + * @pmdp: pointer to pmd entry
>>>>>> + * @start: start address of the range for migration
>>>>>> + * @end: end address of the range for migration
>>>>>> + * @walk: mm_walk callback structure
>>>>>> + *
>>>>>> + * Collect the huge pmd entry at @pmdp for migration and set the
>>>>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
>>>>>> + * migration will occur at HPAGE_PMD granularity
>>>>>> + */
>>>>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>>>>>> +					unsigned long end, struct mm_walk *walk,
>>>>>> +					struct folio *fault_folio)
>>>>>>  {
>>>>>> +	struct mm_struct *mm = walk->mm;
>>>>>> +	struct folio *folio;
>>>>>>  	struct migrate_vma *migrate = walk->private;
>>>>>> -	struct folio *fault_folio = migrate->fault_page ?
>>>>>> -		page_folio(migrate->fault_page) : NULL;
>>>>>> -	struct vm_area_struct *vma = walk->vma;
>>>>>> -	struct mm_struct *mm = vma->vm_mm;
>>>>>> -	unsigned long addr = start, unmapped = 0;
>>>>>>  	spinlock_t *ptl;
>>>>>> -	pte_t *ptep;
>>>>>> +	swp_entry_t entry;
>>>>>> +	int ret;
>>>>>> +	unsigned long write = 0;
>>>>>>  
>>>>>> -again:
>>>>>> -	if (pmd_none(*pmdp))
>>>>>> +	ptl = pmd_lock(mm, pmdp);
>>>>>> +	if (pmd_none(*pmdp)) {
>>>>>> +		spin_unlock(ptl);
>>>>>>  		return migrate_vma_collect_hole(start, end, -1, walk);
>>>>>> +	}
>>>>>>  
>>>>>>  	if (pmd_trans_huge(*pmdp)) {
>>>>>> -		struct folio *folio;
>>>>>> -
>>>>>> -		ptl = pmd_lock(mm, pmdp);
>>>>>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
>>>>>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>>>>>  			spin_unlock(ptl);
>>>>>> -			goto again;
>>>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>>>>  		}
>>>>>>  
>>>>>>  		folio = pmd_folio(*pmdp);
>>>>>>  		if (is_huge_zero_folio(folio)) {
>>>>>>  			spin_unlock(ptl);
>>>>>> -			split_huge_pmd(vma, pmdp, addr);
>>>>>> -		} else {
>>>>>> -			int ret;
>>>>>> +			return migrate_vma_collect_hole(start, end, -1, walk);
>>>>>> +		}
>>>>>> +		if (pmd_write(*pmdp))
>>>>>> +			write = MIGRATE_PFN_WRITE;
>>>>>> +	} else if (!pmd_present(*pmdp)) {
>>>>>> +		entry = pmd_to_swp_entry(*pmdp);
>>>>>> +		folio = pfn_swap_entry_folio(entry);
>>>>>> +
>>>>>> +		if (!is_device_private_entry(entry) ||
>>>>>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>>>>>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
>>>>>> +			spin_unlock(ptl);
>>>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>>>> +		}
>>>>>>  
>>>>>> -			folio_get(folio);
>>>>>> +		if (is_migration_entry(entry)) {
>>>>>> +			migration_entry_wait_on_locked(entry, ptl);
>>>>>>  			spin_unlock(ptl);
>>>>>> -			/* FIXME: we don't expect THP for fault_folio */
>>>>>> -			if (WARN_ON_ONCE(fault_folio == folio))
>>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>>> -								walk);
>>>>>> -			if (unlikely(!folio_trylock(folio)))
>>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>>> -								walk);
>>>>>> -			ret = split_folio(folio);
>>>>>> -			if (fault_folio != folio)
>>>>>> -				folio_unlock(folio);
>>>>>> -			folio_put(folio);
>>>>>> -			if (ret)
>>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>>> -								walk);
>>>>>> +			return -EAGAIN;
>>>>>>  		}
>>>>>> +
>>>>>> +		if (is_writable_device_private_entry(entry))
>>>>>> +			write = MIGRATE_PFN_WRITE;
>>>>>> +	} else {
>>>>>> +		spin_unlock(ptl);
>>>>>> +		return -EAGAIN;
>>>>>> +	}
>>>>>> +
>>>>>> +	folio_get(folio);
>>>>>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
>>>>>> +		spin_unlock(ptl);
>>>>>> +		folio_put(folio);
>>>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>>>> +	}
>>>>>> +
>>>>>> +	if (thp_migration_supported() &&
>>>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>>>> +
>>>>>> +		struct page_vma_mapped_walk pvmw = {
>>>>>> +			.ptl = ptl,
>>>>>> +			.address = start,
>>>>>> +			.pmd = pmdp,
>>>>>> +			.vma = walk->vma,
>>>>>> +		};
>>>>>> +
>>>>>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
>>>>>> +
>>>>>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
>>>>>> +						| MIGRATE_PFN_MIGRATE
>>>>>> +						| MIGRATE_PFN_COMPOUND;
>>>>>> +		migrate->dst[migrate->npages++] = 0;
>>>>>> +		migrate->cpages++;
>>>>>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
>>>>>> +		if (ret) {
>>>>>> +			migrate->npages--;
>>>>>> +			migrate->cpages--;
>>>>>> +			migrate->src[migrate->npages] = 0;
>>>>>> +			migrate->dst[migrate->npages] = 0;
>>>>>> +			goto fallback;
>>>>>> +		}
>>>>>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>>>> +		spin_unlock(ptl);
>>>>>> +		return 0;
>>>>>> +	}
>>>>>> +
>>>>>> +fallback:
>>>>>> +	spin_unlock(ptl);
>>>>>> +	if (!folio_test_large(folio))
>>>>>> +		goto done;
>>>>>> +	ret = split_folio(folio);
>>>>>> +	if (fault_folio != folio)
>>>>>> +		folio_unlock(folio);
>>>>>> +	folio_put(folio);
>>>>>> +	if (ret)
>>>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>>>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
>>>>>> +		return migrate_vma_collect_hole(start, end, -1, walk);
>>>>>> +
>>>>>> +done:
>>>>>> +	return -ENOENT;
>>>>>> +}
>>>>>> +
>>>>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> +				   unsigned long start,
>>>>>> +				   unsigned long end,
>>>>>> +				   struct mm_walk *walk)
>>>>>> +{
>>>>>> +	struct migrate_vma *migrate = walk->private;
>>>>>> +	struct vm_area_struct *vma = walk->vma;
>>>>>> +	struct mm_struct *mm = vma->vm_mm;
>>>>>> +	unsigned long addr = start, unmapped = 0;
>>>>>> +	spinlock_t *ptl;
>>>>>> +	struct folio *fault_folio = migrate->fault_page ?
>>>>>> +		page_folio(migrate->fault_page) : NULL;
>>>>>> +	pte_t *ptep;
>>>>>> +
>>>>>> +again:
>>>>>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
>>>>>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
>>>>>> +
>>>>>> +		if (ret == -EAGAIN)
>>>>>> +			goto again;
>>>>>> +		if (ret == 0)
>>>>>> +			return 0;
>>>>>>  	}
>>>>>>  
>>>>>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>>>>>>  		}
>>>>>>  
>>>>>> -		/* FIXME support THP */
>>>>>> -		if (!page || !page->mapping || PageTransCompound(page)) {
>>>>>> +		if (!page || !page->mapping) {
>>>>>>  			mpfn = 0;
>>>>>>  			goto next;
>>>>>>  		}
>>>>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>>>>>  	 */
>>>>>>  	int extra = 1 + (page == fault_page);
>>>>>>  
>>>>>> -	/*
>>>>>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
>>>>>> -	 * check them than regular pages, because they can be mapped with a pmd
>>>>>> -	 * or with a pte (split pte mapping).
>>>>>> -	 */
>>>>>> -	if (folio_test_large(folio))
>>>>>> -		return false;
>>>>>> -
>>>>> You cannot remove this check unless support normal mTHP folios migrate to device, 
>>>>> which I think this series doesn't do, but maybe should?
>>>>>
>>>> mTHP needs to be a follow up series, there are comments in the cover letter. I had an
>>>> assert earlier to prevent other folio sizes, but I've removed it and the interface
>>>> to drivers does not allow for mTHP device private folios.
>>>>
>>>>
>>> pte mapped device private THPs with other sizes also can be created as a result of pmd and folio splits. 
>>> Your should handle them in one place consistently not by different drivers. 
>>> As pointed by Matthew, there's the problem with the fault_page if not split and just ignored.
>>>
>>
>> I've not run into this with my testing, let me try with more mTHP sizes enabled. I'll wait on Matthew
>> to post his test case or any results, issues seen
>>
> 
> I’ve hit this. In the code I shared privately, I split THPs in the
> page-collection path. You omitted that in v2 and v3; I believe you’ll
> need those changes. The code I'm referring to had the below comment.
> 
>  416         /*
>  417          * XXX: No clean way to support higher-order folios that don't match PMD
>  418          * boundaries for now — split them instead. Once mTHP support lands, add
>  419          * proper support for this case.
>  420          *
>  421          * The test, which exposed this as problematic, remapped (memremap) a
>  422          * large folio to an unaligned address, resulting in the folio being
>  423          * found in the middle of the PTEs. The requested number of pages was
>  424          * less than the folio size. Likely to be handled gracefully by upper
>  425          * layers eventually, but not yet.
>  426          */
> 
> I triggered it by doing some odd mremap operations, which caused the CPU
> page-fault handler to spin indefinitely iirc. In that case, a large device
> folio had been moved into the middle of a PMD.
> 

That is interesting, in general migrate_vma_pages is protected by address alignment
checks and PMD checks, for the case you describe even after the folio comes
in the middle of the PMD, it should get split in migrate_vma_setup(), if the PMD
is split

> Upstream could see the same problem if the device fault handler enforces
> a must-migrate-to-device policy and mremap moves a large CPU folio into
> the middle of a PMD.
> 
> I’m in the middle of other work; when I circle back, I’ll try to create
> a selftest to reproduce this. My current test is a fairly convoluted IGT
> with a bunch of threads doing remap nonsense, but I’ll try to distill it
> into a concise selftest.
> 

Look forward to your tests and feedback based on the integration of this
series with your work.

Thanks,
Balbir


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-15  0:04           ` Matthew Brost
  2025-08-15 12:09             ` Balbir Singh
@ 2025-08-21 10:24             ` Balbir Singh
  2025-08-28 23:14               ` Matthew Brost
  1 sibling, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-21 10:24 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Mika Penttilä, dri-devel, linux-mm, linux-kernel,
	Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Francois Dugast

On 8/15/25 10:04, Matthew Brost wrote:
> On Fri, Aug 15, 2025 at 08:51:21AM +1000, Balbir Singh wrote:
>> On 8/13/25 10:07, Mika Penttilä wrote:
>>>
>>> On 8/13/25 02:36, Balbir Singh wrote:
>>>
>>>> On 8/12/25 15:35, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/12/25 05:40, Balbir Singh wrote:
...

>> I've not run into this with my testing, let me try with more mTHP sizes enabled. I'll wait on Matthew
>> to post his test case or any results, issues seen
>>
> 
> I’ve hit this. In the code I shared privately, I split THPs in the
> page-collection path. You omitted that in v2 and v3; I believe you’ll
> need those changes. The code I'm referring to had the below comment.
> 
>  416         /*
>  417          * XXX: No clean way to support higher-order folios that don't match PMD
>  418          * boundaries for now — split them instead. Once mTHP support lands, add
>  419          * proper support for this case.
>  420          *
>  421          * The test, which exposed this as problematic, remapped (memremap) a
>  422          * large folio to an unaligned address, resulting in the folio being
>  423          * found in the middle of the PTEs. The requested number of pages was
>  424          * less than the folio size. Likely to be handled gracefully by upper
>  425          * layers eventually, but not yet.
>  426          */
> 
> I triggered it by doing some odd mremap operations, which caused the CPU
> page-fault handler to spin indefinitely iirc. In that case, a large device
> folio had been moved into the middle of a PMD.
> 
> Upstream could see the same problem if the device fault handler enforces
> a must-migrate-to-device policy and mremap moves a large CPU folio into
> the middle of a PMD.
> 
> I’m in the middle of other work; when I circle back, I’ll try to create
> a selftest to reproduce this. My current test is a fairly convoluted IGT
> with a bunch of threads doing remap nonsense, but I’ll try to distill it
> into a concise selftest.
> 

I ran into this while doing some testing as well, I fixed it in a manner similar
to split_folio() for partial unmaps. I will consolidate the folio splits into
a single helper and post it with v4.


Balbir Singh


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 01/11] mm/zone_device: support large zone device private folios
  2025-08-12  2:40 ` [v3 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-08-26 14:22   ` David Hildenbrand
  0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-08-26 14:22 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 12.08.25 04:40, Balbir Singh wrote:
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
> 
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
> 
> Zone device private large folios do not support deferred split and
> scan like normal THP folios.

[...]


>   #else
>   static inline void *devm_memremap_pages(struct device *dev,
>   		struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b0ce0d8254bd..13e87dd743ad 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>   void free_zone_device_folio(struct folio *folio)
>   {
>   	struct dev_pagemap *pgmap = folio->pgmap;
> +	unsigned long nr = folio_nr_pages(folio);
> +	int i;

Not that it will currently matter much but

unsigned long i, nr = folio_nr_pages(folio);

might be more consistent

>   
>   	if (WARN_ON_ONCE(!pgmap))
>   		return;
>   
>   	mem_cgroup_uncharge(folio);
>   
> -	/*
> -	 * Note: we don't expect anonymous compound pages yet. Once supported
> -	 * and we could PTE-map them similar to THP, we'd have to clear
> -	 * PG_anon_exclusive on all tail pages.
> -	 */
>   	if (folio_test_anon(folio)) {
> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -		__ClearPageAnonExclusive(folio_page(folio, 0));
> +		for (i = 0; i < nr; i++)
> +			__ClearPageAnonExclusive(folio_page(folio, i));
> +	} else {
> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>   	}
>   
>   	/*
> @@ -464,11 +463,15 @@ void free_zone_device_folio(struct folio *folio)
>   
>   	switch (pgmap->type) {
>   	case MEMORY_DEVICE_PRIVATE:

Why are you effectively dropping the

if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))

> +		percpu_ref_put_many(&folio->pgmap->ref, nr);
> +		pgmap->ops->page_free(&folio->page);


> +		folio->page.mapping = NULL;

Why are we adding this here? Does not seem large-folio specific.

> +		break;
>   	case MEMORY_DEVICE_COHERENT:
>   		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>   			break;
> -		pgmap->ops->page_free(folio_page(folio, 0));
> -		put_dev_pagemap(pgmap);
> +		pgmap->ops->page_free(&folio->page);
> +		percpu_ref_put(&folio->pgmap->ref);

This looks like an independent change that does not belong in this patch.


Can't you just leave the code as is and simply convert percpu_ref_put
to percpu_ref_put_many()? What am I missing?

>   		break;
>   
>   	case MEMORY_DEVICE_GENERIC:
> @@ -491,14 +494,23 @@ void free_zone_device_folio(struct folio *folio)
>   	}
>   }
>   
> -void zone_device_page_init(struct page *page)
> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>   {
> +	struct page *page = folio_page(folio, 0);
> +
> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);

order vs. pages is wrong.

In context of [1] this should probably be

	VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER);

And before that is in

	VM_WARN_ON_ONCE((1u << order) > MAX_FOLIO_NR_PAGES);

because we don't involve the buddy, so likely buddy limits do not apply.

[1] https://lore.kernel.org/all/20250821200701.1329277-1-david@redhat.com/

> +
>   	/*
>   	 * Drivers shouldn't be allocating pages after calling
>   	 * memunmap_pages().
>   	 */
> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> -	set_page_count(page, 1);
> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> +	folio_set_count(folio, 1);
>   	lock_page(page);
> +
> +	if (order > 1) {
> +		prep_compound_page(page, order);
> +		folio_set_large_rmappable(folio);
> +	}
>   }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(zone_device_folio_init);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..b5837075b6e0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1769,9 +1769,13 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   	 * the folio is unmapped and at least one page is still mapped.
>   	 *
>   	 * Check partially_mapped first to ensure it is a large folio.
> +	 *
> +	 * Device private folios do not support deferred splitting and
> +	 * shrinker based scanning of the folios to free.
>   	 */
>   	if (partially_mapped && folio_test_anon(folio) &&
> -	    !folio_test_partially_mapped(folio))
> +	    !folio_test_partially_mapped(folio) &&
> +		!folio_is_device_private(folio))

Please indent like the previous line.

if (partially_mapped && folio_test_anon(folio) &&
    !folio_test_partially_mapped(folio) &&
    !folio_is_device_private(folio))

>   		deferred_split_folio(folio, true);
>   
>   	__folio_mod_stat(folio, -nr, -nr_pmdmapped);


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-12  2:40 ` [v3 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
  2025-08-12 14:47   ` kernel test robot
@ 2025-08-26 15:19   ` David Hildenbrand
  2025-08-27 10:14     ` Balbir Singh
  2025-08-28 20:05   ` Matthew Brost
  2 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-08-26 15:19 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 12.08.25 04:40, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages aware of zone
> device pages. Although the code is designed to be generic when it comes
> to handling splitting of pages, the code is designed to work for THP
> page sizes corresponding to HPAGE_PMD_NR.
> 
> Modify page_vma_mapped_walk() to return true when a zone device huge
> entry is present, enabling try_to_migrate() and other code migration
> paths to appropriately process the entry. page_vma_mapped_walk() will
> return true for zone device private large folios only when
> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
> not zone device private pages from having to add awareness.

Please don't if avoidable.

We should already have the same problem with small zone-device private
pages, and should have proper folio checks in place, no?


[...]

This thing is huge and hard to review. Given there are subtle changes in here that
are likely problematic, this is a problem. Is there any way to split this
into logical chunks?

Like teaching zap, mprotect, rmap walks .... code separately.

I'm, sure you'll find a way to break this down so I don't walk out of a
review with an headake ;)

>   
>   struct page_vma_mapped_walk {
>   	unsigned long pfn;
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 64ea151a7ae3..2641c01bd5d2 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>   {
>   	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>   }
> +

^ unrelated change

>   #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>   static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>   		struct page *page)
> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>   }
>   #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>   
> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
> +}
> +
> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
>   static inline int non_swap_entry(swp_entry_t entry)
>   {
>   	return swp_type(entry) >= MAX_SWAPFILES;
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 761725bc713c..297f1e034045 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1408,7 +1408,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>   	 * the mirror but here we use it to hold the page for the simulated
>   	 * device memory and that page holds the pointer to the mirror.
>   	 */
> -	rpage = vmf->page->zone_device_data;
> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;

Can we have a wrapper please to give us the zone_device_data for a folio, so
we have something like

rpage = folio_zone_device_data(page_folio(vmf->page));

>   	dmirror = rpage->zone_device_data;
>   
>   	/* FIXME demonstrate how we can adjust migrate range */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..2495e3fdbfae 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1711,8 +1711,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	if (unlikely(is_swap_pmd(pmd))) {
>   		swp_entry_t entry = pmd_to_swp_entry(pmd);
>   
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> -		if (!is_readable_migration_entry(entry)) {
> +		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_pmd_device_private_entry(pmd));
> +
> +		if (is_migration_entry(entry) &&
> +			is_writable_migration_entry(entry)) {
>   			entry = make_readable_migration_entry(
>   							swp_offset(entry));

Careful: There is is_readable_exclusive_migration_entry(). So don't
change the !is_readable_migration_entry(entry) to is_writable_migration_entry(entry)(),
because it's wrong.

>   			pmd = swp_entry_to_pmd(entry);
> @@ -1722,6 +1725,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   				pmd = pmd_swp_mkuffd_wp(pmd);
>   			set_pmd_at(src_mm, addr, src_pmd, pmd);
>   		}
> +
> +		if (is_device_private_entry(entry)) {

likely you want "else if" here.

> +			if (is_writable_device_private_entry(entry)) {
> +				entry = make_readable_device_private_entry(
> +					swp_offset(entry));
> +				pmd = swp_entry_to_pmd(entry);
> +
> +				if (pmd_swp_soft_dirty(*src_pmd))
> +					pmd = pmd_swp_mksoft_dirty(pmd);
> +				if (pmd_swp_uffd_wp(*src_pmd))
> +					pmd = pmd_swp_mkuffd_wp(pmd);
> +				set_pmd_at(src_mm, addr, src_pmd, pmd);
> +			}
> +
> +			src_folio = pfn_swap_entry_folio(entry);
> +			VM_WARN_ON(!folio_test_large(src_folio));
> +
> +			folio_get(src_folio);
> +			/*
> +			 * folio_try_dup_anon_rmap_pmd does not fail for
> +			 * device private entries.
> +			 */
> +			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
> +					  &src_folio->page, dst_vma, src_vma));
> +		}

I would appreciate if this code flow here would resemble more what we have in
copy_nonpresent_pte(), at least regarding handling of these two cases.

(e.g., dropping the VM_WARN_ON)

> +
>   		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>   		mm_inc_nr_ptes(dst_mm);
>   		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -2219,15 +2248,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   			folio_remove_rmap_pmd(folio, page, vma);
>   			WARN_ON_ONCE(folio_mapcount(folio) < 0);
>   			VM_BUG_ON_PAGE(!PageHead(page), page);
> -		} else if (thp_migration_supported()) {
> +		} else if (is_pmd_migration_entry(orig_pmd) ||
> +				is_pmd_device_private_entry(orig_pmd)) {
>   			swp_entry_t entry;
>   
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>   			entry = pmd_to_swp_entry(orig_pmd);
>   			folio = pfn_swap_entry_folio(entry);
>   			flush_needed = 0;
> -		} else
> -			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (!thp_migration_supported())
> +				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (is_pmd_device_private_entry(orig_pmd)) {
> +				folio_remove_rmap_pmd(folio, &folio->page, vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);

Can we jsut move that into the folio_is_device_private() check below.

> +			}
> +		}
>   
>   		if (folio_test_anon(folio)) {
>   			zap_deposited_table(tlb->mm, pmd);
> @@ -2247,6 +2283,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   				folio_mark_accessed(folio);
>   		}
>   
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */

The comment can be dropped: it's simple, don't use "folio" after
dropping the reference when zapping.

> +		if (folio_is_device_private(folio))
> +			folio_put(folio);

> +
>   		spin_unlock(ptl);
>   		if (flush_needed)
>   			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2420,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   		struct folio *folio = pfn_swap_entry_folio(entry);
>   		pmd_t newpmd;
>   
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
> +			   !folio_is_device_private(folio));
>   		if (is_writable_migration_entry(entry)) {
>   			/*
>   			 * A protection check is difficult so
> @@ -2388,6 +2434,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   			newpmd = swp_entry_to_pmd(entry);
>   			if (pmd_swp_soft_dirty(*pmd))
>   				newpmd = pmd_swp_mksoft_dirty(newpmd);
> +		} else if (is_writable_device_private_entry(entry)) {
> +			entry = make_readable_device_private_entry(
> +							swp_offset(entry));
> +			newpmd = swp_entry_to_pmd(entry);
>   		} else {
>   			newpmd = *pmd;
>   		}
> @@ -2842,16 +2892,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	struct page *page;
>   	pgtable_t pgtable;
>   	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>   	unsigned long addr;
>   	pte_t *pte;
>   	int i;
> +	swp_entry_t swp_entry;
>   
>   	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>   	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>   	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
> +
> +	VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +			&& !(is_pmd_device_private_entry(*pmd)));

VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd) &&
	   !(is_pmd_device_private_entry(*pmd)));


>   
>   	count_vm_event(THP_SPLIT_PMD);
>   
> @@ -2899,18 +2952,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>   	}
>   
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>   
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>   		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> -		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_WARN_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +
> +		if (is_pmd_migration_entry(old_pmd)) {
> +			write = is_writable_migration_entry(swp_entry);
> +			if (PageAnon(page))
> +				anon_exclusive =
> +					is_readable_exclusive_migration_entry(
> +								swp_entry);
> +			young = is_migration_entry_young(swp_entry);
> +			dirty = is_migration_entry_dirty(swp_entry);
> +		} else if (is_pmd_device_private_entry(old_pmd)) {
> +			write = is_writable_device_private_entry(swp_entry);
> +			anon_exclusive = PageAnonExclusive(page);
> +			if (freeze && anon_exclusive &&
> +			    folio_try_share_anon_rmap_pmd(folio, page))
> +				freeze = false;
> +			if (!freeze) {
> +				rmap_t rmap_flags = RMAP_NONE;
> +
> +				if (anon_exclusive)
> +					rmap_flags |= RMAP_EXCLUSIVE;
> +
> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
> +				if (anon_exclusive)
> +					rmap_flags |= RMAP_EXCLUSIVE;
> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
> +						 vma, haddr, rmap_flags);
> +			}
> +		}

This is massive and I'll have to review it with a fresh mind later.

[...]
	put_page(page);
> @@ -3058,8 +3157,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>   			   pmd_t *pmd, bool freeze)
>   {
> +

^ unrelated

>   	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> -	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
> +	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
> +			(is_pmd_device_private_entry(*pmd)))

if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
     is_pmd_device_private_entry(*pmd))

>   		__split_huge_pmd_locked(vma, pmd, address, freeze);
>   }
>   
> @@ -3238,6 +3339,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>   	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>   	lockdep_assert_held(&lruvec->lru_lock);
>   
> +	if (folio_is_device_private(folio))
> +		return;
> +
>   	if (list) {
>   		/* page reclaim is reclaiming a huge page */
>   		VM_WARN_ON(folio_test_lru(folio));
> @@ -3252,6 +3356,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>   			list_add_tail(&new_folio->lru, &folio->lru);
>   		folio_set_lru(new_folio);
>   	}
> +

^ unrelated

>   }
>   
>   /* Racy check whether the huge page can be split */
> @@ -3727,7 +3832,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   
>   	/* Prevent deferred_split_scan() touching ->_refcount */
>   	spin_lock(&ds_queue->split_queue_lock);
> -	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> +	if (folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio))) {

I think I discussed that with Zi Yan and it's tricky. Such a change should go
into a separate cleanup patch.


>   		struct address_space *swap_cache = NULL;
>   		struct lruvec *lruvec;
>   		int expected_refs;
> @@ -3858,8 +3963,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   	if (nr_shmem_dropped)
>   		shmem_uncharge(mapping->host, nr_shmem_dropped);
>   
> -	if (!ret && is_anon)
> +	if (!ret && is_anon && !folio_is_device_private(folio))
>   		remap_flags = RMP_USE_SHARED_ZEROPAGE;
> +

^ unrelated

>   	remap_page(folio, 1 << order, remap_flags);
>   
>   	/*
> @@ -4603,7 +4709,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>   		return 0;
>   
>   	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	if (unlikely(is_pmd_device_private_entry(*pvmw->pmd)))

Use pmd_present() instead, please. (just like in the pte code that handles this).

Why do we have to flush? pmd_clear() might be sufficient? In the PTE case we use pte_clear().

[...]

>   		pmde = pmd_mksoft_dirty(pmde);
>   	if (is_writable_migration_entry(entry))
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index e05e14d6eacd..0ed337f94fcd 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -136,6 +136,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   			 * page table entry. Other special swap entries are not
>   			 * migratable, and we ignore regular swapped page.
>   			 */
> +			struct folio *folio;
> +
>   			entry = pte_to_swp_entry(pte);
>   			if (!is_device_private_entry(entry))
>   				goto next;
> @@ -147,6 +149,51 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   			    pgmap->owner != migrate->pgmap_owner)
>   				goto next;
>   
> +			folio = page_folio(page);
> +			if (folio_test_large(folio)) {
> +				struct folio *new_folio;
> +				struct folio *new_fault_folio = NULL;
> +
> +				/*
> +				 * The reason for finding pmd present with a
> +				 * device private pte and a large folio for the
> +				 * pte is partial unmaps. Split the folio now
> +				 * for the migration to be handled correctly
> +				 */

There are also other cases, like any VMA splits. Not sure if that makes a difference,
the folio is PTE mapped.

> +				pte_unmap_unlock(ptep, ptl);
> +
> +				folio_get(folio);
> +				if (folio != fault_folio)
> +					folio_lock(folio);
> +				if (split_folio(folio)) {
> +					if (folio != fault_folio)
> +						folio_unlock(folio);
> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +					goto next;
> +				}
> +
> +				new_folio = page_folio(page);
> +				if (fault_folio)
> +					new_fault_folio = page_folio(migrate->fault_page);
> +
> +				/*
> +				 * Ensure the lock is held on the correct
> +				 * folio after the split
> +				 */
> +				if (!new_fault_folio) {
> +					folio_unlock(folio);
> +					folio_put(folio);
> +				} else if (folio != new_fault_folio) {
> +					folio_get(new_fault_folio);
> +					folio_lock(new_fault_folio);
> +					folio_unlock(folio);
> +					folio_put(folio);
> +				}
> +
> +				addr = start;
> +				goto again;

Another thing to revisit with clean mind.

> +			}
> +
>   			mpfn = migrate_pfn(page_to_pfn(page)) |
>   					MIGRATE_PFN_MIGRATE;
>   			if (is_writable_device_private_entry(entry))
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e981a1a292d2..246e6c211f34 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -250,12 +250,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>   			pmde = *pvmw->pmd;
>   			if (!pmd_present(pmde)) {
> -				swp_entry_t entry;
> +				swp_entry_t entry = pmd_to_swp_entry(pmde);
>   
>   				if (!thp_migration_supported() ||
>   				    !(pvmw->flags & PVMW_MIGRATION))
>   					return not_found(pvmw);
> -				entry = pmd_to_swp_entry(pmde);
>   				if (!is_migration_entry(entry) ||
>   				    !check_pmd(swp_offset_pfn(entry), pvmw))
>   					return not_found(pvmw);
> @@ -277,6 +276,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   			 * cannot return prematurely, while zap_huge_pmd() has
>   			 * cleared *pmd but not decremented compound_mapcount().
>   			 */
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(pmde);
> +
> +			if (is_device_private_entry(entry) &&
> +				(pvmw->flags & PVMW_THP_DEVICE_PRIVATE)) {
> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +				return true;
> +			}
> +
>   			if ((pvmw->flags & PVMW_SYNC) &&
>   			    thp_vma_suitable_order(vma, pvmw->address,
>   						   PMD_ORDER) &&
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..604e8206a2ec 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>   		*pmdvalp = pmdval;
>   	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>   		goto nomap;
> +	if (is_swap_pmd(pmdval)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> +
> +		if (is_device_private_entry(entry))
> +			goto nomap;
> +	}
>   	if (unlikely(pmd_trans_huge(pmdval)))
>   		goto nomap;
>   	if (unlikely(pmd_bad(pmdval))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b5837075b6e0..f40e45564295 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2285,7 +2285,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   		     unsigned long address, void *arg)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
> -	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> +	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
> +				PVMW_THP_DEVICE_PRIVATE);
>   	bool anon_exclusive, writable, ret = true;
>   	pte_t pteval;
>   	struct page *subpage;
> @@ -2330,6 +2331,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   	while (page_vma_mapped_walk(&pvmw)) {
>   		/* PMD-mapped THP migration entry */
>   		if (!pvmw.pte) {
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +			unsigned long pfn;
> +#endif
> +
>   			if (flags & TTU_SPLIT_HUGE_PMD) {
>   				split_huge_pmd_locked(vma, pvmw.address,
>   						      pvmw.pmd, true);
> @@ -2338,8 +2343,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   				break;
>   			}
>   #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> -			subpage = folio_page(folio,
> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +			/*
> +			 * Zone device private folios do not work well with
> +			 * pmd_pfn() on some architectures due to pte
> +			 * inversion.
> +			 */

Please use the handling for the PTE case as inspiration.

		/*
		 * Handle PFN swap PTEs, such as device-exclusive ones, that
		 * actually map pages.
		 */
		pteval = ptep_get(pvmw.pte);
		if (likely(pte_present(pteval))) {
			pfn = pte_pfn(pteval);
		} else {
			pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
		}


So I would expect here something like

		/*
		 * Handle PFN swap PTEs, such as device-exclusive ones, that
		 * actually map pages.
		 */
		pmdval = pmdp_get(pvmw.pmd);
		if (likely(pmd_present(pmdval)))
			pfn = pmd_pfn(pmdval);
		else
			pfn = swp_offset_pfn(pmd_to_swp_entry(pmdval));


> +			if (is_pmd_device_private_entry(*pvmw.pmd)) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> +
> +				pfn = swp_offset_pfn(entry);
> +			} else {
> +				pfn = pmd_pfn(*pvmw.pmd);
> +			}
> +
> +			subpage = folio_page(folio, pfn - folio_pfn(folio));
> +
>   			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>   					!folio_test_pmd_mappable(folio), folio);
>   


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-26 15:19   ` David Hildenbrand
@ 2025-08-27 10:14     ` Balbir Singh
  2025-08-27 11:28       ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2025-08-27 10:14 UTC (permalink / raw)
  To: David Hildenbrand, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 8/27/25 01:19, David Hildenbrand wrote:
> On 12.08.25 04:40, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages aware of zone
>> device pages. Although the code is designed to be generic when it comes
>> to handling splitting of pages, the code is designed to work for THP
>> page sizes corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone device huge
>> entry is present, enabling try_to_migrate() and other code migration
>> paths to appropriately process the entry. page_vma_mapped_walk() will
>> return true for zone device private large folios only when
>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>> not zone device private pages from having to add awareness.
> 
> Please don't if avoidable.
> 
> We should already have the same problem with small zone-device private
> pages, and should have proper folio checks in place, no?
> 
> 
> [...]
> 
> This thing is huge and hard to review. Given there are subtle changes in here that
> are likely problematic, this is a problem. Is there any way to split this
> into logical chunks?
> 
> Like teaching zap, mprotect, rmap walks .... code separately.
> 
> I'm, sure you'll find a way to break this down so I don't walk out of a
> review with an headake ;)
> 

:) I had smaller chunks earlier, but then ran into don't add the change unless you
use the change problem

>>     struct page_vma_mapped_walk {
>>       unsigned long pfn;
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 64ea151a7ae3..2641c01bd5d2 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>   {
>>       return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>>   }
>> +
> 
> ^ unrelated change

Ack

> 
>>   #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>   static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>           struct page *page)
>> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>   }
>>   #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>   +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>> +
>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>> +{
>> +    return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
>> +}
>> +
>> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>> +
>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>> +{
>> +    return 0;
>> +}
>> +
>> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>> +
>>   static inline int non_swap_entry(swp_entry_t entry)
>>   {
>>       return swp_type(entry) >= MAX_SWAPFILES;
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 761725bc713c..297f1e034045 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1408,7 +1408,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>        * the mirror but here we use it to hold the page for the simulated
>>        * device memory and that page holds the pointer to the mirror.
>>        */
>> -    rpage = vmf->page->zone_device_data;
>> +    rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
> 
> Can we have a wrapper please to give us the zone_device_data for a folio, so
> we have something like
> 
> rpage = folio_zone_device_data(page_folio(vmf->page));
> 

Yep, will change

>>       dmirror = rpage->zone_device_data;
>>         /* FIXME demonstrate how we can adjust migrate range */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 9c38a95e9f09..2495e3fdbfae 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1711,8 +1711,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       if (unlikely(is_swap_pmd(pmd))) {
>>           swp_entry_t entry = pmd_to_swp_entry(pmd);
>>   -        VM_BUG_ON(!is_pmd_migration_entry(pmd));
>> -        if (!is_readable_migration_entry(entry)) {
>> +        VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
>> +                !is_pmd_device_private_entry(pmd));
>> +
>> +        if (is_migration_entry(entry) &&
>> +            is_writable_migration_entry(entry)) {
>>               entry = make_readable_migration_entry(
>>                               swp_offset(entry));
> 
> Careful: There is is_readable_exclusive_migration_entry(). So don't
> change the !is_readable_migration_entry(entry) to is_writable_migration_entry(entry)(),
> because it's wrong.
> 

Ack, I assume you are referring to potential prot_none entries?

>>               pmd = swp_entry_to_pmd(entry);
>> @@ -1722,6 +1725,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>                   pmd = pmd_swp_mkuffd_wp(pmd);
>>               set_pmd_at(src_mm, addr, src_pmd, pmd);
>>           }
>> +
>> +        if (is_device_private_entry(entry)) {
> 
> likely you want "else if" here.
> 

Ack

>> +            if (is_writable_device_private_entry(entry)) {
>> +                entry = make_readable_device_private_entry(
>> +                    swp_offset(entry));
>> +                pmd = swp_entry_to_pmd(entry);
>> +
>> +                if (pmd_swp_soft_dirty(*src_pmd))
>> +                    pmd = pmd_swp_mksoft_dirty(pmd);
>> +                if (pmd_swp_uffd_wp(*src_pmd))
>> +                    pmd = pmd_swp_mkuffd_wp(pmd);
>> +                set_pmd_at(src_mm, addr, src_pmd, pmd);
>> +            }
>> +
>> +            src_folio = pfn_swap_entry_folio(entry);
>> +            VM_WARN_ON(!folio_test_large(src_folio));
>> +
>> +            folio_get(src_folio);
>> +            /*
>> +             * folio_try_dup_anon_rmap_pmd does not fail for
>> +             * device private entries.
>> +             */
>> +            VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
>> +                      &src_folio->page, dst_vma, src_vma));
>> +        }
> 
> I would appreciate if this code flow here would resemble more what we have in
> copy_nonpresent_pte(), at least regarding handling of these two cases.
> 
> (e.g., dropping the VM_WARN_ON)

Ack

> 
>> +
>>           add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>>           mm_inc_nr_ptes(dst_mm);
>>           pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> @@ -2219,15 +2248,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>               folio_remove_rmap_pmd(folio, page, vma);
>>               WARN_ON_ONCE(folio_mapcount(folio) < 0);
>>               VM_BUG_ON_PAGE(!PageHead(page), page);
>> -        } else if (thp_migration_supported()) {
>> +        } else if (is_pmd_migration_entry(orig_pmd) ||
>> +                is_pmd_device_private_entry(orig_pmd)) {
>>               swp_entry_t entry;
>>   -            VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>               entry = pmd_to_swp_entry(orig_pmd);
>>               folio = pfn_swap_entry_folio(entry);
>>               flush_needed = 0;
>> -        } else
>> -            WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>> +
>> +            if (!thp_migration_supported())
>> +                WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>> +
>> +            if (is_pmd_device_private_entry(orig_pmd)) {
>> +                folio_remove_rmap_pmd(folio, &folio->page, vma);
>> +                WARN_ON_ONCE(folio_mapcount(folio) < 0);
> 
> Can we jsut move that into the folio_is_device_private() check below.

The check you mean?

> 
>> +            }
>> +        }
>>             if (folio_test_anon(folio)) {
>>               zap_deposited_table(tlb->mm, pmd);
>> @@ -2247,6 +2283,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>                   folio_mark_accessed(folio);
>>           }
>>   +        /*
>> +         * Do a folio put on zone device private pages after
>> +         * changes to mm_counter, because the folio_put() will
>> +         * clean folio->mapping and the folio_test_anon() check
>> +         * will not be usable.
>> +         */
> 
> The comment can be dropped: it's simple, don't use "folio" after
> dropping the reference when zapping.
> 

Ack

>> +        if (folio_is_device_private(folio))
>> +            folio_put(folio);
> 
>> +
>>           spin_unlock(ptl);
>>           if (flush_needed)
>>               tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
>> @@ -2375,7 +2420,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>           struct folio *folio = pfn_swap_entry_folio(entry);
>>           pmd_t newpmd;
>>   -        VM_BUG_ON(!is_pmd_migration_entry(*pmd));
>> +        VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
>> +               !folio_is_device_private(folio));
>>           if (is_writable_migration_entry(entry)) {
>>               /*
>>                * A protection check is difficult so
>> @@ -2388,6 +2434,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>               newpmd = swp_entry_to_pmd(entry);
>>               if (pmd_swp_soft_dirty(*pmd))
>>                   newpmd = pmd_swp_mksoft_dirty(newpmd);
>> +        } else if (is_writable_device_private_entry(entry)) {
>> +            entry = make_readable_device_private_entry(
>> +                            swp_offset(entry));
>> +            newpmd = swp_entry_to_pmd(entry);
>>           } else {
>>               newpmd = *pmd;
>>           }
>> @@ -2842,16 +2892,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>       struct page *page;
>>       pgtable_t pgtable;
>>       pmd_t old_pmd, _pmd;
>> -    bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>> -    bool anon_exclusive = false, dirty = false;
>> +    bool young, write, soft_dirty, uffd_wp = false;
>> +    bool anon_exclusive = false, dirty = false, present = false;
>>       unsigned long addr;
>>       pte_t *pte;
>>       int i;
>> +    swp_entry_t swp_entry;
>>         VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>>       VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>>       VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
>> -    VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
>> +
>> +    VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
>> +            && !(is_pmd_device_private_entry(*pmd)));
> 
> VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd) &&
>        !(is_pmd_device_private_entry(*pmd)));
> 
>

Ack
 

>>         count_vm_event(THP_SPLIT_PMD);
>>   @@ -2899,18 +2952,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>           return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>       }
>>   -    pmd_migration = is_pmd_migration_entry(*pmd);
>> -    if (unlikely(pmd_migration)) {
>> -        swp_entry_t entry;
>>   +    present = pmd_present(*pmd);
>> +    if (unlikely(!present)) {
>> +        swp_entry = pmd_to_swp_entry(*pmd);
>>           old_pmd = *pmd;
>> -        entry = pmd_to_swp_entry(old_pmd);
>> -        page = pfn_swap_entry_to_page(entry);
>> -        write = is_writable_migration_entry(entry);
>> -        if (PageAnon(page))
>> -            anon_exclusive = is_readable_exclusive_migration_entry(entry);
>> -        young = is_migration_entry_young(entry);
>> -        dirty = is_migration_entry_dirty(entry);
>> +
>> +        folio = pfn_swap_entry_folio(swp_entry);
>> +        VM_WARN_ON(!is_migration_entry(swp_entry) &&
>> +                !is_device_private_entry(swp_entry));
>> +        page = pfn_swap_entry_to_page(swp_entry);
>> +
>> +        if (is_pmd_migration_entry(old_pmd)) {
>> +            write = is_writable_migration_entry(swp_entry);
>> +            if (PageAnon(page))
>> +                anon_exclusive =
>> +                    is_readable_exclusive_migration_entry(
>> +                                swp_entry);
>> +            young = is_migration_entry_young(swp_entry);
>> +            dirty = is_migration_entry_dirty(swp_entry);
>> +        } else if (is_pmd_device_private_entry(old_pmd)) {
>> +            write = is_writable_device_private_entry(swp_entry);
>> +            anon_exclusive = PageAnonExclusive(page);
>> +            if (freeze && anon_exclusive &&
>> +                folio_try_share_anon_rmap_pmd(folio, page))
>> +                freeze = false;
>> +            if (!freeze) {
>> +                rmap_t rmap_flags = RMAP_NONE;
>> +
>> +                if (anon_exclusive)
>> +                    rmap_flags |= RMAP_EXCLUSIVE;
>> +
>> +                folio_ref_add(folio, HPAGE_PMD_NR - 1);
>> +                if (anon_exclusive)
>> +                    rmap_flags |= RMAP_EXCLUSIVE;
>> +                folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>> +                         vma, haddr, rmap_flags);
>> +            }
>> +        }
> 
> This is massive and I'll have to review it with a fresh mind later.

It is similar to what we do for non device private folios, when we map/unmap the entire
folio during migration, the new fresh folios/pages should be marked as anon exclusive.
But please do check

> 
> [...]
>     put_page(page);
>> @@ -3058,8 +3157,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>                  pmd_t *pmd, bool freeze)
>>   {
>> +
> 
> ^ unrelated
> 

Ack

>>       VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>> -    if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
>> +    if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
>> +            (is_pmd_device_private_entry(*pmd)))
> 
> if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
>     is_pmd_device_private_entry(*pmd))
> 
>>           __split_huge_pmd_locked(vma, pmd, address, freeze);
>>   }
>>   @@ -3238,6 +3339,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>       VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>>       lockdep_assert_held(&lruvec->lru_lock);
>>   +    if (folio_is_device_private(folio))
>> +        return;
>> +
>>       if (list) {
>>           /* page reclaim is reclaiming a huge page */
>>           VM_WARN_ON(folio_test_lru(folio));
>> @@ -3252,6 +3356,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>               list_add_tail(&new_folio->lru, &folio->lru);
>>           folio_set_lru(new_folio);
>>       }
>> +
> 
> ^ unrelated
> 

Ack

>>   }
>>     /* Racy check whether the huge page can be split */
>> @@ -3727,7 +3832,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>         /* Prevent deferred_split_scan() touching ->_refcount */
>>       spin_lock(&ds_queue->split_queue_lock);
>> -    if (folio_ref_freeze(folio, 1 + extra_pins)) {
>> +    if (folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio))) {
> 
> I think I discussed that with Zi Yan and it's tricky. Such a change should go
> into a separate cleanup patch.
> 

Ack, I'll split it up as needed. The reason why this is here is because large folios that
have been partially split (pmd split) have a count of nr_pages + 1 on pmd_split and after
the split the map count falls, but never goes to 0 as the ref_freeze code expects.

> 
>>           struct address_space *swap_cache = NULL;
>>           struct lruvec *lruvec;
>>           int expected_refs;
>> @@ -3858,8 +3963,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>       if (nr_shmem_dropped)
>>           shmem_uncharge(mapping->host, nr_shmem_dropped);
>>   -    if (!ret && is_anon)
>> +    if (!ret && is_anon && !folio_is_device_private(folio))
>>           remap_flags = RMP_USE_SHARED_ZEROPAGE;
>> +
> 
> ^ unrelated

Ack

> 
>>       remap_page(folio, 1 << order, remap_flags);
>>         /*
>> @@ -4603,7 +4709,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>           return 0;
>>         flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>> -    pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> +    if (unlikely(is_pmd_device_private_entry(*pvmw->pmd)))
> 
> Use pmd_present() instead, please. (just like in the pte code that handles this).
> 

Ack

> Why do we have to flush? pmd_clear() might be sufficient? In the PTE case we use pte_clear().

Without the flush, other entities will not see the cleared pmd and isn't the pte_clear() only
when should_defer_flush() is true?

> 
> [...]
> 
>>           pmde = pmd_mksoft_dirty(pmde);
>>       if (is_writable_migration_entry(entry))
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index e05e14d6eacd..0ed337f94fcd 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -136,6 +136,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>                * page table entry. Other special swap entries are not
>>                * migratable, and we ignore regular swapped page.
>>                */
>> +            struct folio *folio;
>> +
>>               entry = pte_to_swp_entry(pte);
>>               if (!is_device_private_entry(entry))
>>                   goto next;
>> @@ -147,6 +149,51 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>                   pgmap->owner != migrate->pgmap_owner)
>>                   goto next;
>>   +            folio = page_folio(page);
>> +            if (folio_test_large(folio)) {
>> +                struct folio *new_folio;
>> +                struct folio *new_fault_folio = NULL;
>> +
>> +                /*
>> +                 * The reason for finding pmd present with a
>> +                 * device private pte and a large folio for the
>> +                 * pte is partial unmaps. Split the folio now
>> +                 * for the migration to be handled correctly
>> +                 */
> 
> There are also other cases, like any VMA splits. Not sure if that makes a difference,
> the folio is PTE mapped.
> 

Ack, I can clarify that the folio is just pte mapped or remove the comment

>> +                pte_unmap_unlock(ptep, ptl);
>> +
>> +                folio_get(folio);
>> +                if (folio != fault_folio)
>> +                    folio_lock(folio);
>> +                if (split_folio(folio)) {
>> +                    if (folio != fault_folio)
>> +                        folio_unlock(folio);
>> +                    ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> +                    goto next;
>> +                }
>> +
>> +                new_folio = page_folio(page);
>> +                if (fault_folio)
>> +                    new_fault_folio = page_folio(migrate->fault_page);
>> +
>> +                /*
>> +                 * Ensure the lock is held on the correct
>> +                 * folio after the split
>> +                 */
>> +                if (!new_fault_folio) {
>> +                    folio_unlock(folio);
>> +                    folio_put(folio);
>> +                } else if (folio != new_fault_folio) {
>> +                    folio_get(new_fault_folio);
>> +                    folio_lock(new_fault_folio);
>> +                    folio_unlock(folio);
>> +                    folio_put(folio);
>> +                }
>> +
>> +                addr = start;
>> +                goto again;
> 
> Another thing to revisit with clean mind.
> 

Sure

>> +            }
>> +
>>               mpfn = migrate_pfn(page_to_pfn(page)) |
>>                       MIGRATE_PFN_MIGRATE;
>>               if (is_writable_device_private_entry(entry))
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index e981a1a292d2..246e6c211f34 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -250,12 +250,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>               pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>               pmde = *pvmw->pmd;
>>               if (!pmd_present(pmde)) {
>> -                swp_entry_t entry;
>> +                swp_entry_t entry = pmd_to_swp_entry(pmde);
>>                     if (!thp_migration_supported() ||
>>                       !(pvmw->flags & PVMW_MIGRATION))
>>                       return not_found(pvmw);
>> -                entry = pmd_to_swp_entry(pmde);
>>                   if (!is_migration_entry(entry) ||
>>                       !check_pmd(swp_offset_pfn(entry), pvmw))
>>                       return not_found(pvmw);
>> @@ -277,6 +276,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>                * cannot return prematurely, while zap_huge_pmd() has
>>                * cleared *pmd but not decremented compound_mapcount().
>>                */
>> +            swp_entry_t entry;
>> +
>> +            entry = pmd_to_swp_entry(pmde);
>> +
>> +            if (is_device_private_entry(entry) &&
>> +                (pvmw->flags & PVMW_THP_DEVICE_PRIVATE)) {
>> +                pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>> +                return true;
>> +            }
>> +
>>               if ((pvmw->flags & PVMW_SYNC) &&
>>                   thp_vma_suitable_order(vma, pvmw->address,
>>                              PMD_ORDER) &&
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 567e2d084071..604e8206a2ec 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>           *pmdvalp = pmdval;
>>       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>           goto nomap;
>> +    if (is_swap_pmd(pmdval)) {
>> +        swp_entry_t entry = pmd_to_swp_entry(pmdval);
>> +
>> +        if (is_device_private_entry(entry))
>> +            goto nomap;
>> +    }
>>       if (unlikely(pmd_trans_huge(pmdval)))
>>           goto nomap;
>>       if (unlikely(pmd_bad(pmdval))) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index b5837075b6e0..f40e45564295 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2285,7 +2285,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>                unsigned long address, void *arg)
>>   {
>>       struct mm_struct *mm = vma->vm_mm;
>> -    DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>> +    DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
>> +                PVMW_THP_DEVICE_PRIVATE);
>>       bool anon_exclusive, writable, ret = true;
>>       pte_t pteval;
>>       struct page *subpage;
>> @@ -2330,6 +2331,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>       while (page_vma_mapped_walk(&pvmw)) {
>>           /* PMD-mapped THP migration entry */
>>           if (!pvmw.pte) {
>> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> +            unsigned long pfn;
>> +#endif
>> +
>>               if (flags & TTU_SPLIT_HUGE_PMD) {
>>                   split_huge_pmd_locked(vma, pvmw.address,
>>                                 pvmw.pmd, true);
>> @@ -2338,8 +2343,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>                   break;
>>               }
>>   #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> -            subpage = folio_page(folio,
>> -                pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
>> +            /*
>> +             * Zone device private folios do not work well with
>> +             * pmd_pfn() on some architectures due to pte
>> +             * inversion.
>> +             */
> 
> Please use the handling for the PTE case as inspiration.
> 
>         /*
>          * Handle PFN swap PTEs, such as device-exclusive ones, that
>          * actually map pages.
>          */
>         pteval = ptep_get(pvmw.pte);
>         if (likely(pte_present(pteval))) {
>             pfn = pte_pfn(pteval);
>         } else {
>             pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
>             VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
>         }
> 
> 
> So I would expect here something like
> 
>         /*
>          * Handle PFN swap PTEs, such as device-exclusive ones, that
>          * actually map pages.
>          */
>         pmdval = pmdp_get(pvmw.pmd);
>         if (likely(pmd_present(pmdval)))
>             pfn = pmd_pfn(pmdval);
>         else
>             pfn = swp_offset_pfn(pmd_to_swp_entry(pmdval));
> 
> 

I can switch over to pmd_present for the checks

>> +            if (is_pmd_device_private_entry(*pvmw.pmd)) {
>> +                swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
>> +
>> +                pfn = swp_offset_pfn(entry);
>> +            } else {
>> +                pfn = pmd_pfn(*pvmw.pmd);
>> +            }
>> +
>> +            subpage = folio_page(folio, pfn - folio_pfn(folio));
>> +
>>               VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>>                       !folio_test_pmd_mappable(folio), folio);
>>   
> 
> 


Thanks for the review
Balbir


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-27 10:14     ` Balbir Singh
@ 2025-08-27 11:28       ` David Hildenbrand
  0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-08-27 11:28 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

>> Like teaching zap, mprotect, rmap walks .... code separately.
>>
>> I'm, sure you'll find a way to break this down so I don't walk out of a
>> review with an headake ;)
>>
> 
> :) I had smaller chunks earlier, but then ran into don't add the change unless you
> use the change problem
> 

It's perfectly reasonable to have something like

mm/huge_memory: teach copy_huge_pmd() about huge device-private entries
mm/huge_memory: support splitting device-private folios

...

etc :)

[...]

>> Careful: There is is_readable_exclusive_migration_entry(). So don't
>> change the !is_readable_migration_entry(entry) to is_writable_migration_entry(entry)(),
>> because it's wrong.
>>
> 
> Ack, I assume you are referring to potential prot_none entries?

readable_exclusive are used to maintain the PageAnonExclusive bit right
now for migration entries. So it's not realted to prot_none.

[...]

>>> -            WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>> +
>>> +            if (!thp_migration_supported())
>>> +                WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>> +
>>> +            if (is_pmd_device_private_entry(orig_pmd)) {
>>> +                folio_remove_rmap_pmd(folio, &folio->page, vma);
>>> +                WARN_ON_ONCE(folio_mapcount(folio) < 0);
>>
>> Can we jsut move that into the folio_is_device_private() check below.
> 
> The check you mean?

The whole thing like

if (...) {
	folio_remove_rmap_pmd(folio, &folio->page, vma);
	WARN_ON_ONCE(folio_mapcount(folio) < 0);
	folio_put(folio)
}


[...]

> 
>> Why do we have to flush? pmd_clear() might be sufficient? In the PTE case we use pte_clear().
> 
> Without the flush, other entities will not see the cleared pmd and isn't the pte_clear() only
> when should_defer_flush() is true?

It's a non-present page entry, so there should be no TLB entry to flush.

> 
>>
>> [...]
>>
>>>            pmde = pmd_mksoft_dirty(pmde);
>>>        if (is_writable_migration_entry(entry))
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index e05e14d6eacd..0ed337f94fcd 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -136,6 +136,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>                 * page table entry. Other special swap entries are not
>>>                 * migratable, and we ignore regular swapped page.
>>>                 */
>>> +            struct folio *folio;
>>> +
>>>                entry = pte_to_swp_entry(pte);
>>>                if (!is_device_private_entry(entry))
>>>                    goto next;
>>> @@ -147,6 +149,51 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>                    pgmap->owner != migrate->pgmap_owner)
>>>                    goto next;
>>>    +            folio = page_folio(page);
>>> +            if (folio_test_large(folio)) {
>>> +                struct folio *new_folio;
>>> +                struct folio *new_fault_folio = NULL;
>>> +
>>> +                /*
>>> +                 * The reason for finding pmd present with a
>>> +                 * device private pte and a large folio for the
>>> +                 * pte is partial unmaps. Split the folio now
>>> +                 * for the migration to be handled correctly
>>> +                 */
>>
>> There are also other cases, like any VMA splits. Not sure if that makes a difference,
>> the folio is PTE mapped.
>>
> 
> Ack, I can clarify that the folio is just pte mapped or remove the comment

Sounds good.


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 07/11] mm/thp: add split during migration support
  2025-08-12  2:40 ` [v3 07/11] mm/thp: add split during migration support Balbir Singh
@ 2025-08-27 20:29   ` David Hildenbrand
  0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-08-27 20:29 UTC (permalink / raw)
  To: Balbir Singh, dri-devel, linux-mm, linux-kernel
  Cc: Andrew Morton, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On 12.08.25 04:40, Balbir Singh wrote:
> Support splitting pages during THP zone device migration as needed.
> The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages.
> 
> Add a new routine migrate_vma_split_pages() to support the splitting
> of already isolated pages. The pages being migrated are already unmapped
> and marked for migration during setup (via unmap). folio_split() and
> __split_unmapped_folio() take additional isolated arguments, to avoid
> unmapping and remaping these pages and unlocking/putting the folio.

No detailed review, just a high-level comment: please take better care 
of crafting your patch subjects.

This should probably be

mm/migrate_device: support splitting device folios during migration

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-12  2:40 ` [v3 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
  2025-08-12 14:47   ` kernel test robot
  2025-08-26 15:19   ` David Hildenbrand
@ 2025-08-28 20:05   ` Matthew Brost
  2025-08-28 20:12     ` David Hildenbrand
  2 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2025-08-28 20:05 UTC (permalink / raw)
  To: Balbir Singh
  Cc: dri-devel, linux-mm, linux-kernel, Andrew Morton,
	David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Oscar Salvador,
	Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Tue, Aug 12, 2025 at 12:40:27PM +1000, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages aware of zone
> device pages. Although the code is designed to be generic when it comes
> to handling splitting of pages, the code is designed to work for THP
> page sizes corresponding to HPAGE_PMD_NR.
> 
> Modify page_vma_mapped_walk() to return true when a zone device huge
> entry is present, enabling try_to_migrate() and other code migration
> paths to appropriately process the entry. page_vma_mapped_walk() will
> return true for zone device private large folios only when
> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
> not zone device private pages from having to add awareness. The key
> callback that needs this flag is try_to_migrate_one(). The other
> callbacks page idle, damon use it for setting young/dirty bits, which is
> not significant when it comes to pmd level bit harvesting.
> 
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
> entries.
> 
> Support partial unmapping of zone device private entries, which happens
> via munmap(). munmap() causes the device private entry pmd to be split,
> but the corresponding folio is not split. Deferred split does not work for
> zone device private folios due to the need to split during fault
> handling. Get migrate_vma_collect_pmd() to handle this case by splitting
> partially unmapped device private folios.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/rmap.h    |   2 +
>  include/linux/swapops.h |  17 ++++
>  lib/test_hmm.c          |   2 +-
>  mm/huge_memory.c        | 214 +++++++++++++++++++++++++++++++---------
>  mm/migrate_device.c     |  47 +++++++++
>  mm/page_vma_mapped.c    |  13 ++-
>  mm/pgtable-generic.c    |   6 ++
>  mm/rmap.c               |  24 ++++-
>  8 files changed, 272 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 6cd020eea37a..dfb7aae3d77b 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -927,6 +927,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  #define PVMW_SYNC		(1 << 0)
>  /* Look for migration entries rather than present PTEs */
>  #define PVMW_MIGRATION		(1 << 1)
> +/* Look for device private THP entries */
> +#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
>  
>  struct page_vma_mapped_walk {
>  	unsigned long pfn;
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 64ea151a7ae3..2641c01bd5d2 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  {
>  	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>  }
> +
>  #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		struct page *page)
> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  }
>  #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  
> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
> +}
> +
> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
>  static inline int non_swap_entry(swp_entry_t entry)
>  {
>  	return swp_type(entry) >= MAX_SWAPFILES;
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 761725bc713c..297f1e034045 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1408,7 +1408,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	 * the mirror but here we use it to hold the page for the simulated
>  	 * device memory and that page holds the pointer to the mirror.
>  	 */
> -	rpage = vmf->page->zone_device_data;
> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>  	dmirror = rpage->zone_device_data;
>  
>  	/* FIXME demonstrate how we can adjust migrate range */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..2495e3fdbfae 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1711,8 +1711,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> -		if (!is_readable_migration_entry(entry)) {
> +		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_pmd_device_private_entry(pmd));
> +
> +		if (is_migration_entry(entry) &&
> +			is_writable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
>  			pmd = swp_entry_to_pmd(entry);
> @@ -1722,6 +1725,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pmd = pmd_swp_mkuffd_wp(pmd);
>  			set_pmd_at(src_mm, addr, src_pmd, pmd);
>  		}
> +
> +		if (is_device_private_entry(entry)) {
> +			if (is_writable_device_private_entry(entry)) {
> +				entry = make_readable_device_private_entry(
> +					swp_offset(entry));
> +				pmd = swp_entry_to_pmd(entry);
> +
> +				if (pmd_swp_soft_dirty(*src_pmd))
> +					pmd = pmd_swp_mksoft_dirty(pmd);
> +				if (pmd_swp_uffd_wp(*src_pmd))
> +					pmd = pmd_swp_mkuffd_wp(pmd);
> +				set_pmd_at(src_mm, addr, src_pmd, pmd);
> +			}
> +
> +			src_folio = pfn_swap_entry_folio(entry);
> +			VM_WARN_ON(!folio_test_large(src_folio));
> +
> +			folio_get(src_folio);
> +			/*
> +			 * folio_try_dup_anon_rmap_pmd does not fail for
> +			 * device private entries.
> +			 */
> +			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
> +					  &src_folio->page, dst_vma, src_vma));

VM_WARN_ON compiles out in non-debug builds. I hit this running the
fork self I shared with a non-debug build.

Matt 

[1] https://elixir.bootlin.com/linux/v6.16.3/source/include/linux/mmdebug.h#L112

> +		}
> +
>  		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>  		mm_inc_nr_ptes(dst_mm);
>  		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -2219,15 +2248,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			folio_remove_rmap_pmd(folio, page, vma);
>  			WARN_ON_ONCE(folio_mapcount(folio) < 0);
>  			VM_BUG_ON_PAGE(!PageHead(page), page);
> -		} else if (thp_migration_supported()) {
> +		} else if (is_pmd_migration_entry(orig_pmd) ||
> +				is_pmd_device_private_entry(orig_pmd)) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> -		} else
> -			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (!thp_migration_supported())
> +				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (is_pmd_device_private_entry(orig_pmd)) {
> +				folio_remove_rmap_pmd(folio, &folio->page, vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
> +		}
>  
>  		if (folio_test_anon(folio)) {
>  			zap_deposited_table(tlb->mm, pmd);
> @@ -2247,6 +2283,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2420,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
> +			   !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,6 +2434,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> +		} else if (is_writable_device_private_entry(entry)) {
> +			entry = make_readable_device_private_entry(
> +							swp_offset(entry));
> +			newpmd = swp_entry_to_pmd(entry);
>  		} else {
>  			newpmd = *pmd;
>  		}
> @@ -2842,16 +2892,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>  	unsigned long addr;
>  	pte_t *pte;
>  	int i;
> +	swp_entry_t swp_entry;
>  
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
> +
> +	VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +			&& !(is_pmd_device_private_entry(*pmd)));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -2899,18 +2952,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>  	}
>  
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>  
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>  		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> -		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_WARN_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +
> +		if (is_pmd_migration_entry(old_pmd)) {
> +			write = is_writable_migration_entry(swp_entry);
> +			if (PageAnon(page))
> +				anon_exclusive =
> +					is_readable_exclusive_migration_entry(
> +								swp_entry);
> +			young = is_migration_entry_young(swp_entry);
> +			dirty = is_migration_entry_dirty(swp_entry);
> +		} else if (is_pmd_device_private_entry(old_pmd)) {
> +			write = is_writable_device_private_entry(swp_entry);
> +			anon_exclusive = PageAnonExclusive(page);
> +			if (freeze && anon_exclusive &&
> +			    folio_try_share_anon_rmap_pmd(folio, page))
> +				freeze = false;
> +			if (!freeze) {
> +				rmap_t rmap_flags = RMAP_NONE;
> +
> +				if (anon_exclusive)
> +					rmap_flags |= RMAP_EXCLUSIVE;
> +
> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
> +				if (anon_exclusive)
> +					rmap_flags |= RMAP_EXCLUSIVE;
> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
> +						 vma, haddr, rmap_flags);
> +			}
> +		}
> +
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
>  	} else {
> @@ -2996,30 +3076,49 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * Note that NUMA hinting access restrictions are not transferred to
>  	 * avoid any possibility of altering permissions across VMAs.
>  	 */
> -	if (freeze || pmd_migration) {
> +	if (freeze || !present) {
>  		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>  			pte_t entry;
> -			swp_entry_t swp_entry;
> -
> -			if (write)
> -				swp_entry = make_writable_migration_entry(
> -							page_to_pfn(page + i));
> -			else if (anon_exclusive)
> -				swp_entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(page + i));
> -			else
> -				swp_entry = make_readable_migration_entry(
> -							page_to_pfn(page + i));
> -			if (young)
> -				swp_entry = make_migration_entry_young(swp_entry);
> -			if (dirty)
> -				swp_entry = make_migration_entry_dirty(swp_entry);
> -			entry = swp_entry_to_pte(swp_entry);
> -			if (soft_dirty)
> -				entry = pte_swp_mksoft_dirty(entry);
> -			if (uffd_wp)
> -				entry = pte_swp_mkuffd_wp(entry);
> -
> +			if (freeze || is_migration_entry(swp_entry)) {
> +				if (write)
> +					swp_entry = make_writable_migration_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_migration_entry(
> +								page_to_pfn(page + i));
> +				if (young)
> +					swp_entry = make_migration_entry_young(swp_entry);
> +				if (dirty)
> +					swp_entry = make_migration_entry_dirty(swp_entry);
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			} else {
> +				/*
> +				 * anon_exclusive was already propagated to the relevant
> +				 * pages corresponding to the pte entries when freeze
> +				 * is false.
> +				 */
> +				if (write)
> +					swp_entry = make_writable_device_private_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_device_private_entry(
> +								page_to_pfn(page + i));
> +				/*
> +				 * Young and dirty bits are not progated via swp_entry
> +				 */
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			}
>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>  			set_pte_at(mm, addr, pte + i, entry);
>  		}
> @@ -3046,7 +3145,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	}
>  	pte_unmap(pte);
>  
> -	if (!pmd_migration)
> +	if (present)
>  		folio_remove_rmap_pmd(folio, page, vma);
>  	if (freeze)
>  		put_page(page);
> @@ -3058,8 +3157,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>  			   pmd_t *pmd, bool freeze)
>  {
> +
>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> -	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
> +	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
> +			(is_pmd_device_private_entry(*pmd)))
>  		__split_huge_pmd_locked(vma, pmd, address, freeze);
>  }
>  
> @@ -3238,6 +3339,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>  	lockdep_assert_held(&lruvec->lru_lock);
>  
> +	if (folio_is_device_private(folio))
> +		return;
> +
>  	if (list) {
>  		/* page reclaim is reclaiming a huge page */
>  		VM_WARN_ON(folio_test_lru(folio));
> @@ -3252,6 +3356,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  			list_add_tail(&new_folio->lru, &folio->lru);
>  		folio_set_lru(new_folio);
>  	}
> +
>  }
>  
>  /* Racy check whether the huge page can be split */
> @@ -3727,7 +3832,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  
>  	/* Prevent deferred_split_scan() touching ->_refcount */
>  	spin_lock(&ds_queue->split_queue_lock);
> -	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> +	if (folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio))) {
>  		struct address_space *swap_cache = NULL;
>  		struct lruvec *lruvec;
>  		int expected_refs;
> @@ -3858,8 +3963,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  	if (nr_shmem_dropped)
>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>  
> -	if (!ret && is_anon)
> +	if (!ret && is_anon && !folio_is_device_private(folio))
>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
> +
>  	remap_page(folio, 1 << order, remap_flags);
>  
>  	/*
> @@ -4603,7 +4709,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		return 0;
>  
>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	if (unlikely(is_pmd_device_private_entry(*pvmw->pmd)))
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> +	else
> +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>  
>  	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>  	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> @@ -4653,6 +4762,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>  	folio_get(folio);
>  	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
> +
> +	if (folio_is_device_private(folio)) {
> +		if (pmd_write(pmde))
> +			entry = make_writable_device_private_entry(
> +							page_to_pfn(new));
> +		else
> +			entry = make_readable_device_private_entry(
> +							page_to_pfn(new));
> +		pmde = swp_entry_to_pmd(entry);
> +	}
> +
>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>  		pmde = pmd_mksoft_dirty(pmde);
>  	if (is_writable_migration_entry(entry))
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index e05e14d6eacd..0ed337f94fcd 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -136,6 +136,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			 * page table entry. Other special swap entries are not
>  			 * migratable, and we ignore regular swapped page.
>  			 */
> +			struct folio *folio;
> +
>  			entry = pte_to_swp_entry(pte);
>  			if (!is_device_private_entry(entry))
>  				goto next;
> @@ -147,6 +149,51 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			    pgmap->owner != migrate->pgmap_owner)
>  				goto next;
>  
> +			folio = page_folio(page);
> +			if (folio_test_large(folio)) {
> +				struct folio *new_folio;
> +				struct folio *new_fault_folio = NULL;
> +
> +				/*
> +				 * The reason for finding pmd present with a
> +				 * device private pte and a large folio for the
> +				 * pte is partial unmaps. Split the folio now
> +				 * for the migration to be handled correctly
> +				 */
> +				pte_unmap_unlock(ptep, ptl);
> +
> +				folio_get(folio);
> +				if (folio != fault_folio)
> +					folio_lock(folio);
> +				if (split_folio(folio)) {
> +					if (folio != fault_folio)
> +						folio_unlock(folio);
> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +					goto next;
> +				}
> +
> +				new_folio = page_folio(page);
> +				if (fault_folio)
> +					new_fault_folio = page_folio(migrate->fault_page);
> +
> +				/*
> +				 * Ensure the lock is held on the correct
> +				 * folio after the split
> +				 */
> +				if (!new_fault_folio) {
> +					folio_unlock(folio);
> +					folio_put(folio);
> +				} else if (folio != new_fault_folio) {
> +					folio_get(new_fault_folio);
> +					folio_lock(new_fault_folio);
> +					folio_unlock(folio);
> +					folio_put(folio);
> +				}
> +
> +				addr = start;
> +				goto again;
> +			}
> +
>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>  					MIGRATE_PFN_MIGRATE;
>  			if (is_writable_device_private_entry(entry))
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e981a1a292d2..246e6c211f34 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -250,12 +250,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>  			pmde = *pvmw->pmd;
>  			if (!pmd_present(pmde)) {
> -				swp_entry_t entry;
> +				swp_entry_t entry = pmd_to_swp_entry(pmde);
>  
>  				if (!thp_migration_supported() ||
>  				    !(pvmw->flags & PVMW_MIGRATION))
>  					return not_found(pvmw);
> -				entry = pmd_to_swp_entry(pmde);
>  				if (!is_migration_entry(entry) ||
>  				    !check_pmd(swp_offset_pfn(entry), pvmw))
>  					return not_found(pvmw);
> @@ -277,6 +276,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			 * cannot return prematurely, while zap_huge_pmd() has
>  			 * cleared *pmd but not decremented compound_mapcount().
>  			 */
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(pmde);
> +
> +			if (is_device_private_entry(entry) &&
> +				(pvmw->flags & PVMW_THP_DEVICE_PRIVATE)) {
> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +				return true;
> +			}
> +
>  			if ((pvmw->flags & PVMW_SYNC) &&
>  			    thp_vma_suitable_order(vma, pvmw->address,
>  						   PMD_ORDER) &&
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..604e8206a2ec 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  		*pmdvalp = pmdval;
>  	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>  		goto nomap;
> +	if (is_swap_pmd(pmdval)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> +
> +		if (is_device_private_entry(entry))
> +			goto nomap;
> +	}
>  	if (unlikely(pmd_trans_huge(pmdval)))
>  		goto nomap;
>  	if (unlikely(pmd_bad(pmdval))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b5837075b6e0..f40e45564295 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2285,7 +2285,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  		     unsigned long address, void *arg)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> +	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
> +				PVMW_THP_DEVICE_PRIVATE);
>  	bool anon_exclusive, writable, ret = true;
>  	pte_t pteval;
>  	struct page *subpage;
> @@ -2330,6 +2331,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		/* PMD-mapped THP migration entry */
>  		if (!pvmw.pte) {
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +			unsigned long pfn;
> +#endif
> +
>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>  				split_huge_pmd_locked(vma, pvmw.address,
>  						      pvmw.pmd, true);
> @@ -2338,8 +2343,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				break;
>  			}
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> -			subpage = folio_page(folio,
> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +			/*
> +			 * Zone device private folios do not work well with
> +			 * pmd_pfn() on some architectures due to pte
> +			 * inversion.
> +			 */
> +			if (is_pmd_device_private_entry(*pvmw.pmd)) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> +
> +				pfn = swp_offset_pfn(entry);
> +			} else {
> +				pfn = pmd_pfn(*pvmw.pmd);
> +			}
> +
> +			subpage = folio_page(folio, pfn - folio_pfn(folio));
> +
>  			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>  					!folio_test_pmd_mappable(folio), folio);
>  
> -- 
> 2.50.1
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-28 20:05   ` Matthew Brost
@ 2025-08-28 20:12     ` David Hildenbrand
  2025-08-28 20:17       ` Matthew Brost
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-08-28 20:12 UTC (permalink / raw)
  To: Matthew Brost, Balbir Singh
  Cc: dri-devel, linux-mm, linux-kernel, Andrew Morton, Zi Yan,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Oscar Salvador, Lorenzo Stoakes, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ralph Campbell, Mika Penttilä, Francois Dugast

On 28.08.25 22:05, Matthew Brost wrote:
> On Tue, Aug 12, 2025 at 12:40:27PM +1000, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages aware of zone
>> device pages. Although the code is designed to be generic when it comes
>> to handling splitting of pages, the code is designed to work for THP
>> page sizes corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone device huge
>> entry is present, enabling try_to_migrate() and other code migration
>> paths to appropriately process the entry. page_vma_mapped_walk() will
>> return true for zone device private large folios only when
>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>> not zone device private pages from having to add awareness. The key
>> callback that needs this flag is try_to_migrate_one(). The other
>> callbacks page idle, damon use it for setting young/dirty bits, which is
>> not significant when it comes to pmd level bit harvesting.
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>> entries.
>>
>> Support partial unmapping of zone device private entries, which happens
>> via munmap(). munmap() causes the device private entry pmd to be split,
>> but the corresponding folio is not split. Deferred split does not work for
>> zone device private folios due to the need to split during fault
>> handling. Get migrate_vma_collect_pmd() to handle this case by splitting
>> partially unmapped device private folios.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   include/linux/rmap.h    |   2 +
>>   include/linux/swapops.h |  17 ++++
>>   lib/test_hmm.c          |   2 +-
>>   mm/huge_memory.c        | 214 +++++++++++++++++++++++++++++++---------
>>   mm/migrate_device.c     |  47 +++++++++
>>   mm/page_vma_mapped.c    |  13 ++-
>>   mm/pgtable-generic.c    |   6 ++
>>   mm/rmap.c               |  24 ++++-
>>   8 files changed, 272 insertions(+), 53 deletions(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 6cd020eea37a..dfb7aae3d77b 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -927,6 +927,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>>   #define PVMW_SYNC		(1 << 0)
>>   /* Look for migration entries rather than present PTEs */
>>   #define PVMW_MIGRATION		(1 << 1)
>> +/* Look for device private THP entries */
>> +#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
>>   
>>   struct page_vma_mapped_walk {
>>   	unsigned long pfn;
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 64ea151a7ae3..2641c01bd5d2 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>   {
>>   	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>>   }
>> +
>>   #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>   static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>   		struct page *page)
>> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>   }
>>   #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>   
>> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>> +
>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>> +{
>> +	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
>> +}
>> +
>> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>> +
>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>> +
>>   static inline int non_swap_entry(swp_entry_t entry)
>>   {
>>   	return swp_type(entry) >= MAX_SWAPFILES;
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 761725bc713c..297f1e034045 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1408,7 +1408,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>   	 * the mirror but here we use it to hold the page for the simulated
>>   	 * device memory and that page holds the pointer to the mirror.
>>   	 */
>> -	rpage = vmf->page->zone_device_data;
>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>   	dmirror = rpage->zone_device_data;
>>   
>>   	/* FIXME demonstrate how we can adjust migrate range */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 9c38a95e9f09..2495e3fdbfae 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1711,8 +1711,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>   	if (unlikely(is_swap_pmd(pmd))) {
>>   		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>   
>> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
>> -		if (!is_readable_migration_entry(entry)) {
>> +		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
>> +				!is_pmd_device_private_entry(pmd));
>> +
>> +		if (is_migration_entry(entry) &&
>> +			is_writable_migration_entry(entry)) {
>>   			entry = make_readable_migration_entry(
>>   							swp_offset(entry));
>>   			pmd = swp_entry_to_pmd(entry);
>> @@ -1722,6 +1725,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>   				pmd = pmd_swp_mkuffd_wp(pmd);
>>   			set_pmd_at(src_mm, addr, src_pmd, pmd);
>>   		}
>> +
>> +		if (is_device_private_entry(entry)) {
>> +			if (is_writable_device_private_entry(entry)) {
>> +				entry = make_readable_device_private_entry(
>> +					swp_offset(entry));
>> +				pmd = swp_entry_to_pmd(entry);
>> +
>> +				if (pmd_swp_soft_dirty(*src_pmd))
>> +					pmd = pmd_swp_mksoft_dirty(pmd);
>> +				if (pmd_swp_uffd_wp(*src_pmd))
>> +					pmd = pmd_swp_mkuffd_wp(pmd);
>> +				set_pmd_at(src_mm, addr, src_pmd, pmd);
>> +			}
>> +
>> +			src_folio = pfn_swap_entry_folio(entry);
>> +			VM_WARN_ON(!folio_test_large(src_folio));
>> +
>> +			folio_get(src_folio);
>> +			/*
>> +			 * folio_try_dup_anon_rmap_pmd does not fail for
>> +			 * device private entries.
>> +			 */
>> +			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
>> +					  &src_folio->page, dst_vma, src_vma));
> 
> VM_WARN_ON compiles out in non-debug builds. I hit this running the
> fork self I shared with a non-debug build.


folio_try_dup_anon_rmap_pmd() will never fail for 
folio_is_device_private(folio) -- unless something is deeply messed up 
that we wouldn't identify this folio as being device-private.

Can you elaborate, what were you able to trigger, and in what kind of 
environment?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-28 20:12     ` David Hildenbrand
@ 2025-08-28 20:17       ` Matthew Brost
  2025-08-28 20:22         ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2025-08-28 20:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Balbir Singh, dri-devel, linux-mm, linux-kernel, Andrew Morton,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On Thu, Aug 28, 2025 at 10:12:53PM +0200, David Hildenbrand wrote:
> On 28.08.25 22:05, Matthew Brost wrote:
> > On Tue, Aug 12, 2025 at 12:40:27PM +1000, Balbir Singh wrote:
> > > Make THP handling code in the mm subsystem for THP pages aware of zone
> > > device pages. Although the code is designed to be generic when it comes
> > > to handling splitting of pages, the code is designed to work for THP
> > > page sizes corresponding to HPAGE_PMD_NR.
> > > 
> > > Modify page_vma_mapped_walk() to return true when a zone device huge
> > > entry is present, enabling try_to_migrate() and other code migration
> > > paths to appropriately process the entry. page_vma_mapped_walk() will
> > > return true for zone device private large folios only when
> > > PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
> > > not zone device private pages from having to add awareness. The key
> > > callback that needs this flag is try_to_migrate_one(). The other
> > > callbacks page idle, damon use it for setting young/dirty bits, which is
> > > not significant when it comes to pmd level bit harvesting.
> > > 
> > > pmd_pfn() does not work well with zone device entries, use
> > > pfn_pmd_entry_to_swap() for checking and comparison as for zone device
> > > entries.
> > > 
> > > Support partial unmapping of zone device private entries, which happens
> > > via munmap(). munmap() causes the device private entry pmd to be split,
> > > but the corresponding folio is not split. Deferred split does not work for
> > > zone device private folios due to the need to split during fault
> > > handling. Get migrate_vma_collect_pmd() to handle this case by splitting
> > > partially unmapped device private folios.
> > > 
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> > > Cc: Rakie Kim <rakie.kim@sk.com>
> > > Cc: Byungchul Park <byungchul@sk.com>
> > > Cc: Gregory Price <gourry@gourry.net>
> > > Cc: Ying Huang <ying.huang@linux.alibaba.com>
> > > Cc: Alistair Popple <apopple@nvidia.com>
> > > Cc: Oscar Salvador <osalvador@suse.de>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Nico Pache <npache@redhat.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Dev Jain <dev.jain@arm.com>
> > > Cc: Barry Song <baohua@kernel.org>
> > > Cc: Lyude Paul <lyude@redhat.com>
> > > Cc: Danilo Krummrich <dakr@kernel.org>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona@ffwll.ch>
> > > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > > Cc: Mika Penttilä <mpenttil@redhat.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Francois Dugast <francois.dugast@intel.com>
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> > > ---
> > >   include/linux/rmap.h    |   2 +
> > >   include/linux/swapops.h |  17 ++++
> > >   lib/test_hmm.c          |   2 +-
> > >   mm/huge_memory.c        | 214 +++++++++++++++++++++++++++++++---------
> > >   mm/migrate_device.c     |  47 +++++++++
> > >   mm/page_vma_mapped.c    |  13 ++-
> > >   mm/pgtable-generic.c    |   6 ++
> > >   mm/rmap.c               |  24 ++++-
> > >   8 files changed, 272 insertions(+), 53 deletions(-)
> > > 
> > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > index 6cd020eea37a..dfb7aae3d77b 100644
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -927,6 +927,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> > >   #define PVMW_SYNC		(1 << 0)
> > >   /* Look for migration entries rather than present PTEs */
> > >   #define PVMW_MIGRATION		(1 << 1)
> > > +/* Look for device private THP entries */
> > > +#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
> > >   struct page_vma_mapped_walk {
> > >   	unsigned long pfn;
> > > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > > index 64ea151a7ae3..2641c01bd5d2 100644
> > > --- a/include/linux/swapops.h
> > > +++ b/include/linux/swapops.h
> > > @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
> > >   {
> > >   	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> > >   }
> > > +
> > >   #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> > >   static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> > >   		struct page *page)
> > > @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
> > >   }
> > >   #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> > > +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> > > +
> > > +static inline int is_pmd_device_private_entry(pmd_t pmd)
> > > +{
> > > +	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
> > > +}
> > > +
> > > +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> > > +
> > > +static inline int is_pmd_device_private_entry(pmd_t pmd)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> > > +
> > >   static inline int non_swap_entry(swp_entry_t entry)
> > >   {
> > >   	return swp_type(entry) >= MAX_SWAPFILES;
> > > diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> > > index 761725bc713c..297f1e034045 100644
> > > --- a/lib/test_hmm.c
> > > +++ b/lib/test_hmm.c
> > > @@ -1408,7 +1408,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
> > >   	 * the mirror but here we use it to hold the page for the simulated
> > >   	 * device memory and that page holds the pointer to the mirror.
> > >   	 */
> > > -	rpage = vmf->page->zone_device_data;
> > > +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
> > >   	dmirror = rpage->zone_device_data;
> > >   	/* FIXME demonstrate how we can adjust migrate range */
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 9c38a95e9f09..2495e3fdbfae 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1711,8 +1711,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >   	if (unlikely(is_swap_pmd(pmd))) {
> > >   		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > > -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> > > -		if (!is_readable_migration_entry(entry)) {
> > > +		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
> > > +				!is_pmd_device_private_entry(pmd));
> > > +
> > > +		if (is_migration_entry(entry) &&
> > > +			is_writable_migration_entry(entry)) {
> > >   			entry = make_readable_migration_entry(
> > >   							swp_offset(entry));
> > >   			pmd = swp_entry_to_pmd(entry);
> > > @@ -1722,6 +1725,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >   				pmd = pmd_swp_mkuffd_wp(pmd);
> > >   			set_pmd_at(src_mm, addr, src_pmd, pmd);
> > >   		}
> > > +
> > > +		if (is_device_private_entry(entry)) {
> > > +			if (is_writable_device_private_entry(entry)) {
> > > +				entry = make_readable_device_private_entry(
> > > +					swp_offset(entry));
> > > +				pmd = swp_entry_to_pmd(entry);
> > > +
> > > +				if (pmd_swp_soft_dirty(*src_pmd))
> > > +					pmd = pmd_swp_mksoft_dirty(pmd);
> > > +				if (pmd_swp_uffd_wp(*src_pmd))
> > > +					pmd = pmd_swp_mkuffd_wp(pmd);
> > > +				set_pmd_at(src_mm, addr, src_pmd, pmd);
> > > +			}
> > > +
> > > +			src_folio = pfn_swap_entry_folio(entry);
> > > +			VM_WARN_ON(!folio_test_large(src_folio));
> > > +
> > > +			folio_get(src_folio);
> > > +			/*
> > > +			 * folio_try_dup_anon_rmap_pmd does not fail for
> > > +			 * device private entries.
> > > +			 */
> > > +			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
> > > +					  &src_folio->page, dst_vma, src_vma));
> > 
> > VM_WARN_ON compiles out in non-debug builds. I hit this running the
> > fork self I shared with a non-debug build.
> 
> 
> folio_try_dup_anon_rmap_pmd() will never fail for
> folio_is_device_private(folio) -- unless something is deeply messed up that
> we wouldn't identify this folio as being device-private.
> 
> Can you elaborate, what were you able to trigger, and in what kind of
> environment?
> 

Maybe this was bad phrasing. I compilied the kernel with a non-debug
build and fork() broke for THP device pages because the above call to
folio_try_dup_anon_rmap_pmd compiled out (i.e., it wasn't called).

Matt

> -- 
> Cheers
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-28 20:17       ` Matthew Brost
@ 2025-08-28 20:22         ` David Hildenbrand
  0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-08-28 20:22 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, dri-devel, linux-mm, linux-kernel, Andrew Morton,
	Zi Yan, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price,
	Ying Huang, Alistair Popple, Oscar Salvador, Lorenzo Stoakes,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Ralph Campbell, Mika Penttilä,
	Francois Dugast

On 28.08.25 22:17, Matthew Brost wrote:
> On Thu, Aug 28, 2025 at 10:12:53PM +0200, David Hildenbrand wrote:
>> On 28.08.25 22:05, Matthew Brost wrote:
>>> On Tue, Aug 12, 2025 at 12:40:27PM +1000, Balbir Singh wrote:
>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>> device pages. Although the code is designed to be generic when it comes
>>>> to handling splitting of pages, the code is designed to work for THP
>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>
>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>> entry is present, enabling try_to_migrate() and other code migration
>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>> return true for zone device private large folios only when
>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>> not zone device private pages from having to add awareness. The key
>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>> not significant when it comes to pmd level bit harvesting.
>>>>
>>>> pmd_pfn() does not work well with zone device entries, use
>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>> entries.
>>>>
>>>> Support partial unmapping of zone device private entries, which happens
>>>> via munmap(). munmap() causes the device private entry pmd to be split,
>>>> but the corresponding folio is not split. Deferred split does not work for
>>>> zone device private folios due to the need to split during fault
>>>> handling. Get migrate_vma_collect_pmd() to handle this case by splitting
>>>> partially unmapped device private folios.
>>>>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>> Cc: Gregory Price <gourry@gourry.net>
>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>> Cc: Nico Pache <npache@redhat.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>    include/linux/rmap.h    |   2 +
>>>>    include/linux/swapops.h |  17 ++++
>>>>    lib/test_hmm.c          |   2 +-
>>>>    mm/huge_memory.c        | 214 +++++++++++++++++++++++++++++++---------
>>>>    mm/migrate_device.c     |  47 +++++++++
>>>>    mm/page_vma_mapped.c    |  13 ++-
>>>>    mm/pgtable-generic.c    |   6 ++
>>>>    mm/rmap.c               |  24 ++++-
>>>>    8 files changed, 272 insertions(+), 53 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index 6cd020eea37a..dfb7aae3d77b 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -927,6 +927,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>>>>    #define PVMW_SYNC		(1 << 0)
>>>>    /* Look for migration entries rather than present PTEs */
>>>>    #define PVMW_MIGRATION		(1 << 1)
>>>> +/* Look for device private THP entries */
>>>> +#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
>>>>    struct page_vma_mapped_walk {
>>>>    	unsigned long pfn;
>>>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>>>> index 64ea151a7ae3..2641c01bd5d2 100644
>>>> --- a/include/linux/swapops.h
>>>> +++ b/include/linux/swapops.h
>>>> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>>>    {
>>>>    	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>>>>    }
>>>> +
>>>>    #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>>>    static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>>>    		struct page *page)
>>>> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>>>    }
>>>>    #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>>> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>>>> +
>>>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>>>> +{
>>>> +	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
>>>> +}
>>>> +
>>>> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>>> +
>>>> +static inline int is_pmd_device_private_entry(pmd_t pmd)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>>> +
>>>>    static inline int non_swap_entry(swp_entry_t entry)
>>>>    {
>>>>    	return swp_type(entry) >= MAX_SWAPFILES;
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 761725bc713c..297f1e034045 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -1408,7 +1408,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>    	 * the mirror but here we use it to hold the page for the simulated
>>>>    	 * device memory and that page holds the pointer to the mirror.
>>>>    	 */
>>>> -	rpage = vmf->page->zone_device_data;
>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>    	dmirror = rpage->zone_device_data;
>>>>    	/* FIXME demonstrate how we can adjust migrate range */
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 9c38a95e9f09..2495e3fdbfae 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1711,8 +1711,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>>    	if (unlikely(is_swap_pmd(pmd))) {
>>>>    		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>>> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
>>>> -		if (!is_readable_migration_entry(entry)) {
>>>> +		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
>>>> +				!is_pmd_device_private_entry(pmd));
>>>> +
>>>> +		if (is_migration_entry(entry) &&
>>>> +			is_writable_migration_entry(entry)) {
>>>>    			entry = make_readable_migration_entry(
>>>>    							swp_offset(entry));
>>>>    			pmd = swp_entry_to_pmd(entry);
>>>> @@ -1722,6 +1725,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>>    				pmd = pmd_swp_mkuffd_wp(pmd);
>>>>    			set_pmd_at(src_mm, addr, src_pmd, pmd);
>>>>    		}
>>>> +
>>>> +		if (is_device_private_entry(entry)) {
>>>> +			if (is_writable_device_private_entry(entry)) {
>>>> +				entry = make_readable_device_private_entry(
>>>> +					swp_offset(entry));
>>>> +				pmd = swp_entry_to_pmd(entry);
>>>> +
>>>> +				if (pmd_swp_soft_dirty(*src_pmd))
>>>> +					pmd = pmd_swp_mksoft_dirty(pmd);
>>>> +				if (pmd_swp_uffd_wp(*src_pmd))
>>>> +					pmd = pmd_swp_mkuffd_wp(pmd);
>>>> +				set_pmd_at(src_mm, addr, src_pmd, pmd);
>>>> +			}
>>>> +
>>>> +			src_folio = pfn_swap_entry_folio(entry);
>>>> +			VM_WARN_ON(!folio_test_large(src_folio));
>>>> +
>>>> +			folio_get(src_folio);
>>>> +			/*
>>>> +			 * folio_try_dup_anon_rmap_pmd does not fail for
>>>> +			 * device private entries.
>>>> +			 */
>>>> +			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
>>>> +					  &src_folio->page, dst_vma, src_vma));
>>>
>>> VM_WARN_ON compiles out in non-debug builds. I hit this running the
>>> fork self I shared with a non-debug build.
>>
>>
>> folio_try_dup_anon_rmap_pmd() will never fail for
>> folio_is_device_private(folio) -- unless something is deeply messed up that
>> we wouldn't identify this folio as being device-private.
>>
>> Can you elaborate, what were you able to trigger, and in what kind of
>> environment?
>>
> 
> Maybe this was bad phrasing. I compilied the kernel with a non-debug
> build and fork() broke for THP device pages because the above call to
> folio_try_dup_anon_rmap_pmd compiled out (i.e., it wasn't called).

Ah, yes!

As I said in my reply, we should not do any kind of WARN here, like in 
the PTE case.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages
  2025-08-21 10:24             ` Balbir Singh
@ 2025-08-28 23:14               ` Matthew Brost
  0 siblings, 0 replies; 36+ messages in thread
From: Matthew Brost @ 2025-08-28 23:14 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Mika Penttilä, dri-devel, linux-mm, linux-kernel,
	Andrew Morton, David Hildenbrand, Zi Yan, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Oscar Salvador, Lorenzo Stoakes, Baolin Wang, Liam R. Howlett,
	Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Ralph Campbell,
	Francois Dugast

On Thu, Aug 21, 2025 at 08:24:00PM +1000, Balbir Singh wrote:
> On 8/15/25 10:04, Matthew Brost wrote:
> > On Fri, Aug 15, 2025 at 08:51:21AM +1000, Balbir Singh wrote:
> >> On 8/13/25 10:07, Mika Penttilä wrote:
> >>>
> >>> On 8/13/25 02:36, Balbir Singh wrote:
> >>>
> >>>> On 8/12/25 15:35, Mika Penttilä wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On 8/12/25 05:40, Balbir Singh wrote:
> ...
> 
> >> I've not run into this with my testing, let me try with more mTHP sizes enabled. I'll wait on Matthew
> >> to post his test case or any results, issues seen
> >>
> > 
> > I’ve hit this. In the code I shared privately, I split THPs in the
> > page-collection path. You omitted that in v2 and v3; I believe you’ll
> > need those changes. The code I'm referring to had the below comment.
> > 
> >  416         /*
> >  417          * XXX: No clean way to support higher-order folios that don't match PMD
> >  418          * boundaries for now — split them instead. Once mTHP support lands, add
> >  419          * proper support for this case.
> >  420          *
> >  421          * The test, which exposed this as problematic, remapped (memremap) a
> >  422          * large folio to an unaligned address, resulting in the folio being
> >  423          * found in the middle of the PTEs. The requested number of pages was
> >  424          * less than the folio size. Likely to be handled gracefully by upper
> >  425          * layers eventually, but not yet.
> >  426          */
> > 
> > I triggered it by doing some odd mremap operations, which caused the CPU
> > page-fault handler to spin indefinitely iirc. In that case, a large device
> > folio had been moved into the middle of a PMD.
> > 
> > Upstream could see the same problem if the device fault handler enforces
> > a must-migrate-to-device policy and mremap moves a large CPU folio into
> > the middle of a PMD.
> > 
> > I’m in the middle of other work; when I circle back, I’ll try to create
> > a selftest to reproduce this. My current test is a fairly convoluted IGT
> > with a bunch of threads doing remap nonsense, but I’ll try to distill it
> > into a concise selftest.
> > 
> 
> I ran into this while doing some testing as well, I fixed it in a manner similar
> to split_folio() for partial unmaps. I will consolidate the folio splits into
> a single helper and post it with v4.
> 

I created a selftest for this one. I'm going to send these over along +
the fixes I've applied in v3. Please include my selftests in the v4.

Matt 

> 
> Balbir Singh


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-08-28 23:14 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-12  2:40 [v3 00/11] mm: support device-private THP Balbir Singh
2025-08-12  2:40 ` [v3 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-08-26 14:22   ` David Hildenbrand
2025-08-12  2:40 ` [v3 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-08-12 14:47   ` kernel test robot
2025-08-26 15:19   ` David Hildenbrand
2025-08-27 10:14     ` Balbir Singh
2025-08-27 11:28       ` David Hildenbrand
2025-08-28 20:05   ` Matthew Brost
2025-08-28 20:12     ` David Hildenbrand
2025-08-28 20:17       ` Matthew Brost
2025-08-28 20:22         ` David Hildenbrand
2025-08-12  2:40 ` [v3 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-08-12  5:35   ` Mika Penttilä
2025-08-12  5:54     ` Matthew Brost
2025-08-12  6:18       ` Matthew Brost
2025-08-12  6:25       ` Mika Penttilä
2025-08-12  6:33         ` Matthew Brost
2025-08-12  6:37           ` Mika Penttilä
2025-08-12 23:36     ` Balbir Singh
2025-08-13  0:07       ` Mika Penttilä
2025-08-14 22:51         ` Balbir Singh
2025-08-15  0:04           ` Matthew Brost
2025-08-15 12:09             ` Balbir Singh
2025-08-21 10:24             ` Balbir Singh
2025-08-28 23:14               ` Matthew Brost
2025-08-12  2:40 ` [v3 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
2025-08-12  2:40 ` [v3 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-08-12  2:40 ` [v3 06/11] mm/memremap: add folio_split support Balbir Singh
2025-08-12  2:40 ` [v3 07/11] mm/thp: add split during migration support Balbir Singh
2025-08-27 20:29   ` David Hildenbrand
2025-08-12  2:40 ` [v3 08/11] lib/test_hmm: add test case for split pages Balbir Singh
2025-08-12  2:40 ` [v3 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-08-12  2:40 ` [v3 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
2025-08-13  2:23   ` kernel test robot
2025-08-12  2:40 ` [v3 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).