[v1 resend 00/12] THP support for zone device page migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [v1 resend 00/12] THP support for zone device page migration
@ 2025-07-03 23:34 Balbir Singh
  2025-07-03 23:35 ` [v1 resend 01/12] mm/zone_device: support large zone device private folios Balbir Singh
                   ` (14 more replies)
  0 siblings, 15 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

This patch series adds support for THP migration of zone device pages.
To do so, the patches implement support for folio zone device pages
by adding support for setting up larger order pages.

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.

The nouveau dmem code has been enhanced to use the new THP migration
capability.

Feedback from the RFC [2]:

It was advised that prep_compound_page() not be exposed just for the purposes
of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
folios did not work due to lock order dependency in the callback for
split folio.

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.

The future plan is to post enhancements to support mTHP with a rough
design as follows:

1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
   a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize

The mTHP patches can be built on top of this series, the key design elements
that need to be worked out are infrastructure and driver support for multiple
ordered pages and their migration.

References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/

These patches are built on top of mm-unstable

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Changelog v1:
- Changes from RFC [2], include support for handling fault_folio and using
  trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
  callback is used to know when the head order needs to be reset.

Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM
- Throughput showed upto 5x improvement with THP migration, system to device
  migration is slower due to the mirroring of data (see buffer->mirror)

Balbir Singh (12):
  mm/zone_device: support large zone device private folios
  mm/migrate_device: flags for selecting device private THP pages
  mm/thp: zone_device awareness in THP handling code
  mm/migrate_device: THP migration of zone device pages
  mm/memory/fault: add support for zone device THP fault handling
  lib/test_hmm: test cases and support for zone device private THP
  mm/memremap: add folio_split support
  mm/thp: add split during migration support
  lib/test_hmm: add test case for split pages
  selftests/mm/hmm-tests: new tests for zone device THP migration
  gpu/drm/nouveau: add THP migration support
  selftests/mm/hmm-tests: new throughput tests including THP

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/huge_mm.h                |  18 +-
 include/linux/memremap.h               |  29 +-
 include/linux/migrate.h                |   2 +
 include/linux/mm.h                     |   1 +
 lib/test_hmm.c                         | 428 +++++++++++++----
 lib/test_hmm_uapi.h                    |   3 +
 mm/huge_memory.c                       | 261 ++++++++---
 mm/memory.c                            |   6 +-
 mm/memremap.c                          |  50 +-
 mm/migrate.c                           |   2 +
 mm/migrate_device.c                    | 488 +++++++++++++++++---
 mm/page_alloc.c                        |   1 +
 mm/page_vma_mapped.c                   |  10 +
 mm/pgtable-generic.c                   |   6 +
 mm/rmap.c                              |  19 +-
 tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
 19 files changed, 1874 insertions(+), 312 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [v1 resend 01/12] mm/zone_device: support large zone device private folios
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-07  5:28   ` Alistair Popple
  2025-07-03 23:35 ` [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h | 22 +++++++++++++++++-
 mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
 2 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..11d586dd8ef1 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
 	return is_device_private_page(&folio->page);
 }
 
+static inline void *folio_zone_device_data(const struct folio *folio)
+{
+	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
+	return folio->page.zone_device_data;
+}
+
+static inline void folio_set_zone_device_data(struct folio *folio, void *data)
+{
+	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
+	folio->page.zone_device_data = data;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
@@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void init_zone_device_folio(struct folio *folio, unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
+
+static inline void zone_device_page_init(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	init_zone_device_folio(folio, 0);
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index b0ce0d8254bd..4085a3893e64 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -427,20 +427,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
+	unsigned int nr = folio_nr_pages(folio);
+	int i;
+	bool anon = folio_test_anon(folio);
+	struct page *page = folio_page(folio, 0);
 
 	if (WARN_ON_ONCE(!pgmap))
 		return;
 
 	mem_cgroup_uncharge(folio);
 
-	/*
-	 * Note: we don't expect anonymous compound pages yet. Once supported
-	 * and we could PTE-map them similar to THP, we'd have to clear
-	 * PG_anon_exclusive on all tail pages.
-	 */
-	if (folio_test_anon(folio)) {
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-		__ClearPageAnonExclusive(folio_page(folio, 0));
+	WARN_ON_ONCE(folio_test_large(folio) && !anon);
+
+	for (i = 0; i < nr; i++) {
+		if (anon)
+			__ClearPageAnonExclusive(folio_page(folio, i));
 	}
 
 	/*
@@ -464,10 +465,19 @@ void free_zone_device_folio(struct folio *folio)
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+		if (folio_test_large(folio)) {
+			folio_unqueue_deferred_split(folio);
+
+			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
+		}
+		pgmap->ops->page_free(page);
+		put_dev_pagemap(pgmap);
+		page->mapping = NULL;
+		break;
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
 			break;
-		pgmap->ops->page_free(folio_page(folio, 0));
+		pgmap->ops->page_free(page);
 		put_dev_pagemap(pgmap);
 		break;
 
@@ -491,14 +501,28 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page)
+void init_zone_device_folio(struct folio *folio, unsigned int order)
 {
+	struct page *page = folio_page(folio, 0);
+
+	VM_BUG_ON(order > MAX_ORDER_NR_PAGES);
+
+	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
-	set_page_count(page, 1);
+	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
+	folio_set_count(folio, 1);
 	lock_page(page);
+
+	/*
+	 * Only PMD level migration is supported for THP migration
+	 */
+	if (order > 1) {
+		prep_compound_page(page, order);
+		folio_set_large_rmappable(folio);
+	}
 }
-EXPORT_SYMBOL_GPL(zone_device_page_init);
+EXPORT_SYMBOL_GPL(init_zone_device_folio);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
  2025-07-03 23:35 ` [v1 resend 01/12] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-07  5:31   ` Alistair Popple
  2025-07-18  3:15   ` Matthew Brost
  2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add flags to mark zone device migration pages.

MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
device pages as compound pages during device pfn migration.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/migrate.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index aaa2114498d6..1661e2d5479a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_COMPOUND	(1UL << 4)
 #define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -185,6 +186,7 @@ enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
 	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
+	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
 };
 
 struct migrate_vma {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
  2025-07-03 23:35 ` [v1 resend 01/12] mm/zone_device: support large zone device private folios Balbir Singh
  2025-07-03 23:35 ` [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-04  4:46   ` Mika Penttilä
                     ` (4 more replies)
  2025-07-03 23:35 ` [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages Balbir Singh
                   ` (11 subsequent siblings)
  14 siblings, 5 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Make THP handling code in the mm subsystem for THP pages
aware of zone device pages. Although the code is
designed to be generic when it comes to handling splitting
of pages, the code is designed to work for THP page sizes
corresponding to HPAGE_PMD_NR.

Modify page_vma_mapped_walk() to return true when a zone
device huge entry is present, enabling try_to_migrate()
and other code migration paths to appropriately process the
entry

pmd_pfn() does not work well with zone device entries, use
pfn_pmd_entry_to_swap() for checking and comparison as for
zone device entries.

try_to_map_to_unused_zeropage() does not apply to zone device
entries, zone device entries are ignored in the call.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
 mm/migrate.c         |   2 +
 mm/page_vma_mapped.c |  10 +++
 mm/pgtable-generic.c |   6 ++
 mm/rmap.c            |  19 +++++-
 5 files changed, 146 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ce130225a8e5..e6e390d0308f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-		VM_BUG_ON(!is_pmd_migration_entry(pmd));
+		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
+				!is_device_private_entry(entry));
 		if (!is_readable_migration_entry(entry)) {
 			entry = make_readable_migration_entry(
 							swp_offset(entry));
@@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
 			folio = pfn_swap_entry_folio(entry);
 			flush_needed = 0;
+
+			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
+					!folio_is_device_private(folio));
+
+			if (folio_is_device_private(folio)) {
+				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
+				WARN_ON_ONCE(folio_mapcount(folio) < 0);
+			}
 		} else
 			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
 
@@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				folio_mark_accessed(folio);
 		}
 
+		/*
+		 * Do a folio put on zone device private pages after
+		 * changes to mm_counter, because the folio_put() will
+		 * clean folio->mapping and the folio_test_anon() check
+		 * will not be usable.
+		 */
+		if (folio_is_device_private(folio))
+			folio_put(folio);
+
 		spin_unlock(ptl);
 		if (flush_needed)
 			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
@@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct folio *folio = pfn_swap_entry_folio(entry);
 		pmd_t newpmd;
 
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
+			  !folio_is_device_private(folio));
 		if (is_writable_migration_entry(entry)) {
 			/*
 			 * A protection check is difficult so
@@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
-		} else {
+		} else if (is_writable_device_private_entry(entry)) {
+			newpmd = swp_entry_to_pmd(entry);
+			entry = make_device_exclusive_entry(swp_offset(entry));
+		} else
 			newpmd = *pmd;
-		}
 
 		if (uffd_wp)
 			newpmd = pmd_swp_mkuffd_wp(newpmd);
@@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	bool anon_exclusive = false, dirty = false;
+	bool young, write, soft_dirty, uffd_wp = false;
+	bool anon_exclusive = false, dirty = false, present = false;
 	unsigned long addr;
 	pte_t *pte;
 	int i;
+	swp_entry_t swp_entry;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
+
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+			&& !(is_swap_pmd(*pmd) &&
+			is_device_private_entry(pmd_to_swp_entry(*pmd))));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	pmd_migration = is_pmd_migration_entry(*pmd);
-	if (unlikely(pmd_migration)) {
-		swp_entry_t entry;
 
+	present = pmd_present(*pmd);
+	if (unlikely(!present)) {
+		swp_entry = pmd_to_swp_entry(*pmd);
 		old_pmd = *pmd;
-		entry = pmd_to_swp_entry(old_pmd);
-		page = pfn_swap_entry_to_page(entry);
-		write = is_writable_migration_entry(entry);
+
+		folio = pfn_swap_entry_folio(swp_entry);
+		VM_BUG_ON(!is_migration_entry(swp_entry) &&
+				!is_device_private_entry(swp_entry));
+		page = pfn_swap_entry_to_page(swp_entry);
+		write = is_writable_migration_entry(swp_entry);
+
 		if (PageAnon(page))
-			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = is_migration_entry_young(entry);
-		dirty = is_migration_entry_dirty(entry);
+			anon_exclusive =
+				is_readable_exclusive_migration_entry(swp_entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
+		young = is_migration_entry_young(swp_entry);
+		dirty = is_migration_entry_dirty(swp_entry);
 	} else {
 		/*
 		 * Up to this point the pmd is present and huge and userland has
@@ -2996,30 +3025,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Note that NUMA hinting access restrictions are not transferred to
 	 * avoid any possibility of altering permissions across VMAs.
 	 */
-	if (freeze || pmd_migration) {
+	if (freeze || !present) {
 		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
 			pte_t entry;
-			swp_entry_t swp_entry;
-
-			if (write)
-				swp_entry = make_writable_migration_entry(
-							page_to_pfn(page + i));
-			else if (anon_exclusive)
-				swp_entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page + i));
-			else
-				swp_entry = make_readable_migration_entry(
-							page_to_pfn(page + i));
-			if (young)
-				swp_entry = make_migration_entry_young(swp_entry);
-			if (dirty)
-				swp_entry = make_migration_entry_dirty(swp_entry);
-			entry = swp_entry_to_pte(swp_entry);
-			if (soft_dirty)
-				entry = pte_swp_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_swp_mkuffd_wp(entry);
-
+			if (freeze || is_migration_entry(swp_entry)) {
+				if (write)
+					swp_entry = make_writable_migration_entry(
+								page_to_pfn(page + i));
+				else if (anon_exclusive)
+					swp_entry = make_readable_exclusive_migration_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_migration_entry(
+								page_to_pfn(page + i));
+				if (young)
+					swp_entry = make_migration_entry_young(swp_entry);
+				if (dirty)
+					swp_entry = make_migration_entry_dirty(swp_entry);
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			} else {
+				VM_BUG_ON(!is_device_private_entry(swp_entry));
+				if (write)
+					swp_entry = make_writable_device_private_entry(
+								page_to_pfn(page + i));
+				else if (anon_exclusive)
+					swp_entry = make_device_exclusive_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_device_private_entry(
+								page_to_pfn(page + i));
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			}
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
@@ -3046,7 +3090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_migration)
+	if (present)
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
@@ -3058,8 +3102,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze)
 {
+
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
-	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
+	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
+			(is_swap_pmd(*pmd) &&
+			is_device_private_entry(pmd_to_swp_entry(*pmd))))
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
@@ -3238,6 +3285,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
+	if (folio_is_device_private(folio))
+		return;
+
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		VM_WARN_ON(folio_test_lru(folio));
@@ -3252,6 +3302,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 			list_add_tail(&new_folio->lru, &folio->lru);
 		folio_set_lru(new_folio);
 	}
+
 }
 
 /* Racy check whether the huge page can be split */
@@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 					((mapping || swap_cache) ?
 						folio_nr_pages(release) : 0));
 
+			if (folio_is_device_private(release))
+				percpu_ref_get_many(&release->pgmap->ref,
+							(1 << new_order) - 1);
+
 			lru_add_split_folio(origin_folio, release, lruvec,
 					list);
 
@@ -4596,7 +4651,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		return 0;
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
-	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	if (!folio_is_device_private(folio))
+		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	else
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
 
 	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
 	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
@@ -4646,6 +4704,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	folio_get(folio);
 	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
+
+	if (unlikely(folio_is_device_private(folio))) {
+		if (pmd_write(pmde))
+			entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+		else
+			entry = make_readable_device_private_entry(
+							page_to_pfn(new));
+		pmde = swp_entry_to_pmd(entry);
+	}
+
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
diff --git a/mm/migrate.c b/mm/migrate.c
index 767f503f0875..0b6ecf559b22 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
 
 	if (PageCompound(page))
 		return false;
+	if (folio_is_device_private(folio))
+		return false;
 	VM_BUG_ON_PAGE(!PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e981a1a292d2..ff8254e52de5 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -277,6 +277,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cannot return prematurely, while zap_huge_pmd() has
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
+			swp_entry_t entry;
+
+			if (!thp_migration_supported())
+				return not_found(pvmw);
+			entry = pmd_to_swp_entry(pmde);
+			if (is_device_private_entry(entry)) {
+				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+				return true;
+			}
+
 			if ((pvmw->flags & PVMW_SYNC) &&
 			    thp_vma_suitable_order(vma, pvmw->address,
 						   PMD_ORDER) &&
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..604e8206a2ec 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
 		goto nomap;
+	if (is_swap_pmd(pmdval)) {
+		swp_entry_t entry = pmd_to_swp_entry(pmdval);
+
+		if (is_device_private_entry(entry))
+			goto nomap;
+	}
 	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_bad(pmdval))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index bd83724d14b6..da1e5b03e1fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2336,8 +2336,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-			subpage = folio_page(folio,
-				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
+			/*
+			 * Zone device private folios do not work well with
+			 * pmd_pfn() on some architectures due to pte
+			 * inversion.
+			 */
+			if (folio_is_device_private(folio)) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
+				unsigned long pfn = swp_offset_pfn(entry);
+
+				subpage = folio_page(folio, pfn
+							- folio_pfn(folio));
+			} else {
+				subpage = folio_page(folio,
+							pmd_pfn(*pvmw.pmd)
+							- folio_pfn(folio));
+			}
+
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (2 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-04 15:35   ` kernel test robot
                     ` (2 more replies)
  2025-07-03 23:35 ` [v1 resend 05/12] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
                   ` (10 subsequent siblings)
  14 siblings, 3 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

migrate_device code paths go through the collect, setup
and finalize phases of migration. Support for MIGRATE_PFN_COMPOUND
was added earlier in the series to mark THP pages as
MIGRATE_PFN_COMPOUND.

The entries in src and dst arrays passed to these functions still
remain at a PAGE_SIZE granularity. When a compound page is passed,
the first entry has the PFN along with MIGRATE_PFN_COMPOUND
and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
representation allows for the compound page to be split into smaller
page sizes.

migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
page aware. Two new helper functions migrate_vma_collect_huge_pmd()
and migrate_vma_insert_huge_pmd_page() have been added.

migrate_vma_collect_huge_pmd() can collect THP pages, but if for
some reason this fails, there is fallback support to split the folio
and migrate it.

migrate_vma_insert_huge_pmd_page() closely follows the logic of
migrate_vma_insert_page()

Support for splitting pages as needed for migration will follow in
later patches in this series.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 mm/migrate_device.c | 437 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 376 insertions(+), 61 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e05e14d6eacd..41d0bd787969 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -14,6 +14,7 @@
 #include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swapops.h>
+#include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
 	if (!vma_is_anonymous(walk->vma))
 		return migrate_vma_collect_skip(start, end, walk);
 
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
+						MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages] = 0;
+		migrate->npages++;
+		migrate->cpages++;
+
+		/*
+		 * Collect the remaining entries as holes, in case we
+		 * need to split later
+		 */
+		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+	}
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
 		migrate->dst[migrate->npages] = 0;
@@ -54,57 +72,148 @@ static int migrate_vma_collect_hole(unsigned long start,
 	return 0;
 }
 
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-				   unsigned long start,
-				   unsigned long end,
-				   struct mm_walk *walk)
+/**
+ * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
+ * folio for device private pages.
+ * @pmdp: pointer to pmd entry
+ * @start: start address of the range for migration
+ * @end: end address of the range for migration
+ * @walk: mm_walk callback structure
+ *
+ * Collect the huge pmd entry at @pmdp for migration and set the
+ * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
+ * migration will occur at HPAGE_PMD granularity
+ */
+static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
+					unsigned long end, struct mm_walk *walk,
+					struct folio *fault_folio)
 {
+	struct mm_struct *mm = walk->mm;
+	struct folio *folio;
 	struct migrate_vma *migrate = walk->private;
-	struct folio *fault_folio = migrate->fault_page ?
-		page_folio(migrate->fault_page) : NULL;
-	struct vm_area_struct *vma = walk->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
-	pte_t *ptep;
+	swp_entry_t entry;
+	int ret;
+	unsigned long write = 0;
 
-again:
-	if (pmd_none(*pmdp))
+	ptl = pmd_lock(mm, pmdp);
+	if (pmd_none(*pmdp)) {
+		spin_unlock(ptl);
 		return migrate_vma_collect_hole(start, end, -1, walk);
+	}
 
 	if (pmd_trans_huge(*pmdp)) {
-		struct folio *folio;
-
-		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
 			spin_unlock(ptl);
-			goto again;
+			return migrate_vma_collect_skip(start, end, walk);
 		}
 
 		folio = pmd_folio(*pmdp);
 		if (is_huge_zero_folio(folio)) {
 			spin_unlock(ptl);
-			split_huge_pmd(vma, pmdp, addr);
-		} else {
-			int ret;
+			return migrate_vma_collect_hole(start, end, -1, walk);
+		}
+		if (pmd_write(*pmdp))
+			write = MIGRATE_PFN_WRITE;
+	} else if (!pmd_present(*pmdp)) {
+		entry = pmd_to_swp_entry(*pmdp);
+		folio = pfn_swap_entry_folio(entry);
+
+		if (!is_device_private_entry(entry) ||
+			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
+			(folio->pgmap->owner != migrate->pgmap_owner)) {
+			spin_unlock(ptl);
+			return migrate_vma_collect_skip(start, end, walk);
+		}
 
-			folio_get(folio);
+		if (is_migration_entry(entry)) {
+			migration_entry_wait_on_locked(entry, ptl);
 			spin_unlock(ptl);
-			/* FIXME: we don't expect THP for fault_folio */
-			if (WARN_ON_ONCE(fault_folio == folio))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			if (unlikely(!folio_trylock(folio)))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			ret = split_folio(folio);
-			if (fault_folio != folio)
-				folio_unlock(folio);
-			folio_put(folio);
-			if (ret)
-				return migrate_vma_collect_skip(start, end,
-								walk);
+			return -EAGAIN;
 		}
+
+		if (is_writable_device_private_entry(entry))
+			write = MIGRATE_PFN_WRITE;
+	} else {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	folio_get(folio);
+	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
+		spin_unlock(ptl);
+		folio_put(folio);
+		return migrate_vma_collect_skip(start, end, walk);
+	}
+
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+
+		struct page_vma_mapped_walk pvmw = {
+			.ptl = ptl,
+			.address = start,
+			.pmd = pmdp,
+			.vma = walk->vma,
+		};
+
+		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
+
+		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
+						| MIGRATE_PFN_MIGRATE
+						| MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages++] = 0;
+		migrate->cpages++;
+		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
+		if (ret) {
+			migrate->npages--;
+			migrate->cpages--;
+			migrate->src[migrate->npages] = 0;
+			migrate->dst[migrate->npages] = 0;
+			goto fallback;
+		}
+		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+		spin_unlock(ptl);
+		return 0;
+	}
+
+fallback:
+	spin_unlock(ptl);
+	ret = split_folio(folio);
+	if (fault_folio != folio)
+		folio_unlock(folio);
+	folio_put(folio);
+	if (ret)
+		return migrate_vma_collect_skip(start, end, walk);
+	if (pmd_none(pmdp_get_lockless(pmdp)))
+		return migrate_vma_collect_hole(start, end, -1, walk);
+
+	return -ENOENT;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start, unmapped = 0;
+	spinlock_t *ptl;
+	struct folio *fault_folio = migrate->fault_page ?
+		page_folio(migrate->fault_page) : NULL;
+	pte_t *ptep;
+
+again:
+	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
+		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
+
+		if (ret == -EAGAIN)
+			goto again;
+		if (ret == 0)
+			return 0;
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -175,8 +284,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
 
-		/* FIXME support THP */
-		if (!page || !page->mapping || PageTransCompound(page)) {
+		if (!page || !page->mapping) {
 			mpfn = 0;
 			goto next;
 		}
@@ -347,14 +455,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 	 */
 	int extra = 1 + (page == fault_page);
 
-	/*
-	 * FIXME support THP (transparent huge page), it is bit more complex to
-	 * check them than regular pages, because they can be mapped with a pmd
-	 * or with a pte (split pte mapping).
-	 */
-	if (folio_test_large(folio))
-		return false;
-
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (folio_is_zone_device(folio))
 		extra++;
@@ -385,17 +485,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 	lru_add_drain();
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct folio *folio;
+		unsigned int nr = 1;
 
 		if (!page) {
 			if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
 				unmapped++;
-			continue;
+			goto next;
 		}
 
 		folio =	page_folio(page);
+		nr = folio_nr_pages(folio);
+
+		if (nr > 1)
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+
+
 		/* ZONE_DEVICE folios are not on LRU */
 		if (!folio_is_zone_device(folio)) {
 			if (!folio_test_lru(folio) && allow_drain) {
@@ -407,7 +514,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			if (!folio_isolate_lru(folio)) {
 				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 				restore++;
-				continue;
+				goto next;
 			}
 
 			/* Drop the reference we took in collect */
@@ -426,10 +533,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			restore++;
-			continue;
+			goto next;
 		}
 
 		unmapped++;
+next:
+		i += nr;
 	}
 
 	for (i = 0; i < npages && restore; i++) {
@@ -575,6 +684,146 @@ int migrate_vma_setup(struct migrate_vma *args)
 }
 EXPORT_SYMBOL(migrate_vma_setup);
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/**
+ * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
+ * at @addr. folio is already allocated as a part of the migration process with
+ * large page.
+ *
+ * @folio needs to be initialized and setup after it's allocated. The code bits
+ * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
+ * not support THP zero pages.
+ *
+ * @migrate: migrate_vma arguments
+ * @addr: address where the folio will be inserted
+ * @folio: folio to be inserted at @addr
+ * @src: src pfn which is being migrated
+ * @pmdp: pointer to the pmd
+ */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	gfp_t gfp = vma_thp_gfp_mask(vma);
+	struct folio *folio = page_folio(page);
+	int ret;
+	spinlock_t *ptl;
+	pgtable_t pgtable;
+	pmd_t entry;
+	bool flush = false;
+	unsigned long i;
+
+	VM_WARN_ON_FOLIO(!folio, folio);
+	VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
+
+	if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
+		return -EINVAL;
+
+	ret = anon_vma_prepare(vma);
+	if (ret)
+		return ret;
+
+	folio_set_order(folio, HPAGE_PMD_ORDER);
+	folio_set_large_rmappable(folio);
+
+	if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+		ret = -ENOMEM;
+		goto abort;
+	}
+
+	__folio_mark_uptodate(folio);
+
+	pgtable = pte_alloc_one(vma->vm_mm);
+	if (unlikely(!pgtable))
+		goto abort;
+
+	if (folio_is_device_private(folio)) {
+		swp_entry_t swp_entry;
+
+		if (vma->vm_flags & VM_WRITE)
+			swp_entry = make_writable_device_private_entry(
+						page_to_pfn(page));
+		else
+			swp_entry = make_readable_device_private_entry(
+						page_to_pfn(page));
+		entry = swp_entry_to_pmd(swp_entry);
+	} else {
+		if (folio_is_zone_device(folio) &&
+		    !folio_is_device_coherent(folio)) {
+			goto abort;
+		}
+		entry = folio_mk_pmd(folio, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
+	}
+
+	ptl = pmd_lock(vma->vm_mm, pmdp);
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto abort;
+
+	/*
+	 * Check for userfaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma))
+		goto unlock_abort;
+
+	if (!pmd_none(*pmdp)) {
+		if (!is_huge_zero_pmd(*pmdp))
+			goto unlock_abort;
+		flush = true;
+	} else if (!pmd_none(*pmdp))
+		goto unlock_abort;
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
+	if (!folio_is_zone_device(folio))
+		folio_add_lru_vma(folio, vma);
+	folio_get(folio);
+
+	if (flush) {
+		pte_free(vma->vm_mm, pgtable);
+		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
+		pmdp_invalidate(vma, addr, pmdp);
+	} else {
+		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		mm_inc_nr_ptes(vma->vm_mm);
+	}
+	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
+	update_mmu_cache_pmd(vma, addr, pmdp);
+
+	spin_unlock(ptl);
+
+	count_vm_event(THP_FAULT_ALLOC);
+	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
+	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+
+	return 0;
+
+unlock_abort:
+	spin_unlock(ptl);
+abort:
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		src[i] &= ~MIGRATE_PFN_MIGRATE;
+	return 0;
+}
+#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	return 0;
+}
+#endif
+
 /*
  * This code closely matches the code in:
  *   __handle_mm_fault()
@@ -585,9 +834,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
-				    struct page *page,
+				    unsigned long *dst,
 				    unsigned long *src)
 {
+	struct page *page = migrate_pfn_to_page(*dst);
 	struct folio *folio = page_folio(page);
 	struct vm_area_struct *vma = migrate->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -615,8 +865,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp))
-		goto abort;
+
+	if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
+		int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
+								src, pmdp);
+		if (ret)
+			goto abort;
+		return;
+	}
+
+	if (!pmd_none(*pmdp)) {
+		if (pmd_trans_huge(*pmdp)) {
+			if (!is_huge_zero_pmd(*pmdp))
+				goto abort;
+			folio_get(pmd_folio(*pmdp));
+			split_huge_pmd(vma, pmdp, addr);
+		} else if (pmd_leaf(*pmdp))
+			goto abort;
+	}
+
 	if (pte_alloc(mm, pmdp))
 		goto abort;
 	if (unlikely(anon_vma_prepare(vma)))
@@ -707,23 +974,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 	unsigned long i;
 	bool notified = false;
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct address_space *mapping;
 		struct folio *newfolio, *folio;
 		int r, extra_cnt = 0;
+		unsigned long nr = 1;
 
 		if (!newpage) {
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		if (!page) {
 			unsigned long addr;
 
 			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-				continue;
+				goto next;
 
 			/*
 			 * The only time there is no vma is when called from
@@ -741,15 +1009,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
-			migrate_vma_insert_page(migrate, addr, newpage,
+
+			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
+				nr = HPAGE_PMD_NR;
+				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				goto next;
+			}
+
+			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
 						&src_pfns[i]);
-			continue;
+			goto next;
 		}
 
 		newfolio = page_folio(newpage);
 		folio = page_folio(page);
 		mapping = folio_mapping(folio);
 
+		/*
+		 * If THP migration is enabled, check if both src and dst
+		 * can migrate large pages
+		 */
+		if (thp_migration_supported()) {
+			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+
+				if (!migrate) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			}
+		}
+
+
 		if (folio_is_device_private(newfolio) ||
 		    folio_is_device_coherent(newfolio)) {
 			if (mapping) {
@@ -762,7 +1062,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				if (!folio_test_anon(folio) ||
 				    !folio_free_swap(folio)) {
 					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-					continue;
+					goto next;
 				}
 			}
 		} else if (folio_is_zone_device(newfolio)) {
@@ -770,7 +1070,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			 * Other types of ZONE_DEVICE page are not supported.
 			 */
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		BUG_ON(folio_test_writeback(folio));
@@ -782,6 +1082,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 		else
 			folio_migrate_flags(newfolio, folio);
+next:
+		i += nr;
 	}
 
 	if (notified)
@@ -943,10 +1245,23 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
 int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 			unsigned long npages)
 {
-	unsigned long i, pfn;
+	unsigned long i, j, pfn;
+
+	for (pfn = start, i = 0; i < npages; pfn++, i++) {
+		struct page *page = pfn_to_page(pfn);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
-	for (pfn = start, i = 0; i < npages; pfn++, i++)
 		src_pfns[i] = migrate_device_pfn_lock(pfn);
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j - 1;
+			pfn += j - 1;
+		}
+	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 05/12] mm/memory/fault: add support for zone device THP fault handling
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (3 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-17 19:34   ` Matthew Brost
  2025-07-03 23:35 ` [v1 resend 06/12] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

When the CPU touches a zone device THP entry, the data needs to
be migrated back to the CPU, call migrate_to_ram() on these pages
via do_huge_pmd_device_private() fault handling helper.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |  7 +++++++
 mm/huge_memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  6 ++++--
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4d5bb67dc4ec..65a1bdf29bb9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -474,6 +474,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
@@ -627,6 +629,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	return 0;
+}
+
 static inline bool is_huge_zero_folio(const struct folio *folio)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e6e390d0308f..f29add796931 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1267,6 +1267,46 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 
 }
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	vm_fault_t ret = 0;
+	spinlock_t *ptl;
+	swp_entry_t swp_entry;
+	struct page *page;
+
+	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
+		return VM_FAULT_FALLBACK;
+
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
+	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) {
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	swp_entry = pmd_to_swp_entry(vmf->orig_pmd);
+	page = pfn_swap_entry_to_page(swp_entry);
+	vmf->page = page;
+	vmf->pte = NULL;
+	if (trylock_page(vmf->page)) {
+		get_page(page);
+		spin_unlock(ptl);
+		ret = page_pgmap(page)->ops->migrate_to_ram(vmf);
+		unlock_page(vmf->page);
+		put_page(page);
+	} else {
+		spin_unlock(ptl);
+	}
+
+	return ret;
+}
+
 /*
  * always: directly stall for all thp allocations
  * defer: wake kswapd and fail if not immediately available
diff --git a/mm/memory.c b/mm/memory.c
index 0f9b32a20e5b..c26c421b8325 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6165,8 +6165,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
 		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_device_private_entry(
+					pmd_to_swp_entry(vmf.orig_pmd)))
+				return do_huge_pmd_device_private(&vmf);
+
 			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 06/12] lib/test_hmm: test cases and support for zone device private THP
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (4 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 05/12] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-03 23:35 ` [v1 resend 07/12] mm/memremap: add folio_split support Balbir Singh
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Enhance the hmm test driver (lib/test_hmm) with support for
THP pages.

A new pool of free_folios() has now been added to the dmirror
device, which can be allocated when a request for a THP zone
device private page is made.

Add compound page awareness to the allocation function during
normal migration and fault based migration. These routines also
copy folio_nr_pages() when moving data between system memory
and device memory.

args.src and args.dst used to hold migration entries are now
dynamically allocated (as they need to hold HPAGE_PMD_NR entries
or more).

Split and migrate support will be added in future patches in this
series.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 lib/test_hmm.c | 355 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 282 insertions(+), 73 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 761725bc713c..95b4276a17fd 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -119,6 +119,7 @@ struct dmirror_device {
 	unsigned long		calloc;
 	unsigned long		cfree;
 	struct page		*free_pages;
+	struct folio		*free_folios;
 	spinlock_t		lock;		/* protects the above */
 };
 
@@ -492,7 +493,7 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 }
 
 static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
-				   struct page **ppage)
+				  struct page **ppage, bool is_large)
 {
 	struct dmirror_chunk *devmem;
 	struct resource *res = NULL;
@@ -572,20 +573,45 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		pfn_first, pfn_last);
 
 	spin_lock(&mdevice->lock);
-	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+	for (pfn = pfn_first; pfn < pfn_last; ) {
 		struct page *page = pfn_to_page(pfn);
 
+		if (is_large && IS_ALIGNED(pfn, HPAGE_PMD_NR)
+			&& (pfn + HPAGE_PMD_NR <= pfn_last)) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+			pfn += HPAGE_PMD_NR;
+			continue;
+		}
+
 		page->zone_device_data = mdevice->free_pages;
 		mdevice->free_pages = page;
+		pfn++;
 	}
+
+	ret = 0;
 	if (ppage) {
-		*ppage = mdevice->free_pages;
-		mdevice->free_pages = (*ppage)->zone_device_data;
-		mdevice->calloc++;
+		if (is_large) {
+			if (!mdevice->free_folios) {
+				ret = -ENOMEM;
+				goto err_unlock;
+			}
+			*ppage = folio_page(mdevice->free_folios, 0);
+			mdevice->free_folios = (*ppage)->zone_device_data;
+			mdevice->calloc += HPAGE_PMD_NR;
+		} else if (mdevice->free_pages) {
+			*ppage = mdevice->free_pages;
+			mdevice->free_pages = (*ppage)->zone_device_data;
+			mdevice->calloc++;
+		} else {
+			ret = -ENOMEM;
+			goto err_unlock;
+		}
 	}
+err_unlock:
 	spin_unlock(&mdevice->lock);
 
-	return 0;
+	return ret;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -598,10 +624,13 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	return ret;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
+					      bool is_large)
 {
 	struct page *dpage = NULL;
 	struct page *rpage = NULL;
+	unsigned int order = is_large ? HPAGE_PMD_ORDER : 0;
+	struct dmirror_device *mdevice = dmirror->mdevice;
 
 	/*
 	 * For ZONE_DEVICE private type, this is a fake device so we allocate
@@ -610,49 +639,55 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	 * data and ignore rpage.
 	 */
 	if (dmirror_is_private_zone(mdevice)) {
-		rpage = alloc_page(GFP_HIGHUSER);
+		rpage = folio_page(folio_alloc(GFP_HIGHUSER, order), 0);
 		if (!rpage)
 			return NULL;
 	}
 	spin_lock(&mdevice->lock);
 
-	if (mdevice->free_pages) {
+	if (is_large && mdevice->free_folios) {
+		dpage = folio_page(mdevice->free_folios, 0);
+		mdevice->free_folios = dpage->zone_device_data;
+		mdevice->calloc += 1 << order;
+		spin_unlock(&mdevice->lock);
+	} else if (!is_large && mdevice->free_pages) {
 		dpage = mdevice->free_pages;
 		mdevice->free_pages = dpage->zone_device_data;
 		mdevice->calloc++;
 		spin_unlock(&mdevice->lock);
 	} else {
 		spin_unlock(&mdevice->lock);
-		if (dmirror_allocate_chunk(mdevice, &dpage))
+		if (dmirror_allocate_chunk(mdevice, &dpage, is_large))
 			goto error;
 	}
 
-	zone_device_page_init(dpage);
+	init_zone_device_folio(page_folio(dpage), order);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
 error:
 	if (rpage)
-		__free_page(rpage);
+		__free_pages(rpage, order);
 	return NULL;
 }
 
 static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 					   struct dmirror *dmirror)
 {
-	struct dmirror_device *mdevice = dmirror->mdevice;
 	const unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
 	unsigned long addr;
 
-	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
-						   src++, dst++) {
+	for (addr = args->start; addr < args->end; ) {
 		struct page *spage;
 		struct page *dpage;
 		struct page *rpage;
+		bool is_large = *src & MIGRATE_PFN_COMPOUND;
+		int write = (*src & MIGRATE_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+		unsigned long nr = 1;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		/*
 		 * Note that spage might be NULL which is OK since it is an
@@ -662,17 +697,45 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		if (WARN(spage && is_zone_device_page(spage),
 		     "page already in device spage pfn: 0x%lx\n",
 		     page_to_pfn(spage)))
+			goto next;
+
+		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (!dpage) {
+			struct folio *folio;
+			unsigned long i;
+			unsigned long spfn = *src >> MIGRATE_PFN_SHIFT;
+			struct page *src_page;
+
+			if (!is_large)
+				goto next;
+
+			if (!spage && is_large) {
+				nr = HPAGE_PMD_NR;
+			} else {
+				folio = page_folio(spage);
+				nr = folio_nr_pages(folio);
+			}
+
+			for (i = 0; i < nr && addr < args->end; i++) {
+				dpage = dmirror_devmem_alloc_page(dmirror, false);
+				rpage = BACKING_PAGE(dpage);
+				rpage->zone_device_data = dmirror;
+
+				*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+				src_page = pfn_to_page(spfn + i);
+
+				if (spage)
+					copy_highpage(rpage, src_page);
+				else
+					clear_highpage(rpage);
+				src++;
+				dst++;
+				addr += PAGE_SIZE;
+			}
 			continue;
-
-		dpage = dmirror_devmem_alloc_page(mdevice);
-		if (!dpage)
-			continue;
+		}
 
 		rpage = BACKING_PAGE(dpage);
-		if (spage)
-			copy_highpage(rpage, spage);
-		else
-			clear_highpage(rpage);
 
 		/*
 		 * Normally, a device would use the page->zone_device_data to
@@ -684,10 +747,42 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 			 page_to_pfn(spage), page_to_pfn(dpage));
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if ((*src & MIGRATE_PFN_WRITE) ||
-		    (!spage && args->vma->vm_flags & VM_WRITE))
-			*dst |= MIGRATE_PFN_WRITE;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+
+		if (is_large) {
+			int i;
+			struct folio *folio = page_folio(dpage);
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+			if (folio_test_large(folio)) {
+				for (i = 0; i < folio_nr_pages(folio); i++) {
+					struct page *dst_page =
+						pfn_to_page(page_to_pfn(rpage) + i);
+					struct page *src_page =
+						pfn_to_page(page_to_pfn(spage) + i);
+
+					if (spage)
+						copy_highpage(dst_page, src_page);
+					else
+						clear_highpage(dst_page);
+					src++;
+					dst++;
+					addr += PAGE_SIZE;
+				}
+				continue;
+			}
+		}
+
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+next:
+		src++;
+		dst++;
+		addr += PAGE_SIZE;
 	}
 }
 
@@ -734,14 +829,17 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	const unsigned long *src = args->src;
 	const unsigned long *dst = args->dst;
 	unsigned long pfn;
+	const unsigned long start_pfn = start >> PAGE_SHIFT;
+	const unsigned long end_pfn = end >> PAGE_SHIFT;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
-								src++, dst++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn++, src++, dst++) {
 		struct page *dpage;
 		void *entry;
+		int nr, i;
+		struct page *rpage;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
 			continue;
@@ -750,13 +848,25 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 		if (!dpage)
 			continue;
 
-		entry = BACKING_PAGE(dpage);
-		if (*dst & MIGRATE_PFN_WRITE)
-			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
-		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
-		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
+		if (*dst & MIGRATE_PFN_COMPOUND)
+			nr = folio_nr_pages(page_folio(dpage));
+		else
+			nr = 1;
+
+		WARN_ON_ONCE(end_pfn < start_pfn + nr);
+
+		rpage = BACKING_PAGE(dpage);
+		VM_BUG_ON(folio_nr_pages(page_folio(rpage)) != nr);
+
+		for (i = 0; i < nr; i++) {
+			entry = folio_page(page_folio(rpage), i);
+			if (*dst & MIGRATE_PFN_WRITE)
+				entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+			entry = xa_store(&dmirror->pt, pfn + i, entry, GFP_ATOMIC);
+			if (xa_is_err(entry)) {
+				mutex_unlock(&dmirror->mutex);
+				return xa_err(entry);
+			}
 		}
 	}
 
@@ -829,31 +939,61 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 	unsigned long start = args->start;
 	unsigned long end = args->end;
 	unsigned long addr;
+	unsigned int order = 0;
+	int i;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
+	for (addr = start; addr < end; ) {
 		struct page *dpage, *spage;
 
 		spage = migrate_pfn_to_page(*src);
 		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		if (WARN_ON(!is_device_private_page(spage) &&
 			    !is_device_coherent_page(spage)))
-			continue;
+			goto next;
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
-			 page_to_pfn(spage), page_to_pfn(dpage));
+		order = folio_order(page_folio(spage));
+
+		if (order)
+			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
+						order, args->vma, addr), 0);
+		else
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+
+		/* Try with smaller pages if large allocation fails */
+		if (!dpage && order) {
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+			if (!dpage)
+				return VM_FAULT_OOM;
+			order = 0;
+		}
 
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+				page_to_pfn(spage), page_to_pfn(dpage));
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
 		copy_highpage(dpage, spage);
 		*dst = migrate_pfn(page_to_pfn(dpage));
 		if (*src & MIGRATE_PFN_WRITE)
 			*dst |= MIGRATE_PFN_WRITE;
+		if (order)
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+		for (i = 0; i < (1 << order); i++) {
+			struct page *src_page;
+			struct page *dst_page;
+
+			src_page = pfn_to_page(page_to_pfn(spage) + i);
+			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+
+			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			copy_highpage(dst_page, src_page);
+		}
+next:
+		addr += PAGE_SIZE << order;
+		src += 1 << order;
+		dst += 1 << order;
 	}
 	return 0;
 }
@@ -879,11 +1019,14 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
+
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
 
 	start = cmd->addr;
 	end = start + size;
@@ -902,7 +1045,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -912,7 +1055,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = dmirror_select_device(dmirror);
+		args.flags = dmirror_select_device(dmirror) | MIGRATE_VMA_SELECT_COMPOUND;
 
 		ret = migrate_vma_setup(&args);
 		if (ret)
@@ -928,6 +1071,8 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+	kvfree(src_pfns);
+	kvfree(dst_pfns);
 
 	return ret;
 }
@@ -939,12 +1084,12 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct dmirror_bounce bounce;
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
 
 	start = cmd->addr;
 	end = start + size;
@@ -955,6 +1100,18 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	if (!mmget_not_zero(mm))
 		return -EINVAL;
 
+	ret = -ENOMEM;
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!src_pfns)
+		goto free_mem;
+
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!dst_pfns)
+		goto free_mem;
+
+	ret = 0;
 	mmap_read_lock(mm);
 	for (addr = start; addr < end; addr = next) {
 		vma = vma_lookup(mm, addr);
@@ -962,7 +1119,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -972,7 +1129,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+		args.flags = MIGRATE_VMA_SELECT_SYSTEM |
+				MIGRATE_VMA_SELECT_COMPOUND;
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out;
@@ -992,7 +1150,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	 */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
-		return ret;
+		goto free_mem;
 	mutex_lock(&dmirror->mutex);
 	ret = dmirror_do_read(dmirror, start, end, &bounce);
 	mutex_unlock(&dmirror->mutex);
@@ -1003,11 +1161,14 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	}
 	cmd->cpages = bounce.cpages;
 	dmirror_bounce_fini(&bounce);
-	return ret;
+	goto free_mem;
 
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+free_mem:
+	kfree(src_pfns);
+	kfree(dst_pfns);
 	return ret;
 }
 
@@ -1200,6 +1361,7 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 	unsigned long i;
 	unsigned long *src_pfns;
 	unsigned long *dst_pfns;
+	unsigned int order = 0;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
@@ -1215,13 +1377,25 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 		if (WARN_ON(!is_device_private_page(spage) &&
 			    !is_device_coherent_page(spage)))
 			continue;
+
+		order = folio_order(page_folio(spage));
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+		if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+			dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE,
+					      order), 0);
+		} else {
+			dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+			order = 0;
+		}
+
+		/* TODO Support splitting here */
 		lock_page(dpage);
-		copy_highpage(dpage, spage);
 		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
 		if (src_pfns[i] & MIGRATE_PFN_WRITE)
 			dst_pfns[i] |= MIGRATE_PFN_WRITE;
+		if (order)
+			dst_pfns[i] |= MIGRATE_PFN_COMPOUND;
+		folio_copy(page_folio(dpage), page_folio(spage));
 	}
 	migrate_device_pages(src_pfns, dst_pfns, npages);
 	migrate_device_finalize(src_pfns, dst_pfns, npages);
@@ -1234,7 +1408,12 @@ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
 {
 	struct dmirror_device *mdevice = devmem->mdevice;
 	struct page *page;
+	struct folio *folio;
 
+
+	for (folio = mdevice->free_folios; folio; folio = folio_zone_device_data(folio))
+		if (dmirror_page_to_chunk(folio_page(folio, 0)) == devmem)
+			mdevice->free_folios = folio_zone_device_data(folio);
 	for (page = mdevice->free_pages; page; page = page->zone_device_data)
 		if (dmirror_page_to_chunk(page) == devmem)
 			mdevice->free_pages = page->zone_device_data;
@@ -1265,6 +1444,7 @@ static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
 		mdevice->devmem_count = 0;
 		mdevice->devmem_capacity = 0;
 		mdevice->free_pages = NULL;
+		mdevice->free_folios = NULL;
 		kfree(mdevice->devmem_chunks);
 		mdevice->devmem_chunks = NULL;
 	}
@@ -1378,18 +1558,30 @@ static void dmirror_devmem_free(struct page *page)
 {
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
+	struct folio *folio = page_folio(rpage);
+	unsigned int order = folio_order(folio);
 
-	if (rpage != page)
-		__free_page(rpage);
+	if (rpage != page) {
+		if (order)
+			__free_pages(rpage, order);
+		else
+			__free_page(rpage);
+		rpage = NULL;
+	}
 
 	mdevice = dmirror_page_to_device(page);
 	spin_lock(&mdevice->lock);
 
 	/* Return page to our allocator if not freeing the chunk */
 	if (!dmirror_page_to_chunk(page)->remove) {
-		mdevice->cfree++;
-		page->zone_device_data = mdevice->free_pages;
-		mdevice->free_pages = page;
+		mdevice->cfree += 1 << order;
+		if (order) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+		} else {
+			page->zone_device_data = mdevice->free_pages;
+			mdevice->free_pages = page;
+		}
 	}
 	spin_unlock(&mdevice->lock);
 }
@@ -1397,11 +1589,10 @@ static void dmirror_devmem_free(struct page *page)
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args = { 0 };
-	unsigned long src_pfns = 0;
-	unsigned long dst_pfns = 0;
 	struct page *rpage;
 	struct dmirror *dmirror;
-	vm_fault_t ret;
+	vm_fault_t ret = 0;
+	unsigned int order, nr;
 
 	/*
 	 * Normally, a device would use the page->zone_device_data to point to
@@ -1412,21 +1603,36 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
+	order = folio_order(page_folio(vmf->page));
+	nr = 1 << order;
+
+	/*
+	 * Consider a per-cpu cache of src and dst pfns, but with
+	 * large number of cpus that might not scale well.
+	 */
+	args.start = ALIGN_DOWN(vmf->address, (1 << (PAGE_SHIFT + order)));
 	args.vma = vmf->vma;
-	args.start = vmf->address;
-	args.end = args.start + PAGE_SIZE;
-	args.src = &src_pfns;
-	args.dst = &dst_pfns;
+	args.end = args.start + (PAGE_SIZE << order);
+	args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
 	args.pgmap_owner = dmirror->mdevice;
 	args.flags = dmirror_select_device(dmirror);
 	args.fault_page = vmf->page;
 
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
 	if (migrate_vma_setup(&args))
 		return VM_FAULT_SIGBUS;
 
 	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
 	if (ret)
-		return ret;
+		goto err;
 	migrate_vma_pages(&args);
 	/*
 	 * No device finalize step is needed since
@@ -1434,7 +1640,10 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * invalidated the device page table.
 	 */
 	migrate_vma_finalize(&args);
-	return 0;
+err:
+	kfree(args.src);
+	kfree(args.dst);
+	return ret;
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
@@ -1465,7 +1674,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 		return ret;
 
 	/* Build a list of free ZONE_DEVICE struct pages */
-	return dmirror_allocate_chunk(mdevice, NULL);
+	return dmirror_allocate_chunk(mdevice, NULL, false);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 07/12] mm/memremap: add folio_split support
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (5 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 06/12] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-04 11:14   ` Mika Penttilä
  2025-07-03 23:35 ` [v1 resend 08/12] mm/thp: add split during migration support Balbir Singh
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

When a zone device page is split (via huge pmd folio split). The
driver callback for folio_split is invoked to let the device driver
know that the folio size has been split into a smaller order.

The HMM test driver has been updated to handle the split, since the
test driver uses backing pages, it requires a mechanism of reorganizing
the backing pages (backing pages are used to create a mirror device)
again into the right sized order pages. This is supported by exporting
prep_compound_page().

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h |  7 +++++++
 include/linux/mm.h       |  1 +
 lib/test_hmm.c           | 42 ++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         | 14 ++++++++++++++
 mm/page_alloc.c          |  1 +
 5 files changed, 65 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 11d586dd8ef1..2091b754f1da 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,13 @@ struct dev_pagemap_ops {
 	 */
 	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
 			      unsigned long nr_pages, int mf_flags);
+
+	/*
+	 * Used for private (un-addressable) device memory only.
+	 * This callback is used when a folio is split into
+	 * a smaller folio
+	 */
+	void (*folio_split)(struct folio *head, struct folio *tail);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef40f68c1183..f7bda8b1e46c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1183,6 +1183,7 @@ static inline struct folio *virt_to_folio(const void *x)
 void __folio_put(struct folio *folio);
 
 void split_page(struct page *page, unsigned int order);
+void prep_compound_page(struct page *page, unsigned int order);
 void folio_copy(struct folio *dst, struct folio *src);
 int folio_mc_copy(struct folio *dst, struct folio *src);
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 95b4276a17fd..e20021fb7c69 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1646,9 +1646,51 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
+{
+	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
+	struct page *rpage_tail;
+	struct folio *rfolio;
+	unsigned long offset = 0;
+	unsigned int tail_order;
+	unsigned int head_order = folio_order(head);
+
+	if (!rpage) {
+		tail->page.zone_device_data = NULL;
+		return;
+	}
+
+	rfolio = page_folio(rpage);
+
+	if (tail == NULL) {
+		folio_reset_order(rfolio);
+		rfolio->mapping = NULL;
+		if (head_order)
+			prep_compound_page(rpage, head_order);
+		folio_set_count(rfolio, 1 << head_order);
+		return;
+	}
+
+	offset = folio_pfn(tail) - folio_pfn(head);
+
+	rpage_tail = folio_page(rfolio, offset);
+	tail->page.zone_device_data = rpage_tail;
+	clear_compound_head(rpage_tail);
+	rpage_tail->mapping = NULL;
+
+	tail_order = folio_order(tail);
+	if (tail_order)
+		prep_compound_page(rpage_tail, tail_order);
+
+	folio_page(tail, 0)->mapping = folio_page(head, 0)->mapping;
+	tail->pgmap = head->pgmap;
+	folio_set_count(page_folio(rpage_tail), 1 << tail_order);
+}
+
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
 	.page_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
+	.folio_split	= dmirror_devmem_folio_split,
 };
 
 static int dmirror_device_init(struct dmirror_device *mdevice, int id)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f29add796931..d55e36ae0c39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3630,6 +3630,11 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			if (release == origin_folio)
 				continue;
 
+			if (folio_is_device_private(origin_folio) &&
+					origin_folio->pgmap->ops->folio_split)
+				origin_folio->pgmap->ops->folio_split(
+					origin_folio, release);
+
 			folio_ref_unfreeze(release, 1 +
 					((mapping || swap_cache) ?
 						folio_nr_pages(release) : 0));
@@ -3661,6 +3666,15 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		}
 	}
 
+	/*
+	 * Mark the end of the split, so that the driver can setup origin_folio's
+	 * order and other metadata
+	 */
+	if (folio_is_device_private(origin_folio) &&
+			origin_folio->pgmap->ops->folio_split)
+		origin_folio->pgmap->ops->folio_split(
+			origin_folio, NULL);
+
 	/*
 	 * Unfreeze origin_folio only after all page cache entries, which used
 	 * to point to it, have been updated with new folios. Otherwise,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f55f8ed65c7..0a538e9c24bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -722,6 +722,7 @@ void prep_compound_page(struct page *page, unsigned int order)
 
 	prep_compound_head(page, order);
 }
+EXPORT_SYMBOL_GPL(prep_compound_page);
 
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (6 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 07/12] mm/memremap: add folio_split support Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-04  5:17   ` Mika Penttilä
  2025-07-04 11:24   ` Zi Yan
  2025-07-03 23:35 ` [v1 resend 09/12] lib/test_hmm: add test case for split pages Balbir Singh
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Support splitting pages during THP zone device migration as needed.
The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages.

Add a new routine migrate_vma_split_pages() to support the splitting
of already isolated pages. The pages being migrated are already unmapped
and marked for migration during setup (via unmap). folio_split() and
__split_unmapped_folio() take additional isolated arguments, to avoid
unmapping and remaping these pages and unlocking/putting the folio.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h | 11 ++++++--
 mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
 mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
 3 files changed, 85 insertions(+), 39 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 65a1bdf29bb9..5f55a754e57c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order, bool isolated);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns);
 int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 		struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
 /*
  * try_folio_split - try to split a @folio at @page using non uniform split.
  * @folio: folio to be split
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d55e36ae0c39..e00ddfed22fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		new_folio->mapping = folio->mapping;
 		new_folio->index = folio->index + i;
 
-		/*
-		 * page->private should not be set in tail pages. Fix up and warn once
-		 * if private is unexpectedly set.
-		 */
-		if (unlikely(new_folio->private)) {
-			VM_WARN_ON_ONCE_PAGE(true, new_head);
-			new_folio->private = NULL;
-		}
-
 		if (folio_test_swapcache(folio))
 			new_folio->swap.val = folio->swap.val + i;
 
@@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		struct page *split_at, struct page *lock_at,
 		struct list_head *list, pgoff_t end,
 		struct xa_state *xas, struct address_space *mapping,
-		bool uniform_split)
+		bool uniform_split, bool isolated)
 {
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
@@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 				percpu_ref_get_many(&release->pgmap->ref,
 							(1 << new_order) - 1);
 
-			lru_add_split_folio(origin_folio, release, lruvec,
-					list);
+			if (!isolated)
+				lru_add_split_folio(origin_folio, release,
+							lruvec, list);
 
 			/* Some pages can be beyond EOF: drop them from cache */
 			if (release->index >= end) {
@@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	if (nr_dropped)
 		shmem_uncharge(mapping->host, nr_dropped);
 
+	/*
+	 * Don't remap and unlock isolated folios
+	 */
+	if (isolated)
+		return ret;
+
 	remap_page(origin_folio, 1 << order,
 			folio_test_anon(origin_folio) ?
 				RMP_USE_SHARED_ZEROPAGE : 0);
@@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
+ * @isolated: The pages are already unmapped
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split)
+		struct list_head *list, bool uniform_split, bool isolated)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = folio_get_anon_vma(folio);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (!isolated) {
+			anon_vma = folio_get_anon_vma(folio);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		end = -1;
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		unsigned int min_order;
 		gfp_t gfp;
@@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	unmap_folio(folio);
+	if (!isolated)
+		unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 		ret = __split_unmapped_folio(folio, new_order,
 				split_at, lock_at, list, end, &xas, mapping,
-				uniform_split);
+				uniform_split, isolated);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio), 0);
+		if (!isolated)
+			remap_page(folio, folio_nr_pages(folio), 0);
 		ret = -EAGAIN;
 	}
 
@@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order, bool isolated)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true);
+	return __folio_split(folio, new_order, &folio->page, page, list, true,
+				isolated);
 }
 
 /*
@@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
 	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false);
+			false, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 41d0bd787969..acd2f03b178d 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		src[i] &= ~MIGRATE_PFN_MIGRATE;
 	return 0;
 }
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{
+	unsigned long i;
+	unsigned long pfn;
+	unsigned long flags;
+
+	folio_get(folio);
+	split_huge_pmd_address(migrate->vma, addr, true);
+	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
+	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+	for (i = 1; i < HPAGE_PMD_NR; i++)
+		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+}
 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 					 unsigned long addr,
@@ -822,6 +840,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 {
 	return 0;
 }
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{}
 #endif
 
 /*
@@ -971,8 +994,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				struct migrate_vma *migrate)
 {
 	struct mmu_notifier_range range;
-	unsigned long i;
+	unsigned long i, j;
 	bool notified = false;
+	unsigned long addr;
 
 	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1014,12 +1038,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
 				nr = HPAGE_PMD_NR;
 				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-				goto next;
+			} else {
+				nr = 1;
 			}
 
-			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
-						&src_pfns[i]);
+			for (j = 0; j < nr && i + j < npages; j++) {
+				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+				migrate_vma_insert_page(migrate,
+					addr + j * PAGE_SIZE,
+					&dst_pfns[i+j], &src_pfns[i+j]);
+			}
 			goto next;
 		}
 
@@ -1041,7 +1069,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 							 MIGRATE_PFN_COMPOUND);
 					goto next;
 				}
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				nr = 1 << folio_order(folio);
+				addr = migrate->start + i * PAGE_SIZE;
+				migrate_vma_split_pages(migrate, i, addr, folio);
 			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
 				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
 				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1076,12 +1106,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 		BUG_ON(folio_test_writeback(folio));
 
 		if (migrate && migrate->fault_page == page)
-			extra_cnt = 1;
-		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
-		if (r != MIGRATEPAGE_SUCCESS)
-			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-		else
-			folio_migrate_flags(newfolio, folio);
+			extra_cnt++;
+		for (j = 0; j < nr && i + j < npages; j++) {
+			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+			if (r != MIGRATEPAGE_SUCCESS)
+				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+			else
+				folio_migrate_flags(newfolio, folio);
+		}
 next:
 		i += nr;
 	}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 09/12] lib/test_hmm: add test case for split pages
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (7 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 08/12] mm/thp: add split during migration support Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-03 23:35 ` [v1 resend 10/12] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add a new flag HMM_DMIRROR_FLAG_FAIL_ALLOC to emulate
failure of allocating a large page. This tests the code paths
involving split migration.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 lib/test_hmm.c      | 61 ++++++++++++++++++++++++++++++---------------
 lib/test_hmm_uapi.h |  3 +++
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e20021fb7c69..c322be89d54c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -92,6 +92,7 @@ struct dmirror {
 	struct xarray			pt;
 	struct mmu_interval_notifier	notifier;
 	struct mutex			mutex;
+	__u64			flags;
 };
 
 /*
@@ -699,7 +700,12 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		     page_to_pfn(spage)))
 			goto next;
 
-		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			dpage = NULL;
+		} else
+			dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+
 		if (!dpage) {
 			struct folio *folio;
 			unsigned long i;
@@ -954,44 +960,55 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 			goto next;
 		spage = BACKING_PAGE(spage);
 		order = folio_order(page_folio(spage));
-
 		if (order)
+			*dst = MIGRATE_PFN_COMPOUND;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			*dst &= ~MIGRATE_PFN_COMPOUND;
+			dpage = NULL;
+		} else if (order) {
 			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
 						order, args->vma, addr), 0);
-		else
-			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-
-		/* Try with smaller pages if large allocation fails */
-		if (!dpage && order) {
+		} else {
 			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-			if (!dpage)
-				return VM_FAULT_OOM;
-			order = 0;
 		}
 
+		if (!dpage && !order)
+			return VM_FAULT_OOM;
+
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 				page_to_pfn(spage), page_to_pfn(dpage));
-		lock_page(dpage);
-		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if (*src & MIGRATE_PFN_WRITE)
-			*dst |= MIGRATE_PFN_WRITE;
-		if (order)
-			*dst |= MIGRATE_PFN_COMPOUND;
+
+		if (dpage) {
+			lock_page(dpage);
+			*dst |= migrate_pfn(page_to_pfn(dpage));
+		}
 
 		for (i = 0; i < (1 << order); i++) {
 			struct page *src_page;
 			struct page *dst_page;
 
+			/* Try with smaller pages if large allocation fails */
+			if (!dpage && order) {
+				dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+				lock_page(dpage);
+				dst[i] = migrate_pfn(page_to_pfn(dpage));
+				dst_page = pfn_to_page(page_to_pfn(dpage));
+				dpage = NULL; /* For the next iteration */
+			} else {
+				dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+			}
+
 			src_page = pfn_to_page(page_to_pfn(spage) + i);
-			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
 
 			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			addr += PAGE_SIZE;
 			copy_highpage(dst_page, src_page);
 		}
 next:
-		addr += PAGE_SIZE << order;
 		src += 1 << order;
 		dst += 1 << order;
 	}
@@ -1509,6 +1526,10 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		dmirror_device_remove_chunks(dmirror->mdevice);
 		ret = 0;
 		break;
+	case HMM_DMIRROR_FLAGS:
+		dmirror->flags = cmd.npages;
+		ret = 0;
+		break;
 
 	default:
 		return -EINVAL;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 8c818a2cf4f6..f94c6d457338 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -37,6 +37,9 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_FLAGS		_IOWR('H', 0x08, struct hmm_dmirror_cmd)
+
+#define HMM_DMIRROR_FLAG_FAIL_ALLOC	(1ULL << 0)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 10/12] selftests/mm/hmm-tests: new tests for zone device THP migration
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (8 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 09/12] lib/test_hmm: add test case for split pages Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-03 23:35 ` [v1 resend 11/12] gpu/drm/nouveau: add THP migration support Balbir Singh
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add new tests for migrating anon THP pages, including anon_huge,
anon_huge_zero and error cases involving forced splitting of pages
during migration.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 410 +++++++++++++++++++++++++
 1 file changed, 410 insertions(+)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 141bf63cbe05..da3322a1282c 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -2056,4 +2056,414 @@ TEST_F(hmm, hmm_cow_in_device)
 
 	hmm_buffer_free(buffer);
 }
+
+/*
+ * Migrate private anonymous huge empty page.
+ */
+TEST_F(hmm, migrate_anon_huge_empty)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page.
+ */
+TEST_F(hmm, migrate_anon_huge_zero)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+	int val;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize a read-only zero huge page. */
+	val = *(int *)buffer->ptr;
+	ASSERT_EQ(val, 0);
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], 0);
+		/* If it asserts once, it probably will 500,000 times */
+		if (ptr[i] != 0)
+			break;
+	}
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and free.
+ */
+TEST_F(hmm, migrate_anon_huge_free)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Try freeing it. */
+	ret = madvise(map, size, MADV_FREE);
+	ASSERT_EQ(ret, 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and fault back to sysmem.
+ */
+TEST_F(hmm, migrate_anon_huge_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], i);
+		if (ptr[i] != i)
+			break;
+	}
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate THP to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/*
+	 * Force an allocation error when faulting back a THP resident in the
+	 * device.
+	 */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+
+	ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_zero_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 0);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory (zero THP page). */
+	ret = ptr[0];
+	ASSERT_EQ(ret, 0);
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Fault the device memory back and check it. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 11/12] gpu/drm/nouveau: add THP migration support
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (9 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 10/12] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-03 23:35 ` [v1 resend 12/12] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Change the code to add support for MIGRATE_VMA_SELECT_COMPOUND
and appropriately handling page sizes in the migrate/evict
code paths.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++++++++++++--------
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 178 insertions(+), 77 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index ca4932a150e3..92b8877d8735 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -83,9 +83,15 @@ struct nouveau_dmem {
 	struct list_head chunks;
 	struct mutex mutex;
 	struct page *free_pages;
+	struct folio *free_folios;
 	spinlock_t lock;
 };
 
+struct nouveau_dmem_dma_info {
+	dma_addr_t dma_addr;
+	size_t size;
+};
+
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
 	return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
@@ -112,10 +118,16 @@ static void nouveau_dmem_page_free(struct page *page)
 {
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
+	struct folio *folio = page_folio(page);
 
 	spin_lock(&dmem->lock);
-	page->zone_device_data = dmem->free_pages;
-	dmem->free_pages = page;
+	if (folio_order(folio)) {
+		folio_set_zone_device_data(folio, dmem->free_folios);
+		dmem->free_folios = folio;
+	} else {
+		page->zone_device_data = dmem->free_pages;
+		dmem->free_pages = page;
+	}
 
 	WARN_ON(!chunk->callocated);
 	chunk->callocated--;
@@ -139,20 +151,28 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
 	}
 }
 
-static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
-				struct page *dpage, dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_folio(struct nouveau_drm *drm,
+				   struct folio *sfolio, struct folio *dfolio,
+				   struct nouveau_dmem_dma_info *dma_info)
 {
 	struct device *dev = drm->dev->dev;
+	struct page *dpage = folio_page(dfolio, 0);
+	struct page *spage = folio_page(sfolio, 0);
 
-	lock_page(dpage);
+	folio_lock(dfolio);
 
-	*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
-	if (dma_mapping_error(dev, *dma_addr))
+	dma_info->dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+					DMA_BIDIRECTIONAL);
+	dma_info->size = page_size(dpage);
+	if (dma_mapping_error(dev, dma_info->dma_addr))
 		return -EIO;
 
-	if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-					 NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage))) {
-		dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(sfolio),
+					 NOUVEAU_APER_HOST, dma_info->dma_addr,
+					 NOUVEAU_APER_VRAM,
+					 nouveau_dmem_page_addr(spage))) {
+		dma_unmap_page(dev, dma_info->dma_addr, page_size(dpage),
+					DMA_BIDIRECTIONAL);
 		return -EIO;
 	}
 
@@ -165,21 +185,38 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	struct nouveau_dmem *dmem = drm->dmem;
 	struct nouveau_fence *fence;
 	struct nouveau_svmm *svmm;
-	struct page *spage, *dpage;
-	unsigned long src = 0, dst = 0;
-	dma_addr_t dma_addr = 0;
+	struct page *dpage;
 	vm_fault_t ret = 0;
 	struct migrate_vma args = {
 		.vma		= vmf->vma,
-		.start		= vmf->address,
-		.end		= vmf->address + PAGE_SIZE,
-		.src		= &src,
-		.dst		= &dst,
 		.pgmap_owner	= drm->dev,
 		.fault_page	= vmf->page,
-		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+				  MIGRATE_VMA_SELECT_COMPOUND,
+		.src = NULL,
+		.dst = NULL,
 	};
-
+	unsigned int order, nr;
+	struct folio *sfolio, *dfolio;
+	struct nouveau_dmem_dma_info dma_info;
+
+	sfolio = page_folio(vmf->page);
+	order = folio_order(sfolio);
+	nr = 1 << order;
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
+	args.start = ALIGN_DOWN(vmf->address, (1 << (PAGE_SHIFT + order)));
+	args.vma = vmf->vma;
+	args.end = args.start + (PAGE_SIZE << order);
+	args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
+
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
 	/*
 	 * FIXME what we really want is to find some heuristic to migrate more
 	 * than just one page on CPU fault. When such fault happens it is very
@@ -190,20 +227,26 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	if (!args.cpages)
 		return 0;
 
-	spage = migrate_pfn_to_page(src);
-	if (!spage || !(src & MIGRATE_PFN_MIGRATE))
-		goto done;
-
-	dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma, vmf->address);
-	if (!dpage)
+	if (order)
+		dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER | __GFP_ZERO,
+					order, vmf->vma, vmf->address), 0);
+	else
+		dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma,
+					vmf->address);
+	if (!dpage) {
+		ret = VM_FAULT_OOM;
 		goto done;
+	}
 
-	dst = migrate_pfn(page_to_pfn(dpage));
+	args.dst[0] = migrate_pfn(page_to_pfn(dpage));
+	if (order)
+		args.dst[0] |= MIGRATE_PFN_COMPOUND;
+	dfolio = page_folio(dpage);
 
-	svmm = spage->zone_device_data;
+	svmm = folio_zone_device_data(sfolio);
 	mutex_lock(&svmm->mutex);
 	nouveau_svmm_invalidate(svmm, args.start, args.end);
-	ret = nouveau_dmem_copy_one(drm, spage, dpage, &dma_addr);
+	ret = nouveau_dmem_copy_folio(drm, sfolio, dfolio, &dma_info);
 	mutex_unlock(&svmm->mutex);
 	if (ret) {
 		ret = VM_FAULT_SIGBUS;
@@ -213,19 +256,33 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	nouveau_fence_new(&fence, dmem->migrate.chan);
 	migrate_vma_pages(&args);
 	nouveau_dmem_fence_done(&fence);
-	dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(drm->dev->dev, dma_info.dma_addr, PAGE_SIZE,
+				DMA_BIDIRECTIONAL);
 done:
 	migrate_vma_finalize(&args);
+err:
+	kfree(args.src);
+	kfree(args.dst);
 	return ret;
 }
 
+static void nouveau_dmem_folio_split(struct folio *head, struct folio *tail)
+{
+	if (tail == NULL)
+		return;
+	tail->pgmap = head->pgmap;
+	folio_set_zone_device_data(tail, folio_zone_device_data(head));
+}
+
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
 	.page_free		= nouveau_dmem_page_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
+	.folio_split		= nouveau_dmem_folio_split,
 };
 
 static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage,
+			 bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct resource *res;
@@ -274,16 +331,21 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 	pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
 	page = pfn_to_page(pfn_first);
 	spin_lock(&drm->dmem->lock);
-	for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
-		page->zone_device_data = drm->dmem->free_pages;
-		drm->dmem->free_pages = page;
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || !is_large) {
+		for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
+			page->zone_device_data = drm->dmem->free_pages;
+			drm->dmem->free_pages = page;
+		}
 	}
+
 	*ppage = page;
 	chunk->callocated++;
 	spin_unlock(&drm->dmem->lock);
 
-	NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
-		DMEM_CHUNK_SIZE >> 20);
+	NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+		DMEM_CHUNK_SIZE >> 20, is_large ? "THP " : "", pfn_first,
+		nouveau_dmem_page_addr(page));
 
 	return 0;
 
@@ -298,27 +360,37 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 }
 
 static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct page *page = NULL;
+	struct folio *folio = NULL;
 	int ret;
+	unsigned int order = 0;
 
 	spin_lock(&drm->dmem->lock);
-	if (drm->dmem->free_pages) {
+	if (is_large && drm->dmem->free_folios) {
+		folio = drm->dmem->free_folios;
+		drm->dmem->free_folios = folio_zone_device_data(folio);
+		chunk = nouveau_page_to_chunk(page);
+		chunk->callocated++;
+		spin_unlock(&drm->dmem->lock);
+		order = DMEM_CHUNK_NPAGES;
+	} else if (!is_large && drm->dmem->free_pages) {
 		page = drm->dmem->free_pages;
 		drm->dmem->free_pages = page->zone_device_data;
 		chunk = nouveau_page_to_chunk(page);
 		chunk->callocated++;
 		spin_unlock(&drm->dmem->lock);
+		folio = page_folio(page);
 	} else {
 		spin_unlock(&drm->dmem->lock);
-		ret = nouveau_dmem_chunk_alloc(drm, &page);
+		ret = nouveau_dmem_chunk_alloc(drm, &page, is_large);
 		if (ret)
 			return NULL;
 	}
 
-	zone_device_page_init(page);
+	init_zone_device_folio(folio, order);
 	return page;
 }
 
@@ -369,12 +441,12 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 {
 	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
 	unsigned long *src_pfns, *dst_pfns;
-	dma_addr_t *dma_addrs;
+	struct nouveau_dmem_dma_info *dma_info;
 	struct nouveau_fence *fence;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
-	dma_addrs = kvcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL | __GFP_NOFAIL);
+	dma_info = kvcalloc(npages, sizeof(*dma_info), GFP_KERNEL | __GFP_NOFAIL);
 
 	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
 			npages);
@@ -382,17 +454,28 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	for (i = 0; i < npages; i++) {
 		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
 			struct page *dpage;
+			struct folio *folio = page_folio(
+				migrate_pfn_to_page(src_pfns[i]));
+			unsigned int order = folio_order(folio);
+
+			if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+				dpage = folio_page(
+						folio_alloc(
+						GFP_HIGHUSER_MOVABLE, order), 0);
+			} else {
+				/*
+				 * _GFP_NOFAIL because the GPU is going away and there
+				 * is nothing sensible we can do if we can't copy the
+				 * data back.
+				 */
+				dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+			}
 
-			/*
-			 * _GFP_NOFAIL because the GPU is going away and there
-			 * is nothing sensible we can do if we can't copy the
-			 * data back.
-			 */
-			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
 			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
-			nouveau_dmem_copy_one(chunk->drm,
-					migrate_pfn_to_page(src_pfns[i]), dpage,
-					&dma_addrs[i]);
+			nouveau_dmem_copy_folio(chunk->drm,
+				page_folio(migrate_pfn_to_page(src_pfns[i])),
+				page_folio(dpage),
+				&dma_info[i]);
 		}
 	}
 
@@ -403,8 +486,9 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	kvfree(src_pfns);
 	kvfree(dst_pfns);
 	for (i = 0; i < npages; i++)
-		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
-	kvfree(dma_addrs);
+		dma_unmap_page(chunk->drm->dev->dev, dma_info[i].dma_addr,
+				dma_info[i].size, DMA_BIDIRECTIONAL);
+	kvfree(dma_info);
 }
 
 void
@@ -607,31 +691,35 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, unsigned long src,
-		dma_addr_t *dma_addr, u64 *pfn)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfn)
 {
 	struct device *dev = drm->dev->dev;
 	struct page *dpage, *spage;
 	unsigned long paddr;
+	bool is_large = false;
 
 	spage = migrate_pfn_to_page(src);
 	if (!(src & MIGRATE_PFN_MIGRATE))
 		goto out;
 
-	dpage = nouveau_dmem_page_alloc_locked(drm);
+	is_large = src & MIGRATE_PFN_COMPOUND;
+	dpage = nouveau_dmem_page_alloc_locked(drm, is_large);
 	if (!dpage)
 		goto out;
 
 	paddr = nouveau_dmem_page_addr(dpage);
 	if (spage) {
-		*dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
+		dma_info->dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
 					 DMA_BIDIRECTIONAL);
-		if (dma_mapping_error(dev, *dma_addr))
+		dma_info->size = page_size(spage);
+		if (dma_mapping_error(dev, dma_info->dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, 1,
-			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
+		if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(page_folio(spage)),
+			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST,
+			dma_info->dma_addr))
 			goto out_dma_unmap;
 	} else {
-		*dma_addr = DMA_MAPPING_ERROR;
+		dma_info->dma_addr = DMA_MAPPING_ERROR;
 		if (drm->dmem->migrate.clear_func(drm, page_size(dpage),
 			NOUVEAU_APER_VRAM, paddr))
 			goto out_free_page;
@@ -645,7 +733,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 	return migrate_pfn(page_to_pfn(dpage));
 
 out_dma_unmap:
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(dev, dma_info->dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 out_free_page:
 	nouveau_dmem_page_free_locked(drm, dpage);
 out:
@@ -655,27 +743,33 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 
 static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, struct migrate_vma *args,
-		dma_addr_t *dma_addrs, u64 *pfns)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfns)
 {
 	struct nouveau_fence *fence;
 	unsigned long addr = args->start, nr_dma = 0, i;
+	unsigned long order = 0;
 
-	for (i = 0; addr < args->end; i++) {
+	for (i = 0; addr < args->end; ) {
+		struct folio *folio;
+
+		folio = page_folio(migrate_pfn_to_page(args->dst[i]));
+		order = folio_order(folio);
 		args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
-				args->src[i], dma_addrs + nr_dma, pfns + i);
-		if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
+				args->src[i], dma_info + nr_dma, pfns + i);
+		if (!dma_mapping_error(drm->dev->dev, dma_info[nr_dma].dma_addr))
 			nr_dma++;
-		addr += PAGE_SIZE;
+		i += 1 << order;
+		addr += (1 << order) * PAGE_SIZE;
 	}
 
 	nouveau_fence_new(&fence, drm->dmem->migrate.chan);
 	migrate_vma_pages(args);
 	nouveau_dmem_fence_done(&fence);
-	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i, order);
 
 	while (nr_dma--) {
-		dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
-				DMA_BIDIRECTIONAL);
+		dma_unmap_page(drm->dev->dev, dma_info[nr_dma].dma_addr,
+				dma_info[nr_dma].size, DMA_BIDIRECTIONAL);
 	}
 	migrate_vma_finalize(args);
 }
@@ -689,20 +783,24 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 {
 	unsigned long npages = (end - start) >> PAGE_SHIFT;
 	unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
-	dma_addr_t *dma_addrs;
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
 		.pgmap_owner	= drm->dev,
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM
+				  | MIGRATE_VMA_SELECT_COMPOUND,
 	};
 	unsigned long i;
 	u64 *pfns;
 	int ret = -ENOMEM;
+	struct nouveau_dmem_dma_info *dma_info;
 
 	if (drm->dmem == NULL)
 		return -ENODEV;
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		max = max(HPAGE_PMD_NR, max);
+
 	args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL);
 	if (!args.src)
 		goto out;
@@ -710,8 +808,8 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	if (!args.dst)
 		goto out_free_src;
 
-	dma_addrs = kmalloc_array(max, sizeof(*dma_addrs), GFP_KERNEL);
-	if (!dma_addrs)
+	dma_info = kmalloc_array(max, sizeof(*dma_info), GFP_KERNEL);
+	if (!dma_info)
 		goto out_free_dst;
 
 	pfns = nouveau_pfns_alloc(max);
@@ -729,7 +827,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 			goto out_free_pfns;
 
 		if (args.cpages)
-			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
+			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_info,
 						   pfns);
 		args.start = args.end;
 	}
@@ -738,7 +836,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 out_free_pfns:
 	nouveau_pfns_free(pfns);
 out_free_dma:
-	kfree(dma_addrs);
+	kfree(dma_info);
 out_free_dst:
 	kfree(args.dst);
 out_free_src:
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 6fa387da0637..b8a3378154d5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -921,12 +921,14 @@ nouveau_pfns_free(u64 *pfns)
 
 void
 nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		 unsigned long addr, u64 *pfns, unsigned long npages)
+		 unsigned long addr, u64 *pfns, unsigned long npages,
+		 unsigned int page_shift)
 {
 	struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
 
 	args->p.addr = addr;
-	args->p.size = npages << PAGE_SHIFT;
+	args->p.size = npages << page_shift;
+	args->p.page = page_shift;
 
 	mutex_lock(&svmm->mutex);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index e7d63d7f0c2d..3fd78662f17e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -33,7 +33,8 @@ void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit);
 u64 *nouveau_pfns_alloc(unsigned long npages);
 void nouveau_pfns_free(u64 *pfns);
 void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		      unsigned long addr, u64 *pfns, unsigned long npages);
+		      unsigned long addr, u64 *pfns, unsigned long npages,
+		      unsigned int page_shift);
 #else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
 static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
 static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [v1 resend 12/12] selftests/mm/hmm-tests: new throughput tests including THP
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (10 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 11/12] gpu/drm/nouveau: add THP migration support Balbir Singh
@ 2025-07-03 23:35 ` Balbir Singh
  2025-07-04 16:16 ` [v1 resend 00/12] THP support for zone device page migration Zi Yan
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

Add new benchmark style support to test transfer bandwidth for
zone device memory operations.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 197 ++++++++++++++++++++++++-
 1 file changed, 196 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index da3322a1282c..1325de70f44f 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -25,6 +25,7 @@
 #include <sys/stat.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
+#include <sys/time.h>
 
 
 /*
@@ -207,8 +208,10 @@ static void hmm_buffer_free(struct hmm_buffer *buffer)
 	if (buffer == NULL)
 		return;
 
-	if (buffer->ptr)
+	if (buffer->ptr) {
 		munmap(buffer->ptr, buffer->size);
+		buffer->ptr = NULL;
+	}
 	free(buffer->mirror);
 	free(buffer);
 }
@@ -2466,4 +2469,196 @@ TEST_F(hmm, migrate_anon_huge_zero_err)
 	buffer->ptr = old_ptr;
 	hmm_buffer_free(buffer);
 }
+
+struct benchmark_results {
+	double sys_to_dev_time;
+	double dev_to_sys_time;
+	double throughput_s2d;
+	double throughput_d2s;
+};
+
+static double get_time_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000.0) + (tv.tv_usec / 1000.0);
+}
+
+static inline struct hmm_buffer *hmm_buffer_alloc(unsigned long size)
+{
+	struct hmm_buffer *buffer;
+
+	buffer = malloc(sizeof(*buffer));
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	memset(buffer->mirror, 0xFF, size);
+	return buffer;
+}
+
+static void print_benchmark_results(const char *test_name, size_t buffer_size,
+				     struct benchmark_results *thp,
+				     struct benchmark_results *regular)
+{
+	double s2d_improvement = ((regular->sys_to_dev_time - thp->sys_to_dev_time) /
+				 regular->sys_to_dev_time) * 100.0;
+	double d2s_improvement = ((regular->dev_to_sys_time - thp->dev_to_sys_time) /
+				 regular->dev_to_sys_time) * 100.0;
+	double throughput_s2d_improvement = ((thp->throughput_s2d - regular->throughput_s2d) /
+					    regular->throughput_s2d) * 100.0;
+	double throughput_d2s_improvement = ((thp->throughput_d2s - regular->throughput_d2s) /
+					    regular->throughput_d2s) * 100.0;
+
+	printf("\n=== %s (%.1f MB) ===\n", test_name, buffer_size / (1024.0 * 1024.0));
+	printf("                     | With THP        | Without THP     | Improvement\n");
+	printf("---------------------------------------------------------------------\n");
+	printf("Sys->Dev Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->sys_to_dev_time, regular->sys_to_dev_time, s2d_improvement);
+	printf("Dev->Sys Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->dev_to_sys_time, regular->dev_to_sys_time, d2s_improvement);
+	printf("S->D Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_s2d, regular->throughput_s2d, throughput_s2d_improvement);
+	printf("D->S Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_d2s, regular->throughput_d2s, throughput_d2s_improvement);
+}
+
+/*
+ * Run a single migration benchmark
+ * fd: file descriptor for hmm device
+ * use_thp: whether to use THP
+ * buffer_size: size of buffer to allocate
+ * iterations: number of iterations
+ * results: where to store results
+ */
+static inline int run_migration_benchmark(int fd, int use_thp, size_t buffer_size,
+					   int iterations, struct benchmark_results *results)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages = buffer_size / sysconf(_SC_PAGESIZE);
+	double start, end;
+	double s2d_total = 0, d2s_total = 0;
+	int ret, i;
+	int *ptr;
+
+	buffer = hmm_buffer_alloc(buffer_size);
+
+	/* Map memory */
+	buffer->ptr = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
+			  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (!buffer->ptr)
+		return -1;
+
+	/* Apply THP hint if requested */
+	if (use_thp)
+		ret = madvise(buffer->ptr, buffer_size, MADV_HUGEPAGE);
+	else
+		ret = madvise(buffer->ptr, buffer_size, MADV_NOHUGEPAGE);
+
+	if (ret)
+		return ret;
+
+	/* Initialize memory to make sure pages are allocated */
+	ptr = (int *)buffer->ptr;
+	for (i = 0; i < buffer_size / sizeof(int); i++)
+		ptr[i] = i & 0xFF;
+
+	/* Warmup iteration */
+	ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	/* Benchmark iterations */
+	for (i = 0; i < iterations; i++) {
+		/* System to device migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		s2d_total += (end - start);
+
+		/* Device to system migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		d2s_total += (end - start);
+	}
+
+	/* Calculate average times and throughput */
+	results->sys_to_dev_time = s2d_total / iterations;
+	results->dev_to_sys_time = d2s_total / iterations;
+	results->throughput_s2d = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->sys_to_dev_time / 1000.0);
+	results->throughput_d2s = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->dev_to_sys_time / 1000.0);
+
+	/* Cleanup */
+	hmm_buffer_free(buffer);
+	return 0;
+}
+
+/*
+ * Benchmark THP migration with different buffer sizes
+ */
+TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120)
+{
+	struct benchmark_results thp_results, regular_results;
+	size_t thp_size = 2 * 1024 * 1024; /* 2MB - typical THP size */
+	int iterations = 5;
+
+	printf("\nHMM THP Migration Benchmark\n");
+	printf("---------------------------\n");
+	printf("System page size: %ld bytes\n", sysconf(_SC_PAGESIZE));
+
+	/* Test different buffer sizes */
+	size_t test_sizes[] = {
+		thp_size / 4,      /* 512KB - smaller than THP */
+		thp_size / 2,      /* 1MB - half THP */
+		thp_size,          /* 2MB - single THP */
+		thp_size * 2,      /* 4MB - two THPs */
+		thp_size * 4,      /* 8MB - four THPs */
+		thp_size * 8,       /* 16MB - eight THPs */
+		thp_size * 128,       /* 256MB - one twenty eight THPs */
+	};
+
+	static const char *const test_names[] = {
+		"Small Buffer (512KB)",
+		"Half THP Size (1MB)",
+		"Single THP Size (2MB)",
+		"Two THP Size (4MB)",
+		"Four THP Size (8MB)",
+		"Eight THP Size (16MB)",
+		"One twenty eight THP Size (256MB)"
+	};
+
+	int num_tests = ARRAY_SIZE(test_sizes);
+
+	/* Run all tests */
+	for (int i = 0; i < num_tests; i++) {
+		/* Test with THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 1, test_sizes[i],
+					iterations, &thp_results), 0);
+
+		/* Test without THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 0, test_sizes[i],
+					iterations, &regular_results), 0);
+
+		/* Print results */
+		print_benchmark_results(test_names[i], test_sizes[i],
+					&thp_results, &regular_results);
+	}
+}
 TEST_HARNESS_MAIN
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-07-04  4:46   ` Mika Penttilä
  2025-07-06  1:21     ` Balbir Singh
  2025-07-04 11:10   ` Mika Penttilä
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-04  4:46 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/4/25 02:35, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages
> aware of zone device pages. Although the code is
> designed to be generic when it comes to handling splitting
> of pages, the code is designed to work for THP page sizes
> corresponding to HPAGE_PMD_NR.
>
> Modify page_vma_mapped_walk() to return true when a zone
> device huge entry is present, enabling try_to_migrate()
> and other code migration paths to appropriately process the
> entry
>
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for
> zone device entries.
>
> try_to_map_to_unused_zeropage() does not apply to zone device
> entries, zone device entries are ignored in the call.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>  mm/migrate.c         |   2 +
>  mm/page_vma_mapped.c |  10 +++
>  mm/pgtable-generic.c |   6 ++
>  mm/rmap.c            |  19 +++++-
>  5 files changed, 146 insertions(+), 44 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ce130225a8e5..e6e390d0308f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_device_private_entry(entry));
>  		if (!is_readable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		} else if (thp_migration_supported()) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> +
> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +					!folio_is_device_private(folio));
> +
> +			if (folio_is_device_private(folio)) {
> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
>  		} else
>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>  
> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +			  !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> -		} else {
> +		} else if (is_writable_device_private_entry(entry)) {
> +			newpmd = swp_entry_to_pmd(entry);
> +			entry = make_device_exclusive_entry(swp_offset(entry));
> +		} else
>  			newpmd = *pmd;
> -		}
>  
>  		if (uffd_wp)
>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>  	unsigned long addr;
>  	pte_t *pte;
>  	int i;
> +	swp_entry_t swp_entry;
>  
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
> +
> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +			&& !(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>  	}
>  
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>  
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>  		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +		write = is_writable_migration_entry(swp_entry);

Shouldn't write include is_writable_device_private_entry() also?


> +
>  		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +			anon_exclusive =
> +				is_readable_exclusive_migration_entry(swp_entry);
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
> +		young = is_migration_entry_young(swp_entry);
> +		dirty = is_migration_entry_dirty(swp_entry);
>  	} else {
>  		/*
>  		 * Up to this point the pmd is present and huge and userland has
> @@ -2996,30 +3025,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * Note that NUMA hinting access restrictions are not transferred to
>  	 * avoid any possibility of altering permissions across VMAs.
>  	 */
> -	if (freeze || pmd_migration) {
> +	if (freeze || !present) {
>  		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>  			pte_t entry;
> -			swp_entry_t swp_entry;
> -
> -			if (write)
> -				swp_entry = make_writable_migration_entry(
> -							page_to_pfn(page + i));
> -			else if (anon_exclusive)
> -				swp_entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(page + i));
> -			else
> -				swp_entry = make_readable_migration_entry(
> -							page_to_pfn(page + i));
> -			if (young)
> -				swp_entry = make_migration_entry_young(swp_entry);
> -			if (dirty)
> -				swp_entry = make_migration_entry_dirty(swp_entry);
> -			entry = swp_entry_to_pte(swp_entry);
> -			if (soft_dirty)
> -				entry = pte_swp_mksoft_dirty(entry);
> -			if (uffd_wp)
> -				entry = pte_swp_mkuffd_wp(entry);
> -
> +			if (freeze || is_migration_entry(swp_entry)) {
> +				if (write)
> +					swp_entry = make_writable_migration_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_migration_entry(
> +								page_to_pfn(page + i));
> +				if (young)
> +					swp_entry = make_migration_entry_young(swp_entry);
> +				if (dirty)
> +					swp_entry = make_migration_entry_dirty(swp_entry);
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			} else {
> +				VM_BUG_ON(!is_device_private_entry(swp_entry));
> +				if (write)
> +					swp_entry = make_writable_device_private_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_device_exclusive_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_device_private_entry(
> +								page_to_pfn(page + i));
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			}
>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>  			set_pte_at(mm, addr, pte + i, entry);
>  		}
> @@ -3046,7 +3090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	}
>  	pte_unmap(pte);
>  
> -	if (!pmd_migration)
> +	if (present)
>  		folio_remove_rmap_pmd(folio, page, vma);
>  	if (freeze)
>  		put_page(page);
> @@ -3058,8 +3102,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>  			   pmd_t *pmd, bool freeze)
>  {
> +
>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> -	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
> +	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
> +			(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))))
>  		__split_huge_pmd_locked(vma, pmd, address, freeze);
>  }
>  
> @@ -3238,6 +3285,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>  	lockdep_assert_held(&lruvec->lru_lock);
>  
> +	if (folio_is_device_private(folio))
> +		return;
> +
>  	if (list) {
>  		/* page reclaim is reclaiming a huge page */
>  		VM_WARN_ON(folio_test_lru(folio));
> @@ -3252,6 +3302,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  			list_add_tail(&new_folio->lru, &folio->lru);
>  		folio_set_lru(new_folio);
>  	}
> +
>  }
>  
>  /* Racy check whether the huge page can be split */
> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  					((mapping || swap_cache) ?
>  						folio_nr_pages(release) : 0));
>  
> +			if (folio_is_device_private(release))
> +				percpu_ref_get_many(&release->pgmap->ref,
> +							(1 << new_order) - 1);
> +
>  			lru_add_split_folio(origin_folio, release, lruvec,
>  					list);
>  
> @@ -4596,7 +4651,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		return 0;
>  
>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	if (!folio_is_device_private(folio))
> +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	else
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
>  
>  	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>  	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> @@ -4646,6 +4704,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>  	folio_get(folio);
>  	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
> +
> +	if (unlikely(folio_is_device_private(folio))) {
> +		if (pmd_write(pmde))
> +			entry = make_writable_device_private_entry(
> +							page_to_pfn(new));
> +		else
> +			entry = make_readable_device_private_entry(
> +							page_to_pfn(new));
> +		pmde = swp_entry_to_pmd(entry);
> +	}
> +
>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>  		pmde = pmd_mksoft_dirty(pmde);
>  	if (is_writable_migration_entry(entry))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 767f503f0875..0b6ecf559b22 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>  
>  	if (PageCompound(page))
>  		return false;
> +	if (folio_is_device_private(folio))
> +		return false;
>  	VM_BUG_ON_PAGE(!PageAnon(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e981a1a292d2..ff8254e52de5 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -277,6 +277,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			 * cannot return prematurely, while zap_huge_pmd() has
>  			 * cleared *pmd but not decremented compound_mapcount().
>  			 */
> +			swp_entry_t entry;
> +
> +			if (!thp_migration_supported())
> +				return not_found(pvmw);
> +			entry = pmd_to_swp_entry(pmde);
> +			if (is_device_private_entry(entry)) {
> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +				return true;
> +			}
> +
>  			if ((pvmw->flags & PVMW_SYNC) &&
>  			    thp_vma_suitable_order(vma, pvmw->address,
>  						   PMD_ORDER) &&
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..604e8206a2ec 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  		*pmdvalp = pmdval;
>  	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>  		goto nomap;
> +	if (is_swap_pmd(pmdval)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> +
> +		if (is_device_private_entry(entry))
> +			goto nomap;
> +	}
>  	if (unlikely(pmd_trans_huge(pmdval)))
>  		goto nomap;
>  	if (unlikely(pmd_bad(pmdval))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bd83724d14b6..da1e5b03e1fe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2336,8 +2336,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				break;
>  			}
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> -			subpage = folio_page(folio,
> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +			/*
> +			 * Zone device private folios do not work well with
> +			 * pmd_pfn() on some architectures due to pte
> +			 * inversion.
> +			 */
> +			if (folio_is_device_private(folio)) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> +				unsigned long pfn = swp_offset_pfn(entry);
> +
> +				subpage = folio_page(folio, pfn
> +							- folio_pfn(folio));
> +			} else {
> +				subpage = folio_page(folio,
> +							pmd_pfn(*pvmw.pmd)
> +							- folio_pfn(folio));
> +			}
> +
>  			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>  					!folio_test_pmd_mappable(folio), folio);
>  



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-03 23:35 ` [v1 resend 08/12] mm/thp: add split during migration support Balbir Singh
@ 2025-07-04  5:17   ` Mika Penttilä
  2025-07-04  6:43     ` Mika Penttilä
  2025-07-04 11:24   ` Zi Yan
  1 sibling, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-04  5:17 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/4/25 02:35, Balbir Singh wrote:
> Support splitting pages during THP zone device migration as needed.
> The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages.
>
> Add a new routine migrate_vma_split_pages() to support the splitting
> of already isolated pages. The pages being migrated are already unmapped
> and marked for migration during setup (via unmap). folio_split() and
> __split_unmapped_folio() take additional isolated arguments, to avoid
> unmapping and remaping these pages and unlocking/putting the folio.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h | 11 ++++++--
>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 85 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 65a1bdf29bb9..5f55a754e57c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>  
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order);
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order, bool isolated);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns);
>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  		struct list_head *list);
> +
> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order)
> +{
> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +}
> +
>  /*
>   * try_folio_split - try to split a @folio at @page using non uniform split.
>   * @folio: folio to be split
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d55e36ae0c39..e00ddfed22fa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		new_folio->mapping = folio->mapping;
>  		new_folio->index = folio->index + i;
>  
> -		/*
> -		 * page->private should not be set in tail pages. Fix up and warn once
> -		 * if private is unexpectedly set.
> -		 */
> -		if (unlikely(new_folio->private)) {
> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
> -			new_folio->private = NULL;
> -		}
> -
>  		if (folio_test_swapcache(folio))
>  			new_folio->swap.val = folio->swap.val + i;
>  
> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		struct page *split_at, struct page *lock_at,
>  		struct list_head *list, pgoff_t end,
>  		struct xa_state *xas, struct address_space *mapping,
> -		bool uniform_split)
> +		bool uniform_split, bool isolated)
>  {
>  	struct lruvec *lruvec;
>  	struct address_space *swap_cache = NULL;
> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  				percpu_ref_get_many(&release->pgmap->ref,
>  							(1 << new_order) - 1);
>  
> -			lru_add_split_folio(origin_folio, release, lruvec,
> -					list);
> +			if (!isolated)
> +				lru_add_split_folio(origin_folio, release,
> +							lruvec, list);
>  
>  			/* Some pages can be beyond EOF: drop them from cache */
>  			if (release->index >= end) {
> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  	if (nr_dropped)
>  		shmem_uncharge(mapping->host, nr_dropped);
>  
> +	/*
> +	 * Don't remap and unlock isolated folios
> +	 */
> +	if (isolated)
> +		return ret;
> +
>  	remap_page(origin_folio, 1 << order,
>  			folio_test_anon(origin_folio) ?
>  				RMP_USE_SHARED_ZEROPAGE : 0);
> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> + * @isolated: The pages are already unmapped
>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split)
> +		struct list_head *list, bool uniform_split, bool isolated)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		anon_vma = folio_get_anon_vma(folio);
> -		if (!anon_vma) {
> -			ret = -EBUSY;
> -			goto out;
> +		if (!isolated) {
> +			anon_vma = folio_get_anon_vma(folio);
> +			if (!anon_vma) {
> +				ret = -EBUSY;
> +				goto out;
> +			}
> +			anon_vma_lock_write(anon_vma);
>  		}
>  		end = -1;
>  		mapping = NULL;
> -		anon_vma_lock_write(anon_vma);
>  	} else {
>  		unsigned int min_order;
>  		gfp_t gfp;
> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>  
> -	unmap_folio(folio);
> +	if (!isolated)
> +		unmap_folio(folio);
>  
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  
>  		ret = __split_unmapped_folio(folio, new_order,
>  				split_at, lock_at, list, end, &xas, mapping,
> -				uniform_split);
> +				uniform_split, isolated);
>  	} else {
>  		spin_unlock(&ds_queue->split_queue_lock);
>  fail:
>  		if (mapping)
>  			xas_unlock(&xas);
>  		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio), 0);
> +		if (!isolated)
> +			remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}
>  
> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order)
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order, bool isolated)
>  {
>  	struct folio *folio = page_folio(page);
>  
> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
> +				isolated);
>  }
>  
>  /*
> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false);
> +			false, false);
>  }
>  
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 41d0bd787969..acd2f03b178d 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>  	return 0;
>  }
> +
> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
> +					unsigned long idx, unsigned long addr,
> +					struct folio *folio)
> +{
> +	unsigned long i;
> +	unsigned long pfn;
> +	unsigned long flags;
> +
> +	folio_get(folio);
> +	split_huge_pmd_address(migrate->vma, addr, true);
> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);

We already have reference to folio, why is folio_get() needed ?

Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?

> +	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
> +	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
> +	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
> +	for (i = 1; i < HPAGE_PMD_NR; i++)
> +		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
> +}
>  #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  					 unsigned long addr,
> @@ -822,6 +840,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  {
>  	return 0;
>  }
> +
> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
> +					unsigned long idx, unsigned long addr,
> +					struct folio *folio)
> +{}
>  #endif
>  
>  /*
> @@ -971,8 +994,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				struct migrate_vma *migrate)
>  {
>  	struct mmu_notifier_range range;
> -	unsigned long i;
> +	unsigned long i, j;
>  	bool notified = false;
> +	unsigned long addr;
>  
>  	for (i = 0; i < npages; ) {
>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
> @@ -1014,12 +1038,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
>  				nr = HPAGE_PMD_NR;
>  				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -				goto next;
> +			} else {
> +				nr = 1;
>  			}
>  
> -			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
> -						&src_pfns[i]);
> +			for (j = 0; j < nr && i + j < npages; j++) {
> +				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
> +				migrate_vma_insert_page(migrate,
> +					addr + j * PAGE_SIZE,
> +					&dst_pfns[i+j], &src_pfns[i+j]);
> +			}
>  			goto next;
>  		}
>  
> @@ -1041,7 +1069,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  							 MIGRATE_PFN_COMPOUND);
>  					goto next;
>  				}
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +				nr = 1 << folio_order(folio);
> +				addr = migrate->start + i * PAGE_SIZE;
> +				migrate_vma_split_pages(migrate, i, addr, folio);
>  			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>  				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>  				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> @@ -1076,12 +1106,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  		BUG_ON(folio_test_writeback(folio));
>  
>  		if (migrate && migrate->fault_page == page)
> -			extra_cnt = 1;
> -		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> -		if (r != MIGRATEPAGE_SUCCESS)
> -			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -		else
> -			folio_migrate_flags(newfolio, folio);
> +			extra_cnt++;
> +		for (j = 0; j < nr && i + j < npages; j++) {
> +			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
> +			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
> +
> +			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> +			if (r != MIGRATEPAGE_SUCCESS)
> +				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
> +			else
> +				folio_migrate_flags(newfolio, folio);
> +		}
>  next:
>  		i += nr;
>  	}



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-04  5:17   ` Mika Penttilä
@ 2025-07-04  6:43     ` Mika Penttilä
  2025-07-05  0:26       ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-04  6:43 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom


On 7/4/25 08:17, Mika Penttilä wrote:
> On 7/4/25 02:35, Balbir Singh wrote:
>> Support splitting pages during THP zone device migration as needed.
>> The common case that arises is that after setup, during migrate
>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>> pages.
>>
>> Add a new routine migrate_vma_split_pages() to support the splitting
>> of already isolated pages. The pages being migrated are already unmapped
>> and marked for migration during setup (via unmap). folio_split() and
>> __split_unmapped_folio() take additional isolated arguments, to avoid
>> unmapping and remaping these pages and unlocking/putting the folio.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h | 11 ++++++--
>>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>>  3 files changed, 85 insertions(+), 39 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 65a1bdf29bb9..5f55a754e57c 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>  		vm_flags_t vm_flags);
>>  
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -		unsigned int new_order);
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order, bool isolated);
>>  int min_order_for_split(struct folio *folio);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>  		bool warns);
>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>  		struct list_head *list);
>> +
>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>> +{
>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>> +}
>> +
>>  /*
>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>   * @folio: folio to be split
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d55e36ae0c39..e00ddfed22fa 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>  		new_folio->mapping = folio->mapping;
>>  		new_folio->index = folio->index + i;
>>  
>> -		/*
>> -		 * page->private should not be set in tail pages. Fix up and warn once
>> -		 * if private is unexpectedly set.
>> -		 */
>> -		if (unlikely(new_folio->private)) {
>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>> -			new_folio->private = NULL;
>> -		}
>> -
>>  		if (folio_test_swapcache(folio))
>>  			new_folio->swap.val = folio->swap.val + i;
>>  
>> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  		struct page *split_at, struct page *lock_at,
>>  		struct list_head *list, pgoff_t end,
>>  		struct xa_state *xas, struct address_space *mapping,
>> -		bool uniform_split)
>> +		bool uniform_split, bool isolated)
>>  {
>>  	struct lruvec *lruvec;
>>  	struct address_space *swap_cache = NULL;
>> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  				percpu_ref_get_many(&release->pgmap->ref,
>>  							(1 << new_order) - 1);
>>  
>> -			lru_add_split_folio(origin_folio, release, lruvec,
>> -					list);
>> +			if (!isolated)
>> +				lru_add_split_folio(origin_folio, release,
>> +							lruvec, list);
>>  
>>  			/* Some pages can be beyond EOF: drop them from cache */
>>  			if (release->index >= end) {
>> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  	if (nr_dropped)
>>  		shmem_uncharge(mapping->host, nr_dropped);
>>  
>> +	/*
>> +	 * Don't remap and unlock isolated folios
>> +	 */
>> +	if (isolated)
>> +		return ret;
>> +
>>  	remap_page(origin_folio, 1 << order,
>>  			folio_test_anon(origin_folio) ?
>>  				RMP_USE_SHARED_ZEROPAGE : 0);
>> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   * @lock_at: a page within @folio to be left locked to caller
>>   * @list: after-split folios will be put on it if non NULL
>>   * @uniform_split: perform uniform split or not (non-uniform split)
>> + * @isolated: The pages are already unmapped
>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool isolated)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!isolated) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>>  		}
>>  		end = -1;
>>  		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>>  	} else {
>>  		unsigned int min_order;
>>  		gfp_t gfp;
>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>  
>> -	unmap_folio(folio);
>> +	if (!isolated)
>> +		unmap_folio(folio);
>>  
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  
>>  		ret = __split_unmapped_folio(folio, new_order,
>>  				split_at, lock_at, list, end, &xas, mapping,
>> -				uniform_split);
>> +				uniform_split, isolated);
>>  	} else {
>>  		spin_unlock(&ds_queue->split_queue_lock);
>>  fail:
>>  		if (mapping)
>>  			xas_unlock(&xas);
>>  		local_irq_enable();
>> -		remap_page(folio, folio_nr_pages(folio), 0);
>> +		if (!isolated)
>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>  		ret = -EAGAIN;
>>  	}
>>  
>> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>   */
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -				     unsigned int new_order)
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order, bool isolated)
>>  {
>>  	struct folio *folio = page_folio(page);
>>  
>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>> +				isolated);
>>  }
>>  
>>  /*
>> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct list_head *list)
>>  {
>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>> -			false);
>> +			false, false);
>>  }
>>  
>>  int min_order_for_split(struct folio *folio)
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 41d0bd787969..acd2f03b178d 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>  	return 0;
>>  }
>> +
>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>> +					unsigned long idx, unsigned long addr,
>> +					struct folio *folio)
>> +{
>> +	unsigned long i;
>> +	unsigned long pfn;
>> +	unsigned long flags;
>> +
>> +	folio_get(folio);
>> +	split_huge_pmd_address(migrate->vma, addr, true);
>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
> We already have reference to folio, why is folio_get() needed ?
>
> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?

Oh I see 
+	if (!isolated)
+		unmap_folio(folio);

which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);

Still, why the folio_get(folio);?
 

>
>> +	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
>> +	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
>> +	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
>> +	for (i = 1; i < HPAGE_PMD_NR; i++)
>> +		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
>> +}
>>  #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>  static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  					 unsigned long addr,
>> @@ -822,6 +840,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  {
>>  	return 0;
>>  }
>> +
>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>> +					unsigned long idx, unsigned long addr,
>> +					struct folio *folio)
>> +{}
>>  #endif
>>  
>>  /*
>> @@ -971,8 +994,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  				struct migrate_vma *migrate)
>>  {
>>  	struct mmu_notifier_range range;
>> -	unsigned long i;
>> +	unsigned long i, j;
>>  	bool notified = false;
>> +	unsigned long addr;
>>  
>>  	for (i = 0; i < npages; ) {
>>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
>> @@ -1014,12 +1038,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
>>  				nr = HPAGE_PMD_NR;
>>  				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
>> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>> -				goto next;
>> +			} else {
>> +				nr = 1;
>>  			}
>>  
>> -			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
>> -						&src_pfns[i]);
>> +			for (j = 0; j < nr && i + j < npages; j++) {
>> +				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
>> +				migrate_vma_insert_page(migrate,
>> +					addr + j * PAGE_SIZE,
>> +					&dst_pfns[i+j], &src_pfns[i+j]);
>> +			}
>>  			goto next;
>>  		}
>>  
>> @@ -1041,7 +1069,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  							 MIGRATE_PFN_COMPOUND);
>>  					goto next;
>>  				}
>> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>> +				nr = 1 << folio_order(folio);
>> +				addr = migrate->start + i * PAGE_SIZE;
>> +				migrate_vma_split_pages(migrate, i, addr, folio);
>>  			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>>  				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>>  				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
>> @@ -1076,12 +1106,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  		BUG_ON(folio_test_writeback(folio));
>>  
>>  		if (migrate && migrate->fault_page == page)
>> -			extra_cnt = 1;
>> -		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
>> -		if (r != MIGRATEPAGE_SUCCESS)
>> -			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>> -		else
>> -			folio_migrate_flags(newfolio, folio);
>> +			extra_cnt++;
>> +		for (j = 0; j < nr && i + j < npages; j++) {
>> +			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
>> +			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
>> +
>> +			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
>> +			if (r != MIGRATEPAGE_SUCCESS)
>> +				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
>> +			else
>> +				folio_migrate_flags(newfolio, folio);
>> +		}
>>  next:
>>  		i += nr;
>>  	}



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
  2025-07-04  4:46   ` Mika Penttilä
@ 2025-07-04 11:10   ` Mika Penttilä
  2025-07-05  0:14     ` Balbir Singh
  2025-07-07  3:49   ` Mika Penttilä
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-04 11:10 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/4/25 02:35, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages
> aware of zone device pages. Although the code is
> designed to be generic when it comes to handling splitting
> of pages, the code is designed to work for THP page sizes
> corresponding to HPAGE_PMD_NR.
>
> Modify page_vma_mapped_walk() to return true when a zone
> device huge entry is present, enabling try_to_migrate()
> and other code migration paths to appropriately process the
> entry
>
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for
> zone device entries.
>
> try_to_map_to_unused_zeropage() does not apply to zone device
> entries, zone device entries are ignored in the call.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>  mm/migrate.c         |   2 +
>  mm/page_vma_mapped.c |  10 +++
>  mm/pgtable-generic.c |   6 ++
>  mm/rmap.c            |  19 +++++-
>  5 files changed, 146 insertions(+), 44 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ce130225a8e5..e6e390d0308f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_device_private_entry(entry));
>  		if (!is_readable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		} else if (thp_migration_supported()) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> +
> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +					!folio_is_device_private(folio));
> +
> +			if (folio_is_device_private(folio)) {
> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
>  		} else
>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>  
> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +			  !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> -		} else {
> +		} else if (is_writable_device_private_entry(entry)) {
> +			newpmd = swp_entry_to_pmd(entry);
> +			entry = make_device_exclusive_entry(swp_offset(entry));
> +		} else
>  			newpmd = *pmd;
> -		}
>  
>  		if (uffd_wp)
>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>  	unsigned long addr;
>  	pte_t *pte;
>  	int i;
> +	swp_entry_t swp_entry;
>  
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
> +
> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +			&& !(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>  	}
>  
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>  
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>  		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +		write = is_writable_migration_entry(swp_entry);
> +
>  		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +			anon_exclusive =
> +				is_readable_exclusive_migration_entry(swp_entry);
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
> +		young = is_migration_entry_young(swp_entry);
> +		dirty = is_migration_entry_dirty(swp_entry);
>  	} else {
>  		/*
>  		 * Up to this point the pmd is present and huge and userland has
> @@ -2996,30 +3025,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * Note that NUMA hinting access restrictions are not transferred to
>  	 * avoid any possibility of altering permissions across VMAs.
>  	 */
> -	if (freeze || pmd_migration) {
> +	if (freeze || !present) {
>  		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>  			pte_t entry;
> -			swp_entry_t swp_entry;
> -
> -			if (write)
> -				swp_entry = make_writable_migration_entry(
> -							page_to_pfn(page + i));
> -			else if (anon_exclusive)
> -				swp_entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(page + i));
> -			else
> -				swp_entry = make_readable_migration_entry(
> -							page_to_pfn(page + i));
> -			if (young)
> -				swp_entry = make_migration_entry_young(swp_entry);
> -			if (dirty)
> -				swp_entry = make_migration_entry_dirty(swp_entry);
> -			entry = swp_entry_to_pte(swp_entry);
> -			if (soft_dirty)
> -				entry = pte_swp_mksoft_dirty(entry);
> -			if (uffd_wp)
> -				entry = pte_swp_mkuffd_wp(entry);
> -
> +			if (freeze || is_migration_entry(swp_entry)) {
> +				if (write)
> +					swp_entry = make_writable_migration_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_migration_entry(
> +								page_to_pfn(page + i));
> +				if (young)
> +					swp_entry = make_migration_entry_young(swp_entry);
> +				if (dirty)
> +					swp_entry = make_migration_entry_dirty(swp_entry);
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			} else {
> +				VM_BUG_ON(!is_device_private_entry(swp_entry));
> +				if (write)
> +					swp_entry = make_writable_device_private_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_device_exclusive_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_device_private_entry(
> +								page_to_pfn(page + i));
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			}
>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>  			set_pte_at(mm, addr, pte + i, entry);
>  		}
> @@ -3046,7 +3090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	}
>  	pte_unmap(pte);
>  
> -	if (!pmd_migration)
> +	if (present)
>  		folio_remove_rmap_pmd(folio, page, vma);
>  	if (freeze)
>  		put_page(page);
> @@ -3058,8 +3102,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>  			   pmd_t *pmd, bool freeze)
>  {
> +
>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> -	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
> +	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
> +			(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))))
>  		__split_huge_pmd_locked(vma, pmd, address, freeze);
>  }
>  
> @@ -3238,6 +3285,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>  	lockdep_assert_held(&lruvec->lru_lock);
>  
> +	if (folio_is_device_private(folio))
> +		return;
> +
>  	if (list) {
>  		/* page reclaim is reclaiming a huge page */
>  		VM_WARN_ON(folio_test_lru(folio));
> @@ -3252,6 +3302,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  			list_add_tail(&new_folio->lru, &folio->lru);
>  		folio_set_lru(new_folio);
>  	}
> +
>  }
>  
>  /* Racy check whether the huge page can be split */
> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  					((mapping || swap_cache) ?
>  						folio_nr_pages(release) : 0));
>  
> +			if (folio_is_device_private(release))
> +				percpu_ref_get_many(&release->pgmap->ref,
> +							(1 << new_order) - 1);

pgmap refcount should not be modified here, count should remain the same after the split also


> +
>  			lru_add_split_folio(origin_folio, release, lruvec,
>  					list);
>  
> @@ -4596,7 +4651,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		return 0;
>  
>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	if (!folio_is_device_private(folio))
> +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	else
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
>  
>  	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>  	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> @@ -4646,6 +4704,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>  	folio_get(folio);
>  	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
> +
> +	if (unlikely(folio_is_device_private(folio))) {
> +		if (pmd_write(pmde))
> +			entry = make_writable_device_private_entry(
> +							page_to_pfn(new));
> +		else
> +			entry = make_readable_device_private_entry(
> +							page_to_pfn(new));
> +		pmde = swp_entry_to_pmd(entry);
> +	}
> +
>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>  		pmde = pmd_mksoft_dirty(pmde);
>  	if (is_writable_migration_entry(entry))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 767f503f0875..0b6ecf559b22 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>  
>  	if (PageCompound(page))
>  		return false;
> +	if (folio_is_device_private(folio))
> +		return false;
>  	VM_BUG_ON_PAGE(!PageAnon(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e981a1a292d2..ff8254e52de5 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -277,6 +277,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			 * cannot return prematurely, while zap_huge_pmd() has
>  			 * cleared *pmd but not decremented compound_mapcount().
>  			 */
> +			swp_entry_t entry;
> +
> +			if (!thp_migration_supported())
> +				return not_found(pvmw);
> +			entry = pmd_to_swp_entry(pmde);
> +			if (is_device_private_entry(entry)) {
> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +				return true;
> +			}
> +
>  			if ((pvmw->flags & PVMW_SYNC) &&
>  			    thp_vma_suitable_order(vma, pvmw->address,
>  						   PMD_ORDER) &&
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..604e8206a2ec 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  		*pmdvalp = pmdval;
>  	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>  		goto nomap;
> +	if (is_swap_pmd(pmdval)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> +
> +		if (is_device_private_entry(entry))
> +			goto nomap;
> +	}
>  	if (unlikely(pmd_trans_huge(pmdval)))
>  		goto nomap;
>  	if (unlikely(pmd_bad(pmdval))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bd83724d14b6..da1e5b03e1fe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2336,8 +2336,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				break;
>  			}
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> -			subpage = folio_page(folio,
> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +			/*
> +			 * Zone device private folios do not work well with
> +			 * pmd_pfn() on some architectures due to pte
> +			 * inversion.
> +			 */
> +			if (folio_is_device_private(folio)) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> +				unsigned long pfn = swp_offset_pfn(entry);
> +
> +				subpage = folio_page(folio, pfn
> +							- folio_pfn(folio));
> +			} else {
> +				subpage = folio_page(folio,
> +							pmd_pfn(*pvmw.pmd)
> +							- folio_pfn(folio));
> +			}
> +
>  			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>  					!folio_test_pmd_mappable(folio), folio);
>  



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 07/12] mm/memremap: add folio_split support
  2025-07-03 23:35 ` [v1 resend 07/12] mm/memremap: add folio_split support Balbir Singh
@ 2025-07-04 11:14   ` Mika Penttilä
  2025-07-06  1:24     ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-04 11:14 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom


On 7/4/25 02:35, Balbir Singh wrote:
> When a zone device page is split (via huge pmd folio split). The
> driver callback for folio_split is invoked to let the device driver
> know that the folio size has been split into a smaller order.
>
> The HMM test driver has been updated to handle the split, since the
> test driver uses backing pages, it requires a mechanism of reorganizing
> the backing pages (backing pages are used to create a mirror device)
> again into the right sized order pages. This is supported by exporting
> prep_compound_page().
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/memremap.h |  7 +++++++
>  include/linux/mm.h       |  1 +
>  lib/test_hmm.c           | 42 ++++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c         | 14 ++++++++++++++
>  mm/page_alloc.c          |  1 +
>  5 files changed, 65 insertions(+)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 11d586dd8ef1..2091b754f1da 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -100,6 +100,13 @@ struct dev_pagemap_ops {
>  	 */
>  	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>  			      unsigned long nr_pages, int mf_flags);
> +
> +	/*
> +	 * Used for private (un-addressable) device memory only.
> +	 * This callback is used when a folio is split into
> +	 * a smaller folio
> +	 */
> +	void (*folio_split)(struct folio *head, struct folio *tail);
>  };
>  
>  #define PGMAP_ALTMAP_VALID	(1 << 0)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef40f68c1183..f7bda8b1e46c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1183,6 +1183,7 @@ static inline struct folio *virt_to_folio(const void *x)
>  void __folio_put(struct folio *folio);
>  
>  void split_page(struct page *page, unsigned int order);
> +void prep_compound_page(struct page *page, unsigned int order);
>  void folio_copy(struct folio *dst, struct folio *src);
>  int folio_mc_copy(struct folio *dst, struct folio *src);
>  
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 95b4276a17fd..e20021fb7c69 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1646,9 +1646,51 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	return ret;
>  }
>  
> +static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
> +{
> +	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
> +	struct page *rpage_tail;
> +	struct folio *rfolio;
> +	unsigned long offset = 0;
> +	unsigned int tail_order;
> +	unsigned int head_order = folio_order(head);
> +
> +	if (!rpage) {
> +		tail->page.zone_device_data = NULL;
> +		return;
> +	}
> +
> +	rfolio = page_folio(rpage);
> +
> +	if (tail == NULL) {
> +		folio_reset_order(rfolio);
> +		rfolio->mapping = NULL;
> +		if (head_order)
> +			prep_compound_page(rpage, head_order);
> +		folio_set_count(rfolio, 1 << head_order);
> +		return;
> +	}
> +
> +	offset = folio_pfn(tail) - folio_pfn(head);
> +
> +	rpage_tail = folio_page(rfolio, offset);
> +	tail->page.zone_device_data = rpage_tail;
> +	clear_compound_head(rpage_tail);
> +	rpage_tail->mapping = NULL;
> +
> +	tail_order = folio_order(tail);
> +	if (tail_order)
> +		prep_compound_page(rpage_tail, tail_order);
> +
> +	folio_page(tail, 0)->mapping = folio_page(head, 0)->mapping;
> +	tail->pgmap = head->pgmap;
> +	folio_set_count(page_folio(rpage_tail), 1 << tail_order);
> +}
> +
>  static const struct dev_pagemap_ops dmirror_devmem_ops = {
>  	.page_free	= dmirror_devmem_free,
>  	.migrate_to_ram	= dmirror_devmem_fault,
> +	.folio_split	= dmirror_devmem_folio_split,
>  };
>  
>  static int dmirror_device_init(struct dmirror_device *mdevice, int id)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f29add796931..d55e36ae0c39 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3630,6 +3630,11 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  			if (release == origin_folio)
>  				continue;
>  
> +			if (folio_is_device_private(origin_folio) &&
> +					origin_folio->pgmap->ops->folio_split)
> +				origin_folio->pgmap->ops->folio_split(
> +					origin_folio, release);

Should folio split fail if pgmap->ops->folio_split() is not defined? If not then at least the >pgmap pointer copy should be in the common code.

> +
>  			folio_ref_unfreeze(release, 1 +
>  					((mapping || swap_cache) ?
>  						folio_nr_pages(release) : 0));
> @@ -3661,6 +3666,15 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		}
>  	}
>  
> +	/*
> +	 * Mark the end of the split, so that the driver can setup origin_folio's
> +	 * order and other metadata
> +	 */
> +	if (folio_is_device_private(origin_folio) &&
> +			origin_folio->pgmap->ops->folio_split)
> +		origin_folio->pgmap->ops->folio_split(
> +			origin_folio, NULL);
> +
>  	/*
>  	 * Unfreeze origin_folio only after all page cache entries, which used
>  	 * to point to it, have been updated with new folios. Otherwise,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4f55f8ed65c7..0a538e9c24bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -722,6 +722,7 @@ void prep_compound_page(struct page *page, unsigned int order)
>  
>  	prep_compound_head(page, order);
>  }
> +EXPORT_SYMBOL_GPL(prep_compound_page);
>  
>  static inline void set_buddy_order(struct page *page, unsigned int order)
>  {



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-03 23:35 ` [v1 resend 08/12] mm/thp: add split during migration support Balbir Singh
  2025-07-04  5:17   ` Mika Penttilä
@ 2025-07-04 11:24   ` Zi Yan
  2025-07-05  0:58     ` Balbir Singh
  1 sibling, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-04 11:24 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 3 Jul 2025, at 19:35, Balbir Singh wrote:

> Support splitting pages during THP zone device migration as needed.
> The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages.
>
> Add a new routine migrate_vma_split_pages() to support the splitting
> of already isolated pages. The pages being migrated are already unmapped
> and marked for migration during setup (via unmap). folio_split() and
> __split_unmapped_folio() take additional isolated arguments, to avoid
> unmapping and remaping these pages and unlocking/putting the folio.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h | 11 ++++++--
>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 85 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 65a1bdf29bb9..5f55a754e57c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order);
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order, bool isolated);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns);
>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  		struct list_head *list);
> +
> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order)
> +{
> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +}
> +
>  /*
>   * try_folio_split - try to split a @folio at @page using non uniform split.
>   * @folio: folio to be split
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d55e36ae0c39..e00ddfed22fa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		new_folio->mapping = folio->mapping;
>  		new_folio->index = folio->index + i;
>
> -		/*
> -		 * page->private should not be set in tail pages. Fix up and warn once
> -		 * if private is unexpectedly set.
> -		 */
> -		if (unlikely(new_folio->private)) {
> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
> -			new_folio->private = NULL;
> -		}
> -
>  		if (folio_test_swapcache(folio))
>  			new_folio->swap.val = folio->swap.val + i;
>
> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		struct page *split_at, struct page *lock_at,
>  		struct list_head *list, pgoff_t end,
>  		struct xa_state *xas, struct address_space *mapping,
> -		bool uniform_split)
> +		bool uniform_split, bool isolated)
>  {
>  	struct lruvec *lruvec;
>  	struct address_space *swap_cache = NULL;
> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  				percpu_ref_get_many(&release->pgmap->ref,
>  							(1 << new_order) - 1);
>
> -			lru_add_split_folio(origin_folio, release, lruvec,
> -					list);
> +			if (!isolated)
> +				lru_add_split_folio(origin_folio, release,
> +							lruvec, list);
>
>  			/* Some pages can be beyond EOF: drop them from cache */
>  			if (release->index >= end) {
> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  	if (nr_dropped)
>  		shmem_uncharge(mapping->host, nr_dropped);
>
> +	/*
> +	 * Don't remap and unlock isolated folios
> +	 */
> +	if (isolated)
> +		return ret;
> +
>  	remap_page(origin_folio, 1 << order,
>  			folio_test_anon(origin_folio) ?
>  				RMP_USE_SHARED_ZEROPAGE : 0);
> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> + * @isolated: The pages are already unmapped

s/pages/folio

Why name it isolated if the folio is unmapped? Isolated folios often mean
they are removed from LRU lists. isolated here causes confusion.

>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split)
> +		struct list_head *list, bool uniform_split, bool isolated)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		anon_vma = folio_get_anon_vma(folio);
> -		if (!anon_vma) {
> -			ret = -EBUSY;
> -			goto out;
> +		if (!isolated) {
> +			anon_vma = folio_get_anon_vma(folio);
> +			if (!anon_vma) {
> +				ret = -EBUSY;
> +				goto out;
> +			}
> +			anon_vma_lock_write(anon_vma);
>  		}
>  		end = -1;
>  		mapping = NULL;
> -		anon_vma_lock_write(anon_vma);
>  	} else {
>  		unsigned int min_order;
>  		gfp_t gfp;
> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>
> -	unmap_folio(folio);
> +	if (!isolated)
> +		unmap_folio(folio);
>
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  		ret = __split_unmapped_folio(folio, new_order,
>  				split_at, lock_at, list, end, &xas, mapping,
> -				uniform_split);
> +				uniform_split, isolated);
>  	} else {
>  		spin_unlock(&ds_queue->split_queue_lock);
>  fail:
>  		if (mapping)
>  			xas_unlock(&xas);
>  		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio), 0);
> +		if (!isolated)
> +			remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}

These "isolated" special handlings does not look good, I wonder if there
is a way of letting split code handle device private folios more gracefully.
It also causes confusions, since why does "isolated/unmapped" folios
not need to unmap_page(), remap_page(), or unlock?


>
> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order)
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order, bool isolated)
>  {
>  	struct folio *folio = page_folio(page);
>
> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
> +				isolated);
>  }
>
>  /*
> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false);
> +			false, false);
>  }
>
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 41d0bd787969..acd2f03b178d 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>  	return 0;
>  }
> +
> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
> +					unsigned long idx, unsigned long addr,
> +					struct folio *folio)
> +{
> +	unsigned long i;
> +	unsigned long pfn;
> +	unsigned long flags;
> +
> +	folio_get(folio);
> +	split_huge_pmd_address(migrate->vma, addr, true);
> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);

If you need to split PMD entries here, why not let unmap_page() and remap_page()
in split code does that?

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-03 23:35 ` [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-07-04 15:35   ` kernel test robot
  2025-07-18  6:59   ` Matthew Brost
  2025-07-19  2:10   ` Matthew Brost
  2 siblings, 0 replies; 99+ messages in thread
From: kernel test robot @ 2025-07-04 15:35 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: llvm, oe-kbuild-all, akpm, linux-kernel, Balbir Singh,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

Hi Balbir,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250704]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250704-073807
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250703233511.2028395-5-balbirs%40nvidia.com
patch subject: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
config: x86_64-randconfig-075-20250704 (https://download.01.org/0day-ci/archive/20250704/202507042336.o2mutGeh-lkp@intel.com/config)
compiler: clang version 20.1.7 (https://github.com/llvm/llvm-project 6146a88f60492b520a36f8f8f3231e15f3cc6082)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250704/202507042336.o2mutGeh-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507042336.o2mutGeh-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: mm/migrate_device.c:89 function parameter 'fault_folio' not described in 'migrate_vma_collect_huge_pmd'
>> Warning: mm/migrate_device.c:707 function parameter 'page' not described in 'migrate_vma_insert_huge_pmd_page'
   Warning: mm/migrate_device.c:707 Excess function parameter 'folio' description in 'migrate_vma_insert_huge_pmd_page'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (11 preceding siblings ...)
  2025-07-03 23:35 ` [v1 resend 12/12] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
@ 2025-07-04 16:16 ` Zi Yan
  2025-07-04 23:56   ` Balbir Singh
  2025-07-08 14:53 ` David Hildenbrand
  2025-07-17 23:40 ` Matthew Brost
  14 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-04 16:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 3 Jul 2025, at 19:34, Balbir Singh wrote:

> This patch series adds support for THP migration of zone device pages.
> To do so, the patches implement support for folio zone device pages
> by adding support for setting up larger order pages.
>
> These patches build on the earlier posts by Ralph Campbell [1]
>
> Two new flags are added in vma_migration to select and mark compound pages.
> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
> is passed in as arguments.
>
> The series also adds zone device awareness to (m)THP pages along
> with fault handling of large zone device private pages. page vma walk
> and the rmap code is also zone device aware. Support has also been
> added for folios that might need to be split in the middle
> of migration (when the src and dst do not agree on
> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
> migrate large pages, but the destination has not been able to allocate
> large pages. The code supported and used folio_split() when migrating
> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
> as an argument to migrate_vma_setup().
>
> The test infrastructure lib/test_hmm.c has been enhanced to support THP
> migration. A new ioctl to emulate failure of large page allocations has
> been added to test the folio split code path. hmm-tests.c has new test
> cases for huge page migration and to test the folio split path. A new
> throughput test has been added as well.
>
> The nouveau dmem code has been enhanced to use the new THP migration
> capability.
>
> Feedback from the RFC [2]:
>
> It was advised that prep_compound_page() not be exposed just for the purposes
> of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
> folios did not work due to lock order dependency in the callback for
> split folio.
>
> mTHP support:
>
> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
> been kept generic to support various order sizes. With additional
> refactoring of the code support of different order sizes should be
> possible.
>
> The future plan is to post enhancements to support mTHP with a rough
> design as follows:
>
> 1. Add the notion of allowable thp orders to the HMM based test driver
> 2. For non PMD based THP paths in migrate_device.c, check to see if
>    a suitable order is found and supported by the driver
> 3. Iterate across orders to check the highest supported order for migration
> 4. Migrate and finalize
>
> The mTHP patches can be built on top of this series, the key design elements
> that need to be worked out are infrastructure and driver support for multiple
> ordered pages and their migration.

To help me better review the patches, can you tell me if my mental model below
for device private folios is correct or not?

1. device private folios represent device memory, but the associated PFNs
do not exist in the system. folio->pgmap contains the meta info about
device memory.

2. when data is migrated from system memory to device private memory, a device
private page table entry is established in place of the original entry.
A device private page table entry is a swap entry with a device private type.
And the swap entry points to a device private folio in which the data resides
in the device private memory.

3. when CPU tries to access an address with device private page table entry,
a fault happens and data is migrated from device private memory to system
memory. The device private folio pointed by the device private page table
entry tells driver where to look for the data on the device.

4. one of the reasons causing a large device private folio split is that
when a large device private folio is migrated back to system memory and
there is no free large folio in system memory. So that driver splits
the large device private folio and only migrate a subpage instead.

Thanks.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-04 16:16 ` [v1 resend 00/12] THP support for zone device page migration Zi Yan
@ 2025-07-04 23:56   ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-04 23:56 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/5/25 02:16, Zi Yan wrote:
> On 3 Jul 2025, at 19:34, Balbir Singh wrote:
> 
>> This patch series adds support for THP migration of zone device pages.
>> To do so, the patches implement support for folio zone device pages
>> by adding support for setting up larger order pages.
>>
>> These patches build on the earlier posts by Ralph Campbell [1]
>>
>> Two new flags are added in vma_migration to select and mark compound pages.
>> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
>> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
>> is passed in as arguments.
>>
>> The series also adds zone device awareness to (m)THP pages along
>> with fault handling of large zone device private pages. page vma walk
>> and the rmap code is also zone device aware. Support has also been
>> added for folios that might need to be split in the middle
>> of migration (when the src and dst do not agree on
>> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
>> migrate large pages, but the destination has not been able to allocate
>> large pages. The code supported and used folio_split() when migrating
>> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
>> as an argument to migrate_vma_setup().
>>
>> The test infrastructure lib/test_hmm.c has been enhanced to support THP
>> migration. A new ioctl to emulate failure of large page allocations has
>> been added to test the folio split code path. hmm-tests.c has new test
>> cases for huge page migration and to test the folio split path. A new
>> throughput test has been added as well.
>>
>> The nouveau dmem code has been enhanced to use the new THP migration
>> capability.
>>
>> Feedback from the RFC [2]:
>>
>> It was advised that prep_compound_page() not be exposed just for the purposes
>> of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
>> folios did not work due to lock order dependency in the callback for
>> split folio.
>>
>> mTHP support:
>>
>> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
>> been kept generic to support various order sizes. With additional
>> refactoring of the code support of different order sizes should be
>> possible.
>>
>> The future plan is to post enhancements to support mTHP with a rough
>> design as follows:
>>
>> 1. Add the notion of allowable thp orders to the HMM based test driver
>> 2. For non PMD based THP paths in migrate_device.c, check to see if
>>    a suitable order is found and supported by the driver
>> 3. Iterate across orders to check the highest supported order for migration
>> 4. Migrate and finalize
>>
>> The mTHP patches can be built on top of this series, the key design elements
>> that need to be worked out are infrastructure and driver support for multiple
>> ordered pages and their migration.
> 
> To help me better review the patches, can you tell me if my mental model below
> for device private folios is correct or not?
> 
> 1. device private folios represent device memory, but the associated PFNs
> do not exist in the system. folio->pgmap contains the meta info about
> device memory.

Yes that is right

> 
> 2. when data is migrated from system memory to device private memory, a device
> private page table entry is established in place of the original entry.
> A device private page table entry is a swap entry with a device private type.
> And the swap entry points to a device private folio in which the data resides
> in the device private memory.
> 

Yes

> 3. when CPU tries to access an address with device private page table entry,
> a fault happens and data is migrated from device private memory to system
> memory. The device private folio pointed by the device private page table
> entry tells driver where to look for the data on the device.
> 
> 4. one of the reasons causing a large device private folio split is that
> when a large device private folio is migrated back to system memory and
> there is no free large folio in system memory. So that driver splits
> the large device private folio and only migrate a subpage instead.
> 

Both the points are correct, to add to point 4, it can so happen that during
migration of memory from system to device, we might need a split as well.
Effectively the destination is unable to allocate a large page for migration


Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-04 11:10   ` Mika Penttilä
@ 2025-07-05  0:14     ` Balbir Singh
  2025-07-07  6:09       ` Alistair Popple
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-05  0:14 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/4/25 21:10, Mika Penttilä wrote:
>>  /* Racy check whether the huge page can be split */
>> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  					((mapping || swap_cache) ?
>>  						folio_nr_pages(release) : 0));
>>  
>> +			if (folio_is_device_private(release))
>> +				percpu_ref_get_many(&release->pgmap->ref,
>> +							(1 << new_order) - 1);
> 
> pgmap refcount should not be modified here, count should remain the same after the split also
> 
> 

Good point, let me revisit the accounting

For this patch series, the tests did not catch it since new ref's evaluate to 0

Thanks,
Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-04  6:43     ` Mika Penttilä
@ 2025-07-05  0:26       ` Balbir Singh
  2025-07-05  3:17         ` Mika Penttilä
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-05  0:26 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/4/25 16:43, Mika Penttilä wrote:
> 
> On 7/4/25 08:17, Mika Penttilä wrote:
>> On 7/4/25 02:35, Balbir Singh wrote:
>>> Support splitting pages during THP zone device migration as needed.
>>> The common case that arises is that after setup, during migrate
>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>> pages.
>>>
>>> Add a new routine migrate_vma_split_pages() to support the splitting
>>> of already isolated pages. The pages being migrated are already unmapped
>>> and marked for migration during setup (via unmap). folio_split() and
>>> __split_unmapped_folio() take additional isolated arguments, to avoid
>>> unmapping and remaping these pages and unlocking/putting the folio.
>>>
>>> Cc: Karol Herbst <kherbst@redhat.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>> Cc: Shuah Khan <shuah@kernel.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Cc: Jane Chu <jane.chu@oracle.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h | 11 ++++++--
>>>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>>>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>>>  3 files changed, 85 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 65a1bdf29bb9..5f55a754e57c 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>  		vm_flags_t vm_flags);
>>>  
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -		unsigned int new_order);
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order, bool isolated);
>>>  int min_order_for_split(struct folio *folio);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>  		bool warns);
>>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>  		struct list_head *list);
>>> +
>>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order)
>>> +{
>>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>> +}
>>> +
>>>  /*
>>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>>   * @folio: folio to be split
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index d55e36ae0c39..e00ddfed22fa 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>  		new_folio->mapping = folio->mapping;
>>>  		new_folio->index = folio->index + i;
>>>  
>>> -		/*
>>> -		 * page->private should not be set in tail pages. Fix up and warn once
>>> -		 * if private is unexpectedly set.
>>> -		 */
>>> -		if (unlikely(new_folio->private)) {
>>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>> -			new_folio->private = NULL;
>>> -		}
>>> -
>>>  		if (folio_test_swapcache(folio))
>>>  			new_folio->swap.val = folio->swap.val + i;
>>>  
>>> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>>  		struct list_head *list, pgoff_t end,
>>>  		struct xa_state *xas, struct address_space *mapping,
>>> -		bool uniform_split)
>>> +		bool uniform_split, bool isolated)
>>>  {
>>>  	struct lruvec *lruvec;
>>>  	struct address_space *swap_cache = NULL;
>>> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>  				percpu_ref_get_many(&release->pgmap->ref,
>>>  							(1 << new_order) - 1);
>>>  
>>> -			lru_add_split_folio(origin_folio, release, lruvec,
>>> -					list);
>>> +			if (!isolated)
>>> +				lru_add_split_folio(origin_folio, release,
>>> +							lruvec, list);
>>>  
>>>  			/* Some pages can be beyond EOF: drop them from cache */
>>>  			if (release->index >= end) {
>>> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>  	if (nr_dropped)
>>>  		shmem_uncharge(mapping->host, nr_dropped);
>>>  
>>> +	/*
>>> +	 * Don't remap and unlock isolated folios
>>> +	 */
>>> +	if (isolated)
>>> +		return ret;
>>> +
>>>  	remap_page(origin_folio, 1 << order,
>>>  			folio_test_anon(origin_folio) ?
>>>  				RMP_USE_SHARED_ZEROPAGE : 0);
>>> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   * @lock_at: a page within @folio to be left locked to caller
>>>   * @list: after-split folios will be put on it if non NULL
>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>> + * @isolated: The pages are already unmapped
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>  {
>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!isolated) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>>  		}
>>>  		end = -1;
>>>  		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>>  	} else {
>>>  		unsigned int min_order;
>>>  		gfp_t gfp;
>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>  
>>> -	unmap_folio(folio);
>>> +	if (!isolated)
>>> +		unmap_folio(folio);
>>>  
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  
>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>  				split_at, lock_at, list, end, &xas, mapping,
>>> -				uniform_split);
>>> +				uniform_split, isolated);
>>>  	} else {
>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>  fail:
>>>  		if (mapping)
>>>  			xas_unlock(&xas);
>>>  		local_irq_enable();
>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>> +		if (!isolated)
>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>  		ret = -EAGAIN;
>>>  	}
>>>  
>>> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>>   */
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -				     unsigned int new_order)
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +				     unsigned int new_order, bool isolated)
>>>  {
>>>  	struct folio *folio = page_folio(page);
>>>  
>>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>>> +				isolated);
>>>  }
>>>  
>>>  /*
>>> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct list_head *list)
>>>  {
>>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>>> -			false);
>>> +			false, false);
>>>  }
>>>  
>>>  int min_order_for_split(struct folio *folio)
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 41d0bd787969..acd2f03b178d 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>  	return 0;
>>>  }
>>> +
>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>> +					unsigned long idx, unsigned long addr,
>>> +					struct folio *folio)
>>> +{
>>> +	unsigned long i;
>>> +	unsigned long pfn;
>>> +	unsigned long flags;
>>> +
>>> +	folio_get(folio);
>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>> We already have reference to folio, why is folio_get() needed ?
>>
>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
> 
> Oh I see 
> +	if (!isolated)
> +		unmap_folio(folio);
> 
> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
> 
> Still, why the folio_get(folio);?
>  
> 

That is for split_huge_pmd_address, when called with freeze=true, it drops the
ref count on the page

	if (freeze)
		put_page(page);

Balbir



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-04 11:24   ` Zi Yan
@ 2025-07-05  0:58     ` Balbir Singh
  2025-07-05  1:55       ` Zi Yan
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-05  0:58 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/4/25 21:24, Zi Yan wrote:
> 
> s/pages/folio
> 

Thanks, will make the changes

> Why name it isolated if the folio is unmapped? Isolated folios often mean
> they are removed from LRU lists. isolated here causes confusion.
> 

Ack, will change the name


>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool isolated)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!isolated) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>>  		}
>>  		end = -1;
>>  		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>>  	} else {
>>  		unsigned int min_order;
>>  		gfp_t gfp;
>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>
>> -	unmap_folio(folio);
>> +	if (!isolated)
>> +		unmap_folio(folio);
>>
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  		ret = __split_unmapped_folio(folio, new_order,
>>  				split_at, lock_at, list, end, &xas, mapping,
>> -				uniform_split);
>> +				uniform_split, isolated);
>>  	} else {
>>  		spin_unlock(&ds_queue->split_queue_lock);
>>  fail:
>>  		if (mapping)
>>  			xas_unlock(&xas);
>>  		local_irq_enable();
>> -		remap_page(folio, folio_nr_pages(folio), 0);
>> +		if (!isolated)
>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>  		ret = -EAGAIN;
>>  	}
> 
> These "isolated" special handlings does not look good, I wonder if there
> is a way of letting split code handle device private folios more gracefully.
> It also causes confusions, since why does "isolated/unmapped" folios
> not need to unmap_page(), remap_page(), or unlock?
> 
> 

There are two reasons for going down the current code path

1. if the isolated check is not present, folio_get_anon_vma will fail and cause
   the split routine to return with -EBUSY
2. Going through unmap_page(), remap_page() causes a full page table walk, which
   the migrate_device API has already just done as a part of the migration. The
   entries under consideration are already migration entries in this case.
   This is wasteful and in some case unexpected.


Thanks for the review,
Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-05  0:58     ` Balbir Singh
@ 2025-07-05  1:55       ` Zi Yan
  2025-07-06  1:15         ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-05  1:55 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 4 Jul 2025, at 20:58, Balbir Singh wrote:

> On 7/4/25 21:24, Zi Yan wrote:
>>
>> s/pages/folio
>>
>
> Thanks, will make the changes
>
>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>> they are removed from LRU lists. isolated here causes confusion.
>>
>
> Ack, will change the name
>
>
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>  {
>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!isolated) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>>  		}
>>>  		end = -1;
>>>  		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>>  	} else {
>>>  		unsigned int min_order;
>>>  		gfp_t gfp;
>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>
>>> -	unmap_folio(folio);
>>> +	if (!isolated)
>>> +		unmap_folio(folio);
>>>
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>  				split_at, lock_at, list, end, &xas, mapping,
>>> -				uniform_split);
>>> +				uniform_split, isolated);
>>>  	} else {
>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>  fail:
>>>  		if (mapping)
>>>  			xas_unlock(&xas);
>>>  		local_irq_enable();
>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>> +		if (!isolated)
>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>  		ret = -EAGAIN;
>>>  	}
>>
>> These "isolated" special handlings does not look good, I wonder if there
>> is a way of letting split code handle device private folios more gracefully.
>> It also causes confusions, since why does "isolated/unmapped" folios
>> not need to unmap_page(), remap_page(), or unlock?
>>
>>
>
> There are two reasons for going down the current code path

After thinking more, I think adding isolated/unmapped is not the right
way, since unmapped folio is a very generic concept. If you add it,
one can easily misuse the folio split code by first unmapping a folio
and trying to split it with unmapped = true. I do not think that is
supported and your patch does not prevent that from happening in the future.

You should teach different parts of folio split code path to handle
device private folios properly. Details are below.

>
> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>    the split routine to return with -EBUSY

You do something below instead.

if (!anon_vma && !folio_is_device_private(folio)) {
	ret = -EBUSY;
	goto out;
} else if (anon_vma) {
	anon_vma_lock_write(anon_vma);
}

People can know device private folio split needs a special handling.

BTW, why a device private folio can also be anonymous? Does it mean
if a page cache folio is migrated to device private, kernel also
sees it as both device private and file-backed?


> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>    the migrate_device API has already just done as a part of the migration. The
>    entries under consideration are already migration entries in this case.
>    This is wasteful and in some case unexpected.

unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
PMD mapping, which you did in migrate_vma_split_pages(). You probably
can teach either try_to_migrate() or try_to_unmap() to just split
device private PMD mapping. Or if that is not preferred,
you can simply call split_huge_pmd_address() when unmap_folio()
sees a device private folio.

For remap_page(), you can simply return for device private folios
like it is currently doing for non anonymous folios.


For lru_add_split_folio(), you can skip it if a device private
folio is seen.

Last, for unlock part, why do you need to keep all after-split folios
locked? It should be possible to just keep the to-be-migrated folio
locked and unlock the rest for a later retry. But I could miss something
since I am not familiar with device private migration code.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-05  0:26       ` Balbir Singh
@ 2025-07-05  3:17         ` Mika Penttilä
  2025-07-07  2:35           ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-05  3:17 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom


>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>> +					unsigned long idx, unsigned long addr,
>>>> +					struct folio *folio)
>>>> +{
>>>> +	unsigned long i;
>>>> +	unsigned long pfn;
>>>> +	unsigned long flags;
>>>> +
>>>> +	folio_get(folio);
>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>> We already have reference to folio, why is folio_get() needed ?
>>>
>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>> Oh I see 
>> +	if (!isolated)
>> +		unmap_folio(folio);
>>
>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>
>> Still, why the folio_get(folio);?
>>  
>>
> That is for split_huge_pmd_address, when called with freeze=true, it drops the
> ref count on the page
>
> 	if (freeze)
> 		put_page(page);
>
> Balbir
>
yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
Doing the split this late is quite problematic all in all.



--Mika



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-05  1:55       ` Zi Yan
@ 2025-07-06  1:15         ` Balbir Singh
  2025-07-06  1:34           ` Zi Yan
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-06  1:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/5/25 11:55, Zi Yan wrote:
> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> 
>> On 7/4/25 21:24, Zi Yan wrote:
>>>
>>> s/pages/folio
>>>
>>
>> Thanks, will make the changes
>>
>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>> they are removed from LRU lists. isolated here causes confusion.
>>>
>>
>> Ack, will change the name
>>
>>
>>>>   *
>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>   * It is in charge of checking whether the split is supported or not and
>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   */
>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split)
>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>  {
>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		 * is taken to serialise against parallel split or collapse
>>>>  		 * operations.
>>>>  		 */
>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>> -		if (!anon_vma) {
>>>> -			ret = -EBUSY;
>>>> -			goto out;
>>>> +		if (!isolated) {
>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>> +			if (!anon_vma) {
>>>> +				ret = -EBUSY;
>>>> +				goto out;
>>>> +			}
>>>> +			anon_vma_lock_write(anon_vma);
>>>>  		}
>>>>  		end = -1;
>>>>  		mapping = NULL;
>>>> -		anon_vma_lock_write(anon_vma);
>>>>  	} else {
>>>>  		unsigned int min_order;
>>>>  		gfp_t gfp;
>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		goto out_unlock;
>>>>  	}
>>>>
>>>> -	unmap_folio(folio);
>>>> +	if (!isolated)
>>>> +		unmap_folio(folio);
>>>>
>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>  	local_irq_disable();
>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>> -				uniform_split);
>>>> +				uniform_split, isolated);
>>>>  	} else {
>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>  fail:
>>>>  		if (mapping)
>>>>  			xas_unlock(&xas);
>>>>  		local_irq_enable();
>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>> +		if (!isolated)
>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>  		ret = -EAGAIN;
>>>>  	}
>>>
>>> These "isolated" special handlings does not look good, I wonder if there
>>> is a way of letting split code handle device private folios more gracefully.
>>> It also causes confusions, since why does "isolated/unmapped" folios
>>> not need to unmap_page(), remap_page(), or unlock?
>>>
>>>
>>
>> There are two reasons for going down the current code path
> 
> After thinking more, I think adding isolated/unmapped is not the right
> way, since unmapped folio is a very generic concept. If you add it,
> one can easily misuse the folio split code by first unmapping a folio
> and trying to split it with unmapped = true. I do not think that is
> supported and your patch does not prevent that from happening in the future.
> 

I don't understand the misuse case you mention, I assume you mean someone can
get the usage wrong? The responsibility is on the caller to do the right thing
if calling the API with unmapped

> You should teach different parts of folio split code path to handle
> device private folios properly. Details are below.
> 
>>
>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>    the split routine to return with -EBUSY
> 
> You do something below instead.
> 
> if (!anon_vma && !folio_is_device_private(folio)) {
> 	ret = -EBUSY;
> 	goto out;
> } else if (anon_vma) {
> 	anon_vma_lock_write(anon_vma);
> }
> 

folio_get_anon() cannot be called for unmapped folios. In our case the page has
already been unmapped. Is there a reason why you mix anon_vma_lock_write with
the check for device private folios?

> People can know device private folio split needs a special handling.
> 
> BTW, why a device private folio can also be anonymous? Does it mean
> if a page cache folio is migrated to device private, kernel also
> sees it as both device private and file-backed?
> 

FYI: device private folios only work with anonymous private pages, hence
the name device private.

> 
>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>    the migrate_device API has already just done as a part of the migration. The
>>    entries under consideration are already migration entries in this case.
>>    This is wasteful and in some case unexpected.
> 
> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> can teach either try_to_migrate() or try_to_unmap() to just split
> device private PMD mapping. Or if that is not preferred,
> you can simply call split_huge_pmd_address() when unmap_folio()
> sees a device private folio.
> 
> For remap_page(), you can simply return for device private folios
> like it is currently doing for non anonymous folios.
> 

Doing a full rmap walk does not make sense with unmap_folio() and
remap_folio(), because

1. We need to do a page table walk/rmap walk again
2. We'll need special handling of migration <-> migration entries
   in the rmap handling (set/remove migration ptes)
3. In this context, the code is already in the middle of migration,
   so trying to do that again does not make sense.


> 
> For lru_add_split_folio(), you can skip it if a device private
> folio is seen.
> 
> Last, for unlock part, why do you need to keep all after-split folios
> locked? It should be possible to just keep the to-be-migrated folio
> locked and unlock the rest for a later retry. But I could miss something
> since I am not familiar with device private migration code.
> 

Not sure I follow this comment

Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-04  4:46   ` Mika Penttilä
@ 2025-07-06  1:21     ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-06  1:21 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/4/25 14:46, Mika Penttilä wrote:
> On 7/4/25 02:35, Balbir Singh wrote:
>>  
>> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>  	}
>>  
>> -	pmd_migration = is_pmd_migration_entry(*pmd);
>> -	if (unlikely(pmd_migration)) {
>> -		swp_entry_t entry;
>>  
>> +	present = pmd_present(*pmd);
>> +	if (unlikely(!present)) {
>> +		swp_entry = pmd_to_swp_entry(*pmd);
>>  		old_pmd = *pmd;
>> -		entry = pmd_to_swp_entry(old_pmd);
>> -		page = pfn_swap_entry_to_page(entry);
>> -		write = is_writable_migration_entry(entry);
>> +
>> +		folio = pfn_swap_entry_folio(swp_entry);
>> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
>> +				!is_device_private_entry(swp_entry));
>> +		page = pfn_swap_entry_to_page(swp_entry);
>> +		write = is_writable_migration_entry(swp_entry);
> 
> Shouldn't write include is_writable_device_private_entry() also?
> 
> 

Good point, will fix.

Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 07/12] mm/memremap: add folio_split support
  2025-07-04 11:14   ` Mika Penttilä
@ 2025-07-06  1:24     ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-06  1:24 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/4/25 21:14, Mika Penttilä wrote:
> 
> On 7/4/25 02:35, Balbir Singh wrote:
>>  
>> +			if (folio_is_device_private(origin_folio) &&
>> +					origin_folio->pgmap->ops->folio_split)
>> +				origin_folio->pgmap->ops->folio_split(
>> +					origin_folio, release);
> 
> Should folio split fail if pgmap->ops->folio_split() is not defined? If not then at least the >pgmap pointer copy should be in the common code.
> 

We could change the code can check ->folio_split callback is not NULL when MIGRATE_VMA_SELECT_COMPOUND is used and cause migrate_vma to fail

Providing a default implementation might surprise drivers about the split

Balbir Singh



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-06  1:15         ` Balbir Singh
@ 2025-07-06  1:34           ` Zi Yan
  2025-07-06  1:47             ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-06  1:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 5 Jul 2025, at 21:15, Balbir Singh wrote:

> On 7/5/25 11:55, Zi Yan wrote:
>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>
>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>
>>>> s/pages/folio
>>>>
>>>
>>> Thanks, will make the changes
>>>
>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>
>>>
>>> Ack, will change the name
>>>
>>>
>>>>>   *
>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>   */
>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  		struct page *split_at, struct page *lock_at,
>>>>> -		struct list_head *list, bool uniform_split)
>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>  {
>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>  		 * operations.
>>>>>  		 */
>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>> -		if (!anon_vma) {
>>>>> -			ret = -EBUSY;
>>>>> -			goto out;
>>>>> +		if (!isolated) {
>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>> +			if (!anon_vma) {
>>>>> +				ret = -EBUSY;
>>>>> +				goto out;
>>>>> +			}
>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>  		}
>>>>>  		end = -1;
>>>>>  		mapping = NULL;
>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>  	} else {
>>>>>  		unsigned int min_order;
>>>>>  		gfp_t gfp;
>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>>
>>>>> -	unmap_folio(folio);
>>>>> +	if (!isolated)
>>>>> +		unmap_folio(folio);
>>>>>
>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>  	local_irq_disable();
>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>
>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>> -				uniform_split);
>>>>> +				uniform_split, isolated);
>>>>>  	} else {
>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>  fail:
>>>>>  		if (mapping)
>>>>>  			xas_unlock(&xas);
>>>>>  		local_irq_enable();
>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>> +		if (!isolated)
>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>  		ret = -EAGAIN;
>>>>>  	}
>>>>
>>>> These "isolated" special handlings does not look good, I wonder if there
>>>> is a way of letting split code handle device private folios more gracefully.
>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>
>>>>
>>>
>>> There are two reasons for going down the current code path
>>
>> After thinking more, I think adding isolated/unmapped is not the right
>> way, since unmapped folio is a very generic concept. If you add it,
>> one can easily misuse the folio split code by first unmapping a folio
>> and trying to split it with unmapped = true. I do not think that is
>> supported and your patch does not prevent that from happening in the future.
>>
>
> I don't understand the misuse case you mention, I assume you mean someone can
> get the usage wrong? The responsibility is on the caller to do the right thing
> if calling the API with unmapped

Before your patch, there is no use case of splitting unmapped folios.
Your patch only adds support for device private page split, not any unmapped
folio split. So using a generic isolated/unmapped parameter is not OK.

>
>> You should teach different parts of folio split code path to handle
>> device private folios properly. Details are below.
>>
>>>
>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>    the split routine to return with -EBUSY
>>
>> You do something below instead.
>>
>> if (!anon_vma && !folio_is_device_private(folio)) {
>> 	ret = -EBUSY;
>> 	goto out;
>> } else if (anon_vma) {
>> 	anon_vma_lock_write(anon_vma);
>> }
>>
>
> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> the check for device private folios?

Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
in if (!isolated) branch. In that case, just do

if (folio_is_device_private(folio) {
...
} else if (is_anon) {
...
} else {
...
}

>
>> People can know device private folio split needs a special handling.
>>
>> BTW, why a device private folio can also be anonymous? Does it mean
>> if a page cache folio is migrated to device private, kernel also
>> sees it as both device private and file-backed?
>>
>
> FYI: device private folios only work with anonymous private pages, hence
> the name device private.

OK.

>
>>
>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>    the migrate_device API has already just done as a part of the migration. The
>>>    entries under consideration are already migration entries in this case.
>>>    This is wasteful and in some case unexpected.
>>
>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>> can teach either try_to_migrate() or try_to_unmap() to just split
>> device private PMD mapping. Or if that is not preferred,
>> you can simply call split_huge_pmd_address() when unmap_folio()
>> sees a device private folio.
>>
>> For remap_page(), you can simply return for device private folios
>> like it is currently doing for non anonymous folios.
>>
>
> Doing a full rmap walk does not make sense with unmap_folio() and
> remap_folio(), because
>
> 1. We need to do a page table walk/rmap walk again
> 2. We'll need special handling of migration <-> migration entries
>    in the rmap handling (set/remove migration ptes)
> 3. In this context, the code is already in the middle of migration,
>    so trying to do that again does not make sense.

Why doing split in the middle of migration? Existing split code
assumes to-be-split folios are mapped.

What prevents doing split before migration?

>
>
>>
>> For lru_add_split_folio(), you can skip it if a device private
>> folio is seen.
>>
>> Last, for unlock part, why do you need to keep all after-split folios
>> locked? It should be possible to just keep the to-be-migrated folio
>> locked and unlock the rest for a later retry. But I could miss something
>> since I am not familiar with device private migration code.
>>
>
> Not sure I follow this comment

Because the patch is doing split in the middle of migration and existing
split code never supports. My comment is based on the assumption that
the split is done when a folio is mapped.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-06  1:34           ` Zi Yan
@ 2025-07-06  1:47             ` Balbir Singh
  2025-07-06  2:34               ` Zi Yan
  2025-07-16  5:34               ` Matthew Brost
  0 siblings, 2 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-06  1:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/6/25 11:34, Zi Yan wrote:
> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> 
>> On 7/5/25 11:55, Zi Yan wrote:
>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>
>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>
>>>>> s/pages/folio
>>>>>
>>>>
>>>> Thanks, will make the changes
>>>>
>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>
>>>>
>>>> Ack, will change the name
>>>>
>>>>
>>>>>>   *
>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>   */
>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>  {
>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>  		 * operations.
>>>>>>  		 */
>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>> -		if (!anon_vma) {
>>>>>> -			ret = -EBUSY;
>>>>>> -			goto out;
>>>>>> +		if (!isolated) {
>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>> +			if (!anon_vma) {
>>>>>> +				ret = -EBUSY;
>>>>>> +				goto out;
>>>>>> +			}
>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>  		}
>>>>>>  		end = -1;
>>>>>>  		mapping = NULL;
>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>  	} else {
>>>>>>  		unsigned int min_order;
>>>>>>  		gfp_t gfp;
>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  		goto out_unlock;
>>>>>>  	}
>>>>>>
>>>>>> -	unmap_folio(folio);
>>>>>> +	if (!isolated)
>>>>>> +		unmap_folio(folio);
>>>>>>
>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>  	local_irq_disable();
>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>
>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>> -				uniform_split);
>>>>>> +				uniform_split, isolated);
>>>>>>  	} else {
>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>  fail:
>>>>>>  		if (mapping)
>>>>>>  			xas_unlock(&xas);
>>>>>>  		local_irq_enable();
>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>> +		if (!isolated)
>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>  		ret = -EAGAIN;
>>>>>>  	}
>>>>>
>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>
>>>>>
>>>>
>>>> There are two reasons for going down the current code path
>>>
>>> After thinking more, I think adding isolated/unmapped is not the right
>>> way, since unmapped folio is a very generic concept. If you add it,
>>> one can easily misuse the folio split code by first unmapping a folio
>>> and trying to split it with unmapped = true. I do not think that is
>>> supported and your patch does not prevent that from happening in the future.
>>>
>>
>> I don't understand the misuse case you mention, I assume you mean someone can
>> get the usage wrong? The responsibility is on the caller to do the right thing
>> if calling the API with unmapped
> 
> Before your patch, there is no use case of splitting unmapped folios.
> Your patch only adds support for device private page split, not any unmapped
> folio split. So using a generic isolated/unmapped parameter is not OK.
> 

There is a use for splitting unmapped folios (see below)

>>
>>> You should teach different parts of folio split code path to handle
>>> device private folios properly. Details are below.
>>>
>>>>
>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>    the split routine to return with -EBUSY
>>>
>>> You do something below instead.
>>>
>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>> 	ret = -EBUSY;
>>> 	goto out;
>>> } else if (anon_vma) {
>>> 	anon_vma_lock_write(anon_vma);
>>> }
>>>
>>
>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>> the check for device private folios?
> 
> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> in if (!isolated) branch. In that case, just do
> 
> if (folio_is_device_private(folio) {
> ...
> } else if (is_anon) {
> ...
> } else {
> ...
> }
> 
>>
>>> People can know device private folio split needs a special handling.
>>>
>>> BTW, why a device private folio can also be anonymous? Does it mean
>>> if a page cache folio is migrated to device private, kernel also
>>> sees it as both device private and file-backed?
>>>
>>
>> FYI: device private folios only work with anonymous private pages, hence
>> the name device private.
> 
> OK.
> 
>>
>>>
>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>    entries under consideration are already migration entries in this case.
>>>>    This is wasteful and in some case unexpected.
>>>
>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>> device private PMD mapping. Or if that is not preferred,
>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>> sees a device private folio.
>>>
>>> For remap_page(), you can simply return for device private folios
>>> like it is currently doing for non anonymous folios.
>>>
>>
>> Doing a full rmap walk does not make sense with unmap_folio() and
>> remap_folio(), because
>>
>> 1. We need to do a page table walk/rmap walk again
>> 2. We'll need special handling of migration <-> migration entries
>>    in the rmap handling (set/remove migration ptes)
>> 3. In this context, the code is already in the middle of migration,
>>    so trying to do that again does not make sense.
> 
> Why doing split in the middle of migration? Existing split code
> assumes to-be-split folios are mapped.
> 
> What prevents doing split before migration?
> 

The code does do a split prior to migration if THP selection fails

Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
and the fallback part which calls split_folio()

But the case under consideration is special since the device needs to allocate
corresponding pfn's as well. The changelog mentions it:

"The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages."

I can expand on it, because migrate_vma() is a multi-phase operation

1. migrate_vma_setup()
2. migrate_vma_pages()
3. migrate_vma_finalize()

It can so happen that when we get the destination pfn's allocated the destination
might not be able to allocate a large page, so we do the split in migrate_vma_pages().

The pages have been unmapped and collected in migrate_vma_setup()

The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
tests the split and emulates a failure on the device side to allocate large pages
and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)


>>
>>
>>>
>>> For lru_add_split_folio(), you can skip it if a device private
>>> folio is seen.
>>>
>>> Last, for unlock part, why do you need to keep all after-split folios
>>> locked? It should be possible to just keep the to-be-migrated folio
>>> locked and unlock the rest for a later retry. But I could miss something
>>> since I am not familiar with device private migration code.
>>>
>>
>> Not sure I follow this comment
> 
> Because the patch is doing split in the middle of migration and existing
> split code never supports. My comment is based on the assumption that
> the split is done when a folio is mapped.
> 

Understood, hopefully I've explained the reason for the split in the middle
of migration

Thanks for the detailed review
Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-06  1:47             ` Balbir Singh
@ 2025-07-06  2:34               ` Zi Yan
  2025-07-06  3:03                 ` Zi Yan
  2025-07-16  5:34               ` Matthew Brost
  1 sibling, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-06  2:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 5 Jul 2025, at 21:47, Balbir Singh wrote:

> On 7/6/25 11:34, Zi Yan wrote:
>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>
>>> On 7/5/25 11:55, Zi Yan wrote:
>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>
>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>
>>>>>> s/pages/folio
>>>>>>
>>>>>
>>>>> Thanks, will make the changes
>>>>>
>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>
>>>>>
>>>>> Ack, will change the name
>>>>>
>>>>>
>>>>>>>   *
>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>   */
>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>  {
>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>  		 * operations.
>>>>>>>  		 */
>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>> -		if (!anon_vma) {
>>>>>>> -			ret = -EBUSY;
>>>>>>> -			goto out;
>>>>>>> +		if (!isolated) {
>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>> +			if (!anon_vma) {
>>>>>>> +				ret = -EBUSY;
>>>>>>> +				goto out;
>>>>>>> +			}
>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>  		}
>>>>>>>  		end = -1;
>>>>>>>  		mapping = NULL;
>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>  	} else {
>>>>>>>  		unsigned int min_order;
>>>>>>>  		gfp_t gfp;
>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  		goto out_unlock;
>>>>>>>  	}
>>>>>>>
>>>>>>> -	unmap_folio(folio);
>>>>>>> +	if (!isolated)
>>>>>>> +		unmap_folio(folio);
>>>>>>>
>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>  	local_irq_disable();
>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>
>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>> -				uniform_split);
>>>>>>> +				uniform_split, isolated);
>>>>>>>  	} else {
>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>  fail:
>>>>>>>  		if (mapping)
>>>>>>>  			xas_unlock(&xas);
>>>>>>>  		local_irq_enable();
>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>> +		if (!isolated)
>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>  		ret = -EAGAIN;
>>>>>>>  	}
>>>>>>
>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>
>>>>>>
>>>>>
>>>>> There are two reasons for going down the current code path
>>>>
>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>> one can easily misuse the folio split code by first unmapping a folio
>>>> and trying to split it with unmapped = true. I do not think that is
>>>> supported and your patch does not prevent that from happening in the future.
>>>>
>>>
>>> I don't understand the misuse case you mention, I assume you mean someone can
>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>> if calling the API with unmapped
>>
>> Before your patch, there is no use case of splitting unmapped folios.
>> Your patch only adds support for device private page split, not any unmapped
>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>
>
> There is a use for splitting unmapped folios (see below)
>
>>>
>>>> You should teach different parts of folio split code path to handle
>>>> device private folios properly. Details are below.
>>>>
>>>>>
>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>    the split routine to return with -EBUSY
>>>>
>>>> You do something below instead.
>>>>
>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>> 	ret = -EBUSY;
>>>> 	goto out;
>>>> } else if (anon_vma) {
>>>> 	anon_vma_lock_write(anon_vma);
>>>> }
>>>>
>>>
>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>> the check for device private folios?
>>
>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>> in if (!isolated) branch. In that case, just do
>>
>> if (folio_is_device_private(folio) {
>> ...
>> } else if (is_anon) {
>> ...
>> } else {
>> ...
>> }
>>
>>>
>>>> People can know device private folio split needs a special handling.
>>>>
>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>> if a page cache folio is migrated to device private, kernel also
>>>> sees it as both device private and file-backed?
>>>>
>>>
>>> FYI: device private folios only work with anonymous private pages, hence
>>> the name device private.
>>
>> OK.
>>
>>>
>>>>
>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>    entries under consideration are already migration entries in this case.
>>>>>    This is wasteful and in some case unexpected.
>>>>
>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>> device private PMD mapping. Or if that is not preferred,
>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>> sees a device private folio.
>>>>
>>>> For remap_page(), you can simply return for device private folios
>>>> like it is currently doing for non anonymous folios.
>>>>
>>>
>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>> remap_folio(), because
>>>
>>> 1. We need to do a page table walk/rmap walk again
>>> 2. We'll need special handling of migration <-> migration entries
>>>    in the rmap handling (set/remove migration ptes)
>>> 3. In this context, the code is already in the middle of migration,
>>>    so trying to do that again does not make sense.
>>
>> Why doing split in the middle of migration? Existing split code
>> assumes to-be-split folios are mapped.
>>
>> What prevents doing split before migration?
>>
>
> The code does do a split prior to migration if THP selection fails
>
> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> and the fallback part which calls split_folio()

So this split is done when the folio in system memory is mapped.

>
> But the case under consideration is special since the device needs to allocate
> corresponding pfn's as well. The changelog mentions it:
>
> "The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages."
>
> I can expand on it, because migrate_vma() is a multi-phase operation
>
> 1. migrate_vma_setup()
> 2. migrate_vma_pages()
> 3. migrate_vma_finalize()
>
> It can so happen that when we get the destination pfn's allocated the destination
> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>
> The pages have been unmapped and collected in migrate_vma_setup()

So these unmapped folios are system memory folios? I thought they are
large device private folios.

OK. It sounds like splitting unmapped folios is really needed. I think
it is better to make a new split_unmapped_folio() function
by reusing __split_unmapped_folio(), since __folio_split() assumes
the input folio is mapped.

>
> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> tests the split and emulates a failure on the device side to allocate large pages
> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>
>
>>>
>>>
>>>>
>>>> For lru_add_split_folio(), you can skip it if a device private
>>>> folio is seen.
>>>>
>>>> Last, for unlock part, why do you need to keep all after-split folios
>>>> locked? It should be possible to just keep the to-be-migrated folio
>>>> locked and unlock the rest for a later retry. But I could miss something
>>>> since I am not familiar with device private migration code.
>>>>
>>>
>>> Not sure I follow this comment
>>
>> Because the patch is doing split in the middle of migration and existing
>> split code never supports. My comment is based on the assumption that
>> the split is done when a folio is mapped.
>>
>
> Understood, hopefully I've explained the reason for the split in the middle
> of migration


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-06  2:34               ` Zi Yan
@ 2025-07-06  3:03                 ` Zi Yan
  2025-07-07  2:29                   ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-06  3:03 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 5 Jul 2025, at 22:34, Zi Yan wrote:

> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>
>> On 7/6/25 11:34, Zi Yan wrote:
>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>
>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>
>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>
>>>>>>> s/pages/folio
>>>>>>>
>>>>>>
>>>>>> Thanks, will make the changes
>>>>>>
>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>
>>>>>>
>>>>>> Ack, will change the name
>>>>>>
>>>>>>
>>>>>>>>   *
>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>   */
>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>  {
>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>  		 * operations.
>>>>>>>>  		 */
>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>> -		if (!anon_vma) {
>>>>>>>> -			ret = -EBUSY;
>>>>>>>> -			goto out;
>>>>>>>> +		if (!isolated) {
>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>> +			if (!anon_vma) {
>>>>>>>> +				ret = -EBUSY;
>>>>>>>> +				goto out;
>>>>>>>> +			}
>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>  		}
>>>>>>>>  		end = -1;
>>>>>>>>  		mapping = NULL;
>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>  	} else {
>>>>>>>>  		unsigned int min_order;
>>>>>>>>  		gfp_t gfp;
>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		goto out_unlock;
>>>>>>>>  	}
>>>>>>>>
>>>>>>>> -	unmap_folio(folio);
>>>>>>>> +	if (!isolated)
>>>>>>>> +		unmap_folio(folio);
>>>>>>>>
>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>  	local_irq_disable();
>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>> -				uniform_split);
>>>>>>>> +				uniform_split, isolated);
>>>>>>>>  	} else {
>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>  fail:
>>>>>>>>  		if (mapping)
>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>  		local_irq_enable();
>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>> +		if (!isolated)
>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>  	}
>>>>>>>
>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> There are two reasons for going down the current code path
>>>>>
>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>
>>>>
>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>> if calling the API with unmapped
>>>
>>> Before your patch, there is no use case of splitting unmapped folios.
>>> Your patch only adds support for device private page split, not any unmapped
>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>
>>
>> There is a use for splitting unmapped folios (see below)
>>
>>>>
>>>>> You should teach different parts of folio split code path to handle
>>>>> device private folios properly. Details are below.
>>>>>
>>>>>>
>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>    the split routine to return with -EBUSY
>>>>>
>>>>> You do something below instead.
>>>>>
>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>> 	ret = -EBUSY;
>>>>> 	goto out;
>>>>> } else if (anon_vma) {
>>>>> 	anon_vma_lock_write(anon_vma);
>>>>> }
>>>>>
>>>>
>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>> the check for device private folios?
>>>
>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>> in if (!isolated) branch. In that case, just do
>>>
>>> if (folio_is_device_private(folio) {
>>> ...
>>> } else if (is_anon) {
>>> ...
>>> } else {
>>> ...
>>> }
>>>
>>>>
>>>>> People can know device private folio split needs a special handling.
>>>>>
>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>> if a page cache folio is migrated to device private, kernel also
>>>>> sees it as both device private and file-backed?
>>>>>
>>>>
>>>> FYI: device private folios only work with anonymous private pages, hence
>>>> the name device private.
>>>
>>> OK.
>>>
>>>>
>>>>>
>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>    This is wasteful and in some case unexpected.
>>>>>
>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>> device private PMD mapping. Or if that is not preferred,
>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>> sees a device private folio.
>>>>>
>>>>> For remap_page(), you can simply return for device private folios
>>>>> like it is currently doing for non anonymous folios.
>>>>>
>>>>
>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>> remap_folio(), because
>>>>
>>>> 1. We need to do a page table walk/rmap walk again
>>>> 2. We'll need special handling of migration <-> migration entries
>>>>    in the rmap handling (set/remove migration ptes)
>>>> 3. In this context, the code is already in the middle of migration,
>>>>    so trying to do that again does not make sense.
>>>
>>> Why doing split in the middle of migration? Existing split code
>>> assumes to-be-split folios are mapped.
>>>
>>> What prevents doing split before migration?
>>>
>>
>> The code does do a split prior to migration if THP selection fails
>>
>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>> and the fallback part which calls split_folio()
>
> So this split is done when the folio in system memory is mapped.
>
>>
>> But the case under consideration is special since the device needs to allocate
>> corresponding pfn's as well. The changelog mentions it:
>>
>> "The common case that arises is that after setup, during migrate
>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>> pages."
>>
>> I can expand on it, because migrate_vma() is a multi-phase operation
>>
>> 1. migrate_vma_setup()
>> 2. migrate_vma_pages()
>> 3. migrate_vma_finalize()
>>
>> It can so happen that when we get the destination pfn's allocated the destination
>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>
>> The pages have been unmapped and collected in migrate_vma_setup()
>
> So these unmapped folios are system memory folios? I thought they are
> large device private folios.
>
> OK. It sounds like splitting unmapped folios is really needed. I think
> it is better to make a new split_unmapped_folio() function
> by reusing __split_unmapped_folio(), since __folio_split() assumes
> the input folio is mapped.

And to make __split_unmapped_folio()'s functionality match its name,
I will later refactor it. At least move local_irq_enable(), remap_page(),
and folio_unlocks out of it. I will think about how to deal with
lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
parameter from __split_unmapped_folio().

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-06  3:03                 ` Zi Yan
@ 2025-07-07  2:29                   ` Balbir Singh
  2025-07-07  2:45                     ` Zi Yan
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-07  2:29 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/6/25 13:03, Zi Yan wrote:
> On 5 Jul 2025, at 22:34, Zi Yan wrote:
> 
>> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>>
>>> On 7/6/25 11:34, Zi Yan wrote:
>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>
>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>
>>>>>>>> s/pages/folio
>>>>>>>>
>>>>>>>
>>>>>>> Thanks, will make the changes
>>>>>>>
>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>
>>>>>>>
>>>>>>> Ack, will change the name
>>>>>>>
>>>>>>>
>>>>>>>>>   *
>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>   */
>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>  {
>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>  		 * operations.
>>>>>>>>>  		 */
>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>> -			goto out;
>>>>>>>>> +		if (!isolated) {
>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>> +				goto out;
>>>>>>>>> +			}
>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>  		}
>>>>>>>>>  		end = -1;
>>>>>>>>>  		mapping = NULL;
>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>  	} else {
>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>  		gfp_t gfp;
>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  		goto out_unlock;
>>>>>>>>>  	}
>>>>>>>>>
>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>> +	if (!isolated)
>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>
>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>  	local_irq_disable();
>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>
>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>> -				uniform_split);
>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>  	} else {
>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>  fail:
>>>>>>>>>  		if (mapping)
>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>  		local_irq_enable();
>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>> +		if (!isolated)
>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>  	}
>>>>>>>>
>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> There are two reasons for going down the current code path
>>>>>>
>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>
>>>>>
>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>> if calling the API with unmapped
>>>>
>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>> Your patch only adds support for device private page split, not any unmapped
>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>
>>>
>>> There is a use for splitting unmapped folios (see below)
>>>
>>>>>
>>>>>> You should teach different parts of folio split code path to handle
>>>>>> device private folios properly. Details are below.
>>>>>>
>>>>>>>
>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>    the split routine to return with -EBUSY
>>>>>>
>>>>>> You do something below instead.
>>>>>>
>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>> 	ret = -EBUSY;
>>>>>> 	goto out;
>>>>>> } else if (anon_vma) {
>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>> }
>>>>>>
>>>>>
>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>> the check for device private folios?
>>>>
>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>> in if (!isolated) branch. In that case, just do
>>>>
>>>> if (folio_is_device_private(folio) {
>>>> ...
>>>> } else if (is_anon) {
>>>> ...
>>>> } else {
>>>> ...
>>>> }
>>>>
>>>>>
>>>>>> People can know device private folio split needs a special handling.
>>>>>>
>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>> sees it as both device private and file-backed?
>>>>>>
>>>>>
>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>> the name device private.
>>>>
>>>> OK.
>>>>
>>>>>
>>>>>>
>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>
>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>> sees a device private folio.
>>>>>>
>>>>>> For remap_page(), you can simply return for device private folios
>>>>>> like it is currently doing for non anonymous folios.
>>>>>>
>>>>>
>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>> remap_folio(), because
>>>>>
>>>>> 1. We need to do a page table walk/rmap walk again
>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>    in the rmap handling (set/remove migration ptes)
>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>    so trying to do that again does not make sense.
>>>>
>>>> Why doing split in the middle of migration? Existing split code
>>>> assumes to-be-split folios are mapped.
>>>>
>>>> What prevents doing split before migration?
>>>>
>>>
>>> The code does do a split prior to migration if THP selection fails
>>>
>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>> and the fallback part which calls split_folio()
>>
>> So this split is done when the folio in system memory is mapped.
>>
>>>
>>> But the case under consideration is special since the device needs to allocate
>>> corresponding pfn's as well. The changelog mentions it:
>>>
>>> "The common case that arises is that after setup, during migrate
>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>> pages."
>>>
>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>
>>> 1. migrate_vma_setup()
>>> 2. migrate_vma_pages()
>>> 3. migrate_vma_finalize()
>>>
>>> It can so happen that when we get the destination pfn's allocated the destination
>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>
>>> The pages have been unmapped and collected in migrate_vma_setup()
>>
>> So these unmapped folios are system memory folios? I thought they are
>> large device private folios.
>>
>> OK. It sounds like splitting unmapped folios is really needed. I think
>> it is better to make a new split_unmapped_folio() function
>> by reusing __split_unmapped_folio(), since __folio_split() assumes
>> the input folio is mapped.
> 
> And to make __split_unmapped_folio()'s functionality match its name,
> I will later refactor it. At least move local_irq_enable(), remap_page(),
> and folio_unlocks out of it. I will think about how to deal with
> lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
> parameter from __split_unmapped_folio().
> 

That sounds like a plan, it seems like there needs to be a finish phase of
the split and it does not belong to __split_unmapped_folio(). I would propose
that we rename "isolated" to "folio_is_migrating" and then your cleanups can
follow? Once your cleanups come in, we won't need to pass the parameter to
__split_unmapped_folio().

Balbir Singh



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-05  3:17         ` Mika Penttilä
@ 2025-07-07  2:35           ` Balbir Singh
  2025-07-07  3:29             ` Mika Penttilä
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-07  2:35 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/5/25 13:17, Mika Penttilä wrote:
> 
>>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>>> +					unsigned long idx, unsigned long addr,
>>>>> +					struct folio *folio)
>>>>> +{
>>>>> +	unsigned long i;
>>>>> +	unsigned long pfn;
>>>>> +	unsigned long flags;
>>>>> +
>>>>> +	folio_get(folio);
>>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>>> We already have reference to folio, why is folio_get() needed ?
>>>>
>>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>>> Oh I see 
>>> +	if (!isolated)
>>> +		unmap_folio(folio);
>>>
>>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>>
>>> Still, why the folio_get(folio);?
>>>  
>>>
>> That is for split_huge_pmd_address, when called with freeze=true, it drops the
>> ref count on the page
>>
>> 	if (freeze)
>> 		put_page(page);
>>
>> Balbir
>>
> yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
> Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
> right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
> Doing the split this late is quite problematic all in all.
> 

Clearing PageAnonExclusive will *not* fail for device private pages from what I can see in __folio_try_share_anon_rmap().
Doing the split late is a requirement due to the nature of the three stage migration operation, the other side
might fail to allocate THP sized pages and so the code needs to deal with it

Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-07  2:29                   ` Balbir Singh
@ 2025-07-07  2:45                     ` Zi Yan
  2025-07-08  3:31                       ` Balbir Singh
  2025-07-08  7:43                       ` Balbir Singh
  0 siblings, 2 replies; 99+ messages in thread
From: Zi Yan @ 2025-07-07  2:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 6 Jul 2025, at 22:29, Balbir Singh wrote:

> On 7/6/25 13:03, Zi Yan wrote:
>> On 5 Jul 2025, at 22:34, Zi Yan wrote:
>>
>>> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>>>
>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>
>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>
>>>>>>>>> s/pages/folio
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, will make the changes
>>>>>>>>
>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ack, will change the name
>>>>>>>>
>>>>>>>>
>>>>>>>>>>   *
>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>   */
>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>  {
>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>  		 * operations.
>>>>>>>>>>  		 */
>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>> -			goto out;
>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>> +				goto out;
>>>>>>>>>> +			}
>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>  		}
>>>>>>>>>>  		end = -1;
>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>  	}
>>>>>>>>>>
>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>> +	if (!isolated)
>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>
>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>
>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>> -				uniform_split);
>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>  fail:
>>>>>>>>>>  		if (mapping)
>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>> +		if (!isolated)
>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>  	}
>>>>>>>>>
>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> There are two reasons for going down the current code path
>>>>>>>
>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>
>>>>>>
>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>> if calling the API with unmapped
>>>>>
>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>
>>>>
>>>> There is a use for splitting unmapped folios (see below)
>>>>
>>>>>>
>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>> device private folios properly. Details are below.
>>>>>>>
>>>>>>>>
>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>
>>>>>>> You do something below instead.
>>>>>>>
>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>> 	ret = -EBUSY;
>>>>>>> 	goto out;
>>>>>>> } else if (anon_vma) {
>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>> the check for device private folios?
>>>>>
>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>> in if (!isolated) branch. In that case, just do
>>>>>
>>>>> if (folio_is_device_private(folio) {
>>>>> ...
>>>>> } else if (is_anon) {
>>>>> ...
>>>>> } else {
>>>>> ...
>>>>> }
>>>>>
>>>>>>
>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>
>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>> sees it as both device private and file-backed?
>>>>>>>
>>>>>>
>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>> the name device private.
>>>>>
>>>>> OK.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>
>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>> sees a device private folio.
>>>>>>>
>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>
>>>>>>
>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>> remap_folio(), because
>>>>>>
>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>    so trying to do that again does not make sense.
>>>>>
>>>>> Why doing split in the middle of migration? Existing split code
>>>>> assumes to-be-split folios are mapped.
>>>>>
>>>>> What prevents doing split before migration?
>>>>>
>>>>
>>>> The code does do a split prior to migration if THP selection fails
>>>>
>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>> and the fallback part which calls split_folio()
>>>
>>> So this split is done when the folio in system memory is mapped.
>>>
>>>>
>>>> But the case under consideration is special since the device needs to allocate
>>>> corresponding pfn's as well. The changelog mentions it:
>>>>
>>>> "The common case that arises is that after setup, during migrate
>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>> pages."
>>>>
>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>
>>>> 1. migrate_vma_setup()
>>>> 2. migrate_vma_pages()
>>>> 3. migrate_vma_finalize()
>>>>
>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>
>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>
>>> So these unmapped folios are system memory folios? I thought they are
>>> large device private folios.
>>>
>>> OK. It sounds like splitting unmapped folios is really needed. I think
>>> it is better to make a new split_unmapped_folio() function
>>> by reusing __split_unmapped_folio(), since __folio_split() assumes
>>> the input folio is mapped.
>>
>> And to make __split_unmapped_folio()'s functionality match its name,
>> I will later refactor it. At least move local_irq_enable(), remap_page(),
>> and folio_unlocks out of it. I will think about how to deal with
>> lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
>> parameter from __split_unmapped_folio().
>>
>
> That sounds like a plan, it seems like there needs to be a finish phase of
> the split and it does not belong to __split_unmapped_folio(). I would propose
> that we rename "isolated" to "folio_is_migrating" and then your cleanups can
> follow? Once your cleanups come in, we won't need to pass the parameter to
> __split_unmapped_folio().

Sure.

The patch below should work. It only passed mm selftests and I am planning
to do more. If you are brave enough, you can give it a try and use
__split_unmapped_folio() from it.

From e594924d689bef740c38d93c7c1653f31bd5ae83 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Sun, 6 Jul 2025 22:40:53 -0400
Subject: [PATCH] mm/huge_memory: move epilogue code out of
 __split_unmapped_folio()

The code is not related to splitting unmapped folio operations. Move
it out, so that __split_unmapped_folio() only do split works on unmapped
folios.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 226 ++++++++++++++++++++++++-----------------------
 1 file changed, 116 insertions(+), 110 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3eb1c34be601..6eead616583f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3396,9 +3396,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
  *             order - 1 to new_order).
  * @split_at: in buddy allocator like split, the folio containing @split_at
  *            will be split until its order becomes @new_order.
- * @lock_at: the folio containing @lock_at is left locked for caller.
- * @list: the after split folios will be added to @list if it is not NULL,
- *        otherwise to LRU lists.
  * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
  * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
  * @mapping: @folio->mapping
@@ -3436,40 +3433,20 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
  * split. The caller needs to check the input folio.
  */
 static int __split_unmapped_folio(struct folio *folio, int new_order,
-		struct page *split_at, struct page *lock_at,
-		struct list_head *list, pgoff_t end,
-		struct xa_state *xas, struct address_space *mapping,
-		bool uniform_split)
+				  struct page *split_at, struct xa_state *xas,
+				  struct address_space *mapping,
+				  bool uniform_split)
 {
-	struct lruvec *lruvec;
-	struct address_space *swap_cache = NULL;
-	struct folio *origin_folio = folio;
-	struct folio *next_folio = folio_next(folio);
-	struct folio *new_folio;
 	struct folio *next;
 	int order = folio_order(folio);
 	int split_order;
 	int start_order = uniform_split ? new_order : order - 1;
-	int nr_dropped = 0;
 	int ret = 0;
 	bool stop_split = false;

-	if (folio_test_swapcache(folio)) {
-		VM_BUG_ON(mapping);
-
-		/* a swapcache folio can only be uniformly split to order-0 */
-		if (!uniform_split || new_order != 0)
-			return -EINVAL;
-
-		swap_cache = swap_address_space(folio->swap);
-		xa_lock(&swap_cache->i_pages);
-	}
-
 	if (folio_test_anon(folio))
 		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);

-	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
-	lruvec = folio_lruvec_lock(folio);

 	folio_clear_has_hwpoisoned(folio);

@@ -3541,89 +3518,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 						MTHP_STAT_NR_ANON, 1);
 			}

-			/*
-			 * origin_folio should be kept frozon until page cache
-			 * entries are updated with all the other after-split
-			 * folios to prevent others seeing stale page cache
-			 * entries.
-			 */
-			if (release == origin_folio)
-				continue;
-
-			folio_ref_unfreeze(release, 1 +
-					((mapping || swap_cache) ?
-						folio_nr_pages(release) : 0));
-
-			lru_add_split_folio(origin_folio, release, lruvec,
-					list);
-
-			/* Some pages can be beyond EOF: drop them from cache */
-			if (release->index >= end) {
-				if (shmem_mapping(mapping))
-					nr_dropped += folio_nr_pages(release);
-				else if (folio_test_clear_dirty(release))
-					folio_account_cleaned(release,
-						inode_to_wb(mapping->host));
-				__filemap_remove_folio(release, NULL);
-				folio_put_refs(release, folio_nr_pages(release));
-			} else if (mapping) {
-				__xa_store(&mapping->i_pages,
-						release->index, release, 0);
-			} else if (swap_cache) {
-				__xa_store(&swap_cache->i_pages,
-						swap_cache_index(release->swap),
-						release, 0);
-			}
 		}
 	}

-	/*
-	 * Unfreeze origin_folio only after all page cache entries, which used
-	 * to point to it, have been updated with new folios. Otherwise,
-	 * a parallel folio_try_get() can grab origin_folio and its caller can
-	 * see stale page cache entries.
-	 */
-	folio_ref_unfreeze(origin_folio, 1 +
-		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
-
-	unlock_page_lruvec(lruvec);
-
-	if (swap_cache)
-		xa_unlock(&swap_cache->i_pages);
-	if (mapping)
-		xa_unlock(&mapping->i_pages);

-	/* Caller disabled irqs, so they are still disabled here */
-	local_irq_enable();
-
-	if (nr_dropped)
-		shmem_uncharge(mapping->host, nr_dropped);
-
-	remap_page(origin_folio, 1 << order,
-			folio_test_anon(origin_folio) ?
-				RMP_USE_SHARED_ZEROPAGE : 0);
-
-	/*
-	 * At this point, folio should contain the specified page.
-	 * For uniform split, it is left for caller to unlock.
-	 * For buddy allocator like split, the first after-split folio is left
-	 * for caller to unlock.
-	 */
-	for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) {
-		next = folio_next(new_folio);
-		if (new_folio == page_folio(lock_at))
-			continue;
-
-		folio_unlock(new_folio);
-		/*
-		 * Subpages may be freed if there wasn't any mapping
-		 * like if add_to_swap() is running on a lru page that
-		 * had its mapping zapped. And freeing these pages
-		 * requires taking the lru_lock so we do the put_page
-		 * of the tail pages after the split is complete.
-		 */
-		free_folio_and_swap_cache(new_folio);
-	}
 	return ret;
 }

@@ -3706,10 +3604,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
+	struct folio *next_folio = folio_next(folio);
 	bool is_anon = folio_test_anon(folio);
 	struct address_space *mapping = NULL;
 	struct anon_vma *anon_vma = NULL;
 	int order = folio_order(folio);
+	struct folio *new_folio, *next;
 	int extra_pins, ret;
 	pgoff_t end;
 	bool is_hzp;
@@ -3840,6 +3740,10 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
+		struct address_space *swap_cache = NULL;
+		struct lruvec *lruvec;
+		int nr_dropped = 0;
+
 		if (folio_order(folio) > 1 &&
 		    !list_empty(&folio->_deferred_list)) {
 			ds_queue->split_queue_len--;
@@ -3873,19 +3777,121 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			}
 		}

-		ret = __split_unmapped_folio(folio, new_order,
-				split_at, lock_at, list, end, &xas, mapping,
-				uniform_split);
+		if (folio_test_swapcache(folio)) {
+			VM_BUG_ON(mapping);
+
+			/* a swapcache folio can only be uniformly split to order-0 */
+			if (!uniform_split || new_order != 0) {
+				ret = -EINVAL;
+				goto out_unlock;
+			}
+
+			swap_cache = swap_address_space(folio->swap);
+			xa_lock(&swap_cache->i_pages);
+		}
+
+		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
+		lruvec = folio_lruvec_lock(folio);
+
+		ret = __split_unmapped_folio(folio, new_order, split_at, &xas,
+					     mapping, uniform_split);
+
+		/* Unfreeze after-split folios */
+		for (new_folio = folio; new_folio != next_folio;
+		     new_folio = next) {
+			next = folio_next(new_folio);
+			/*
+			 * @folio should be kept frozon until page cache
+			 * entries are updated with all the other after-split
+			 * folios to prevent others seeing stale page cache
+			 * entries.
+			 */
+			if (new_folio == folio)
+				continue;
+
+			folio_ref_unfreeze(
+				new_folio,
+				1 + ((mapping || swap_cache) ?
+					     folio_nr_pages(new_folio) :
+					     0));
+
+			lru_add_split_folio(folio, new_folio, lruvec, list);
+
+			/* Some pages can be beyond EOF: drop them from cache */
+			if (new_folio->index >= end) {
+				if (shmem_mapping(mapping))
+					nr_dropped += folio_nr_pages(new_folio);
+				else if (folio_test_clear_dirty(new_folio))
+					folio_account_cleaned(
+						new_folio,
+						inode_to_wb(mapping->host));
+				__filemap_remove_folio(new_folio, NULL);
+				folio_put_refs(new_folio,
+					       folio_nr_pages(new_folio));
+			} else if (mapping) {
+				__xa_store(&mapping->i_pages, new_folio->index,
+					   new_folio, 0);
+			} else if (swap_cache) {
+				__xa_store(&swap_cache->i_pages,
+					   swap_cache_index(new_folio->swap),
+					   new_folio, 0);
+			}
+		}
+		/*
+		 * Unfreeze @folio only after all page cache entries, which
+		 * used to point to it, have been updated with new folios.
+		 * Otherwise, a parallel folio_try_get() can grab origin_folio
+		 * and its caller can see stale page cache entries.
+		 */
+		folio_ref_unfreeze(folio, 1 +
+			((mapping || swap_cache) ? folio_nr_pages(folio) : 0));
+
+		unlock_page_lruvec(lruvec);
+
+		if (swap_cache)
+			xa_unlock(&swap_cache->i_pages);
+		if (mapping)
+			xa_unlock(&mapping->i_pages);
+
+		if (nr_dropped)
+			shmem_uncharge(mapping->host, nr_dropped);
+
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
-		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio), 0);
 		ret = -EAGAIN;
 	}

+	local_irq_enable();
+
+	remap_page(folio, 1 << order,
+		   !ret && folio_test_anon(folio) ? RMP_USE_SHARED_ZEROPAGE :
+						    0);
+
+	/*
+	 * At this point, folio should contain the specified page.
+	 * For uniform split, it is left for caller to unlock.
+	 * For buddy allocator like split, the first after-split folio is left
+	 * for caller to unlock.
+	 */
+	for (new_folio = folio; new_folio != next_folio; new_folio = next) {
+		next = folio_next(new_folio);
+		if (new_folio == page_folio(lock_at))
+			continue;
+
+		folio_unlock(new_folio);
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		free_folio_and_swap_cache(new_folio);
+	}
+
 out_unlock:
 	if (anon_vma) {
 		anon_vma_unlock_write(anon_vma);
-- 
2.47.2



--
Best Regards,
Yan, Zi


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-07  2:35           ` Balbir Singh
@ 2025-07-07  3:29             ` Mika Penttilä
  2025-07-08  7:37               ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-07  3:29 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/7/25 05:35, Balbir Singh wrote:
> On 7/5/25 13:17, Mika Penttilä wrote:
>>>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>>>> +					unsigned long idx, unsigned long addr,
>>>>>> +					struct folio *folio)
>>>>>> +{
>>>>>> +	unsigned long i;
>>>>>> +	unsigned long pfn;
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	folio_get(folio);
>>>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>>>> We already have reference to folio, why is folio_get() needed ?
>>>>>
>>>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>>>> Oh I see 
>>>> +	if (!isolated)
>>>> +		unmap_folio(folio);
>>>>
>>>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>>>
>>>> Still, why the folio_get(folio);?
>>>>  
>>>>
>>> That is for split_huge_pmd_address, when called with freeze=true, it drops the
>>> ref count on the page
>>>
>>> 	if (freeze)
>>> 		put_page(page);
>>>
>>> Balbir
>>>
>> yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
>> Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
>> right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
>> Doing the split this late is quite problematic all in all.
>>
> Clearing PageAnonExclusive will *not* fail for device private pages from what I can see in __folio_try_share_anon_rmap().
> Doing the split late is a requirement due to the nature of the three stage migration operation, the other side
> might fail to allocate THP sized pages and so the code needs to deal with it
>
> Balbir Singh

Yes seems clearing PageAnonExclusive doesn't fail for device private pages in the end, 
but the 3/12 patch doesn't even try to clear PageAnonExclusive with your changes afaics,
which is a separate issue.

And __split_huge_page_to_list_to_order() (return value is not checked) can fail for out of memory.
So think you can not just assume split just works. If late split is a requirement (which I can understand is),
you should be prepared to rollback somehow the operation.

>
--Mika



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
  2025-07-04  4:46   ` Mika Penttilä
  2025-07-04 11:10   ` Mika Penttilä
@ 2025-07-07  3:49   ` Mika Penttilä
  2025-07-08  4:20     ` Balbir Singh
  2025-07-07  6:07   ` Alistair Popple
  2025-07-22  4:42   ` Matthew Brost
  4 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-07  3:49 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom


On 7/4/25 02:35, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages
> aware of zone device pages. Although the code is
> designed to be generic when it comes to handling splitting
> of pages, the code is designed to work for THP page sizes
> corresponding to HPAGE_PMD_NR.
>
> Modify page_vma_mapped_walk() to return true when a zone
> device huge entry is present, enabling try_to_migrate()
> and other code migration paths to appropriately process the
> entry
>
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for
> zone device entries.
>
> try_to_map_to_unused_zeropage() does not apply to zone device
> entries, zone device entries are ignored in the call.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>  mm/migrate.c         |   2 +
>  mm/page_vma_mapped.c |  10 +++
>  mm/pgtable-generic.c |   6 ++
>  mm/rmap.c            |  19 +++++-
>  5 files changed, 146 insertions(+), 44 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ce130225a8e5..e6e390d0308f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_device_private_entry(entry));
>  		if (!is_readable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		} else if (thp_migration_supported()) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> +
> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +					!folio_is_device_private(folio));
> +
> +			if (folio_is_device_private(folio)) {
> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
>  		} else
>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>  
> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +			  !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> -		} else {
> +		} else if (is_writable_device_private_entry(entry)) {
> +			newpmd = swp_entry_to_pmd(entry);
> +			entry = make_device_exclusive_entry(swp_offset(entry));
> +		} else
>  			newpmd = *pmd;
> -		}
>  
>  		if (uffd_wp)
>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>  	unsigned long addr;
>  	pte_t *pte;
>  	int i;
> +	swp_entry_t swp_entry;
>  
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
> +
> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +			&& !(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>  	}
>  
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>  
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>  		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +		write = is_writable_migration_entry(swp_entry);
> +
>  		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +			anon_exclusive =
> +				is_readable_exclusive_migration_entry(swp_entry);
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
> +		young = is_migration_entry_young(swp_entry);
> +		dirty = is_migration_entry_dirty(swp_entry);
>  	} else {

This is where folio_try_share_anon_rmap_pmd() is skipped for device private pages, to which I referred in
https://lore.kernel.org/linux-mm/f1e26e18-83db-4c0e-b8d8-0af8ffa8a206@redhat.com/

--Mika




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 01/12] mm/zone_device: support large zone device private folios
  2025-07-03 23:35 ` [v1 resend 01/12] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-07-07  5:28   ` Alistair Popple
  2025-07-08  6:47     ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Alistair Popple @ 2025-07-07  5:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Fri, Jul 04, 2025 at 09:35:00AM +1000, Balbir Singh wrote:
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
> 
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/memremap.h | 22 +++++++++++++++++-
>  mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
>  2 files changed, 58 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 4aa151914eab..11d586dd8ef1 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
>  	return is_device_private_page(&folio->page);
>  }
>  
> +static inline void *folio_zone_device_data(const struct folio *folio)
> +{
> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
> +	return folio->page.zone_device_data;
> +}
> +
> +static inline void folio_set_zone_device_data(struct folio *folio, void *data)
> +{
> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
> +	folio->page.zone_device_data = data;
> +}
> +
>  static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
> @@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void init_zone_device_folio(struct folio *folio, unsigned int order);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>  
>  unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_page_init(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	init_zone_device_folio(folio, 0);

Minor nit, but why not call this zone_device_folio_init() to keep the naming
consistent with zone_device_page_init()?

> +}
> +
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
>  		struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b0ce0d8254bd..4085a3893e64 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -427,20 +427,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  void free_zone_device_folio(struct folio *folio)
>  {
>  	struct dev_pagemap *pgmap = folio->pgmap;
> +	unsigned int nr = folio_nr_pages(folio);
> +	int i;
> +	bool anon = folio_test_anon(folio);
> +	struct page *page = folio_page(folio, 0);
>  
>  	if (WARN_ON_ONCE(!pgmap))
>  		return;
>  
>  	mem_cgroup_uncharge(folio);
>  
> -	/*
> -	 * Note: we don't expect anonymous compound pages yet. Once supported
> -	 * and we could PTE-map them similar to THP, we'd have to clear
> -	 * PG_anon_exclusive on all tail pages.
> -	 */
> -	if (folio_test_anon(folio)) {
> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -		__ClearPageAnonExclusive(folio_page(folio, 0));
> +	WARN_ON_ONCE(folio_test_large(folio) && !anon);
> +
> +	for (i = 0; i < nr; i++) {

The above comment says we should do this for all tail pages, but this appears to
do it for the head page as well. Is there a particular reason for that?

> +		if (anon)
> +			__ClearPageAnonExclusive(folio_page(folio, i));
>  	}
>  
>  	/*
> @@ -464,10 +465,19 @@ void free_zone_device_folio(struct folio *folio)
>  
>  	switch (pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
> +		if (folio_test_large(folio)) {
> +			folio_unqueue_deferred_split(folio);
> +
> +			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
> +		}
> +		pgmap->ops->page_free(page);
> +		put_dev_pagemap(pgmap);

Why is this needed/added, and where is the associated get_dev_pagemap()? Note
that the whole {get|put}_dev_pagemap() thing is basically unused now. Which
reminds me I should send a patch to remove it.

> +		page->mapping = NULL;
> +		break;
>  	case MEMORY_DEVICE_COHERENT:
>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>  			break;
> -		pgmap->ops->page_free(folio_page(folio, 0));
> +		pgmap->ops->page_free(page);
>  		put_dev_pagemap(pgmap);
>  		break;
>  
> @@ -491,14 +501,28 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>  
> -void zone_device_page_init(struct page *page)
> +void init_zone_device_folio(struct folio *folio, unsigned int order)

See above for some bike-shedding on the name.

>  {
> +	struct page *page = folio_page(folio, 0);
> +
> +	VM_BUG_ON(order > MAX_ORDER_NR_PAGES);
> +
> +	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
> +
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().
>  	 */
> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> -	set_page_count(page, 1);
> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> +	folio_set_count(folio, 1);
>  	lock_page(page);
> +
> +	/*
> +	 * Only PMD level migration is supported for THP migration
> +	 */
> +	if (order > 1) {
> +		prep_compound_page(page, order);

Shouldn't this happen for order > 0 not 1? What about calling
INIT_LIST_HEAD(&folio->_deferred_list)? Last time I looked prep_compound_page()
didn't do that and I see above you are calling folio_unqueue_deferred_split() so
I assume you need to do this for DEVICE_PRIVATE pages too.

> +		folio_set_large_rmappable(folio);
> +	}
>  }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(init_zone_device_folio);
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages
  2025-07-03 23:35 ` [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
@ 2025-07-07  5:31   ` Alistair Popple
  2025-07-08  7:31     ` Balbir Singh
  2025-07-18  3:15   ` Matthew Brost
  1 sibling, 1 reply; 99+ messages in thread
From: Alistair Popple @ 2025-07-07  5:31 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Fri, Jul 04, 2025 at 09:35:01AM +1000, Balbir Singh wrote:
> Add flags to mark zone device migration pages.
> 
> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> device pages as compound pages during device pfn migration.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/migrate.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index aaa2114498d6..1661e2d5479a 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>  #define MIGRATE_PFN_VALID	(1UL << 0)
>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)

Why is this necessary? Couldn't migrate_vma just use folio_order() to figure out
if it's a compound page or not?

>  #define MIGRATE_PFN_SHIFT	6
>  
>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> @@ -185,6 +186,7 @@ enum migrate_vma_direction {
>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>  };
>  
>  struct migrate_vma {
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
                     ` (2 preceding siblings ...)
  2025-07-07  3:49   ` Mika Penttilä
@ 2025-07-07  6:07   ` Alistair Popple
  2025-07-08  4:59     ` Balbir Singh
  2025-07-22  4:42   ` Matthew Brost
  4 siblings, 1 reply; 99+ messages in thread
From: Alistair Popple @ 2025-07-07  6:07 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Fri, Jul 04, 2025 at 09:35:02AM +1000, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages
> aware of zone device pages. Although the code is
> designed to be generic when it comes to handling splitting
> of pages, the code is designed to work for THP page sizes
> corresponding to HPAGE_PMD_NR.
> 
> Modify page_vma_mapped_walk() to return true when a zone
> device huge entry is present, enabling try_to_migrate()
> and other code migration paths to appropriately process the
> entry
> 
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for
> zone device entries.
> 
> try_to_map_to_unused_zeropage() does not apply to zone device
> entries, zone device entries are ignored in the call.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>  mm/migrate.c         |   2 +
>  mm/page_vma_mapped.c |  10 +++
>  mm/pgtable-generic.c |   6 ++
>  mm/rmap.c            |  19 +++++-
>  5 files changed, 146 insertions(+), 44 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ce130225a8e5..e6e390d0308f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_device_private_entry(entry));
>  		if (!is_readable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		} else if (thp_migration_supported()) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> +
> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +					!folio_is_device_private(folio));
> +
> +			if (folio_is_device_private(folio)) {
> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
>  		} else
>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>  
> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +			  !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> -		} else {
> +		} else if (is_writable_device_private_entry(entry)) {
> +			newpmd = swp_entry_to_pmd(entry);
> +			entry = make_device_exclusive_entry(swp_offset(entry));

Argh. The naming here is terribly confusing (of which I'm at least partly to
blame) because it ended up clashing with David's PG_anon_exclusive which is a
completely different concept - see 6c287605fd56 ("mm: remember exclusively
mapped anonymous pages with PG_anon_exclusive").

The exclusive entries you are creating here are for emulating atomic access -
see the documentation for make_device_exclusive() for more details - and are
almost certainly not what you want.

As far as I understand things we don't need to create anon exclusive entries for
device private pages because they can never be pinned, so likely you just want
make_readable_device_private_entry() here. If we really want to track anon
exclusive status you probably need pte_swp_exclusive(), but then we should do it
for non-THP device private pages as well and that sounds like a whole different
problem/patch series.

> +		} else
>  			newpmd = *pmd;
> -		}
>  
>  		if (uffd_wp)
>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>  	unsigned long addr;
>  	pte_t *pte;
>  	int i;
> +	swp_entry_t swp_entry;
>  
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
> +
> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +			&& !(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>  	}
>  
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>  
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>  		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +		write = is_writable_migration_entry(swp_entry);
> +
>  		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +			anon_exclusive =
> +				is_readable_exclusive_migration_entry(swp_entry);
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
> +		young = is_migration_entry_young(swp_entry);
> +		dirty = is_migration_entry_dirty(swp_entry);

This could be a device-private swp_entry right? In which case calling
is_migration_entry_*() on them isn't correct. I suspect you want to have
separate code paths for migration vs. device_private entries here.

>  	} else {
>  		/*
>  		 * Up to this point the pmd is present and huge and userland has
> @@ -2996,30 +3025,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * Note that NUMA hinting access restrictions are not transferred to
>  	 * avoid any possibility of altering permissions across VMAs.
>  	 */
> -	if (freeze || pmd_migration) {
> +	if (freeze || !present) {
>  		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>  			pte_t entry;
> -			swp_entry_t swp_entry;
> -
> -			if (write)
> -				swp_entry = make_writable_migration_entry(
> -							page_to_pfn(page + i));
> -			else if (anon_exclusive)
> -				swp_entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(page + i));
> -			else
> -				swp_entry = make_readable_migration_entry(
> -							page_to_pfn(page + i));
> -			if (young)
> -				swp_entry = make_migration_entry_young(swp_entry);
> -			if (dirty)
> -				swp_entry = make_migration_entry_dirty(swp_entry);
> -			entry = swp_entry_to_pte(swp_entry);
> -			if (soft_dirty)
> -				entry = pte_swp_mksoft_dirty(entry);
> -			if (uffd_wp)
> -				entry = pte_swp_mkuffd_wp(entry);
> -
> +			if (freeze || is_migration_entry(swp_entry)) {
> +				if (write)
> +					swp_entry = make_writable_migration_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_migration_entry(
> +								page_to_pfn(page + i));
> +				if (young)
> +					swp_entry = make_migration_entry_young(swp_entry);
> +				if (dirty)
> +					swp_entry = make_migration_entry_dirty(swp_entry);
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			} else {
> +				VM_BUG_ON(!is_device_private_entry(swp_entry));
> +				if (write)
> +					swp_entry = make_writable_device_private_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_device_exclusive_entry(

As above, you're welcome for the naming :-)

> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_device_private_entry(
> +								page_to_pfn(page + i));
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			}
>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>  			set_pte_at(mm, addr, pte + i, entry);
>  		}
> @@ -3046,7 +3090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	}
>  	pte_unmap(pte);
>  
> -	if (!pmd_migration)
> +	if (present)
>  		folio_remove_rmap_pmd(folio, page, vma);
>  	if (freeze)
>  		put_page(page);
> @@ -3058,8 +3102,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>  			   pmd_t *pmd, bool freeze)
>  {
> +
>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> -	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
> +	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
> +			(is_swap_pmd(*pmd) &&

Should we create is_pmd_device_entry() to match is_pmd_migration_entry()?

> +			is_device_private_entry(pmd_to_swp_entry(*pmd))))
>  		__split_huge_pmd_locked(vma, pmd, address, freeze);
>  }
>  
> @@ -3238,6 +3285,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>  	lockdep_assert_held(&lruvec->lru_lock);
>  
> +	if (folio_is_device_private(folio))
> +		return;
> +
>  	if (list) {
>  		/* page reclaim is reclaiming a huge page */
>  		VM_WARN_ON(folio_test_lru(folio));
> @@ -3252,6 +3302,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  			list_add_tail(&new_folio->lru, &folio->lru);
>  		folio_set_lru(new_folio);
>  	}
> +
>  }
>  
>  /* Racy check whether the huge page can be split */
> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  					((mapping || swap_cache) ?
>  						folio_nr_pages(release) : 0));
>  
> +			if (folio_is_device_private(release))
> +				percpu_ref_get_many(&release->pgmap->ref,
> +							(1 << new_order) - 1);

Is there a description somewhere for how we think pgmap->ref works for compound/
higher-order device private pages? Not that it matters too much, I'd like to
remove it. Maybe I can create a patch to do that which you can base on top of.

>  			lru_add_split_folio(origin_folio, release, lruvec,
>  					list);
>  
> @@ -4596,7 +4651,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		return 0;
>  
>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	if (!folio_is_device_private(folio))
> +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	else
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);

Do we need to flush? A device private entry is already non-present so is the
flush necessary?

>  
>  	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>  	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> @@ -4646,6 +4704,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>  	folio_get(folio);
>  	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
> +
> +	if (unlikely(folio_is_device_private(folio))) {
> +		if (pmd_write(pmde))
> +			entry = make_writable_device_private_entry(
> +							page_to_pfn(new));
> +		else
> +			entry = make_readable_device_private_entry(
> +							page_to_pfn(new));
> +		pmde = swp_entry_to_pmd(entry);
> +	}
> +
>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>  		pmde = pmd_mksoft_dirty(pmde);
>  	if (is_writable_migration_entry(entry))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 767f503f0875..0b6ecf559b22 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>  
>  	if (PageCompound(page))
>  		return false;
> +	if (folio_is_device_private(folio))
> +		return false;
>  	VM_BUG_ON_PAGE(!PageAnon(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e981a1a292d2..ff8254e52de5 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -277,6 +277,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			 * cannot return prematurely, while zap_huge_pmd() has
>  			 * cleared *pmd but not decremented compound_mapcount().
>  			 */
> +			swp_entry_t entry;
> +
> +			if (!thp_migration_supported())
> +				return not_found(pvmw);
> +			entry = pmd_to_swp_entry(pmde);
> +			if (is_device_private_entry(entry)) {
> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +				return true;

Do other callers of page_vma_mapped_walk() need to be updated now that large
device private pages may be returned?

> +			}
> +
>  			if ((pvmw->flags & PVMW_SYNC) &&
>  			    thp_vma_suitable_order(vma, pvmw->address,
>  						   PMD_ORDER) &&
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..604e8206a2ec 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  		*pmdvalp = pmdval;
>  	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>  		goto nomap;
> +	if (is_swap_pmd(pmdval)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> +
> +		if (is_device_private_entry(entry))
> +			goto nomap;
> +	}
>  	if (unlikely(pmd_trans_huge(pmdval)))
>  		goto nomap;
>  	if (unlikely(pmd_bad(pmdval))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bd83724d14b6..da1e5b03e1fe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2336,8 +2336,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				break;
>  			}
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> -			subpage = folio_page(folio,
> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +			/*
> +			 * Zone device private folios do not work well with
> +			 * pmd_pfn() on some architectures due to pte
> +			 * inversion.
> +			 */
> +			if (folio_is_device_private(folio)) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> +				unsigned long pfn = swp_offset_pfn(entry);
> +
> +				subpage = folio_page(folio, pfn
> +							- folio_pfn(folio));
> +			} else {
> +				subpage = folio_page(folio,
> +							pmd_pfn(*pvmw.pmd)
> +							- folio_pfn(folio));
> +			}
> +
>  			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>  					!folio_test_pmd_mappable(folio), folio);
>  
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-05  0:14     ` Balbir Singh
@ 2025-07-07  6:09       ` Alistair Popple
  2025-07-08  7:40         ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Alistair Popple @ 2025-07-07  6:09 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Mika Penttilä, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Sat, Jul 05, 2025 at 10:14:18AM +1000, Balbir Singh wrote:
> On 7/4/25 21:10, Mika Penttilä wrote:
> >>  /* Racy check whether the huge page can be split */
> >> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
> >>  					((mapping || swap_cache) ?
> >>  						folio_nr_pages(release) : 0));
> >>  
> >> +			if (folio_is_device_private(release))
> >> +				percpu_ref_get_many(&release->pgmap->ref,
> >> +							(1 << new_order) - 1);
> > 
> > pgmap refcount should not be modified here, count should remain the same after the split also

Agreed.

> > 
> > 
> 
> Good point, let me revisit the accounting

Yes, hopefully we can just delete it.

> For this patch series, the tests did not catch it since new ref's evaluate to 0

You may not notice bad accounting here unless you unload the kernel module,
which can hang during memunmap() pages waiting for the refcount to go to zero.

> Thanks,
> Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-07  2:45                     ` Zi Yan
@ 2025-07-08  3:31                       ` Balbir Singh
  2025-07-08  7:43                       ` Balbir Singh
  1 sibling, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  3:31 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/7/25 12:45, Zi Yan wrote:
> On 6 Jul 2025, at 22:29, Balbir Singh wrote:
> 
>> On 7/6/25 13:03, Zi Yan wrote:
>>> On 5 Jul 2025, at 22:34, Zi Yan wrote:
>>>
>>>> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>>>>
>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>
>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>> s/pages/folio
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, will make the changes
>>>>>>>>>
>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ack, will change the name
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>   *
>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>   */
>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>  {
>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>  		 */
>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>> -			goto out;
>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>> +				goto out;
>>>>>>>>>>> +			}
>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>  		}
>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>  	} else {
>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>  	}
>>>>>>>>>>>
>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>
>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>
>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>  	} else {
>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>  fail:
>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>  	}
>>>>>>>>>>
>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>
>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>
>>>>>>>
>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>> if calling the API with unmapped
>>>>>>
>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>
>>>>>
>>>>> There is a use for splitting unmapped folios (see below)
>>>>>
>>>>>>>
>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>> device private folios properly. Details are below.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>
>>>>>>>> You do something below instead.
>>>>>>>>
>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>> 	ret = -EBUSY;
>>>>>>>> 	goto out;
>>>>>>>> } else if (anon_vma) {
>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>> }
>>>>>>>>
>>>>>>>
>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>> the check for device private folios?
>>>>>>
>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>
>>>>>> if (folio_is_device_private(folio) {
>>>>>> ...
>>>>>> } else if (is_anon) {
>>>>>> ...
>>>>>> } else {
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>>>
>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>
>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>
>>>>>>>
>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>> the name device private.
>>>>>>
>>>>>> OK.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>
>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>> sees a device private folio.
>>>>>>>>
>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>
>>>>>>>
>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>> remap_folio(), because
>>>>>>>
>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>    so trying to do that again does not make sense.
>>>>>>
>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>> assumes to-be-split folios are mapped.
>>>>>>
>>>>>> What prevents doing split before migration?
>>>>>>
>>>>>
>>>>> The code does do a split prior to migration if THP selection fails
>>>>>
>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>> and the fallback part which calls split_folio()
>>>>
>>>> So this split is done when the folio in system memory is mapped.
>>>>
>>>>>
>>>>> But the case under consideration is special since the device needs to allocate
>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>
>>>>> "The common case that arises is that after setup, during migrate
>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>> pages."
>>>>>
>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>
>>>>> 1. migrate_vma_setup()
>>>>> 2. migrate_vma_pages()
>>>>> 3. migrate_vma_finalize()
>>>>>
>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>
>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>
>>>> So these unmapped folios are system memory folios? I thought they are
>>>> large device private folios.
>>>>
>>>> OK. It sounds like splitting unmapped folios is really needed. I think
>>>> it is better to make a new split_unmapped_folio() function
>>>> by reusing __split_unmapped_folio(), since __folio_split() assumes
>>>> the input folio is mapped.
>>>
>>> And to make __split_unmapped_folio()'s functionality match its name,
>>> I will later refactor it. At least move local_irq_enable(), remap_page(),
>>> and folio_unlocks out of it. I will think about how to deal with
>>> lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
>>> parameter from __split_unmapped_folio().
>>>
>>
>> That sounds like a plan, it seems like there needs to be a finish phase of
>> the split and it does not belong to __split_unmapped_folio(). I would propose
>> that we rename "isolated" to "folio_is_migrating" and then your cleanups can
>> follow? Once your cleanups come in, we won't need to pass the parameter to
>> __split_unmapped_folio().
> 
> Sure.
> 
> The patch below should work. It only passed mm selftests and I am planning
> to do more. If you are brave enough, you can give it a try and use
> __split_unmapped_folio() from it.
> 
> From e594924d689bef740c38d93c7c1653f31bd5ae83 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Sun, 6 Jul 2025 22:40:53 -0400
> Subject: [PATCH] mm/huge_memory: move epilogue code out of
>  __split_unmapped_folio()
> 
> The code is not related to splitting unmapped folio operations. Move
> it out, so that __split_unmapped_folio() only do split works on unmapped
> folios.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> 

The patch fails to apply for me, let me try and rebase it on top of this series

Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-07  3:49   ` Mika Penttilä
@ 2025-07-08  4:20     ` Balbir Singh
  2025-07-08  4:30       ` Mika Penttilä
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  4:20 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/7/25 13:49, Mika Penttilä wrote:
> 
> On 7/4/25 02:35, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages
>> aware of zone device pages. Although the code is
>> designed to be generic when it comes to handling splitting
>> of pages, the code is designed to work for THP page sizes
>> corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone
>> device huge entry is present, enabling try_to_migrate()
>> and other code migration paths to appropriately process the
>> entry
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for
>> zone device entries.
>>
>> try_to_map_to_unused_zeropage() does not apply to zone device
>> entries, zone device entries are ignored in the call.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>>  mm/migrate.c         |   2 +
>>  mm/page_vma_mapped.c |  10 +++
>>  mm/pgtable-generic.c |   6 ++
>>  mm/rmap.c            |  19 +++++-
>>  5 files changed, 146 insertions(+), 44 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ce130225a8e5..e6e390d0308f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>  	if (unlikely(is_swap_pmd(pmd))) {
>>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>  
>> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
>> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
>> +				!is_device_private_entry(entry));
>>  		if (!is_readable_migration_entry(entry)) {
>>  			entry = make_readable_migration_entry(
>>  							swp_offset(entry));
>> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		} else if (thp_migration_supported()) {
>>  			swp_entry_t entry;
>>  
>> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>  			entry = pmd_to_swp_entry(orig_pmd);
>>  			folio = pfn_swap_entry_folio(entry);
>>  			flush_needed = 0;
>> +
>> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> +					!folio_is_device_private(folio));
>> +
>> +			if (folio_is_device_private(folio)) {
>> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
>> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
>> +			}
>>  		} else
>>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>  
>> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  				folio_mark_accessed(folio);
>>  		}
>>  
>> +		/*
>> +		 * Do a folio put on zone device private pages after
>> +		 * changes to mm_counter, because the folio_put() will
>> +		 * clean folio->mapping and the folio_test_anon() check
>> +		 * will not be usable.
>> +		 */
>> +		if (folio_is_device_private(folio))
>> +			folio_put(folio);
>> +
>>  		spin_unlock(ptl);
>>  		if (flush_needed)
>>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
>> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		struct folio *folio = pfn_swap_entry_folio(entry);
>>  		pmd_t newpmd;
>>  
>> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
>> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> +			  !folio_is_device_private(folio));
>>  		if (is_writable_migration_entry(entry)) {
>>  			/*
>>  			 * A protection check is difficult so
>> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  			newpmd = swp_entry_to_pmd(entry);
>>  			if (pmd_swp_soft_dirty(*pmd))
>>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
>> -		} else {
>> +		} else if (is_writable_device_private_entry(entry)) {
>> +			newpmd = swp_entry_to_pmd(entry);
>> +			entry = make_device_exclusive_entry(swp_offset(entry));
>> +		} else
>>  			newpmd = *pmd;
>> -		}
>>  
>>  		if (uffd_wp)
>>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
>> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  	struct page *page;
>>  	pgtable_t pgtable;
>>  	pmd_t old_pmd, _pmd;
>> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>> -	bool anon_exclusive = false, dirty = false;
>> +	bool young, write, soft_dirty, uffd_wp = false;
>> +	bool anon_exclusive = false, dirty = false, present = false;
>>  	unsigned long addr;
>>  	pte_t *pte;
>>  	int i;
>> +	swp_entry_t swp_entry;
>>  
>>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
>> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
>> +
>> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
>> +			&& !(is_swap_pmd(*pmd) &&
>> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>>  
>>  	count_vm_event(THP_SPLIT_PMD);
>>  
>> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>  	}
>>  
>> -	pmd_migration = is_pmd_migration_entry(*pmd);
>> -	if (unlikely(pmd_migration)) {
>> -		swp_entry_t entry;
>>  
>> +	present = pmd_present(*pmd);
>> +	if (unlikely(!present)) {
>> +		swp_entry = pmd_to_swp_entry(*pmd);
>>  		old_pmd = *pmd;
>> -		entry = pmd_to_swp_entry(old_pmd);
>> -		page = pfn_swap_entry_to_page(entry);
>> -		write = is_writable_migration_entry(entry);
>> +
>> +		folio = pfn_swap_entry_folio(swp_entry);
>> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
>> +				!is_device_private_entry(swp_entry));
>> +		page = pfn_swap_entry_to_page(swp_entry);
>> +		write = is_writable_migration_entry(swp_entry);
>> +
>>  		if (PageAnon(page))
>> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
>> -		young = is_migration_entry_young(entry);
>> -		dirty = is_migration_entry_dirty(entry);
>> +			anon_exclusive =
>> +				is_readable_exclusive_migration_entry(swp_entry);
>>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
>> +		young = is_migration_entry_young(swp_entry);
>> +		dirty = is_migration_entry_dirty(swp_entry);
>>  	} else {
> 
> This is where folio_try_share_anon_rmap_pmd() is skipped for device private pages, to which I referred in
> https://lore.kernel.org/linux-mm/f1e26e18-83db-4c0e-b8d8-0af8ffa8a206@redhat.com/
> 

Does it matter for device private pages/folios? It does not affect the freeze value.

Balbir Singh



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-08  4:20     ` Balbir Singh
@ 2025-07-08  4:30       ` Mika Penttilä
  0 siblings, 0 replies; 99+ messages in thread
From: Mika Penttilä @ 2025-07-08  4:30 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom


On 7/8/25 07:20, Balbir Singh wrote:
> On 7/7/25 13:49, Mika Penttilä wrote:
>> On 7/4/25 02:35, Balbir Singh wrote:
>>> Make THP handling code in the mm subsystem for THP pages
>>> aware of zone device pages. Although the code is
>>> designed to be generic when it comes to handling splitting
>>> of pages, the code is designed to work for THP page sizes
>>> corresponding to HPAGE_PMD_NR.
>>>
>>> Modify page_vma_mapped_walk() to return true when a zone
>>> device huge entry is present, enabling try_to_migrate()
>>> and other code migration paths to appropriately process the
>>> entry
>>>
>>> pmd_pfn() does not work well with zone device entries, use
>>> pfn_pmd_entry_to_swap() for checking and comparison as for
>>> zone device entries.
>>>
>>> try_to_map_to_unused_zeropage() does not apply to zone device
>>> entries, zone device entries are ignored in the call.
>>>
>>> Cc: Karol Herbst <kherbst@redhat.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>> Cc: Shuah Khan <shuah@kernel.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Cc: Jane Chu <jane.chu@oracle.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>>>  mm/migrate.c         |   2 +
>>>  mm/page_vma_mapped.c |  10 +++
>>>  mm/pgtable-generic.c |   6 ++
>>>  mm/rmap.c            |  19 +++++-
>>>  5 files changed, 146 insertions(+), 44 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index ce130225a8e5..e6e390d0308f 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>  	if (unlikely(is_swap_pmd(pmd))) {
>>>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>>  
>>> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
>>> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
>>> +				!is_device_private_entry(entry));
>>>  		if (!is_readable_migration_entry(entry)) {
>>>  			entry = make_readable_migration_entry(
>>>  							swp_offset(entry));
>>> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>  		} else if (thp_migration_supported()) {
>>>  			swp_entry_t entry;
>>>  
>>> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>>  			entry = pmd_to_swp_entry(orig_pmd);
>>>  			folio = pfn_swap_entry_folio(entry);
>>>  			flush_needed = 0;
>>> +
>>> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>>> +					!folio_is_device_private(folio));
>>> +
>>> +			if (folio_is_device_private(folio)) {
>>> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
>>> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
>>> +			}
>>>  		} else
>>>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>>  
>>> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>  				folio_mark_accessed(folio);
>>>  		}
>>>  
>>> +		/*
>>> +		 * Do a folio put on zone device private pages after
>>> +		 * changes to mm_counter, because the folio_put() will
>>> +		 * clean folio->mapping and the folio_test_anon() check
>>> +		 * will not be usable.
>>> +		 */
>>> +		if (folio_is_device_private(folio))
>>> +			folio_put(folio);
>>> +
>>>  		spin_unlock(ptl);
>>>  		if (flush_needed)
>>>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
>>> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>  		struct folio *folio = pfn_swap_entry_folio(entry);
>>>  		pmd_t newpmd;
>>>  
>>> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
>>> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>>> +			  !folio_is_device_private(folio));
>>>  		if (is_writable_migration_entry(entry)) {
>>>  			/*
>>>  			 * A protection check is difficult so
>>> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>  			newpmd = swp_entry_to_pmd(entry);
>>>  			if (pmd_swp_soft_dirty(*pmd))
>>>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
>>> -		} else {
>>> +		} else if (is_writable_device_private_entry(entry)) {
>>> +			newpmd = swp_entry_to_pmd(entry);
>>> +			entry = make_device_exclusive_entry(swp_offset(entry));
>>> +		} else
>>>  			newpmd = *pmd;
>>> -		}
>>>  
>>>  		if (uffd_wp)
>>>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
>>> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  	struct page *page;
>>>  	pgtable_t pgtable;
>>>  	pmd_t old_pmd, _pmd;
>>> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>>> -	bool anon_exclusive = false, dirty = false;
>>> +	bool young, write, soft_dirty, uffd_wp = false;
>>> +	bool anon_exclusive = false, dirty = false, present = false;
>>>  	unsigned long addr;
>>>  	pte_t *pte;
>>>  	int i;
>>> +	swp_entry_t swp_entry;
>>>  
>>>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>>>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>>>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
>>> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
>>> +
>>> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
>>> +			&& !(is_swap_pmd(*pmd) &&
>>> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>>>  
>>>  	count_vm_event(THP_SPLIT_PMD);
>>>  
>>> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>>  	}
>>>  
>>> -	pmd_migration = is_pmd_migration_entry(*pmd);
>>> -	if (unlikely(pmd_migration)) {
>>> -		swp_entry_t entry;
>>>  
>>> +	present = pmd_present(*pmd);
>>> +	if (unlikely(!present)) {
>>> +		swp_entry = pmd_to_swp_entry(*pmd);
>>>  		old_pmd = *pmd;
>>> -		entry = pmd_to_swp_entry(old_pmd);
>>> -		page = pfn_swap_entry_to_page(entry);
>>> -		write = is_writable_migration_entry(entry);
>>> +
>>> +		folio = pfn_swap_entry_folio(swp_entry);
>>> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
>>> +				!is_device_private_entry(swp_entry));
>>> +		page = pfn_swap_entry_to_page(swp_entry);
>>> +		write = is_writable_migration_entry(swp_entry);
>>> +
>>>  		if (PageAnon(page))
>>> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
>>> -		young = is_migration_entry_young(entry);
>>> -		dirty = is_migration_entry_dirty(entry);
>>> +			anon_exclusive =
>>> +				is_readable_exclusive_migration_entry(swp_entry);
>>>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>>>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
>>> +		young = is_migration_entry_young(swp_entry);
>>> +		dirty = is_migration_entry_dirty(swp_entry);
>>>  	} else {
>> This is where folio_try_share_anon_rmap_pmd() is skipped for device private pages, to which I referred in
>> https://lore.kernel.org/linux-mm/f1e26e18-83db-4c0e-b8d8-0af8ffa8a206@redhat.com/
>>
> Does it matter for device private pages/folios? It does not affect the freeze value.

I think ClearPageAnonExclusive is needed.

>
> Balbir Singh
>
--Mika



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-07  6:07   ` Alistair Popple
@ 2025-07-08  4:59     ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  4:59 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On 7/7/25 16:07, Alistair Popple wrote:
> On Fri, Jul 04, 2025 at 09:35:02AM +1000, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages
>> aware of zone device pages. Although the code is
>> designed to be generic when it comes to handling splitting
>> of pages, the code is designed to work for THP page sizes
>> corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone
>> device huge entry is present, enabling try_to_migrate()
>> and other code migration paths to appropriately process the
>> entry
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for
>> zone device entries.
>>
>> try_to_map_to_unused_zeropage() does not apply to zone device
>> entries, zone device entries are ignored in the call.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>>  mm/migrate.c         |   2 +
>>  mm/page_vma_mapped.c |  10 +++
>>  mm/pgtable-generic.c |   6 ++
>>  mm/rmap.c            |  19 +++++-
>>  5 files changed, 146 insertions(+), 44 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ce130225a8e5..e6e390d0308f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>  	if (unlikely(is_swap_pmd(pmd))) {
>>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>  
>> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
>> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
>> +				!is_device_private_entry(entry));
>>  		if (!is_readable_migration_entry(entry)) {
>>  			entry = make_readable_migration_entry(
>>  							swp_offset(entry));
>> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		} else if (thp_migration_supported()) {
>>  			swp_entry_t entry;
>>  
>> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>>  			entry = pmd_to_swp_entry(orig_pmd);
>>  			folio = pfn_swap_entry_folio(entry);
>>  			flush_needed = 0;
>> +
>> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> +					!folio_is_device_private(folio));
>> +
>> +			if (folio_is_device_private(folio)) {
>> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
>> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
>> +			}
>>  		} else
>>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>  
>> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  				folio_mark_accessed(folio);
>>  		}
>>  
>> +		/*
>> +		 * Do a folio put on zone device private pages after
>> +		 * changes to mm_counter, because the folio_put() will
>> +		 * clean folio->mapping and the folio_test_anon() check
>> +		 * will not be usable.
>> +		 */
>> +		if (folio_is_device_private(folio))
>> +			folio_put(folio);
>> +
>>  		spin_unlock(ptl);
>>  		if (flush_needed)
>>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
>> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		struct folio *folio = pfn_swap_entry_folio(entry);
>>  		pmd_t newpmd;
>>  
>> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
>> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> +			  !folio_is_device_private(folio));
>>  		if (is_writable_migration_entry(entry)) {
>>  			/*
>>  			 * A protection check is difficult so
>> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  			newpmd = swp_entry_to_pmd(entry);
>>  			if (pmd_swp_soft_dirty(*pmd))
>>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
>> -		} else {
>> +		} else if (is_writable_device_private_entry(entry)) {
>> +			newpmd = swp_entry_to_pmd(entry);
>> +			entry = make_device_exclusive_entry(swp_offset(entry));
> 
> Argh. The naming here is terribly confusing (of which I'm at least partly to
> blame) because it ended up clashing with David's PG_anon_exclusive which is a
> completely different concept - see 6c287605fd56 ("mm: remember exclusively
> mapped anonymous pages with PG_anon_exclusive").
> 
> The exclusive entries you are creating here are for emulating atomic access -
> see the documentation for make_device_exclusive() for more details - and are
> almost certainly not what you want.
> 
> As far as I understand things we don't need to create anon exclusive entries for
> device private pages because they can never be pinned, so likely you just want
> make_readable_device_private_entry() here. If we really want to track anon
> exclusive status you probably need pte_swp_exclusive(), but then we should do it
> for non-THP device private pages as well and that sounds like a whole different
> problem/patch series.
> 

Thanks for catching this, I agree we don't need to track anon exclusive for device
private pages.

>> +		} else
>>  			newpmd = *pmd;
>> -		}
>>  
>>  		if (uffd_wp)
>>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
>> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  	struct page *page;
>>  	pgtable_t pgtable;
>>  	pmd_t old_pmd, _pmd;
>> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>> -	bool anon_exclusive = false, dirty = false;
>> +	bool young, write, soft_dirty, uffd_wp = false;
>> +	bool anon_exclusive = false, dirty = false, present = false;
>>  	unsigned long addr;
>>  	pte_t *pte;
>>  	int i;
>> +	swp_entry_t swp_entry;
>>  
>>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
>> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
>> +
>> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
>> +			&& !(is_swap_pmd(*pmd) &&
>> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>>  
>>  	count_vm_event(THP_SPLIT_PMD);
>>  
>> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);
>>  	}
>>  
>> -	pmd_migration = is_pmd_migration_entry(*pmd);
>> -	if (unlikely(pmd_migration)) {
>> -		swp_entry_t entry;
>>  
>> +	present = pmd_present(*pmd);
>> +	if (unlikely(!present)) {
>> +		swp_entry = pmd_to_swp_entry(*pmd);
>>  		old_pmd = *pmd;
>> -		entry = pmd_to_swp_entry(old_pmd);
>> -		page = pfn_swap_entry_to_page(entry);
>> -		write = is_writable_migration_entry(entry);
>> +
>> +		folio = pfn_swap_entry_folio(swp_entry);
>> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
>> +				!is_device_private_entry(swp_entry));
>> +		page = pfn_swap_entry_to_page(swp_entry);
>> +		write = is_writable_migration_entry(swp_entry);
>> +
>>  		if (PageAnon(page))
>> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
>> -		young = is_migration_entry_young(entry);
>> -		dirty = is_migration_entry_dirty(entry);
>> +			anon_exclusive =
>> +				is_readable_exclusive_migration_entry(swp_entry);
>>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
>> +		young = is_migration_entry_young(swp_entry);
>> +		dirty = is_migration_entry_dirty(swp_entry);
> 
> This could be a device-private swp_entry right? In which case calling
> is_migration_entry_*() on them isn't correct. I suspect you want to have
> separate code paths for migration vs. device_private entries here.
> 

Yep, I will split them up based on the entry type

>>  	} else {
>>  		/*
>>  		 * Up to this point the pmd is present and huge and userland has
>> @@ -2996,30 +3025,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  	 * Note that NUMA hinting access restrictions are not transferred to
>>  	 * avoid any possibility of altering permissions across VMAs.
>>  	 */
>> -	if (freeze || pmd_migration) {
>> +	if (freeze || !present) {
>>  		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>>  			pte_t entry;
>> -			swp_entry_t swp_entry;
>> -
>> -			if (write)
>> -				swp_entry = make_writable_migration_entry(
>> -							page_to_pfn(page + i));
>> -			else if (anon_exclusive)
>> -				swp_entry = make_readable_exclusive_migration_entry(
>> -							page_to_pfn(page + i));
>> -			else
>> -				swp_entry = make_readable_migration_entry(
>> -							page_to_pfn(page + i));
>> -			if (young)
>> -				swp_entry = make_migration_entry_young(swp_entry);
>> -			if (dirty)
>> -				swp_entry = make_migration_entry_dirty(swp_entry);
>> -			entry = swp_entry_to_pte(swp_entry);
>> -			if (soft_dirty)
>> -				entry = pte_swp_mksoft_dirty(entry);
>> -			if (uffd_wp)
>> -				entry = pte_swp_mkuffd_wp(entry);
>> -
>> +			if (freeze || is_migration_entry(swp_entry)) {
>> +				if (write)
>> +					swp_entry = make_writable_migration_entry(
>> +								page_to_pfn(page + i));
>> +				else if (anon_exclusive)
>> +					swp_entry = make_readable_exclusive_migration_entry(
>> +								page_to_pfn(page + i));
>> +				else
>> +					swp_entry = make_readable_migration_entry(
>> +								page_to_pfn(page + i));
>> +				if (young)
>> +					swp_entry = make_migration_entry_young(swp_entry);
>> +				if (dirty)
>> +					swp_entry = make_migration_entry_dirty(swp_entry);
>> +				entry = swp_entry_to_pte(swp_entry);
>> +				if (soft_dirty)
>> +					entry = pte_swp_mksoft_dirty(entry);
>> +				if (uffd_wp)
>> +					entry = pte_swp_mkuffd_wp(entry);
>> +			} else {
>> +				VM_BUG_ON(!is_device_private_entry(swp_entry));
>> +				if (write)
>> +					swp_entry = make_writable_device_private_entry(
>> +								page_to_pfn(page + i));
>> +				else if (anon_exclusive)
>> +					swp_entry = make_device_exclusive_entry(
> 
> As above, you're welcome for the naming :-)
> 

:) We don't need to track these, I'll fix up the patch

>> +								page_to_pfn(page + i));
>> +				else
>> +					swp_entry = make_readable_device_private_entry(
>> +								page_to_pfn(page + i));
>> +				entry = swp_entry_to_pte(swp_entry);
>> +				if (soft_dirty)
>> +					entry = pte_swp_mksoft_dirty(entry);
>> +				if (uffd_wp)
>> +					entry = pte_swp_mkuffd_wp(entry);
>> +			}
>>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>>  			set_pte_at(mm, addr, pte + i, entry);
>>  		}
>> @@ -3046,7 +3090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  	}
>>  	pte_unmap(pte);
>>  
>> -	if (!pmd_migration)
>> +	if (present)
>>  		folio_remove_rmap_pmd(folio, page, vma);
>>  	if (freeze)
>>  		put_page(page);
>> @@ -3058,8 +3102,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>>  			   pmd_t *pmd, bool freeze)
>>  {
>> +
>>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>> -	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
>> +	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
>> +			(is_swap_pmd(*pmd) &&
> 
> Should we create is_pmd_device_entry() to match is_pmd_migration_entry()?
> 

Yes, I think that's reasonable, I'll look into it

>> +			is_device_private_entry(pmd_to_swp_entry(*pmd))))
>>  		__split_huge_pmd_locked(vma, pmd, address, freeze);
>>  }
>>  
>> @@ -3238,6 +3285,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>  	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>>  	lockdep_assert_held(&lruvec->lru_lock);
>>  
>> +	if (folio_is_device_private(folio))
>> +		return;
>> +
>>  	if (list) {
>>  		/* page reclaim is reclaiming a huge page */
>>  		VM_WARN_ON(folio_test_lru(folio));
>> @@ -3252,6 +3302,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>  			list_add_tail(&new_folio->lru, &folio->lru);
>>  		folio_set_lru(new_folio);
>>  	}
>> +
>>  }
>>  
>>  /* Racy check whether the huge page can be split */
>> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  					((mapping || swap_cache) ?
>>  						folio_nr_pages(release) : 0));
>>  
>> +			if (folio_is_device_private(release))
>> +				percpu_ref_get_many(&release->pgmap->ref,
>> +							(1 << new_order) - 1);
> 
> Is there a description somewhere for how we think pgmap->ref works for compound/
> higher-order device private pages? Not that it matters too much, I'd like to
> remove it. Maybe I can create a patch to do that which you can base on top of.
> 

This bit is not needed, I'll be removing the above lines of code.

>>  			lru_add_split_folio(origin_folio, release, lruvec,
>>  					list);
>>  
>> @@ -4596,7 +4651,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>  		return 0;
>>  
>>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> +	if (!folio_is_device_private(folio))
>> +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> +	else
>> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> 
> Do we need to flush? A device private entry is already non-present so is the
> flush necessary?
> 

We're clearing the entry as well, why do you think a flush is not required?

>>  
>>  	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>>  	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
>> @@ -4646,6 +4704,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>>  	folio_get(folio);
>>  	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
>> +
>> +	if (unlikely(folio_is_device_private(folio))) {
>> +		if (pmd_write(pmde))
>> +			entry = make_writable_device_private_entry(
>> +							page_to_pfn(new));
>> +		else
>> +			entry = make_readable_device_private_entry(
>> +							page_to_pfn(new));
>> +		pmde = swp_entry_to_pmd(entry);
>> +	}
>> +
>>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>>  		pmde = pmd_mksoft_dirty(pmde);
>>  	if (is_writable_migration_entry(entry))
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 767f503f0875..0b6ecf559b22 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>  
>>  	if (PageCompound(page))
>>  		return false;
>> +	if (folio_is_device_private(folio))
>> +		return false;
>>  	VM_BUG_ON_PAGE(!PageAnon(page), page);
>>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>>  	VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index e981a1a292d2..ff8254e52de5 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -277,6 +277,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  			 * cannot return prematurely, while zap_huge_pmd() has
>>  			 * cleared *pmd but not decremented compound_mapcount().
>>  			 */
>> +			swp_entry_t entry;
>> +
>> +			if (!thp_migration_supported())
>> +				return not_found(pvmw);
>> +			entry = pmd_to_swp_entry(pmde);
>> +			if (is_device_private_entry(entry)) {
>> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>> +				return true;
> 
> Do other callers of page_vma_mapped_walk() need to be updated now that large
> device private pages may be returned?
> 

I think we probably need a new flag in the page walk code to return true for device
private entries at points where the callers/walkers care about tracking them.

>> +			}
>> +
>>  			if ((pvmw->flags & PVMW_SYNC) &&
>>  			    thp_vma_suitable_order(vma, pvmw->address,
>>  						   PMD_ORDER) &&
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 567e2d084071..604e8206a2ec 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>  		*pmdvalp = pmdval;
>>  	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>  		goto nomap;
>> +	if (is_swap_pmd(pmdval)) {
>> +		swp_entry_t entry = pmd_to_swp_entry(pmdval);
>> +
>> +		if (is_device_private_entry(entry))
>> +			goto nomap;
>> +	}
>>  	if (unlikely(pmd_trans_huge(pmdval)))
>>  		goto nomap;
>>  	if (unlikely(pmd_bad(pmdval))) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index bd83724d14b6..da1e5b03e1fe 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2336,8 +2336,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>  				break;
>>  			}
>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> -			subpage = folio_page(folio,
>> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
>> +			/*
>> +			 * Zone device private folios do not work well with
>> +			 * pmd_pfn() on some architectures due to pte
>> +			 * inversion.
>> +			 */
>> +			if (folio_is_device_private(folio)) {
>> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
>> +				unsigned long pfn = swp_offset_pfn(entry);
>> +
>> +				subpage = folio_page(folio, pfn
>> +							- folio_pfn(folio));
>> +			} else {
>> +				subpage = folio_page(folio,
>> +							pmd_pfn(*pvmw.pmd)
>> +							- folio_pfn(folio));
>> +			}
>> +
>>  			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>>  					!folio_test_pmd_mappable(folio), folio);
>>  
>> -- 
>> 2.49.0
>>

Thanks for the detailed feedback

Balbir



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 01/12] mm/zone_device: support large zone device private folios
  2025-07-07  5:28   ` Alistair Popple
@ 2025-07-08  6:47     ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  6:47 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On 7/7/25 15:28, Alistair Popple wrote:
> On Fri, Jul 04, 2025 at 09:35:00AM +1000, Balbir Singh wrote:
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/memremap.h | 22 +++++++++++++++++-
>>  mm/memremap.c            | 50 +++++++++++++++++++++++++++++-----------
>>  2 files changed, 58 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 4aa151914eab..11d586dd8ef1 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>  	return is_device_private_page(&folio->page);
>>  }
>>  
>> +static inline void *folio_zone_device_data(const struct folio *folio)
>> +{
>> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
>> +	return folio->page.zone_device_data;
>> +}
>> +
>> +static inline void folio_set_zone_device_data(struct folio *folio, void *data)
>> +{
>> +	VM_BUG_ON_FOLIO(!folio_is_device_private(folio), folio);
>> +	folio->page.zone_device_data = data;
>> +}
>> +
>>  static inline bool is_pci_p2pdma_page(const struct page *page)
>>  {
>>  	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
>> @@ -199,7 +211,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>>  }
>>  
>>  #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void init_zone_device_folio(struct folio *folio, unsigned int order);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -209,6 +221,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>  
>>  unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_page_init(struct page *page)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +
>> +	init_zone_device_folio(folio, 0);
> 
> Minor nit, but why not call this zone_device_folio_init() to keep the naming
> consistent with zone_device_page_init()?
> 

Ack, will do!

>> +}
>> +
>>  #else
>>  static inline void *devm_memremap_pages(struct device *dev,
>>  		struct dev_pagemap *pgmap)
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index b0ce0d8254bd..4085a3893e64 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -427,20 +427,21 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>  void free_zone_device_folio(struct folio *folio)
>>  {
>>  	struct dev_pagemap *pgmap = folio->pgmap;
>> +	unsigned int nr = folio_nr_pages(folio);
>> +	int i;
>> +	bool anon = folio_test_anon(folio);
>> +	struct page *page = folio_page(folio, 0);
>>  
>>  	if (WARN_ON_ONCE(!pgmap))
>>  		return;
>>  
>>  	mem_cgroup_uncharge(folio);
>>  
>> -	/*
>> -	 * Note: we don't expect anonymous compound pages yet. Once supported
>> -	 * and we could PTE-map them similar to THP, we'd have to clear
>> -	 * PG_anon_exclusive on all tail pages.
>> -	 */
>> -	if (folio_test_anon(folio)) {
>> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -		__ClearPageAnonExclusive(folio_page(folio, 0));
>> +	WARN_ON_ONCE(folio_test_large(folio) && !anon);
>> +
>> +	for (i = 0; i < nr; i++) {
> 
> The above comment says we should do this for all tail pages, but this appears to
> do it for the head page as well. Is there a particular reason for that?
> 

The original code clears the head page (when the folio is not large), the only
page. I don't think the head page can be skipped.

>> +		if (anon)
>> +			__ClearPageAnonExclusive(folio_page(folio, i));
>>  	}
>>  
>>  	/*
>> @@ -464,10 +465,19 @@ void free_zone_device_folio(struct folio *folio)
>>  
>>  	switch (pgmap->type) {
>>  	case MEMORY_DEVICE_PRIVATE:
>> +		if (folio_test_large(folio)) {
>> +			folio_unqueue_deferred_split(folio);
>> +
>> +			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
>> +		}
>> +		pgmap->ops->page_free(page);
>> +		put_dev_pagemap(pgmap);
> 
> Why is this needed/added, and where is the associated get_dev_pagemap()? Note
> that the whole {get|put}_dev_pagemap() thing is basically unused now. Which
> reminds me I should send a patch to remove it.
> 

Thanks, I'll remove these bits

>> +		page->mapping = NULL;
>> +		break;
>>  	case MEMORY_DEVICE_COHERENT:
>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>  			break;
>> -		pgmap->ops->page_free(folio_page(folio, 0));
>> +		pgmap->ops->page_free(page);
>>  		put_dev_pagemap(pgmap);
>>  		break;
>>  
>> @@ -491,14 +501,28 @@ void free_zone_device_folio(struct folio *folio)
>>  	}
>>  }
>>  
>> -void zone_device_page_init(struct page *page)
>> +void init_zone_device_folio(struct folio *folio, unsigned int order)
> 
> See above for some bike-shedding on the name.
> 

Ack

>>  {
>> +	struct page *page = folio_page(folio, 0);
>> +
>> +	VM_BUG_ON(order > MAX_ORDER_NR_PAGES);
>> +
>> +	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
>> +
>>  	/*
>>  	 * Drivers shouldn't be allocating pages after calling
>>  	 * memunmap_pages().
>>  	 */
>> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>> -	set_page_count(page, 1);
>> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> +	folio_set_count(folio, 1);
>>  	lock_page(page);
>> +
>> +	/*
>> +	 * Only PMD level migration is supported for THP migration
>> +	 */
>> +	if (order > 1) {
>> +		prep_compound_page(page, order);
> 
> Shouldn't this happen for order > 0 not 1? What about calling
> INIT_LIST_HEAD(&folio->_deferred_list)? Last time I looked prep_compound_page()
> didn't do that and I see above you are calling folio_unqueue_deferred_split() so
> I assume you need to do this for DEVICE_PRIVATE pages too.
> 

order == 1 has no deferred_list. prep_compound_page handles the INIT_LIST_HEAD



>> +		folio_set_large_rmappable(folio);
>> +	}
>>  }
>> -EXPORT_SYMBOL_GPL(zone_device_page_init);
>> +EXPORT_SYMBOL_GPL(init_zone_device_folio);
>> -- 
>> 2.49.0
>>


Thanks for the review
Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages
  2025-07-07  5:31   ` Alistair Popple
@ 2025-07-08  7:31     ` Balbir Singh
  2025-07-19 20:06       ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  7:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On 7/7/25 15:31, Alistair Popple wrote:
> On Fri, Jul 04, 2025 at 09:35:01AM +1000, Balbir Singh wrote:
>> Add flags to mark zone device migration pages.
>>
>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>> device pages as compound pages during device pfn migration.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/migrate.h | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index aaa2114498d6..1661e2d5479a 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
> 
> Why is this necessary? Couldn't migrate_vma just use folio_order() to figure out
> if it's a compound page or not?
> 

I can definitely explore that angle. As we move towards mTHP, we'll need additional bits for the various order sizes as well.

Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-07  3:29             ` Mika Penttilä
@ 2025-07-08  7:37               ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  7:37 UTC (permalink / raw)
  To: Mika Penttilä, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom

On 7/7/25 13:29, Mika Penttilä wrote:
> On 7/7/25 05:35, Balbir Singh wrote:
>> On 7/5/25 13:17, Mika Penttilä wrote:
>>>>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>>>>> +					unsigned long idx, unsigned long addr,
>>>>>>> +					struct folio *folio)
>>>>>>> +{
>>>>>>> +	unsigned long i;
>>>>>>> +	unsigned long pfn;
>>>>>>> +	unsigned long flags;
>>>>>>> +
>>>>>>> +	folio_get(folio);
>>>>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>>>>> We already have reference to folio, why is folio_get() needed ?
>>>>>>
>>>>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>>>>> Oh I see 
>>>>> +	if (!isolated)
>>>>> +		unmap_folio(folio);
>>>>>
>>>>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>>>>
>>>>> Still, why the folio_get(folio);?
>>>>>  
>>>>>
>>>> That is for split_huge_pmd_address, when called with freeze=true, it drops the
>>>> ref count on the page
>>>>
>>>> 	if (freeze)
>>>> 		put_page(page);
>>>>
>>>> Balbir
>>>>
>>> yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
>>> Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
>>> right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
>>> Doing the split this late is quite problematic all in all.
>>>
>> Clearing PageAnonExclusive will *not* fail for device private pages from what I can see in __folio_try_share_anon_rmap().
>> Doing the split late is a requirement due to the nature of the three stage migration operation, the other side
>> might fail to allocate THP sized pages and so the code needs to deal with it
>>
>> Balbir Singh
> 
> Yes seems clearing PageAnonExclusive doesn't fail for device private pages in the end, 
> but the 3/12 patch doesn't even try to clear PageAnonExclusive with your changes afaics,
> which is a separate issue.
> 
> And __split_huge_page_to_list_to_order() (return value is not checked) can fail for out of memory.
> So think you can not just assume split just works. If late split is a requirement (which I can understand is),
> you should be prepared to rollback somehow the operation.
> 

I'll add a check, rolling back is just setting up the entries to not be migrated

Thanks,
Balbir Singh



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-07  6:09       ` Alistair Popple
@ 2025-07-08  7:40         ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  7:40 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Mika Penttilä, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On 7/7/25 16:09, Alistair Popple wrote:
> On Sat, Jul 05, 2025 at 10:14:18AM +1000, Balbir Singh wrote:
>> On 7/4/25 21:10, Mika Penttilä wrote:
>>>>  /* Racy check whether the huge page can be split */
>>>> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>  					((mapping || swap_cache) ?
>>>>  						folio_nr_pages(release) : 0));
>>>>  
>>>> +			if (folio_is_device_private(release))
>>>> +				percpu_ref_get_many(&release->pgmap->ref,
>>>> +							(1 << new_order) - 1);
>>>
>>> pgmap refcount should not be modified here, count should remain the same after the split also
> 
> Agreed.
> 
>>>
>>>
>>
>> Good point, let me revisit the accounting
> 
> Yes, hopefully we can just delete it.
> 
>> For this patch series, the tests did not catch it since new ref's evaluate to 0
> 
> You may not notice bad accounting here unless you unload the kernel module,
> which can hang during memunmap() pages waiting for the refcount to go to zero.
> 

The tests do have an eviction test, which tests that all pages can indeed be evicted
and I do unload/reload the driver

Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-07  2:45                     ` Zi Yan
  2025-07-08  3:31                       ` Balbir Singh
@ 2025-07-08  7:43                       ` Balbir Singh
  1 sibling, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-08  7:43 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

>>
>> That sounds like a plan, it seems like there needs to be a finish phase of
>> the split and it does not belong to __split_unmapped_folio(). I would propose
>> that we rename "isolated" to "folio_is_migrating" and then your cleanups can
>> follow? Once your cleanups come in, we won't need to pass the parameter to
>> __split_unmapped_folio().
> 
> Sure.
> 
> The patch below should work. It only passed mm selftests and I am planning
> to do more. If you are brave enough, you can give it a try and use
> __split_unmapped_folio() from it.
> 
> From e594924d689bef740c38d93c7c1653f31bd5ae83 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Sun, 6 Jul 2025 22:40:53 -0400
> Subject: [PATCH] mm/huge_memory: move epilogue code out of
>  __split_unmapped_folio()
> 
> The code is not related to splitting unmapped folio operations. Move
> it out, so that __split_unmapped_folio() only do split works on unmapped
> folios.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 226 ++++++++++++++++++++++++-----------------------
>  1 file changed, 116 insertions(+), 110 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3eb1c34be601..6eead616583f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3396,9 +3396,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>   *             order - 1 to new_order).
>   * @split_at: in buddy allocator like split, the folio containing @split_at
>   *            will be split until its order becomes @new_order.
> - * @lock_at: the folio containing @lock_at is left locked for caller.
> - * @list: the after split folios will be added to @list if it is not NULL,
> - *        otherwise to LRU lists.
>   * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
>   * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
>   * @mapping: @folio->mapping
> @@ -3436,40 +3433,20 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>   * split. The caller needs to check the input folio.
>   */
>  static int __split_unmapped_folio(struct folio *folio, int new_order,
> -		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, pgoff_t end,
> -		struct xa_state *xas, struct address_space *mapping,
> -		bool uniform_split)
> +				  struct page *split_at, struct xa_state *xas,
> +				  struct address_space *mapping,
> +				  bool uniform_split)
>  {
> -	struct lruvec *lruvec;
> -	struct address_space *swap_cache = NULL;
> -	struct folio *origin_folio = folio;
> -	struct folio *next_folio = folio_next(folio);
> -	struct folio *new_folio;
>  	struct folio *next;
>  	int order = folio_order(folio);
>  	int split_order;
>  	int start_order = uniform_split ? new_order : order - 1;
> -	int nr_dropped = 0;
>  	int ret = 0;
>  	bool stop_split = false;
> 
> -	if (folio_test_swapcache(folio)) {
> -		VM_BUG_ON(mapping);
> -
> -		/* a swapcache folio can only be uniformly split to order-0 */
> -		if (!uniform_split || new_order != 0)
> -			return -EINVAL;
> -
> -		swap_cache = swap_address_space(folio->swap);
> -		xa_lock(&swap_cache->i_pages);
> -	}
> -
>  	if (folio_test_anon(folio))
>  		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> 
> -	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> -	lruvec = folio_lruvec_lock(folio);
> 
>  	folio_clear_has_hwpoisoned(folio);
> 
> @@ -3541,89 +3518,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  						MTHP_STAT_NR_ANON, 1);
>  			}
> 
> -			/*
> -			 * origin_folio should be kept frozon until page cache
> -			 * entries are updated with all the other after-split
> -			 * folios to prevent others seeing stale page cache
> -			 * entries.
> -			 */
> -			if (release == origin_folio)
> -				continue;
> -
> -			folio_ref_unfreeze(release, 1 +
> -					((mapping || swap_cache) ?
> -						folio_nr_pages(release) : 0));
> -
> -			lru_add_split_folio(origin_folio, release, lruvec,
> -					list);
> -
> -			/* Some pages can be beyond EOF: drop them from cache */
> -			if (release->index >= end) {
> -				if (shmem_mapping(mapping))
> -					nr_dropped += folio_nr_pages(release);
> -				else if (folio_test_clear_dirty(release))
> -					folio_account_cleaned(release,
> -						inode_to_wb(mapping->host));
> -				__filemap_remove_folio(release, NULL);
> -				folio_put_refs(release, folio_nr_pages(release));
> -			} else if (mapping) {
> -				__xa_store(&mapping->i_pages,
> -						release->index, release, 0);
> -			} else if (swap_cache) {
> -				__xa_store(&swap_cache->i_pages,
> -						swap_cache_index(release->swap),
> -						release, 0);
> -			}
>  		}
>  	}
> 
> -	/*
> -	 * Unfreeze origin_folio only after all page cache entries, which used
> -	 * to point to it, have been updated with new folios. Otherwise,
> -	 * a parallel folio_try_get() can grab origin_folio and its caller can
> -	 * see stale page cache entries.
> -	 */
> -	folio_ref_unfreeze(origin_folio, 1 +
> -		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
> -
> -	unlock_page_lruvec(lruvec);
> -
> -	if (swap_cache)
> -		xa_unlock(&swap_cache->i_pages);
> -	if (mapping)
> -		xa_unlock(&mapping->i_pages);
> 
> -	/* Caller disabled irqs, so they are still disabled here */
> -	local_irq_enable();
> -
> -	if (nr_dropped)
> -		shmem_uncharge(mapping->host, nr_dropped);
> -
> -	remap_page(origin_folio, 1 << order,
> -			folio_test_anon(origin_folio) ?
> -				RMP_USE_SHARED_ZEROPAGE : 0);
> -
> -	/*
> -	 * At this point, folio should contain the specified page.
> -	 * For uniform split, it is left for caller to unlock.
> -	 * For buddy allocator like split, the first after-split folio is left
> -	 * for caller to unlock.
> -	 */
> -	for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) {
> -		next = folio_next(new_folio);
> -		if (new_folio == page_folio(lock_at))
> -			continue;
> -
> -		folio_unlock(new_folio);
> -		/*
> -		 * Subpages may be freed if there wasn't any mapping
> -		 * like if add_to_swap() is running on a lru page that
> -		 * had its mapping zapped. And freeing these pages
> -		 * requires taking the lru_lock so we do the put_page
> -		 * of the tail pages after the split is complete.
> -		 */
> -		free_folio_and_swap_cache(new_folio);
> -	}
>  	return ret;
>  }
> 
> @@ -3706,10 +3604,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> +	struct folio *next_folio = folio_next(folio);
>  	bool is_anon = folio_test_anon(folio);
>  	struct address_space *mapping = NULL;
>  	struct anon_vma *anon_vma = NULL;
>  	int order = folio_order(folio);
> +	struct folio *new_folio, *next;
>  	int extra_pins, ret;
>  	pgoff_t end;
>  	bool is_hzp;
> @@ -3840,6 +3740,10 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  	/* Prevent deferred_split_scan() touching ->_refcount */
>  	spin_lock(&ds_queue->split_queue_lock);
>  	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> +		struct address_space *swap_cache = NULL;
> +		struct lruvec *lruvec;
> +		int nr_dropped = 0;
> +
>  		if (folio_order(folio) > 1 &&
>  		    !list_empty(&folio->_deferred_list)) {
>  			ds_queue->split_queue_len--;
> @@ -3873,19 +3777,121 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  			}
>  		}
> 
> -		ret = __split_unmapped_folio(folio, new_order,
> -				split_at, lock_at, list, end, &xas, mapping,
> -				uniform_split);
> +		if (folio_test_swapcache(folio)) {
> +			VM_BUG_ON(mapping);
> +
> +			/* a swapcache folio can only be uniformly split to order-0 */
> +			if (!uniform_split || new_order != 0) {
> +				ret = -EINVAL;
> +				goto out_unlock;
> +			}
> +
> +			swap_cache = swap_address_space(folio->swap);
> +			xa_lock(&swap_cache->i_pages);
> +		}
> +
> +		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> +		lruvec = folio_lruvec_lock(folio);
> +
> +		ret = __split_unmapped_folio(folio, new_order, split_at, &xas,
> +					     mapping, uniform_split);
> +
> +		/* Unfreeze after-split folios */
> +		for (new_folio = folio; new_folio != next_folio;
> +		     new_folio = next) {
> +			next = folio_next(new_folio);
> +			/*
> +			 * @folio should be kept frozon until page cache
> +			 * entries are updated with all the other after-split
> +			 * folios to prevent others seeing stale page cache
> +			 * entries.
> +			 */
> +			if (new_folio == folio)
> +				continue;
> +
> +			folio_ref_unfreeze(
> +				new_folio,
> +				1 + ((mapping || swap_cache) ?
> +					     folio_nr_pages(new_folio) :
> +					     0));
> +
> +			lru_add_split_folio(folio, new_folio, lruvec, list);
> +
> +			/* Some pages can be beyond EOF: drop them from cache */
> +			if (new_folio->index >= end) {
> +				if (shmem_mapping(mapping))
> +					nr_dropped += folio_nr_pages(new_folio);
> +				else if (folio_test_clear_dirty(new_folio))
> +					folio_account_cleaned(
> +						new_folio,
> +						inode_to_wb(mapping->host));
> +				__filemap_remove_folio(new_folio, NULL);
> +				folio_put_refs(new_folio,
> +					       folio_nr_pages(new_folio));
> +			} else if (mapping) {
> +				__xa_store(&mapping->i_pages, new_folio->index,
> +					   new_folio, 0);
> +			} else if (swap_cache) {
> +				__xa_store(&swap_cache->i_pages,
> +					   swap_cache_index(new_folio->swap),
> +					   new_folio, 0);
> +			}
> +		}
> +		/*
> +		 * Unfreeze @folio only after all page cache entries, which
> +		 * used to point to it, have been updated with new folios.
> +		 * Otherwise, a parallel folio_try_get() can grab origin_folio
> +		 * and its caller can see stale page cache entries.
> +		 */
> +		folio_ref_unfreeze(folio, 1 +
> +			((mapping || swap_cache) ? folio_nr_pages(folio) : 0));
> +
> +		unlock_page_lruvec(lruvec);
> +
> +		if (swap_cache)
> +			xa_unlock(&swap_cache->i_pages);
> +		if (mapping)
> +			xa_unlock(&mapping->i_pages);
> +
> +		if (nr_dropped)
> +			shmem_uncharge(mapping->host, nr_dropped);
> +
>  	} else {
>  		spin_unlock(&ds_queue->split_queue_lock);
>  fail:
>  		if (mapping)
>  			xas_unlock(&xas);
> -		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}
> 
> +	local_irq_enable();
> +
> +	remap_page(folio, 1 << order,
> +		   !ret && folio_test_anon(folio) ? RMP_USE_SHARED_ZEROPAGE :
> +						    0);
> +
> +	/*
> +	 * At this point, folio should contain the specified page.
> +	 * For uniform split, it is left for caller to unlock.
> +	 * For buddy allocator like split, the first after-split folio is left
> +	 * for caller to unlock.
> +	 */
> +	for (new_folio = folio; new_folio != next_folio; new_folio = next) {
> +		next = folio_next(new_folio);
> +		if (new_folio == page_folio(lock_at))
> +			continue;
> +
> +		folio_unlock(new_folio);
> +		/*
> +		 * Subpages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		free_folio_and_swap_cache(new_folio);
> +	}
> +
>  out_unlock:
>  	if (anon_vma) {
>  		anon_vma_unlock_write(anon_vma);


I applied my changes and tested on top of this patch. Thanks!

Tested-by: Balbir Singh <balbirs@nvidia.com>



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (12 preceding siblings ...)
  2025-07-04 16:16 ` [v1 resend 00/12] THP support for zone device page migration Zi Yan
@ 2025-07-08 14:53 ` David Hildenbrand
  2025-07-08 22:43   ` Balbir Singh
  2025-07-17 23:40 ` Matthew Brost
  14 siblings, 1 reply; 99+ messages in thread
From: David Hildenbrand @ 2025-07-08 14:53 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 04.07.25 01:34, Balbir Singh wrote:

So, I shared some feedback as reply to "[RFC 00/11] THP support for zone 
device pages".

It popped up in my mail box again after there apparently was a 
discussion a couple of days ago.

... and now I realize that this is apparently the same series with 
renamed subject. Gah.

So please, find my feedback there -- most of that should still apply, 
scanning your changelog ...


> Changelog v1:
> - Changes from RFC [2], include support for handling fault_folio and using
>    trylock in the fault path
> - A new test case has been added to measure the throughput improvement
> - General refactoring of code to keep up with the changes in mm
> - New split folio callback when the entire split is complete/done. The
>    callback is used to know when the head order needs to be reset.
> 
> Testing:
> - Testing was done with ZONE_DEVICE private pages on an x86 VM
> - Throughput showed upto 5x improvement with THP migration, system to device
>    migration is slower due to the mirroring of data (see buffer->mirror)
> 
> 
> Balbir Singh (12):
>    mm/zone_device: support large zone device private folios
>    mm/migrate_device: flags for selecting device private THP pages
>    mm/thp: zone_device awareness in THP handling code
>    mm/migrate_device: THP migration of zone device pages
>    mm/memory/fault: add support for zone device THP fault handling
>    lib/test_hmm: test cases and support for zone device private THP
>    mm/memremap: add folio_split support
>    mm/thp: add split during migration support
>    lib/test_hmm: add test case for split pages
>    selftests/mm/hmm-tests: new tests for zone device THP migration
>    gpu/drm/nouveau: add THP migration support
>    selftests/mm/hmm-tests: new throughput tests including THP
> 
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
>   drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
>   drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
>   include/linux/huge_mm.h                |  18 +-
>   include/linux/memremap.h               |  29 +-
>   include/linux/migrate.h                |   2 +
>   include/linux/mm.h                     |   1 +
>   lib/test_hmm.c                         | 428 +++++++++++++----
>   lib/test_hmm_uapi.h                    |   3 +
>   mm/huge_memory.c                       | 261 ++++++++---
>   mm/memory.c                            |   6 +-
>   mm/memremap.c                          |  50 +-
>   mm/migrate.c                           |   2 +
>   mm/migrate_device.c                    | 488 +++++++++++++++++---
>   mm/page_alloc.c                        |   1 +
>   mm/page_vma_mapped.c                   |  10 +
>   mm/pgtable-generic.c                   |   6 +
>   mm/rmap.c                              |  19 +-
>   tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
>   19 files changed, 1874 insertions(+), 312 deletions(-)
> 


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-08 14:53 ` David Hildenbrand
@ 2025-07-08 22:43   ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-08 22:43 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: akpm, linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/9/25 00:53, David Hildenbrand wrote:
> On 04.07.25 01:34, Balbir Singh wrote:
> 
> So, I shared some feedback as reply to "[RFC 00/11] THP support for zone device pages".
> 
> It popped up in my mail box again after there apparently was a discussion a couple of days ago.
> 
> ... and now I realize that this is apparently the same series with renamed subject. Gah.
> 
> So please, find my feedback there -- most of that should still apply, scanning your changelog ...
> 
Will do

Thanks for letting me know

Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-06  1:47             ` Balbir Singh
  2025-07-06  2:34               ` Zi Yan
@ 2025-07-16  5:34               ` Matthew Brost
  2025-07-16 11:19                 ` Zi Yan
  1 sibling, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-16  5:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Zi Yan, linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> On 7/6/25 11:34, Zi Yan wrote:
> > On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> > 
> >> On 7/5/25 11:55, Zi Yan wrote:
> >>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>
> >>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>
> >>>>> s/pages/folio
> >>>>>
> >>>>
> >>>> Thanks, will make the changes
> >>>>
> >>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>
> >>>>
> >>>> Ack, will change the name
> >>>>
> >>>>
> >>>>>>   *
> >>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>   */
> >>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>  {
> >>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>  		 * operations.
> >>>>>>  		 */
> >>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>> -		if (!anon_vma) {
> >>>>>> -			ret = -EBUSY;
> >>>>>> -			goto out;
> >>>>>> +		if (!isolated) {
> >>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>> +			if (!anon_vma) {
> >>>>>> +				ret = -EBUSY;
> >>>>>> +				goto out;
> >>>>>> +			}
> >>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>  		}
> >>>>>>  		end = -1;
> >>>>>>  		mapping = NULL;
> >>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>  	} else {
> >>>>>>  		unsigned int min_order;
> >>>>>>  		gfp_t gfp;
> >>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>  		goto out_unlock;
> >>>>>>  	}
> >>>>>>
> >>>>>> -	unmap_folio(folio);
> >>>>>> +	if (!isolated)
> >>>>>> +		unmap_folio(folio);
> >>>>>>
> >>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>  	local_irq_disable();
> >>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>
> >>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>> -				uniform_split);
> >>>>>> +				uniform_split, isolated);
> >>>>>>  	} else {
> >>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>  fail:
> >>>>>>  		if (mapping)
> >>>>>>  			xas_unlock(&xas);
> >>>>>>  		local_irq_enable();
> >>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>> +		if (!isolated)
> >>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>  		ret = -EAGAIN;
> >>>>>>  	}
> >>>>>
> >>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>
> >>>>>
> >>>>
> >>>> There are two reasons for going down the current code path
> >>>
> >>> After thinking more, I think adding isolated/unmapped is not the right
> >>> way, since unmapped folio is a very generic concept. If you add it,
> >>> one can easily misuse the folio split code by first unmapping a folio
> >>> and trying to split it with unmapped = true. I do not think that is
> >>> supported and your patch does not prevent that from happening in the future.
> >>>
> >>
> >> I don't understand the misuse case you mention, I assume you mean someone can
> >> get the usage wrong? The responsibility is on the caller to do the right thing
> >> if calling the API with unmapped
> > 
> > Before your patch, there is no use case of splitting unmapped folios.
> > Your patch only adds support for device private page split, not any unmapped
> > folio split. So using a generic isolated/unmapped parameter is not OK.
> > 
> 
> There is a use for splitting unmapped folios (see below)
> 
> >>
> >>> You should teach different parts of folio split code path to handle
> >>> device private folios properly. Details are below.
> >>>
> >>>>
> >>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>    the split routine to return with -EBUSY
> >>>
> >>> You do something below instead.
> >>>
> >>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>> 	ret = -EBUSY;
> >>> 	goto out;
> >>> } else if (anon_vma) {
> >>> 	anon_vma_lock_write(anon_vma);
> >>> }
> >>>
> >>
> >> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >> the check for device private folios?
> > 
> > Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> > in if (!isolated) branch. In that case, just do
> > 
> > if (folio_is_device_private(folio) {
> > ...
> > } else if (is_anon) {
> > ...
> > } else {
> > ...
> > }
> > 
> >>
> >>> People can know device private folio split needs a special handling.
> >>>
> >>> BTW, why a device private folio can also be anonymous? Does it mean
> >>> if a page cache folio is migrated to device private, kernel also
> >>> sees it as both device private and file-backed?
> >>>
> >>
> >> FYI: device private folios only work with anonymous private pages, hence
> >> the name device private.
> > 
> > OK.
> > 
> >>
> >>>
> >>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>    entries under consideration are already migration entries in this case.
> >>>>    This is wasteful and in some case unexpected.
> >>>
> >>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>> device private PMD mapping. Or if that is not preferred,
> >>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>> sees a device private folio.
> >>>
> >>> For remap_page(), you can simply return for device private folios
> >>> like it is currently doing for non anonymous folios.
> >>>
> >>
> >> Doing a full rmap walk does not make sense with unmap_folio() and
> >> remap_folio(), because
> >>
> >> 1. We need to do a page table walk/rmap walk again
> >> 2. We'll need special handling of migration <-> migration entries
> >>    in the rmap handling (set/remove migration ptes)
> >> 3. In this context, the code is already in the middle of migration,
> >>    so trying to do that again does not make sense.
> > 
> > Why doing split in the middle of migration? Existing split code
> > assumes to-be-split folios are mapped.
> > 
> > What prevents doing split before migration?
> > 
> 
> The code does do a split prior to migration if THP selection fails
> 
> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> and the fallback part which calls split_folio()
> 
> But the case under consideration is special since the device needs to allocate
> corresponding pfn's as well. The changelog mentions it:
> 
> "The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages."
> 
> I can expand on it, because migrate_vma() is a multi-phase operation
> 
> 1. migrate_vma_setup()
> 2. migrate_vma_pages()
> 3. migrate_vma_finalize()
> 
> It can so happen that when we get the destination pfn's allocated the destination
> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> 
> The pages have been unmapped and collected in migrate_vma_setup()
> 
> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> tests the split and emulates a failure on the device side to allocate large pages
> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> 

Another use case I’ve seen is when a previously allocated high-order
folio, now in the free memory pool, is reallocated as a lower-order
page. For example, a 2MB fault allocates a folio, the memory is later
freed, and then a 4KB fault reuses a page from that previously allocated
folio. This will be actually quite common in Xe / GPU SVM. In such
cases, the folio in an unmapped state needs to be split. I’d suggest a
migrate_device_* helper built on top of the core MM __split_folio
function add here.

Matt

> 
> >>
> >>
> >>>
> >>> For lru_add_split_folio(), you can skip it if a device private
> >>> folio is seen.
> >>>
> >>> Last, for unlock part, why do you need to keep all after-split folios
> >>> locked? It should be possible to just keep the to-be-migrated folio
> >>> locked and unlock the rest for a later retry. But I could miss something
> >>> since I am not familiar with device private migration code.
> >>>
> >>
> >> Not sure I follow this comment
> > 
> > Because the patch is doing split in the middle of migration and existing
> > split code never supports. My comment is based on the assumption that
> > the split is done when a folio is mapped.
> > 
> 
> Understood, hopefully I've explained the reason for the split in the middle
> of migration
> 
> Thanks for the detailed review
> Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-16  5:34               ` Matthew Brost
@ 2025-07-16 11:19                 ` Zi Yan
  2025-07-16 16:24                   ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-16 11:19 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 16 Jul 2025, at 1:34, Matthew Brost wrote:

> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>> On 7/6/25 11:34, Zi Yan wrote:
>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>
>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>
>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>
>>>>>>> s/pages/folio
>>>>>>>
>>>>>>
>>>>>> Thanks, will make the changes
>>>>>>
>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>
>>>>>>
>>>>>> Ack, will change the name
>>>>>>
>>>>>>
>>>>>>>>   *
>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>   */
>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>  {
>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>  		 * operations.
>>>>>>>>  		 */
>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>> -		if (!anon_vma) {
>>>>>>>> -			ret = -EBUSY;
>>>>>>>> -			goto out;
>>>>>>>> +		if (!isolated) {
>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>> +			if (!anon_vma) {
>>>>>>>> +				ret = -EBUSY;
>>>>>>>> +				goto out;
>>>>>>>> +			}
>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>  		}
>>>>>>>>  		end = -1;
>>>>>>>>  		mapping = NULL;
>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>  	} else {
>>>>>>>>  		unsigned int min_order;
>>>>>>>>  		gfp_t gfp;
>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		goto out_unlock;
>>>>>>>>  	}
>>>>>>>>
>>>>>>>> -	unmap_folio(folio);
>>>>>>>> +	if (!isolated)
>>>>>>>> +		unmap_folio(folio);
>>>>>>>>
>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>  	local_irq_disable();
>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>> -				uniform_split);
>>>>>>>> +				uniform_split, isolated);
>>>>>>>>  	} else {
>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>  fail:
>>>>>>>>  		if (mapping)
>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>  		local_irq_enable();
>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>> +		if (!isolated)
>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>  	}
>>>>>>>
>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> There are two reasons for going down the current code path
>>>>>
>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>
>>>>
>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>> if calling the API with unmapped
>>>
>>> Before your patch, there is no use case of splitting unmapped folios.
>>> Your patch only adds support for device private page split, not any unmapped
>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>
>>
>> There is a use for splitting unmapped folios (see below)
>>
>>>>
>>>>> You should teach different parts of folio split code path to handle
>>>>> device private folios properly. Details are below.
>>>>>
>>>>>>
>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>    the split routine to return with -EBUSY
>>>>>
>>>>> You do something below instead.
>>>>>
>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>> 	ret = -EBUSY;
>>>>> 	goto out;
>>>>> } else if (anon_vma) {
>>>>> 	anon_vma_lock_write(anon_vma);
>>>>> }
>>>>>
>>>>
>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>> the check for device private folios?
>>>
>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>> in if (!isolated) branch. In that case, just do
>>>
>>> if (folio_is_device_private(folio) {
>>> ...
>>> } else if (is_anon) {
>>> ...
>>> } else {
>>> ...
>>> }
>>>
>>>>
>>>>> People can know device private folio split needs a special handling.
>>>>>
>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>> if a page cache folio is migrated to device private, kernel also
>>>>> sees it as both device private and file-backed?
>>>>>
>>>>
>>>> FYI: device private folios only work with anonymous private pages, hence
>>>> the name device private.
>>>
>>> OK.
>>>
>>>>
>>>>>
>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>    This is wasteful and in some case unexpected.
>>>>>
>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>> device private PMD mapping. Or if that is not preferred,
>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>> sees a device private folio.
>>>>>
>>>>> For remap_page(), you can simply return for device private folios
>>>>> like it is currently doing for non anonymous folios.
>>>>>
>>>>
>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>> remap_folio(), because
>>>>
>>>> 1. We need to do a page table walk/rmap walk again
>>>> 2. We'll need special handling of migration <-> migration entries
>>>>    in the rmap handling (set/remove migration ptes)
>>>> 3. In this context, the code is already in the middle of migration,
>>>>    so trying to do that again does not make sense.
>>>
>>> Why doing split in the middle of migration? Existing split code
>>> assumes to-be-split folios are mapped.
>>>
>>> What prevents doing split before migration?
>>>
>>
>> The code does do a split prior to migration if THP selection fails
>>
>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>> and the fallback part which calls split_folio()
>>
>> But the case under consideration is special since the device needs to allocate
>> corresponding pfn's as well. The changelog mentions it:
>>
>> "The common case that arises is that after setup, during migrate
>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>> pages."
>>
>> I can expand on it, because migrate_vma() is a multi-phase operation
>>
>> 1. migrate_vma_setup()
>> 2. migrate_vma_pages()
>> 3. migrate_vma_finalize()
>>
>> It can so happen that when we get the destination pfn's allocated the destination
>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>
>> The pages have been unmapped and collected in migrate_vma_setup()
>>
>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>> tests the split and emulates a failure on the device side to allocate large pages
>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>
>
> Another use case I’ve seen is when a previously allocated high-order
> folio, now in the free memory pool, is reallocated as a lower-order
> page. For example, a 2MB fault allocates a folio, the memory is later

That is different. If the high-order folio is free, it should be split
using split_page() from mm/page_alloc.c.

> freed, and then a 4KB fault reuses a page from that previously allocated
> folio. This will be actually quite common in Xe / GPU SVM. In such
> cases, the folio in an unmapped state needs to be split. I’d suggest a

This folio is unused, so ->flags, ->mapping, and etc. are not set,
__split_unmapped_folio() is not for it, unless you mean free folio
differently.

> migrate_device_* helper built on top of the core MM __split_folio
> function add here.
>

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-16 11:19                 ` Zi Yan
@ 2025-07-16 16:24                   ` Matthew Brost
  2025-07-16 21:53                     ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-16 16:24 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> 
> > On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >> On 7/6/25 11:34, Zi Yan wrote:
> >>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>
> >>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>
> >>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>
> >>>>>>> s/pages/folio
> >>>>>>>
> >>>>>>
> >>>>>> Thanks, will make the changes
> >>>>>>
> >>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>
> >>>>>>
> >>>>>> Ack, will change the name
> >>>>>>
> >>>>>>
> >>>>>>>>   *
> >>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>   */
> >>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>  {
> >>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>  		 * operations.
> >>>>>>>>  		 */
> >>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>> -		if (!anon_vma) {
> >>>>>>>> -			ret = -EBUSY;
> >>>>>>>> -			goto out;
> >>>>>>>> +		if (!isolated) {
> >>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>> +			if (!anon_vma) {
> >>>>>>>> +				ret = -EBUSY;
> >>>>>>>> +				goto out;
> >>>>>>>> +			}
> >>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>  		}
> >>>>>>>>  		end = -1;
> >>>>>>>>  		mapping = NULL;
> >>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>  	} else {
> >>>>>>>>  		unsigned int min_order;
> >>>>>>>>  		gfp_t gfp;
> >>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		goto out_unlock;
> >>>>>>>>  	}
> >>>>>>>>
> >>>>>>>> -	unmap_folio(folio);
> >>>>>>>> +	if (!isolated)
> >>>>>>>> +		unmap_folio(folio);
> >>>>>>>>
> >>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>  	local_irq_disable();
> >>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>
> >>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>> -				uniform_split);
> >>>>>>>> +				uniform_split, isolated);
> >>>>>>>>  	} else {
> >>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>  fail:
> >>>>>>>>  		if (mapping)
> >>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>  		local_irq_enable();
> >>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>> +		if (!isolated)
> >>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>  	}
> >>>>>>>
> >>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> There are two reasons for going down the current code path
> >>>>>
> >>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>
> >>>>
> >>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>> if calling the API with unmapped
> >>>
> >>> Before your patch, there is no use case of splitting unmapped folios.
> >>> Your patch only adds support for device private page split, not any unmapped
> >>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>
> >>
> >> There is a use for splitting unmapped folios (see below)
> >>
> >>>>
> >>>>> You should teach different parts of folio split code path to handle
> >>>>> device private folios properly. Details are below.
> >>>>>
> >>>>>>
> >>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>    the split routine to return with -EBUSY
> >>>>>
> >>>>> You do something below instead.
> >>>>>
> >>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>> 	ret = -EBUSY;
> >>>>> 	goto out;
> >>>>> } else if (anon_vma) {
> >>>>> 	anon_vma_lock_write(anon_vma);
> >>>>> }
> >>>>>
> >>>>
> >>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>> the check for device private folios?
> >>>
> >>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>> in if (!isolated) branch. In that case, just do
> >>>
> >>> if (folio_is_device_private(folio) {
> >>> ...
> >>> } else if (is_anon) {
> >>> ...
> >>> } else {
> >>> ...
> >>> }
> >>>
> >>>>
> >>>>> People can know device private folio split needs a special handling.
> >>>>>
> >>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>> if a page cache folio is migrated to device private, kernel also
> >>>>> sees it as both device private and file-backed?
> >>>>>
> >>>>
> >>>> FYI: device private folios only work with anonymous private pages, hence
> >>>> the name device private.
> >>>
> >>> OK.
> >>>
> >>>>
> >>>>>
> >>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>    This is wasteful and in some case unexpected.
> >>>>>
> >>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>> device private PMD mapping. Or if that is not preferred,
> >>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>> sees a device private folio.
> >>>>>
> >>>>> For remap_page(), you can simply return for device private folios
> >>>>> like it is currently doing for non anonymous folios.
> >>>>>
> >>>>
> >>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>> remap_folio(), because
> >>>>
> >>>> 1. We need to do a page table walk/rmap walk again
> >>>> 2. We'll need special handling of migration <-> migration entries
> >>>>    in the rmap handling (set/remove migration ptes)
> >>>> 3. In this context, the code is already in the middle of migration,
> >>>>    so trying to do that again does not make sense.
> >>>
> >>> Why doing split in the middle of migration? Existing split code
> >>> assumes to-be-split folios are mapped.
> >>>
> >>> What prevents doing split before migration?
> >>>
> >>
> >> The code does do a split prior to migration if THP selection fails
> >>
> >> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >> and the fallback part which calls split_folio()
> >>
> >> But the case under consideration is special since the device needs to allocate
> >> corresponding pfn's as well. The changelog mentions it:
> >>
> >> "The common case that arises is that after setup, during migrate
> >> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >> pages."
> >>
> >> I can expand on it, because migrate_vma() is a multi-phase operation
> >>
> >> 1. migrate_vma_setup()
> >> 2. migrate_vma_pages()
> >> 3. migrate_vma_finalize()
> >>
> >> It can so happen that when we get the destination pfn's allocated the destination
> >> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>
> >> The pages have been unmapped and collected in migrate_vma_setup()
> >>
> >> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >> tests the split and emulates a failure on the device side to allocate large pages
> >> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>
> >
> > Another use case I’ve seen is when a previously allocated high-order
> > folio, now in the free memory pool, is reallocated as a lower-order
> > page. For example, a 2MB fault allocates a folio, the memory is later
> 
> That is different. If the high-order folio is free, it should be split
> using split_page() from mm/page_alloc.c.
> 

Ah, ok. Let me see if that works - it would easier.

> > freed, and then a 4KB fault reuses a page from that previously allocated
> > folio. This will be actually quite common in Xe / GPU SVM. In such
> > cases, the folio in an unmapped state needs to be split. I’d suggest a
> 
> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> __split_unmapped_folio() is not for it, unless you mean free folio
> differently.
> 

This is right, those fields should be clear.

Thanks for the tip.

Matt

> > migrate_device_* helper built on top of the core MM __split_folio
> > function add here.
> >
> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-16 16:24                   ` Matthew Brost
@ 2025-07-16 21:53                     ` Balbir Singh
  2025-07-17 22:24                       ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-16 21:53 UTC (permalink / raw)
  To: Matthew Brost, Zi Yan
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/17/25 02:24, Matthew Brost wrote:
> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>
>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>
>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>
>>>>>>>>> s/pages/folio
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, will make the changes
>>>>>>>>
>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ack, will change the name
>>>>>>>>
>>>>>>>>
>>>>>>>>>>   *
>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>   */
>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>  {
>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>  		 * operations.
>>>>>>>>>>  		 */
>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>> -			goto out;
>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>> +				goto out;
>>>>>>>>>> +			}
>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>  		}
>>>>>>>>>>  		end = -1;
>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>  	}
>>>>>>>>>>
>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>> +	if (!isolated)
>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>
>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>
>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>> -				uniform_split);
>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>  fail:
>>>>>>>>>>  		if (mapping)
>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>> +		if (!isolated)
>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>  	}
>>>>>>>>>
>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> There are two reasons for going down the current code path
>>>>>>>
>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>
>>>>>>
>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>> if calling the API with unmapped
>>>>>
>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>
>>>>
>>>> There is a use for splitting unmapped folios (see below)
>>>>
>>>>>>
>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>> device private folios properly. Details are below.
>>>>>>>
>>>>>>>>
>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>
>>>>>>> You do something below instead.
>>>>>>>
>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>> 	ret = -EBUSY;
>>>>>>> 	goto out;
>>>>>>> } else if (anon_vma) {
>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>> the check for device private folios?
>>>>>
>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>> in if (!isolated) branch. In that case, just do
>>>>>
>>>>> if (folio_is_device_private(folio) {
>>>>> ...
>>>>> } else if (is_anon) {
>>>>> ...
>>>>> } else {
>>>>> ...
>>>>> }
>>>>>
>>>>>>
>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>
>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>> sees it as both device private and file-backed?
>>>>>>>
>>>>>>
>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>> the name device private.
>>>>>
>>>>> OK.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>
>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>> sees a device private folio.
>>>>>>>
>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>
>>>>>>
>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>> remap_folio(), because
>>>>>>
>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>    so trying to do that again does not make sense.
>>>>>
>>>>> Why doing split in the middle of migration? Existing split code
>>>>> assumes to-be-split folios are mapped.
>>>>>
>>>>> What prevents doing split before migration?
>>>>>
>>>>
>>>> The code does do a split prior to migration if THP selection fails
>>>>
>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>> and the fallback part which calls split_folio()
>>>>
>>>> But the case under consideration is special since the device needs to allocate
>>>> corresponding pfn's as well. The changelog mentions it:
>>>>
>>>> "The common case that arises is that after setup, during migrate
>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>> pages."
>>>>
>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>
>>>> 1. migrate_vma_setup()
>>>> 2. migrate_vma_pages()
>>>> 3. migrate_vma_finalize()
>>>>
>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>
>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>
>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>
>>>
>>> Another use case I’ve seen is when a previously allocated high-order
>>> folio, now in the free memory pool, is reallocated as a lower-order
>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>
>> That is different. If the high-order folio is free, it should be split
>> using split_page() from mm/page_alloc.c.
>>
> 
> Ah, ok. Let me see if that works - it would easier.
> 
>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>
>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>> __split_unmapped_folio() is not for it, unless you mean free folio
>> differently.
>>
> 
> This is right, those fields should be clear.
> 
> Thanks for the tip.
> 
I was hoping to reuse __split_folio_to_order() at some point in the future
to split the backing pages in the driver, but it is not an immediate priority

Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 05/12] mm/memory/fault: add support for zone device THP fault handling
  2025-07-03 23:35 ` [v1 resend 05/12] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
@ 2025-07-17 19:34   ` Matthew Brost
  0 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-17 19:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 04, 2025 at 09:35:04AM +1000, Balbir Singh wrote:
> When the CPU touches a zone device THP entry, the data needs to
> be migrated back to the CPU, call migrate_to_ram() on these pages
> via do_huge_pmd_device_private() fault handling helper.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h |  7 +++++++
>  mm/huge_memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c             |  6 ++++--
>  3 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4d5bb67dc4ec..65a1bdf29bb9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -474,6 +474,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
>  
>  vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
>  
> +vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
> +
>  extern struct folio *huge_zero_folio;
>  extern unsigned long huge_zero_pfn;
>  
> @@ -627,6 +629,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
>  static inline bool is_huge_zero_folio(const struct folio *folio)
>  {
>  	return false;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e6e390d0308f..f29add796931 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1267,6 +1267,46 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  
>  }
>  
> +vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	vm_fault_t ret = 0;
> +	spinlock_t *ptl;
> +	swp_entry_t swp_entry;
> +	struct page *page;
> +
> +	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> +		return VM_FAULT_FALLBACK;
> +
> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +		vma_end_read(vma);
> +		return VM_FAULT_RETRY;
> +	}
> +
> +	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) {
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	swp_entry = pmd_to_swp_entry(vmf->orig_pmd);
> +	page = pfn_swap_entry_to_page(swp_entry);

Would it make more sense here to use folio lock, unlock, put, get
functions?

Matt

> +	vmf->page = page;
> +	vmf->pte = NULL;
> +	if (trylock_page(vmf->page)) {
> +		get_page(page);
> +		spin_unlock(ptl);
> +		ret = page_pgmap(page)->ops->migrate_to_ram(vmf);
> +		unlock_page(vmf->page);
> +		put_page(page);
> +	} else {
> +		spin_unlock(ptl);
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * always: directly stall for all thp allocations
>   * defer: wake kswapd and fail if not immediately available
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b..c26c421b8325 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6165,8 +6165,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
>  
>  		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
> -			VM_BUG_ON(thp_migration_supported() &&
> -					  !is_pmd_migration_entry(vmf.orig_pmd));
> +			if (is_device_private_entry(
> +					pmd_to_swp_entry(vmf.orig_pmd)))
> +				return do_huge_pmd_device_private(&vmf);
> +
>  			if (is_pmd_migration_entry(vmf.orig_pmd))
>  				pmd_migration_entry_wait(mm, vmf.pmd);
>  			return 0;
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-16 21:53                     ` Balbir Singh
@ 2025-07-17 22:24                       ` Matthew Brost
  2025-07-17 23:04                         ` Zi Yan
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-17 22:24 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Zi Yan, linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> On 7/17/25 02:24, Matthew Brost wrote:
> > On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>
> >>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>
> >>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>
> >>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>
> >>>>>>>>> s/pages/folio
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks, will make the changes
> >>>>>>>>
> >>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Ack, will change the name
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>   *
> >>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>   */
> >>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>  {
> >>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>  		 * operations.
> >>>>>>>>>>  		 */
> >>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>> -			goto out;
> >>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>> +				goto out;
> >>>>>>>>>> +			}
> >>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>  		}
> >>>>>>>>>>  		end = -1;
> >>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>  	} else {
> >>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>  	}
> >>>>>>>>>>
> >>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>> +	if (!isolated)
> >>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>
> >>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>
> >>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>> -				uniform_split);
> >>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>  	} else {
> >>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>  fail:
> >>>>>>>>>>  		if (mapping)
> >>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>> +		if (!isolated)
> >>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>  	}
> >>>>>>>>>
> >>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> There are two reasons for going down the current code path
> >>>>>>>
> >>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>
> >>>>>>
> >>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>> if calling the API with unmapped
> >>>>>
> >>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>
> >>>>
> >>>> There is a use for splitting unmapped folios (see below)
> >>>>
> >>>>>>
> >>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>> device private folios properly. Details are below.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>
> >>>>>>> You do something below instead.
> >>>>>>>
> >>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>> 	ret = -EBUSY;
> >>>>>>> 	goto out;
> >>>>>>> } else if (anon_vma) {
> >>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>> }
> >>>>>>>
> >>>>>>
> >>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>> the check for device private folios?
> >>>>>
> >>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>> in if (!isolated) branch. In that case, just do
> >>>>>
> >>>>> if (folio_is_device_private(folio) {
> >>>>> ...
> >>>>> } else if (is_anon) {
> >>>>> ...
> >>>>> } else {
> >>>>> ...
> >>>>> }
> >>>>>
> >>>>>>
> >>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>
> >>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>> sees it as both device private and file-backed?
> >>>>>>>
> >>>>>>
> >>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>> the name device private.
> >>>>>
> >>>>> OK.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>
> >>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>> sees a device private folio.
> >>>>>>>
> >>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>
> >>>>>>
> >>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>> remap_folio(), because
> >>>>>>
> >>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>    so trying to do that again does not make sense.
> >>>>>
> >>>>> Why doing split in the middle of migration? Existing split code
> >>>>> assumes to-be-split folios are mapped.
> >>>>>
> >>>>> What prevents doing split before migration?
> >>>>>
> >>>>
> >>>> The code does do a split prior to migration if THP selection fails
> >>>>
> >>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>> and the fallback part which calls split_folio()
> >>>>
> >>>> But the case under consideration is special since the device needs to allocate
> >>>> corresponding pfn's as well. The changelog mentions it:
> >>>>
> >>>> "The common case that arises is that after setup, during migrate
> >>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>> pages."
> >>>>
> >>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>
> >>>> 1. migrate_vma_setup()
> >>>> 2. migrate_vma_pages()
> >>>> 3. migrate_vma_finalize()
> >>>>
> >>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>
> >>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>
> >>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>
> >>>
> >>> Another use case I’ve seen is when a previously allocated high-order
> >>> folio, now in the free memory pool, is reallocated as a lower-order
> >>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>
> >> That is different. If the high-order folio is free, it should be split
> >> using split_page() from mm/page_alloc.c.
> >>
> > 
> > Ah, ok. Let me see if that works - it would easier.
> > 

This suggestion quickly blows up as PageCompound is true and page_count
here is zero.

> >>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>
> >> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >> __split_unmapped_folio() is not for it, unless you mean free folio
> >> differently.
> >>
> > 
> > This is right, those fields should be clear.
> > 
> > Thanks for the tip.
> > 
> I was hoping to reuse __split_folio_to_order() at some point in the future
> to split the backing pages in the driver, but it is not an immediate priority
> 

I think we need something for the scenario I describe here. I was to
make __split_huge_page_to_list_to_order with a couple of hacks but it
almostly certainig not right as Zi pointed out.

New to the MM stuff, but play around with this a bit and see if I can
come up with something that will work here.

Matt

> Balbir


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-17 22:24                       ` Matthew Brost
@ 2025-07-17 23:04                         ` Zi Yan
  2025-07-18  0:41                           ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-17 23:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 17 Jul 2025, at 18:24, Matthew Brost wrote:

> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
>> On 7/17/25 02:24, Matthew Brost wrote:
>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>>>
>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>>
>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>> s/pages/folio
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks, will make the changes
>>>>>>>>>>
>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ack, will change the name
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>   *
>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>   */
>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>>  		 */
>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>>> -			goto out;
>>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>>> +				goto out;
>>>>>>>>>>>> +			}
>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>  		}
>>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>>  	}
>>>>>>>>>>>>
>>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>>
>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>
>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>>  fail:
>>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>>  	}
>>>>>>>>>>>
>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>>
>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>>> if calling the API with unmapped
>>>>>>>
>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>>
>>>>>>
>>>>>> There is a use for splitting unmapped folios (see below)
>>>>>>
>>>>>>>>
>>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>>> device private folios properly. Details are below.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>>
>>>>>>>>> You do something below instead.
>>>>>>>>>
>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>>> 	ret = -EBUSY;
>>>>>>>>> 	goto out;
>>>>>>>>> } else if (anon_vma) {
>>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>
>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>>> the check for device private folios?
>>>>>>>
>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>>
>>>>>>> if (folio_is_device_private(folio) {
>>>>>>> ...
>>>>>>> } else if (is_anon) {
>>>>>>> ...
>>>>>>> } else {
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>>>
>>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>>
>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>>
>>>>>>>>
>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>>> the name device private.
>>>>>>>
>>>>>>> OK.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>>
>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>>> sees a device private folio.
>>>>>>>>>
>>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>>> remap_folio(), because
>>>>>>>>
>>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>>    so trying to do that again does not make sense.
>>>>>>>
>>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>>> assumes to-be-split folios are mapped.
>>>>>>>
>>>>>>> What prevents doing split before migration?
>>>>>>>
>>>>>>
>>>>>> The code does do a split prior to migration if THP selection fails
>>>>>>
>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>>> and the fallback part which calls split_folio()
>>>>>>
>>>>>> But the case under consideration is special since the device needs to allocate
>>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>>
>>>>>> "The common case that arises is that after setup, during migrate
>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>>> pages."
>>>>>>
>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>>
>>>>>> 1. migrate_vma_setup()
>>>>>> 2. migrate_vma_pages()
>>>>>> 3. migrate_vma_finalize()
>>>>>>
>>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>>
>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>>>
>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>>>
>>>>>
>>>>> Another use case I’ve seen is when a previously allocated high-order
>>>>> folio, now in the free memory pool, is reallocated as a lower-order
>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>>>
>>>> That is different. If the high-order folio is free, it should be split
>>>> using split_page() from mm/page_alloc.c.
>>>>
>>>
>>> Ah, ok. Let me see if that works - it would easier.
>>>
>
> This suggestion quickly blows up as PageCompound is true and page_count
> here is zero.

OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().

>
>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>>>
>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>>>> __split_unmapped_folio() is not for it, unless you mean free folio
>>>> differently.
>>>>
>>>
>>> This is right, those fields should be clear.
>>>
>>> Thanks for the tip.
>>>
>> I was hoping to reuse __split_folio_to_order() at some point in the future
>> to split the backing pages in the driver, but it is not an immediate priority
>>
>
> I think we need something for the scenario I describe here. I was to
> make __split_huge_page_to_list_to_order with a couple of hacks but it
> almostly certainig not right as Zi pointed out.
>
> New to the MM stuff, but play around with this a bit and see if I can
> come up with something that will work here.

Can you try to write a new split_page function with __split_unmapped_folio()?
Since based on your description, your folio is not mapped.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
                   ` (13 preceding siblings ...)
  2025-07-08 14:53 ` David Hildenbrand
@ 2025-07-17 23:40 ` Matthew Brost
  2025-07-18  3:57   ` Balbir Singh
  14 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-17 23:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> This patch series adds support for THP migration of zone device pages.
> To do so, the patches implement support for folio zone device pages
> by adding support for setting up larger order pages.
> 
> These patches build on the earlier posts by Ralph Campbell [1]
> 
> Two new flags are added in vma_migration to select and mark compound pages.
> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
> is passed in as arguments.
> 
> The series also adds zone device awareness to (m)THP pages along
> with fault handling of large zone device private pages. page vma walk
> and the rmap code is also zone device aware. Support has also been
> added for folios that might need to be split in the middle
> of migration (when the src and dst do not agree on
> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
> migrate large pages, but the destination has not been able to allocate
> large pages. The code supported and used folio_split() when migrating
> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
> as an argument to migrate_vma_setup().
> 
> The test infrastructure lib/test_hmm.c has been enhanced to support THP
> migration. A new ioctl to emulate failure of large page allocations has
> been added to test the folio split code path. hmm-tests.c has new test
> cases for huge page migration and to test the folio split path. A new
> throughput test has been added as well.
> 
> The nouveau dmem code has been enhanced to use the new THP migration
> capability.
> 
> Feedback from the RFC [2]:
> 

Thanks for the patches, results look very promising. I wanted to give
some quick feedback:

- You appear to have missed updating hmm_range_fault, specifically
hmm_vma_handle_pmd, to check for device-private entries and populate the
HMM PFNs accordingly. My colleague François has a fix for this if you're
interested.

- I believe copy_huge_pmd also needs to be updated to avoid installing a
migration entry if the swap entry is device-private. I don't have an
exact fix yet due to my limited experience with core MM. The test case
that triggers this is fairly simple: fault in a 2MB device page on the
GPU, then fork a process that reads the page — the kernel crashes in
this scenario.

Matt

> It was advised that prep_compound_page() not be exposed just for the purposes
> of testing (test driver lib/test_hmm.c). Work arounds of copy and split the
> folios did not work due to lock order dependency in the callback for
> split folio.
> 
> mTHP support:
> 
> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
> been kept generic to support various order sizes. With additional
> refactoring of the code support of different order sizes should be
> possible.
> 
> The future plan is to post enhancements to support mTHP with a rough
> design as follows:
> 
> 1. Add the notion of allowable thp orders to the HMM based test driver
> 2. For non PMD based THP paths in migrate_device.c, check to see if
>    a suitable order is found and supported by the driver
> 3. Iterate across orders to check the highest supported order for migration
> 4. Migrate and finalize
> 
> The mTHP patches can be built on top of this series, the key design elements
> that need to be worked out are infrastructure and driver support for multiple
> ordered pages and their migration.
> 
> References:
> [1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
> [2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
> 
> These patches are built on top of mm-unstable
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Changelog v1:
> - Changes from RFC [2], include support for handling fault_folio and using
>   trylock in the fault path
> - A new test case has been added to measure the throughput improvement
> - General refactoring of code to keep up with the changes in mm
> - New split folio callback when the entire split is complete/done. The
>   callback is used to know when the head order needs to be reset.
> 
> Testing:
> - Testing was done with ZONE_DEVICE private pages on an x86 VM
> - Throughput showed upto 5x improvement with THP migration, system to device
>   migration is slower due to the mirroring of data (see buffer->mirror)
> 
> 
> Balbir Singh (12):
>   mm/zone_device: support large zone device private folios
>   mm/migrate_device: flags for selecting device private THP pages
>   mm/thp: zone_device awareness in THP handling code
>   mm/migrate_device: THP migration of zone device pages
>   mm/memory/fault: add support for zone device THP fault handling
>   lib/test_hmm: test cases and support for zone device private THP
>   mm/memremap: add folio_split support
>   mm/thp: add split during migration support
>   lib/test_hmm: add test case for split pages
>   selftests/mm/hmm-tests: new tests for zone device THP migration
>   gpu/drm/nouveau: add THP migration support
>   selftests/mm/hmm-tests: new throughput tests including THP
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
>  drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
>  include/linux/huge_mm.h                |  18 +-
>  include/linux/memremap.h               |  29 +-
>  include/linux/migrate.h                |   2 +
>  include/linux/mm.h                     |   1 +
>  lib/test_hmm.c                         | 428 +++++++++++++----
>  lib/test_hmm_uapi.h                    |   3 +
>  mm/huge_memory.c                       | 261 ++++++++---
>  mm/memory.c                            |   6 +-
>  mm/memremap.c                          |  50 +-
>  mm/migrate.c                           |   2 +
>  mm/migrate_device.c                    | 488 +++++++++++++++++---
>  mm/page_alloc.c                        |   1 +
>  mm/page_vma_mapped.c                   |  10 +
>  mm/pgtable-generic.c                   |   6 +
>  mm/rmap.c                              |  19 +-
>  tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
>  19 files changed, 1874 insertions(+), 312 deletions(-)
> 
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-17 23:04                         ` Zi Yan
@ 2025-07-18  0:41                           ` Matthew Brost
  2025-07-18  1:25                             ` Zi Yan
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-18  0:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
> 
> > On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> >> On 7/17/25 02:24, Matthew Brost wrote:
> >>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>>>
> >>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>>>
> >>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>>>
> >>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> s/pages/folio
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks, will make the changes
> >>>>>>>>>>
> >>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Ack, will change the name
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>   *
> >>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>   */
> >>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>>>  		 * operations.
> >>>>>>>>>>>>  		 */
> >>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>>>> -			goto out;
> >>>>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>>>> +				goto out;
> >>>>>>>>>>>> +			}
> >>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>  		}
> >>>>>>>>>>>>  		end = -1;
> >>>>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>>>  	}
> >>>>>>>>>>>>
> >>>>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>>>> +	if (!isolated)
> >>>>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>>>
> >>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>
> >>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>>>> -				uniform_split);
> >>>>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>>>  fail:
> >>>>>>>>>>>>  		if (mapping)
> >>>>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>> +		if (!isolated)
> >>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>>>  	}
> >>>>>>>>>>>
> >>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> There are two reasons for going down the current code path
> >>>>>>>>>
> >>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>>>> if calling the API with unmapped
> >>>>>>>
> >>>>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>>>
> >>>>>>
> >>>>>> There is a use for splitting unmapped folios (see below)
> >>>>>>
> >>>>>>>>
> >>>>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>>>> device private folios properly. Details are below.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>>>
> >>>>>>>>> You do something below instead.
> >>>>>>>>>
> >>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>>>> 	ret = -EBUSY;
> >>>>>>>>> 	goto out;
> >>>>>>>>> } else if (anon_vma) {
> >>>>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>>>> the check for device private folios?
> >>>>>>>
> >>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>>>> in if (!isolated) branch. In that case, just do
> >>>>>>>
> >>>>>>> if (folio_is_device_private(folio) {
> >>>>>>> ...
> >>>>>>> } else if (is_anon) {
> >>>>>>> ...
> >>>>>>> } else {
> >>>>>>> ...
> >>>>>>> }
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>>>
> >>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>>>> sees it as both device private and file-backed?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>>>> the name device private.
> >>>>>>>
> >>>>>>> OK.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>>>
> >>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>>>> sees a device private folio.
> >>>>>>>>>
> >>>>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>>>> remap_folio(), because
> >>>>>>>>
> >>>>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>>>    so trying to do that again does not make sense.
> >>>>>>>
> >>>>>>> Why doing split in the middle of migration? Existing split code
> >>>>>>> assumes to-be-split folios are mapped.
> >>>>>>>
> >>>>>>> What prevents doing split before migration?
> >>>>>>>
> >>>>>>
> >>>>>> The code does do a split prior to migration if THP selection fails
> >>>>>>
> >>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>>>> and the fallback part which calls split_folio()
> >>>>>>
> >>>>>> But the case under consideration is special since the device needs to allocate
> >>>>>> corresponding pfn's as well. The changelog mentions it:
> >>>>>>
> >>>>>> "The common case that arises is that after setup, during migrate
> >>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>>>> pages."
> >>>>>>
> >>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>>>
> >>>>>> 1. migrate_vma_setup()
> >>>>>> 2. migrate_vma_pages()
> >>>>>> 3. migrate_vma_finalize()
> >>>>>>
> >>>>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>>>
> >>>>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>>>
> >>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>>>
> >>>>>
> >>>>> Another use case I’ve seen is when a previously allocated high-order
> >>>>> folio, now in the free memory pool, is reallocated as a lower-order
> >>>>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>>>
> >>>> That is different. If the high-order folio is free, it should be split
> >>>> using split_page() from mm/page_alloc.c.
> >>>>
> >>>
> >>> Ah, ok. Let me see if that works - it would easier.
> >>>
> >
> > This suggestion quickly blows up as PageCompound is true and page_count
> > here is zero.
> 
> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
> 
> >
> >>>>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>>>
> >>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >>>> __split_unmapped_folio() is not for it, unless you mean free folio
> >>>> differently.
> >>>>
> >>>
> >>> This is right, those fields should be clear.
> >>>
> >>> Thanks for the tip.
> >>>
> >> I was hoping to reuse __split_folio_to_order() at some point in the future
> >> to split the backing pages in the driver, but it is not an immediate priority
> >>
> >
> > I think we need something for the scenario I describe here. I was to
> > make __split_huge_page_to_list_to_order with a couple of hacks but it
> > almostly certainig not right as Zi pointed out.
> >
> > New to the MM stuff, but play around with this a bit and see if I can
> > come up with something that will work here.
> 
> Can you try to write a new split_page function with __split_unmapped_folio()?
> Since based on your description, your folio is not mapped.
> 

Yes, page->mapping is NULL in this case - that was part of the hacks to
__split_huge_page_to_list_to_order (more specially __folio_split) I had
to make in order to get something working for this case.

I can try out something based on __split_unmapped_folio and report back.

Matt 

> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-18  0:41                           ` Matthew Brost
@ 2025-07-18  1:25                             ` Zi Yan
  2025-07-18  3:33                               ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-18  1:25 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 17 Jul 2025, at 20:41, Matthew Brost wrote:

> On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
>> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
>>
>>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
>>>> On 7/17/25 02:24, Matthew Brost wrote:
>>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>>>>>
>>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>>>>
>>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> s/pages/folio
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, will make the changes
>>>>>>>>>>>>
>>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ack, will change the name
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>   *
>>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>>>>  		 */
>>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>>>>> -			goto out;
>>>>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>>>>> +				goto out;
>>>>>>>>>>>>>> +			}
>>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>  		}
>>>>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>>>>  fail:
>>>>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>
>>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>>>>
>>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>>>>> if calling the API with unmapped
>>>>>>>>>
>>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>>>>
>>>>>>>>
>>>>>>>> There is a use for splitting unmapped folios (see below)
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>>>>> device private folios properly. Details are below.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>>>>
>>>>>>>>>>> You do something below instead.
>>>>>>>>>>>
>>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>>>>> 	ret = -EBUSY;
>>>>>>>>>>> 	goto out;
>>>>>>>>>>> } else if (anon_vma) {
>>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>>>>> the check for device private folios?
>>>>>>>>>
>>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>>>>
>>>>>>>>> if (folio_is_device_private(folio) {
>>>>>>>>> ...
>>>>>>>>> } else if (is_anon) {
>>>>>>>>> ...
>>>>>>>>> } else {
>>>>>>>>> ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>>>>
>>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>>>>> the name device private.
>>>>>>>>>
>>>>>>>>> OK.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>>>>
>>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>>>>> sees a device private folio.
>>>>>>>>>>>
>>>>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>>>>> remap_folio(), because
>>>>>>>>>>
>>>>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>>>>    so trying to do that again does not make sense.
>>>>>>>>>
>>>>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>>>>> assumes to-be-split folios are mapped.
>>>>>>>>>
>>>>>>>>> What prevents doing split before migration?
>>>>>>>>>
>>>>>>>>
>>>>>>>> The code does do a split prior to migration if THP selection fails
>>>>>>>>
>>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>>>>> and the fallback part which calls split_folio()
>>>>>>>>
>>>>>>>> But the case under consideration is special since the device needs to allocate
>>>>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>>>>
>>>>>>>> "The common case that arises is that after setup, during migrate
>>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>>>>> pages."
>>>>>>>>
>>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>>>>
>>>>>>>> 1. migrate_vma_setup()
>>>>>>>> 2. migrate_vma_pages()
>>>>>>>> 3. migrate_vma_finalize()
>>>>>>>>
>>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>>>>
>>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>>>>>
>>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>>>>>
>>>>>>>
>>>>>>> Another use case I’ve seen is when a previously allocated high-order
>>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
>>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>>>>>
>>>>>> That is different. If the high-order folio is free, it should be split
>>>>>> using split_page() from mm/page_alloc.c.
>>>>>>
>>>>>
>>>>> Ah, ok. Let me see if that works - it would easier.
>>>>>
>>>
>>> This suggestion quickly blows up as PageCompound is true and page_count
>>> here is zero.
>>
>> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
>>
>>>
>>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>>>>>
>>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
>>>>>> differently.
>>>>>>
>>>>>
>>>>> This is right, those fields should be clear.
>>>>>
>>>>> Thanks for the tip.
>>>>>
>>>> I was hoping to reuse __split_folio_to_order() at some point in the future
>>>> to split the backing pages in the driver, but it is not an immediate priority
>>>>
>>>
>>> I think we need something for the scenario I describe here. I was to
>>> make __split_huge_page_to_list_to_order with a couple of hacks but it
>>> almostly certainig not right as Zi pointed out.
>>>
>>> New to the MM stuff, but play around with this a bit and see if I can
>>> come up with something that will work here.
>>
>> Can you try to write a new split_page function with __split_unmapped_folio()?
>> Since based on your description, your folio is not mapped.
>>
>
> Yes, page->mapping is NULL in this case - that was part of the hacks to
> __split_huge_page_to_list_to_order (more specially __folio_split) I had
> to make in order to get something working for this case.
>
> I can try out something based on __split_unmapped_folio and report back.

mm-new tree has an updated __split_unmapped_folio() version, it moves
all unmap irrelevant code out of __split_unmaped_folio(). You might find
it easier to reuse.

See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430

I am about to update the code with v4 patches. I will cc you, so that
you can get the updated __split_unmaped_folio().

Feel free to ask questions on folio split code.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages
  2025-07-03 23:35 ` [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
  2025-07-07  5:31   ` Alistair Popple
@ 2025-07-18  3:15   ` Matthew Brost
  1 sibling, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-18  3:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 04, 2025 at 09:35:01AM +1000, Balbir Singh wrote:
> Add flags to mark zone device migration pages.
> 
> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> device pages as compound pages during device pfn migration.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/migrate.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index aaa2114498d6..1661e2d5479a 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>  #define MIGRATE_PFN_VALID	(1UL << 0)
>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)

Can some documentation be added around the usage of MIGRATE_PFN_COMPOUND?

In particular, how the field is used in relation to the migrate_vma_* functions?

For example, when MIGRATE_PFN_COMPOUND is set in a returned mpfn, the caller
should check the order of the folio associated with that mpfn, and then expect
the next 1 << order entries in the source array to be unpopulated.

Likewise, when a caller populates an mpfn with MIGRATE_PFN_COMPOUND, the next 1
<< order entries should also be unpopulated.

This behavior wasn’t immediately obvious, so I think it would be helpful to
document it to avoid requiring readers to reverse-engineer the code.

>  #define MIGRATE_PFN_SHIFT	6
>  
>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> @@ -185,6 +186,7 @@ enum migrate_vma_direction {

Maybe out of scope, but migrate_vma_direction is not a great name. While
we are here, it could be cleaned up.

>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,

Same here.

For example, what happens if MIGRATE_VMA_SELECT_COMPOUND is select vs unselected
when higher order folio is found.

Matt

>  };
>  
>  struct migrate_vma {
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-18  1:25                             ` Zi Yan
@ 2025-07-18  3:33                               ` Matthew Brost
  2025-07-18 15:06                                 ` Zi Yan
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-18  3:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Thu, Jul 17, 2025 at 09:25:02PM -0400, Zi Yan wrote:
> On 17 Jul 2025, at 20:41, Matthew Brost wrote:
> 
> > On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
> >> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
> >>
> >>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> >>>> On 7/17/25 02:24, Matthew Brost wrote:
> >>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>>>>>
> >>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>>>>>
> >>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> s/pages/folio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks, will make the changes
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ack, will change the name
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>   *
> >>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>   */
> >>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>>>>>  		 * operations.
> >>>>>>>>>>>>>>  		 */
> >>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>>>>>> -			goto out;
> >>>>>>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>>>>>> +				goto out;
> >>>>>>>>>>>>>> +			}
> >>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>  		}
> >>>>>>>>>>>>>>  		end = -1;
> >>>>>>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>>>>>> +	if (!isolated)
> >>>>>>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>>>>>> -				uniform_split);
> >>>>>>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>>>>>  fail:
> >>>>>>>>>>>>>>  		if (mapping)
> >>>>>>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>> +		if (!isolated)
> >>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> There are two reasons for going down the current code path
> >>>>>>>>>>>
> >>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>>>>>> if calling the API with unmapped
> >>>>>>>>>
> >>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>>>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> There is a use for splitting unmapped folios (see below)
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>>>>>> device private folios properly. Details are below.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>>>>>
> >>>>>>>>>>> You do something below instead.
> >>>>>>>>>>>
> >>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>>>>>> 	ret = -EBUSY;
> >>>>>>>>>>> 	goto out;
> >>>>>>>>>>> } else if (anon_vma) {
> >>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>>>>>> the check for device private folios?
> >>>>>>>>>
> >>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>>>>>> in if (!isolated) branch. In that case, just do
> >>>>>>>>>
> >>>>>>>>> if (folio_is_device_private(folio) {
> >>>>>>>>> ...
> >>>>>>>>> } else if (is_anon) {
> >>>>>>>>> ...
> >>>>>>>>> } else {
> >>>>>>>>> ...
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>>>>>
> >>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>>>>>> sees it as both device private and file-backed?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>>>>>> the name device private.
> >>>>>>>>>
> >>>>>>>>> OK.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>>>>>
> >>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>>>>>> sees a device private folio.
> >>>>>>>>>>>
> >>>>>>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>>>>>> remap_folio(), because
> >>>>>>>>>>
> >>>>>>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>>>>>    so trying to do that again does not make sense.
> >>>>>>>>>
> >>>>>>>>> Why doing split in the middle of migration? Existing split code
> >>>>>>>>> assumes to-be-split folios are mapped.
> >>>>>>>>>
> >>>>>>>>> What prevents doing split before migration?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> The code does do a split prior to migration if THP selection fails
> >>>>>>>>
> >>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>>>>>> and the fallback part which calls split_folio()
> >>>>>>>>
> >>>>>>>> But the case under consideration is special since the device needs to allocate
> >>>>>>>> corresponding pfn's as well. The changelog mentions it:
> >>>>>>>>
> >>>>>>>> "The common case that arises is that after setup, during migrate
> >>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>>>>>> pages."
> >>>>>>>>
> >>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>>>>>
> >>>>>>>> 1. migrate_vma_setup()
> >>>>>>>> 2. migrate_vma_pages()
> >>>>>>>> 3. migrate_vma_finalize()
> >>>>>>>>
> >>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>>>>>
> >>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>>>>>
> >>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Another use case I’ve seen is when a previously allocated high-order
> >>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
> >>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>>>>>
> >>>>>> That is different. If the high-order folio is free, it should be split
> >>>>>> using split_page() from mm/page_alloc.c.
> >>>>>>
> >>>>>
> >>>>> Ah, ok. Let me see if that works - it would easier.
> >>>>>
> >>>
> >>> This suggestion quickly blows up as PageCompound is true and page_count
> >>> here is zero.
> >>
> >> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
> >>
> >>>
> >>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>>>>>
> >>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
> >>>>>> differently.
> >>>>>>
> >>>>>
> >>>>> This is right, those fields should be clear.
> >>>>>
> >>>>> Thanks for the tip.
> >>>>>
> >>>> I was hoping to reuse __split_folio_to_order() at some point in the future
> >>>> to split the backing pages in the driver, but it is not an immediate priority
> >>>>
> >>>
> >>> I think we need something for the scenario I describe here. I was to
> >>> make __split_huge_page_to_list_to_order with a couple of hacks but it
> >>> almostly certainig not right as Zi pointed out.
> >>>
> >>> New to the MM stuff, but play around with this a bit and see if I can
> >>> come up with something that will work here.
> >>
> >> Can you try to write a new split_page function with __split_unmapped_folio()?
> >> Since based on your description, your folio is not mapped.
> >>
> >
> > Yes, page->mapping is NULL in this case - that was part of the hacks to
> > __split_huge_page_to_list_to_order (more specially __folio_split) I had
> > to make in order to get something working for this case.
> >
> > I can try out something based on __split_unmapped_folio and report back.
> 
> mm-new tree has an updated __split_unmapped_folio() version, it moves
> all unmap irrelevant code out of __split_unmaped_folio(). You might find
> it easier to reuse.
> 
> See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430
> 

Will take a look. It is possible some of the issues we are hitting are
due to working on drm-tip + pulling in core MM patches in this series on
top of that branch then missing some other patches in mm-new. I'll see
if ww can figure out a work flow to have the latest and greatest from
both drm-tip and the MM branches.

Will these changes be in 6.17?

> I am about to update the code with v4 patches. I will cc you, so that
> you can get the updated __split_unmaped_folio().
> 
> Feel free to ask questions on folio split code.
>

Thanks.

Matt
 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-17 23:40 ` Matthew Brost
@ 2025-07-18  3:57   ` Balbir Singh
  2025-07-18  4:57     ` Matthew Brost
                       ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-18  3:57 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/18/25 09:40, Matthew Brost wrote:
> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
...
>>
>> The nouveau dmem code has been enhanced to use the new THP migration
>> capability.
>>
>> Feedback from the RFC [2]:
>>
> 
> Thanks for the patches, results look very promising. I wanted to give
> some quick feedback:
> 

Are you seeing improvements with the patchset?

> - You appear to have missed updating hmm_range_fault, specifically
> hmm_vma_handle_pmd, to check for device-private entries and populate the
> HMM PFNs accordingly. My colleague François has a fix for this if you're
> interested.
> 

Sure, please feel free to post them. 

> - I believe copy_huge_pmd also needs to be updated to avoid installing a
> migration entry if the swap entry is device-private. I don't have an
> exact fix yet due to my limited experience with core MM. The test case
> that triggers this is fairly simple: fault in a 2MB device page on the
> GPU, then fork a process that reads the page — the kernel crashes in
> this scenario.
> 

I'd be happy to look at any traces you have or post any fixes you have

Thanks for the feedback
Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-18  3:57   ` Balbir Singh
@ 2025-07-18  4:57     ` Matthew Brost
  2025-07-21 23:48       ` Balbir Singh
  2025-07-19  0:53     ` Matthew Brost
  2025-07-21 11:42     ` Francois Dugast
  2 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-18  4:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> On 7/18/25 09:40, Matthew Brost wrote:
> > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> ...
> >>
> >> The nouveau dmem code has been enhanced to use the new THP migration
> >> capability.
> >>
> >> Feedback from the RFC [2]:
> >>
> > 
> > Thanks for the patches, results look very promising. I wanted to give
> > some quick feedback:
> > 
> 
> Are you seeing improvements with the patchset?
> 

We're nowhere near stable yet, but basic testing shows that CPU time
from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
too.

> > - You appear to have missed updating hmm_range_fault, specifically
> > hmm_vma_handle_pmd, to check for device-private entries and populate the
> > HMM PFNs accordingly. My colleague François has a fix for this if you're
> > interested.
> > 
> 
> Sure, please feel free to post them. 
> 
> > - I believe copy_huge_pmd also needs to be updated to avoid installing a
> > migration entry if the swap entry is device-private. I don't have an
> > exact fix yet due to my limited experience with core MM. The test case
> > that triggers this is fairly simple: fault in a 2MB device page on the
> > GPU, then fork a process that reads the page — the kernel crashes in
> > this scenario.
> > 
> 
> I'd be happy to look at any traces you have or post any fixes you have
> 

I've got it so the kernel doesn't explode but still get warnings like:

[ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
[ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
[ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504

I'm basically just skip is_swap_pmd clause if the entry is device
private, and let the rest of the function execute. This avoids
installing a migration entry—which isn't required and cause the
crash—and allows the rmap code to run, which flips the pages to not
anonymous exclusive (effectively making them copy-on-write (?), though
that doesn't fully apply to device pages). It's not 100% correct yet,
but it's a step in the right direction.

Matt

> Thanks for the feedback
> Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-03 23:35 ` [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages Balbir Singh
  2025-07-04 15:35   ` kernel test robot
@ 2025-07-18  6:59   ` Matthew Brost
  2025-07-18  7:04     ` Balbir Singh
  2025-07-19  2:10   ` Matthew Brost
  2 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-18  6:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 04, 2025 at 09:35:03AM +1000, Balbir Singh wrote:
> migrate_device code paths go through the collect, setup
> and finalize phases of migration. Support for MIGRATE_PFN_COMPOUND
> was added earlier in the series to mark THP pages as
> MIGRATE_PFN_COMPOUND.
> 
> The entries in src and dst arrays passed to these functions still
> remain at a PAGE_SIZE granularity. When a compound page is passed,
> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
> representation allows for the compound page to be split into smaller
> page sizes.
> 
> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
> and migrate_vma_insert_huge_pmd_page() have been added.
> 
> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
> some reason this fails, there is fallback support to split the folio
> and migrate it.
> 
> migrate_vma_insert_huge_pmd_page() closely follows the logic of
> migrate_vma_insert_page()
> 
> Support for splitting pages as needed for migration will follow in
> later patches in this series.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  mm/migrate_device.c | 437 +++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 376 insertions(+), 61 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index e05e14d6eacd..41d0bd787969 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -14,6 +14,7 @@
>  #include <linux/pagewalk.h>
>  #include <linux/rmap.h>
>  #include <linux/swapops.h>
> +#include <asm/pgalloc.h>
>  #include <asm/tlbflush.h>
>  #include "internal.h"
>  
> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>  	if (!vma_is_anonymous(walk->vma))
>  		return migrate_vma_collect_skip(start, end, walk);
>  
> +	if (thp_migration_supported() &&
> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> +						MIGRATE_PFN_COMPOUND;
> +		migrate->dst[migrate->npages] = 0;
> +		migrate->npages++;
> +		migrate->cpages++;

It's a bit unclear what cpages and npages actually represent when
collecting a THP. In my opinion, they should reflect the total number of
minimum sized pages collected—i.e., we should increment by the shifted
order (512) here. I'm fairly certain the logic in migrate_device.c would
break if a 4MB range was requested and a THP was found first, followed by a
non-THP.

Matt

> +
> +		/*
> +		 * Collect the remaining entries as holes, in case we
> +		 * need to split later
> +		 */
> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> +	}
> +
>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>  		migrate->dst[migrate->npages] = 0;
> @@ -54,57 +72,148 @@ static int migrate_vma_collect_hole(unsigned long start,
>  	return 0;
>  }
>  
> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> -				   unsigned long start,
> -				   unsigned long end,
> -				   struct mm_walk *walk)
> +/**
> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
> + * folio for device private pages.
> + * @pmdp: pointer to pmd entry
> + * @start: start address of the range for migration
> + * @end: end address of the range for migration
> + * @walk: mm_walk callback structure
> + *
> + * Collect the huge pmd entry at @pmdp for migration and set the
> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
> + * migration will occur at HPAGE_PMD granularity
> + */
> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
> +					unsigned long end, struct mm_walk *walk,
> +					struct folio *fault_folio)
>  {
> +	struct mm_struct *mm = walk->mm;
> +	struct folio *folio;
>  	struct migrate_vma *migrate = walk->private;
> -	struct folio *fault_folio = migrate->fault_page ?
> -		page_folio(migrate->fault_page) : NULL;
> -	struct vm_area_struct *vma = walk->vma;
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long addr = start, unmapped = 0;
>  	spinlock_t *ptl;
> -	pte_t *ptep;
> +	swp_entry_t entry;
> +	int ret;
> +	unsigned long write = 0;
>  
> -again:
> -	if (pmd_none(*pmdp))
> +	ptl = pmd_lock(mm, pmdp);
> +	if (pmd_none(*pmdp)) {
> +		spin_unlock(ptl);
>  		return migrate_vma_collect_hole(start, end, -1, walk);
> +	}
>  
>  	if (pmd_trans_huge(*pmdp)) {
> -		struct folio *folio;
> -
> -		ptl = pmd_lock(mm, pmdp);
> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>  			spin_unlock(ptl);
> -			goto again;
> +			return migrate_vma_collect_skip(start, end, walk);
>  		}
>  
>  		folio = pmd_folio(*pmdp);
>  		if (is_huge_zero_folio(folio)) {
>  			spin_unlock(ptl);
> -			split_huge_pmd(vma, pmdp, addr);
> -		} else {
> -			int ret;
> +			return migrate_vma_collect_hole(start, end, -1, walk);
> +		}
> +		if (pmd_write(*pmdp))
> +			write = MIGRATE_PFN_WRITE;
> +	} else if (!pmd_present(*pmdp)) {
> +		entry = pmd_to_swp_entry(*pmdp);
> +		folio = pfn_swap_entry_folio(entry);
> +
> +		if (!is_device_private_entry(entry) ||
> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
> +			spin_unlock(ptl);
> +			return migrate_vma_collect_skip(start, end, walk);
> +		}
>  
> -			folio_get(folio);
> +		if (is_migration_entry(entry)) {
> +			migration_entry_wait_on_locked(entry, ptl);
>  			spin_unlock(ptl);
> -			/* FIXME: we don't expect THP for fault_folio */
> -			if (WARN_ON_ONCE(fault_folio == folio))
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> -			if (unlikely(!folio_trylock(folio)))
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> -			ret = split_folio(folio);
> -			if (fault_folio != folio)
> -				folio_unlock(folio);
> -			folio_put(folio);
> -			if (ret)
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> +			return -EAGAIN;
>  		}
> +
> +		if (is_writable_device_private_entry(entry))
> +			write = MIGRATE_PFN_WRITE;
> +	} else {
> +		spin_unlock(ptl);
> +		return -EAGAIN;
> +	}
> +
> +	folio_get(folio);
> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
> +		spin_unlock(ptl);
> +		folio_put(folio);
> +		return migrate_vma_collect_skip(start, end, walk);
> +	}
> +
> +	if (thp_migration_supported() &&
> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> +
> +		struct page_vma_mapped_walk pvmw = {
> +			.ptl = ptl,
> +			.address = start,
> +			.pmd = pmdp,
> +			.vma = walk->vma,
> +		};
> +
> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
> +
> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
> +						| MIGRATE_PFN_MIGRATE
> +						| MIGRATE_PFN_COMPOUND;
> +		migrate->dst[migrate->npages++] = 0;
> +		migrate->cpages++;
> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
> +		if (ret) {
> +			migrate->npages--;
> +			migrate->cpages--;
> +			migrate->src[migrate->npages] = 0;
> +			migrate->dst[migrate->npages] = 0;
> +			goto fallback;
> +		}
> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +fallback:
> +	spin_unlock(ptl);
> +	ret = split_folio(folio);
> +	if (fault_folio != folio)
> +		folio_unlock(folio);
> +	folio_put(folio);
> +	if (ret)
> +		return migrate_vma_collect_skip(start, end, walk);
> +	if (pmd_none(pmdp_get_lockless(pmdp)))
> +		return migrate_vma_collect_hole(start, end, -1, walk);
> +
> +	return -ENOENT;
> +}
> +
> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> +				   unsigned long start,
> +				   unsigned long end,
> +				   struct mm_walk *walk)
> +{
> +	struct migrate_vma *migrate = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long addr = start, unmapped = 0;
> +	spinlock_t *ptl;
> +	struct folio *fault_folio = migrate->fault_page ?
> +		page_folio(migrate->fault_page) : NULL;
> +	pte_t *ptep;
> +
> +again:
> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
> +
> +		if (ret == -EAGAIN)
> +			goto again;
> +		if (ret == 0)
> +			return 0;
>  	}
>  
>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> @@ -175,8 +284,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>  		}
>  
> -		/* FIXME support THP */
> -		if (!page || !page->mapping || PageTransCompound(page)) {
> +		if (!page || !page->mapping) {
>  			mpfn = 0;
>  			goto next;
>  		}
> @@ -347,14 +455,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>  	 */
>  	int extra = 1 + (page == fault_page);
>  
> -	/*
> -	 * FIXME support THP (transparent huge page), it is bit more complex to
> -	 * check them than regular pages, because they can be mapped with a pmd
> -	 * or with a pte (split pte mapping).
> -	 */
> -	if (folio_test_large(folio))
> -		return false;
> -
>  	/* Page from ZONE_DEVICE have one extra reference */
>  	if (folio_is_zone_device(folio))
>  		extra++;
> @@ -385,17 +485,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  
>  	lru_add_drain();
>  
> -	for (i = 0; i < npages; i++) {
> +	for (i = 0; i < npages; ) {
>  		struct page *page = migrate_pfn_to_page(src_pfns[i]);
>  		struct folio *folio;
> +		unsigned int nr = 1;
>  
>  		if (!page) {
>  			if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
>  				unmapped++;
> -			continue;
> +			goto next;
>  		}
>  
>  		folio =	page_folio(page);
> +		nr = folio_nr_pages(folio);
> +
> +		if (nr > 1)
> +			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
> +
> +
>  		/* ZONE_DEVICE folios are not on LRU */
>  		if (!folio_is_zone_device(folio)) {
>  			if (!folio_test_lru(folio) && allow_drain) {
> @@ -407,7 +514,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  			if (!folio_isolate_lru(folio)) {
>  				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>  				restore++;
> -				continue;
> +				goto next;
>  			}
>  
>  			/* Drop the reference we took in collect */
> @@ -426,10 +533,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>  			restore++;
> -			continue;
> +			goto next;
>  		}
>  
>  		unmapped++;
> +next:
> +		i += nr;
>  	}
>  
>  	for (i = 0; i < npages && restore; i++) {
> @@ -575,6 +684,146 @@ int migrate_vma_setup(struct migrate_vma *args)
>  }
>  EXPORT_SYMBOL(migrate_vma_setup);
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +/**
> + * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
> + * at @addr. folio is already allocated as a part of the migration process with
> + * large page.
> + *
> + * @folio needs to be initialized and setup after it's allocated. The code bits
> + * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
> + * not support THP zero pages.
> + *
> + * @migrate: migrate_vma arguments
> + * @addr: address where the folio will be inserted
> + * @folio: folio to be inserted at @addr
> + * @src: src pfn which is being migrated
> + * @pmdp: pointer to the pmd
> + */
> +static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
> +					 unsigned long addr,
> +					 struct page *page,
> +					 unsigned long *src,
> +					 pmd_t *pmdp)
> +{
> +	struct vm_area_struct *vma = migrate->vma;
> +	gfp_t gfp = vma_thp_gfp_mask(vma);
> +	struct folio *folio = page_folio(page);
> +	int ret;
> +	spinlock_t *ptl;
> +	pgtable_t pgtable;
> +	pmd_t entry;
> +	bool flush = false;
> +	unsigned long i;
> +
> +	VM_WARN_ON_FOLIO(!folio, folio);
> +	VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
> +
> +	if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
> +		return -EINVAL;
> +
> +	ret = anon_vma_prepare(vma);
> +	if (ret)
> +		return ret;
> +
> +	folio_set_order(folio, HPAGE_PMD_ORDER);
> +	folio_set_large_rmappable(folio);
> +
> +	if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
> +		count_vm_event(THP_FAULT_FALLBACK);
> +		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +		ret = -ENOMEM;
> +		goto abort;
> +	}
> +
> +	__folio_mark_uptodate(folio);
> +
> +	pgtable = pte_alloc_one(vma->vm_mm);
> +	if (unlikely(!pgtable))
> +		goto abort;
> +
> +	if (folio_is_device_private(folio)) {
> +		swp_entry_t swp_entry;
> +
> +		if (vma->vm_flags & VM_WRITE)
> +			swp_entry = make_writable_device_private_entry(
> +						page_to_pfn(page));
> +		else
> +			swp_entry = make_readable_device_private_entry(
> +						page_to_pfn(page));
> +		entry = swp_entry_to_pmd(swp_entry);
> +	} else {
> +		if (folio_is_zone_device(folio) &&
> +		    !folio_is_device_coherent(folio)) {
> +			goto abort;
> +		}
> +		entry = folio_mk_pmd(folio, vma->vm_page_prot);
> +		if (vma->vm_flags & VM_WRITE)
> +			entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
> +	}
> +
> +	ptl = pmd_lock(vma->vm_mm, pmdp);
> +	ret = check_stable_address_space(vma->vm_mm);
> +	if (ret)
> +		goto abort;
> +
> +	/*
> +	 * Check for userfaultfd but do not deliver the fault. Instead,
> +	 * just back off.
> +	 */
> +	if (userfaultfd_missing(vma))
> +		goto unlock_abort;
> +
> +	if (!pmd_none(*pmdp)) {
> +		if (!is_huge_zero_pmd(*pmdp))
> +			goto unlock_abort;
> +		flush = true;
> +	} else if (!pmd_none(*pmdp))
> +		goto unlock_abort;
> +
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> +	if (!folio_is_zone_device(folio))
> +		folio_add_lru_vma(folio, vma);
> +	folio_get(folio);
> +
> +	if (flush) {
> +		pte_free(vma->vm_mm, pgtable);
> +		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
> +		pmdp_invalidate(vma, addr, pmdp);
> +	} else {
> +		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
> +		mm_inc_nr_ptes(vma->vm_mm);
> +	}
> +	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
> +	update_mmu_cache_pmd(vma, addr, pmdp);
> +
> +	spin_unlock(ptl);
> +
> +	count_vm_event(THP_FAULT_ALLOC);
> +	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
> +	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
> +
> +	return 0;
> +
> +unlock_abort:
> +	spin_unlock(ptl);
> +abort:
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		src[i] &= ~MIGRATE_PFN_MIGRATE;
> +	return 0;
> +}
> +#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
> +					 unsigned long addr,
> +					 struct page *page,
> +					 unsigned long *src,
> +					 pmd_t *pmdp)
> +{
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * This code closely matches the code in:
>   *   __handle_mm_fault()
> @@ -585,9 +834,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
>   */
>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  				    unsigned long addr,
> -				    struct page *page,
> +				    unsigned long *dst,
>  				    unsigned long *src)
>  {
> +	struct page *page = migrate_pfn_to_page(*dst);
>  	struct folio *folio = page_folio(page);
>  	struct vm_area_struct *vma = migrate->vma;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -615,8 +865,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  	pmdp = pmd_alloc(mm, pudp, addr);
>  	if (!pmdp)
>  		goto abort;
> -	if (pmd_trans_huge(*pmdp))
> -		goto abort;
> +
> +	if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
> +		int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
> +								src, pmdp);
> +		if (ret)
> +			goto abort;
> +		return;
> +	}
> +
> +	if (!pmd_none(*pmdp)) {
> +		if (pmd_trans_huge(*pmdp)) {
> +			if (!is_huge_zero_pmd(*pmdp))
> +				goto abort;
> +			folio_get(pmd_folio(*pmdp));
> +			split_huge_pmd(vma, pmdp, addr);
> +		} else if (pmd_leaf(*pmdp))
> +			goto abort;
> +	}
> +
>  	if (pte_alloc(mm, pmdp))
>  		goto abort;
>  	if (unlikely(anon_vma_prepare(vma)))
> @@ -707,23 +974,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  	unsigned long i;
>  	bool notified = false;
>  
> -	for (i = 0; i < npages; i++) {
> +	for (i = 0; i < npages; ) {
>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
>  		struct page *page = migrate_pfn_to_page(src_pfns[i]);
>  		struct address_space *mapping;
>  		struct folio *newfolio, *folio;
>  		int r, extra_cnt = 0;
> +		unsigned long nr = 1;
>  
>  		if (!newpage) {
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -			continue;
> +			goto next;
>  		}
>  
>  		if (!page) {
>  			unsigned long addr;
>  
>  			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
> -				continue;
> +				goto next;
>  
>  			/*
>  			 * The only time there is no vma is when called from
> @@ -741,15 +1009,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  					migrate->pgmap_owner);
>  				mmu_notifier_invalidate_range_start(&range);
>  			}
> -			migrate_vma_insert_page(migrate, addr, newpage,
> +
> +			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
> +				nr = HPAGE_PMD_NR;
> +				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +				goto next;
> +			}
> +
> +			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
>  						&src_pfns[i]);
> -			continue;
> +			goto next;
>  		}
>  
>  		newfolio = page_folio(newpage);
>  		folio = page_folio(page);
>  		mapping = folio_mapping(folio);
>  
> +		/*
> +		 * If THP migration is enabled, check if both src and dst
> +		 * can migrate large pages
> +		 */
> +		if (thp_migration_supported()) {
> +			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
> +				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> +
> +				if (!migrate) {
> +					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
> +							 MIGRATE_PFN_COMPOUND);
> +					goto next;
> +				}
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
> +				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +			}
> +		}
> +
> +
>  		if (folio_is_device_private(newfolio) ||
>  		    folio_is_device_coherent(newfolio)) {
>  			if (mapping) {
> @@ -762,7 +1062,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				if (!folio_test_anon(folio) ||
>  				    !folio_free_swap(folio)) {
>  					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -					continue;
> +					goto next;
>  				}
>  			}
>  		} else if (folio_is_zone_device(newfolio)) {
> @@ -770,7 +1070,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  			 * Other types of ZONE_DEVICE page are not supported.
>  			 */
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -			continue;
> +			goto next;
>  		}
>  
>  		BUG_ON(folio_test_writeback(folio));
> @@ -782,6 +1082,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>  		else
>  			folio_migrate_flags(newfolio, folio);
> +next:
> +		i += nr;
>  	}
>  
>  	if (notified)
> @@ -943,10 +1245,23 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
>  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
>  			unsigned long npages)
>  {
> -	unsigned long i, pfn;
> +	unsigned long i, j, pfn;
> +
> +	for (pfn = start, i = 0; i < npages; pfn++, i++) {
> +		struct page *page = pfn_to_page(pfn);
> +		struct folio *folio = page_folio(page);
> +		unsigned int nr = 1;
>  
> -	for (pfn = start, i = 0; i < npages; pfn++, i++)
>  		src_pfns[i] = migrate_device_pfn_lock(pfn);
> +		nr = folio_nr_pages(folio);
> +		if (nr > 1) {
> +			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
> +			for (j = 1; j < nr; j++)
> +				src_pfns[i+j] = 0;
> +			i += j - 1;
> +			pfn += j - 1;
> +		}
> +	}
>  
>  	migrate_device_unmap(src_pfns, npages, NULL);
>  
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-18  6:59   ` Matthew Brost
@ 2025-07-18  7:04     ` Balbir Singh
  2025-07-18  7:21       ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-18  7:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/18/25 16:59, Matthew Brost wrote:
> On Fri, Jul 04, 2025 at 09:35:03AM +1000, Balbir Singh wrote:
>> +	if (thp_migration_supported() &&
>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>> +						MIGRATE_PFN_COMPOUND;
>> +		migrate->dst[migrate->npages] = 0;
>> +		migrate->npages++;
>> +		migrate->cpages++;
> 
> It's a bit unclear what cpages and npages actually represent when
> collecting a THP. In my opinion, they should reflect the total number of
> minimum sized pages collected—i.e., we should increment by the shifted
> order (512) here. I'm fairly certain the logic in migrate_device.c would
> break if a 4MB range was requested and a THP was found first, followed by a
> non-THP.
> 

cpages and npages represent entries in the array and when or'ed with MIGRATE_PFN_COMPOUND
represent the right number of entries populated. If you have a test that shows
the breakage, I'd be keen to see it. We do populate other entries in 4k size(s) when
collecting to allow for a split of the folio.

Thanks for the review,
Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-18  7:04     ` Balbir Singh
@ 2025-07-18  7:21       ` Matthew Brost
  2025-07-18  8:22         ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-18  7:21 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 18, 2025 at 05:04:39PM +1000, Balbir Singh wrote:
> On 7/18/25 16:59, Matthew Brost wrote:
> > On Fri, Jul 04, 2025 at 09:35:03AM +1000, Balbir Singh wrote:
> >> +	if (thp_migration_supported() &&
> >> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> >> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> >> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> >> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> >> +						MIGRATE_PFN_COMPOUND;
> >> +		migrate->dst[migrate->npages] = 0;
> >> +		migrate->npages++;
> >> +		migrate->cpages++;
> > 
> > It's a bit unclear what cpages and npages actually represent when
> > collecting a THP. In my opinion, they should reflect the total number of
> > minimum sized pages collected—i.e., we should increment by the shifted
> > order (512) here. I'm fairly certain the logic in migrate_device.c would
> > break if a 4MB range was requested and a THP was found first, followed by a
> > non-THP.
> > 
> 
> cpages and npages represent entries in the array and when or'ed with MIGRATE_PFN_COMPOUND
> represent the right number of entries populated. If you have a test that shows
> the breakage, I'd be keen to see it. We do populate other entries in 4k size(s) when
> collecting to allow for a split of the folio.
> 

I don’t have a test case, but let me quickly point out a logic bug.

Look at migrate_device_unmap. The variable i is incremented by
folio_nr_pages, which seems correct. However, in the earlier code, we
populate migrate->src using migrate->npages as the index, then increment
it by 1. So, if two THPs are found back to back, they’ll occupy entries
0 and 1, while migrate_device_unmap will access entries 0 and 512.

Given that we have no idea what mix of THP vs non-THP we’ll encounter,
the only sane approach is to populate the input array at minimum
page-entry alignment. Similarly, npages and cpages should reflect the
number of minimum-sized pages found, with the caller (and
migrate_device) understanding that src and dst will be sparsely
populated based on each entry’s folio order.

Matt

> Thanks for the review,
> Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-18  7:21       ` Matthew Brost
@ 2025-07-18  8:22         ` Matthew Brost
  2025-07-22  4:54           ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-18  8:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 18, 2025 at 12:21:36AM -0700, Matthew Brost wrote:
> On Fri, Jul 18, 2025 at 05:04:39PM +1000, Balbir Singh wrote:
> > On 7/18/25 16:59, Matthew Brost wrote:
> > > On Fri, Jul 04, 2025 at 09:35:03AM +1000, Balbir Singh wrote:
> > >> +	if (thp_migration_supported() &&
> > >> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> > >> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> > >> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> > >> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> > >> +						MIGRATE_PFN_COMPOUND;
> > >> +		migrate->dst[migrate->npages] = 0;
> > >> +		migrate->npages++;
> > >> +		migrate->cpages++;
> > > 
> > > It's a bit unclear what cpages and npages actually represent when
> > > collecting a THP. In my opinion, they should reflect the total number of
> > > minimum sized pages collected—i.e., we should increment by the shifted
> > > order (512) here. I'm fairly certain the logic in migrate_device.c would
> > > break if a 4MB range was requested and a THP was found first, followed by a
> > > non-THP.
> > > 
> > 
> > cpages and npages represent entries in the array and when or'ed with MIGRATE_PFN_COMPOUND
> > represent the right number of entries populated. If you have a test that shows
> > the breakage, I'd be keen to see it. We do populate other entries in 4k size(s) when
> > collecting to allow for a split of the folio.
> > 
> 
> I don’t have a test case, but let me quickly point out a logic bug.
> 
> Look at migrate_device_unmap. The variable i is incremented by
> folio_nr_pages, which seems correct. However, in the earlier code, we
> populate migrate->src using migrate->npages as the index, then increment
> it by 1. So, if two THPs are found back to back, they’ll occupy entries
> 0 and 1, while migrate_device_unmap will access entries 0 and 512.
> 
> Given that we have no idea what mix of THP vs non-THP we’ll encounter,
> the only sane approach is to populate the input array at minimum
> page-entry alignment. Similarly, npages and cpages should reflect the
> number of minimum-sized pages found, with the caller (and
> migrate_device) understanding that src and dst will be sparsely
> populated based on each entry’s folio order.
> 

I looked into this further and found another case where the logic breaks.

In __migrate_device_pages, the call to migrate_vma_split_pages assumes
that based on folio's order it can populate subsequent entries upon
split. This requires the source array to reflect the folio order upon
finding it.

Here’s a summary of how I believe the migrate_vma_setup interface should
behave, assuming 4K pages and 2M THPs:

Example A: 4MB requested, 2 THPs found and unmapped
src[0]:     folio, order 9, migrate flag set
src[1–511]: not present
src[512]:   folio, order 9, migrate flag set
src[513–1023]: not present
npages = 1024, cpages = 1024

Example B: 4MB requested, 2 THPs found, first THP unmap fails
src[0]:     folio, order 9, migrate flag clear
src[1–511]: not present
src[512]:   folio, order 9, migrate flag set
src[513–1023]: not present
npages = 1024, cpages = 512

Example C: 4MB requested, 512 small pages + 1 THP found, some small pages fail to unmap
src[0–7]:   folio, order 0, migrate flag clear  
src[8–511]: folio, order 0, migrate flag set  
src[512]:   folio, order 9, migrate flag set  
src[513–1023]: not present  
npages = 1024, cpages = 1016  

As I suggested in my previous reply to patch #2, this should be
documented—preferably in kernel-doc—so the final behavior is clear to
both migrate_device.c (and the structs in migrate.h) and the layers
above. I can help take a pass at writing kernel-doc for both, as its
behavior is fairly before you changes.

Matt

> Matt
> 
> > Thanks for the review,
> > Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-18  3:33                               ` Matthew Brost
@ 2025-07-18 15:06                                 ` Zi Yan
  2025-07-23  0:00                                   ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Zi Yan @ 2025-07-18 15:06 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 17 Jul 2025, at 23:33, Matthew Brost wrote:

> On Thu, Jul 17, 2025 at 09:25:02PM -0400, Zi Yan wrote:
>> On 17 Jul 2025, at 20:41, Matthew Brost wrote:
>>
>>> On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
>>>> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
>>>>
>>>>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
>>>>>> On 7/17/25 02:24, Matthew Brost wrote:
>>>>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>>>>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>>>>>>>
>>>>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> s/pages/folio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, will make the changes
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ack, will change the name
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   *
>>>>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>>>>>>  		 */
>>>>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>>>>>>> -			goto out;
>>>>>>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>>>>>>> +				goto out;
>>>>>>>>>>>>>>>> +			}
>>>>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>>>  		}
>>>>>>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>>>>>>  fail:
>>>>>>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>>>>>>
>>>>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>>>>>>> if calling the API with unmapped
>>>>>>>>>>>
>>>>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There is a use for splitting unmapped folios (see below)
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>>>>>>> device private folios properly. Details are below.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>>>>>>
>>>>>>>>>>>>> You do something below instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>>>>>>> 	ret = -EBUSY;
>>>>>>>>>>>>> 	goto out;
>>>>>>>>>>>>> } else if (anon_vma) {
>>>>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>>>>>>> the check for device private folios?
>>>>>>>>>>>
>>>>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>>>>>>
>>>>>>>>>>> if (folio_is_device_private(folio) {
>>>>>>>>>>> ...
>>>>>>>>>>> } else if (is_anon) {
>>>>>>>>>>> ...
>>>>>>>>>>> } else {
>>>>>>>>>>> ...
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>>>>>>> the name device private.
>>>>>>>>>>>
>>>>>>>>>>> OK.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>>>>>>> sees a device private folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>>>>>>> remap_folio(), because
>>>>>>>>>>>>
>>>>>>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>>>>>>    so trying to do that again does not make sense.
>>>>>>>>>>>
>>>>>>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>>>>>>> assumes to-be-split folios are mapped.
>>>>>>>>>>>
>>>>>>>>>>> What prevents doing split before migration?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The code does do a split prior to migration if THP selection fails
>>>>>>>>>>
>>>>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>>>>>>> and the fallback part which calls split_folio()
>>>>>>>>>>
>>>>>>>>>> But the case under consideration is special since the device needs to allocate
>>>>>>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>>>>>>
>>>>>>>>>> "The common case that arises is that after setup, during migrate
>>>>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>>>>>>> pages."
>>>>>>>>>>
>>>>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>>>>>>
>>>>>>>>>> 1. migrate_vma_setup()
>>>>>>>>>> 2. migrate_vma_pages()
>>>>>>>>>> 3. migrate_vma_finalize()
>>>>>>>>>>
>>>>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>>>>>>
>>>>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>>>>>>>
>>>>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Another use case I’ve seen is when a previously allocated high-order
>>>>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
>>>>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>>>>>>>
>>>>>>>> That is different. If the high-order folio is free, it should be split
>>>>>>>> using split_page() from mm/page_alloc.c.
>>>>>>>>
>>>>>>>
>>>>>>> Ah, ok. Let me see if that works - it would easier.
>>>>>>>
>>>>>
>>>>> This suggestion quickly blows up as PageCompound is true and page_count
>>>>> here is zero.
>>>>
>>>> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
>>>>
>>>>>
>>>>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>>>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>>>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>>>>>>>
>>>>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>>>>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
>>>>>>>> differently.
>>>>>>>>
>>>>>>>
>>>>>>> This is right, those fields should be clear.
>>>>>>>
>>>>>>> Thanks for the tip.
>>>>>>>
>>>>>> I was hoping to reuse __split_folio_to_order() at some point in the future
>>>>>> to split the backing pages in the driver, but it is not an immediate priority
>>>>>>
>>>>>
>>>>> I think we need something for the scenario I describe here. I was to
>>>>> make __split_huge_page_to_list_to_order with a couple of hacks but it
>>>>> almostly certainig not right as Zi pointed out.
>>>>>
>>>>> New to the MM stuff, but play around with this a bit and see if I can
>>>>> come up with something that will work here.
>>>>
>>>> Can you try to write a new split_page function with __split_unmapped_folio()?
>>>> Since based on your description, your folio is not mapped.
>>>>
>>>
>>> Yes, page->mapping is NULL in this case - that was part of the hacks to
>>> __split_huge_page_to_list_to_order (more specially __folio_split) I had
>>> to make in order to get something working for this case.
>>>
>>> I can try out something based on __split_unmapped_folio and report back.
>>
>> mm-new tree has an updated __split_unmapped_folio() version, it moves
>> all unmap irrelevant code out of __split_unmaped_folio(). You might find
>> it easier to reuse.
>>
>> See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430
>>
>
> Will take a look. It is possible some of the issues we are hitting are
> due to working on drm-tip + pulling in core MM patches in this series on
> top of that branch then missing some other patches in mm-new. I'll see
> if ww can figure out a work flow to have the latest and greatest from
> both drm-tip and the MM branches.
>
> Will these changes be in 6.17?

Hopefully yes. mm patches usually go from mm-new to mm-unstable
to mm-stable to mainline. If not, we will figure it out. :)

>
>> I am about to update the code with v4 patches. I will cc you, so that
>> you can get the updated __split_unmaped_folio().
>>
>> Feel free to ask questions on folio split code.
>>
>
> Thanks.
>
> Matt
>
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-18  3:57   ` Balbir Singh
  2025-07-18  4:57     ` Matthew Brost
@ 2025-07-19  0:53     ` Matthew Brost
  2025-07-21 11:42     ` Francois Dugast
  2 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-19  0:53 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> On 7/18/25 09:40, Matthew Brost wrote:
> > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> ...
> >>
> >> The nouveau dmem code has been enhanced to use the new THP migration
> >> capability.
> >>
> >> Feedback from the RFC [2]:
> >>
> > 
> > Thanks for the patches, results look very promising. I wanted to give
> > some quick feedback:
> > 
> 
> Are you seeing improvements with the patchset?
> 
> > - You appear to have missed updating hmm_range_fault, specifically
> > hmm_vma_handle_pmd, to check for device-private entries and populate the
> > HMM PFNs accordingly. My colleague François has a fix for this if you're
> > interested.
> > 
> 
> Sure, please feel free to post them. 
> 
> > - I believe copy_huge_pmd also needs to be updated to avoid installing a
> > migration entry if the swap entry is device-private. I don't have an
> > exact fix yet due to my limited experience with core MM. The test case
> > that triggers this is fairly simple: fault in a 2MB device page on the
> > GPU, then fork a process that reads the page — the kernel crashes in
> > this scenario.
> > 
> 
> I'd be happy to look at any traces you have or post any fixes you have
> 

Ok, I think I have some code that works after slowly reverse-engineering
the core MM code - my test case passes without any warnings / kernel
crashes.

I've included it below. Feel free to include it in your next revision,
modify it as you see fit, or do whatever you like with it.

diff --git  a/mm/huge_memory.c b/mm/huge_memory.c
index 2b2563f35544..1cd6d9a10657 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1773,17 +1773,46 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                swp_entry_t entry = pmd_to_swp_entry(pmd);

                VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
-                               !is_device_private_entry(entry));
-               if (!is_readable_migration_entry(entry)) {
-                       entry = make_readable_migration_entry(
-                                                       swp_offset(entry));
+                         !is_device_private_entry(entry));
+
+               if (!is_device_private_entry(entry) &&
+                   !is_readable_migration_entry(entry)) {
+                       entry = make_readable_migration_entry(swp_offset(entry));
                        pmd = swp_entry_to_pmd(entry);
                        if (pmd_swp_soft_dirty(*src_pmd))
                                pmd = pmd_swp_mksoft_dirty(pmd);
                        if (pmd_swp_uffd_wp(*src_pmd))
                                pmd = pmd_swp_mkuffd_wp(pmd);
                        set_pmd_at(src_mm, addr, src_pmd, pmd);
+               } else if (is_device_private_entry(entry)) {
+                       if (is_writable_device_private_entry(entry)) {
+                               entry = make_readable_device_private_entry(swp_offset(entry));
+
+                               pmd = swp_entry_to_pmd(entry);
+                               if (pmd_swp_soft_dirty(*src_pmd))
+                                       pmd = pmd_swp_mksoft_dirty(pmd);
+                               if (pmd_swp_uffd_wp(*src_pmd))
+                                       pmd = pmd_swp_mkuffd_wp(pmd);
+                               set_pmd_at(src_mm, addr, src_pmd, pmd);
+                       }
+
+                       src_page = pfn_swap_entry_to_page(entry);
+                       VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+                       src_folio = page_folio(src_page);
+
+                       folio_get(src_folio);
+                       if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page,
+                                                                dst_vma, src_vma))) {
+                               /* Page maybe pinned: split and retry the fault on PTEs. */
+                               folio_put(src_folio);
+                               pte_free(dst_mm, pgtable);
+                               spin_unlock(src_ptl);
+                               spin_unlock(dst_ptl);
+                               __split_huge_pmd(src_vma, src_pmd, addr, false);
+                               return -EAGAIN;
+                       }
                }
+
                add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
                mm_inc_nr_ptes(dst_mm);
                pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);

Matt

> Thanks for the feedback
> Balbir Singh


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-03 23:35 ` [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages Balbir Singh
  2025-07-04 15:35   ` kernel test robot
  2025-07-18  6:59   ` Matthew Brost
@ 2025-07-19  2:10   ` Matthew Brost
  2 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-19  2:10 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 04, 2025 at 09:35:03AM +1000, Balbir Singh wrote:
> migrate_device code paths go through the collect, setup
> and finalize phases of migration. Support for MIGRATE_PFN_COMPOUND
> was added earlier in the series to mark THP pages as
> MIGRATE_PFN_COMPOUND.
> 
> The entries in src and dst arrays passed to these functions still
> remain at a PAGE_SIZE granularity. When a compound page is passed,
> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
> representation allows for the compound page to be split into smaller
> page sizes.
> 
> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
> and migrate_vma_insert_huge_pmd_page() have been added.
> 
> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
> some reason this fails, there is fallback support to split the folio
> and migrate it.
> 
> migrate_vma_insert_huge_pmd_page() closely follows the logic of
> migrate_vma_insert_page()
> 
> Support for splitting pages as needed for migration will follow in
> later patches in this series.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  mm/migrate_device.c | 437 +++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 376 insertions(+), 61 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index e05e14d6eacd..41d0bd787969 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -14,6 +14,7 @@
>  #include <linux/pagewalk.h>
>  #include <linux/rmap.h>
>  #include <linux/swapops.h>
> +#include <asm/pgalloc.h>
>  #include <asm/tlbflush.h>
>  #include "internal.h"
>  
> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>  	if (!vma_is_anonymous(walk->vma))
>  		return migrate_vma_collect_skip(start, end, walk);
>  
> +	if (thp_migration_supported() &&
> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> +						MIGRATE_PFN_COMPOUND;
> +		migrate->dst[migrate->npages] = 0;
> +		migrate->npages++;
> +		migrate->cpages++;
> +
> +		/*
> +		 * Collect the remaining entries as holes, in case we
> +		 * need to split later
> +		 */
> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> +	}
> +
>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>  		migrate->dst[migrate->npages] = 0;
> @@ -54,57 +72,148 @@ static int migrate_vma_collect_hole(unsigned long start,
>  	return 0;
>  }
>  
> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
> -				   unsigned long start,
> -				   unsigned long end,
> -				   struct mm_walk *walk)
> +/**
> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
> + * folio for device private pages.
> + * @pmdp: pointer to pmd entry
> + * @start: start address of the range for migration
> + * @end: end address of the range for migration
> + * @walk: mm_walk callback structure
> + *
> + * Collect the huge pmd entry at @pmdp for migration and set the
> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
> + * migration will occur at HPAGE_PMD granularity
> + */
> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
> +					unsigned long end, struct mm_walk *walk,
> +					struct folio *fault_folio)
>  {
> +	struct mm_struct *mm = walk->mm;
> +	struct folio *folio;
>  	struct migrate_vma *migrate = walk->private;
> -	struct folio *fault_folio = migrate->fault_page ?
> -		page_folio(migrate->fault_page) : NULL;
> -	struct vm_area_struct *vma = walk->vma;
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long addr = start, unmapped = 0;
>  	spinlock_t *ptl;
> -	pte_t *ptep;
> +	swp_entry_t entry;
> +	int ret;
> +	unsigned long write = 0;
>  
> -again:
> -	if (pmd_none(*pmdp))
> +	ptl = pmd_lock(mm, pmdp);
> +	if (pmd_none(*pmdp)) {
> +		spin_unlock(ptl);
>  		return migrate_vma_collect_hole(start, end, -1, walk);
> +	}
>  
>  	if (pmd_trans_huge(*pmdp)) {
> -		struct folio *folio;
> -
> -		ptl = pmd_lock(mm, pmdp);
> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>  			spin_unlock(ptl);
> -			goto again;
> +			return migrate_vma_collect_skip(start, end, walk);
>  		}
>  
>  		folio = pmd_folio(*pmdp);
>  		if (is_huge_zero_folio(folio)) {
>  			spin_unlock(ptl);
> -			split_huge_pmd(vma, pmdp, addr);
> -		} else {
> -			int ret;
> +			return migrate_vma_collect_hole(start, end, -1, walk);
> +		}
> +		if (pmd_write(*pmdp))
> +			write = MIGRATE_PFN_WRITE;
> +	} else if (!pmd_present(*pmdp)) {
> +		entry = pmd_to_swp_entry(*pmdp);
> +		folio = pfn_swap_entry_folio(entry);
> +
> +		if (!is_device_private_entry(entry) ||
> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
> +			spin_unlock(ptl);
> +			return migrate_vma_collect_skip(start, end, walk);
> +		}
>  
> -			folio_get(folio);
> +		if (is_migration_entry(entry)) {
> +			migration_entry_wait_on_locked(entry, ptl);
>  			spin_unlock(ptl);
> -			/* FIXME: we don't expect THP for fault_folio */
> -			if (WARN_ON_ONCE(fault_folio == folio))
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> -			if (unlikely(!folio_trylock(folio)))
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> -			ret = split_folio(folio);
> -			if (fault_folio != folio)
> -				folio_unlock(folio);
> -			folio_put(folio);
> -			if (ret)
> -				return migrate_vma_collect_skip(start, end,
> -								walk);
> +			return -EAGAIN;
>  		}
> +
> +		if (is_writable_device_private_entry(entry))
> +			write = MIGRATE_PFN_WRITE;
> +	} else {
> +		spin_unlock(ptl);
> +		return -EAGAIN;
> +	}
> +
> +	folio_get(folio);
> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
> +		spin_unlock(ptl);
> +		folio_put(folio);
> +		return migrate_vma_collect_skip(start, end, walk);
> +	}
> +
> +	if (thp_migration_supported() &&
> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> +
> +		struct page_vma_mapped_walk pvmw = {
> +			.ptl = ptl,
> +			.address = start,
> +			.pmd = pmdp,
> +			.vma = walk->vma,
> +		};
> +
> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
> +
> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
> +						| MIGRATE_PFN_MIGRATE
> +						| MIGRATE_PFN_COMPOUND;
> +		migrate->dst[migrate->npages++] = 0;
> +		migrate->cpages++;
> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
> +		if (ret) {
> +			migrate->npages--;
> +			migrate->cpages--;
> +			migrate->src[migrate->npages] = 0;
> +			migrate->dst[migrate->npages] = 0;
> +			goto fallback;
> +		}
> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +fallback:
> +	spin_unlock(ptl);
> +	ret = split_folio(folio);
> +	if (fault_folio != folio)
> +		folio_unlock(folio);
> +	folio_put(folio);
> +	if (ret)
> +		return migrate_vma_collect_skip(start, end, walk);
> +	if (pmd_none(pmdp_get_lockless(pmdp)))
> +		return migrate_vma_collect_hole(start, end, -1, walk);
> +
> +	return -ENOENT;
> +}
> +
> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
> +				   unsigned long start,
> +				   unsigned long end,
> +				   struct mm_walk *walk)
> +{
> +	struct migrate_vma *migrate = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long addr = start, unmapped = 0;
> +	spinlock_t *ptl;
> +	struct folio *fault_folio = migrate->fault_page ?
> +		page_folio(migrate->fault_page) : NULL;
> +	pte_t *ptep;
> +
> +again:
> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
> +
> +		if (ret == -EAGAIN)
> +			goto again;
> +		if (ret == 0)
> +			return 0;
>  	}
>  
>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> @@ -175,8 +284,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>  		}
>  
> -		/* FIXME support THP */
> -		if (!page || !page->mapping || PageTransCompound(page)) {
> +		if (!page || !page->mapping) {
>  			mpfn = 0;
>  			goto next;
>  		}
> @@ -347,14 +455,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>  	 */
>  	int extra = 1 + (page == fault_page);
>  
> -	/*
> -	 * FIXME support THP (transparent huge page), it is bit more complex to
> -	 * check them than regular pages, because they can be mapped with a pmd
> -	 * or with a pte (split pte mapping).
> -	 */
> -	if (folio_test_large(folio))
> -		return false;
> -
>  	/* Page from ZONE_DEVICE have one extra reference */
>  	if (folio_is_zone_device(folio))
>  		extra++;
> @@ -385,17 +485,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  
>  	lru_add_drain();
>  
> -	for (i = 0; i < npages; i++) {
> +	for (i = 0; i < npages; ) {
>  		struct page *page = migrate_pfn_to_page(src_pfns[i]);
>  		struct folio *folio;
> +		unsigned int nr = 1;
>  
>  		if (!page) {
>  			if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
>  				unmapped++;
> -			continue;
> +			goto next;
>  		}
>  
>  		folio =	page_folio(page);
> +		nr = folio_nr_pages(folio);
> +
> +		if (nr > 1)
> +			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
> +
> +
>  		/* ZONE_DEVICE folios are not on LRU */
>  		if (!folio_is_zone_device(folio)) {
>  			if (!folio_test_lru(folio) && allow_drain) {
> @@ -407,7 +514,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  			if (!folio_isolate_lru(folio)) {
>  				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>  				restore++;
> -				continue;
> +				goto next;
>  			}
>  
>  			/* Drop the reference we took in collect */
> @@ -426,10 +533,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>  			restore++;
> -			continue;
> +			goto next;
>  		}
>  
>  		unmapped++;
> +next:
> +		i += nr;
>  	}
>  
>  	for (i = 0; i < npages && restore; i++) {
> @@ -575,6 +684,146 @@ int migrate_vma_setup(struct migrate_vma *args)
>  }
>  EXPORT_SYMBOL(migrate_vma_setup);
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +/**
> + * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
> + * at @addr. folio is already allocated as a part of the migration process with
> + * large page.
> + *
> + * @folio needs to be initialized and setup after it's allocated. The code bits
> + * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
> + * not support THP zero pages.
> + *
> + * @migrate: migrate_vma arguments
> + * @addr: address where the folio will be inserted
> + * @folio: folio to be inserted at @addr
> + * @src: src pfn which is being migrated
> + * @pmdp: pointer to the pmd
> + */
> +static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
> +					 unsigned long addr,
> +					 struct page *page,
> +					 unsigned long *src,
> +					 pmd_t *pmdp)
> +{
> +	struct vm_area_struct *vma = migrate->vma;
> +	gfp_t gfp = vma_thp_gfp_mask(vma);
> +	struct folio *folio = page_folio(page);
> +	int ret;
> +	spinlock_t *ptl;
> +	pgtable_t pgtable;
> +	pmd_t entry;
> +	bool flush = false;
> +	unsigned long i;
> +
> +	VM_WARN_ON_FOLIO(!folio, folio);
> +	VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
> +
> +	if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
> +		return -EINVAL;
> +
> +	ret = anon_vma_prepare(vma);
> +	if (ret)
> +		return ret;
> +
> +	folio_set_order(folio, HPAGE_PMD_ORDER);
> +	folio_set_large_rmappable(folio);
> +
> +	if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
> +		count_vm_event(THP_FAULT_FALLBACK);
> +		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +		ret = -ENOMEM;
> +		goto abort;
> +	}
> +
> +	__folio_mark_uptodate(folio);
> +
> +	pgtable = pte_alloc_one(vma->vm_mm);
> +	if (unlikely(!pgtable))
> +		goto abort;
> +
> +	if (folio_is_device_private(folio)) {
> +		swp_entry_t swp_entry;
> +
> +		if (vma->vm_flags & VM_WRITE)
> +			swp_entry = make_writable_device_private_entry(
> +						page_to_pfn(page));
> +		else
> +			swp_entry = make_readable_device_private_entry(
> +						page_to_pfn(page));
> +		entry = swp_entry_to_pmd(swp_entry);
> +	} else {
> +		if (folio_is_zone_device(folio) &&
> +		    !folio_is_device_coherent(folio)) {
> +			goto abort;
> +		}
> +		entry = folio_mk_pmd(folio, vma->vm_page_prot);
> +		if (vma->vm_flags & VM_WRITE)
> +			entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
> +	}
> +
> +	ptl = pmd_lock(vma->vm_mm, pmdp);
> +	ret = check_stable_address_space(vma->vm_mm);
> +	if (ret)
> +		goto abort;
> +
> +	/*
> +	 * Check for userfaultfd but do not deliver the fault. Instead,
> +	 * just back off.
> +	 */
> +	if (userfaultfd_missing(vma))
> +		goto unlock_abort;
> +
> +	if (!pmd_none(*pmdp)) {
> +		if (!is_huge_zero_pmd(*pmdp))
> +			goto unlock_abort;
> +		flush = true;
> +	} else if (!pmd_none(*pmdp))
> +		goto unlock_abort;
> +
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> +	if (!folio_is_zone_device(folio))
> +		folio_add_lru_vma(folio, vma);
> +	folio_get(folio);
> +
> +	if (flush) {
> +		pte_free(vma->vm_mm, pgtable);
> +		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
> +		pmdp_invalidate(vma, addr, pmdp);
> +	} else {
> +		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
> +		mm_inc_nr_ptes(vma->vm_mm);
> +	}
> +	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
> +	update_mmu_cache_pmd(vma, addr, pmdp);
> +
> +	spin_unlock(ptl);
> +
> +	count_vm_event(THP_FAULT_ALLOC);
> +	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
> +	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
> +
> +	return 0;
> +
> +unlock_abort:
> +	spin_unlock(ptl);
> +abort:
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		src[i] &= ~MIGRATE_PFN_MIGRATE;
> +	return 0;
> +}
> +#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
> +					 unsigned long addr,
> +					 struct page *page,
> +					 unsigned long *src,
> +					 pmd_t *pmdp)
> +{
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * This code closely matches the code in:
>   *   __handle_mm_fault()
> @@ -585,9 +834,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
>   */
>  static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  				    unsigned long addr,
> -				    struct page *page,
> +				    unsigned long *dst,
>  				    unsigned long *src)
>  {
> +	struct page *page = migrate_pfn_to_page(*dst);
>  	struct folio *folio = page_folio(page);
>  	struct vm_area_struct *vma = migrate->vma;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -615,8 +865,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>  	pmdp = pmd_alloc(mm, pudp, addr);
>  	if (!pmdp)
>  		goto abort;
> -	if (pmd_trans_huge(*pmdp))
> -		goto abort;
> +
> +	if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
> +		int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
> +								src, pmdp);
> +		if (ret)
> +			goto abort;
> +		return;
> +	}
> +
> +	if (!pmd_none(*pmdp)) {
> +		if (pmd_trans_huge(*pmdp)) {
> +			if (!is_huge_zero_pmd(*pmdp))
> +				goto abort;
> +			folio_get(pmd_folio(*pmdp));
> +			split_huge_pmd(vma, pmdp, addr);
> +		} else if (pmd_leaf(*pmdp))
> +			goto abort;
> +	}
> +
>  	if (pte_alloc(mm, pmdp))
>  		goto abort;
>  	if (unlikely(anon_vma_prepare(vma)))
> @@ -707,23 +974,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  	unsigned long i;
>  	bool notified = false;
>  
> -	for (i = 0; i < npages; i++) {
> +	for (i = 0; i < npages; ) {
>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
>  		struct page *page = migrate_pfn_to_page(src_pfns[i]);
>  		struct address_space *mapping;
>  		struct folio *newfolio, *folio;
>  		int r, extra_cnt = 0;
> +		unsigned long nr = 1;
>  
>  		if (!newpage) {
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -			continue;
> +			goto next;
>  		}
>  
>  		if (!page) {
>  			unsigned long addr;
>  
>  			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
> -				continue;
> +				goto next;
>  
>  			/*
>  			 * The only time there is no vma is when called from
> @@ -741,15 +1009,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  					migrate->pgmap_owner);
>  				mmu_notifier_invalidate_range_start(&range);
>  			}
> -			migrate_vma_insert_page(migrate, addr, newpage,
> +
> +			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
> +				nr = HPAGE_PMD_NR;
> +				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +				goto next;
> +			}
> +
> +			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
>  						&src_pfns[i]);
> -			continue;
> +			goto next;
>  		}
>  
>  		newfolio = page_folio(newpage);
>  		folio = page_folio(page);
>  		mapping = folio_mapping(folio);
>  
> +		/*
> +		 * If THP migration is enabled, check if both src and dst
> +		 * can migrate large pages
> +		 */
> +		if (thp_migration_supported()) {
> +			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
> +				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> +
> +				if (!migrate) {
> +					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
> +							 MIGRATE_PFN_COMPOUND);
> +					goto next;
> +				}
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
> +				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
> +				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> +				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +			}
> +		}
> +
> +
>  		if (folio_is_device_private(newfolio) ||
>  		    folio_is_device_coherent(newfolio)) {
>  			if (mapping) {
> @@ -762,7 +1062,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				if (!folio_test_anon(folio) ||
>  				    !folio_free_swap(folio)) {
>  					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -					continue;
> +					goto next;
>  				}
>  			}
>  		} else if (folio_is_zone_device(newfolio)) {
> @@ -770,7 +1070,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  			 * Other types of ZONE_DEVICE page are not supported.
>  			 */
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -			continue;
> +			goto next;
>  		}
>  
>  		BUG_ON(folio_test_writeback(folio));
> @@ -782,6 +1082,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>  		else
>  			folio_migrate_flags(newfolio, folio);
> +next:
> +		i += nr;
>  	}
>  
>  	if (notified)
> @@ -943,10 +1245,23 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
>  int migrate_device_range(unsigned long *src_pfns, unsigned long start,
>  			unsigned long npages)

I think migrate_device_pfns should be updated too in similar way.

Here is what I came up with. Again feel free to include or modify these
changes as you see fit. 

  * Similar to migrate_device_range() but supports non-contiguous pre-popluated
- * array of device pages to migrate.
+ * array of device pages to migrate. If a higher-order folio is found, the mpfn
+ * is OR'ed with MIGRATE_PFN_COMPOUND, and the subsequent mpfns within the range
+ * of the order are cleared.
  */
 int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)
 {
-       unsigned long i;
+       unsigned long i, j;
+
+       for (i = 0; i < npages; i++) {
+               struct page *page = pfn_to_page(src_pfns[i]);
+               struct folio *folio = page_folio(page);
+               unsigned int nr = 1;

-       for (i = 0; i < npages; i++)
                src_pfns[i] = migrate_device_pfn_lock(src_pfns[i]);
+               nr = folio_nr_pages(folio);
+               if (nr > 1) {
+                       src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+                       for (j = 1; j < nr; j++)
+                               src_pfns[i+j] = 0;
+                       i += j - 1;
+               }
+       }

Matt

>  {
> -	unsigned long i, pfn;
> +	unsigned long i, j, pfn;
> +
> +	for (pfn = start, i = 0; i < npages; pfn++, i++) {
> +		struct page *page = pfn_to_page(pfn);
> +		struct folio *folio = page_folio(page);
> +		unsigned int nr = 1;
>  
> -	for (pfn = start, i = 0; i < npages; pfn++, i++)
>  		src_pfns[i] = migrate_device_pfn_lock(pfn);
> +		nr = folio_nr_pages(folio);
> +		if (nr > 1) {
> +			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
> +			for (j = 1; j < nr; j++)
> +				src_pfns[i+j] = 0;
> +			i += j - 1;
> +			pfn += j - 1;
> +		}
> +	}
>  
>  	migrate_device_unmap(src_pfns, npages, NULL);
>  
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages
  2025-07-08  7:31     ` Balbir Singh
@ 2025-07-19 20:06       ` Matthew Brost
  2025-07-19 20:16         ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-19 20:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Alistair Popple, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Tue, Jul 08, 2025 at 05:31:49PM +1000, Balbir Singh wrote:
> On 7/7/25 15:31, Alistair Popple wrote:
> > On Fri, Jul 04, 2025 at 09:35:01AM +1000, Balbir Singh wrote:
> >> Add flags to mark zone device migration pages.
> >>
> >> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> >> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> >> device pages as compound pages during device pfn migration.
> >>
> >> Cc: Karol Herbst <kherbst@redhat.com>
> >> Cc: Lyude Paul <lyude@redhat.com>
> >> Cc: Danilo Krummrich <dakr@kernel.org>
> >> Cc: David Airlie <airlied@gmail.com>
> >> Cc: Simona Vetter <simona@ffwll.ch>
> >> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> >> Cc: Shuah Khan <shuah@kernel.org>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Barry Song <baohua@kernel.org>
> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: Zi Yan <ziy@nvidia.com>
> >> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> >> Cc: Jane Chu <jane.chu@oracle.com>
> >> Cc: Alistair Popple <apopple@nvidia.com>
> >> Cc: Donet Tom <donettom@linux.ibm.com>
> >>
> >> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> >> ---
> >>  include/linux/migrate.h | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >> index aaa2114498d6..1661e2d5479a 100644
> >> --- a/include/linux/migrate.h
> >> +++ b/include/linux/migrate.h
> >> @@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> >>  #define MIGRATE_PFN_VALID	(1UL << 0)
> >>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
> >>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> >> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
> > 
> > Why is this necessary? Couldn't migrate_vma just use folio_order() to figure out
> > if it's a compound page or not?
> > 
> 
> I can definitely explore that angle. As we move towards mTHP, we'll need additional bits for the various order sizes as well.
> 

I agree you probably could get away without having an explict mpfn flag
for compound pages and rely on the folio order everywhere.

Matt

> Balbir Singh



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages
  2025-07-19 20:06       ` Matthew Brost
@ 2025-07-19 20:16         ` Matthew Brost
  0 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-19 20:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Alistair Popple, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Donet Tom

On Sat, Jul 19, 2025 at 01:06:21PM -0700, Matthew Brost wrote:
> On Tue, Jul 08, 2025 at 05:31:49PM +1000, Balbir Singh wrote:
> > On 7/7/25 15:31, Alistair Popple wrote:
> > > On Fri, Jul 04, 2025 at 09:35:01AM +1000, Balbir Singh wrote:
> > >> Add flags to mark zone device migration pages.
> > >>
> > >> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
> > >> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
> > >> device pages as compound pages during device pfn migration.
> > >>
> > >> Cc: Karol Herbst <kherbst@redhat.com>
> > >> Cc: Lyude Paul <lyude@redhat.com>
> > >> Cc: Danilo Krummrich <dakr@kernel.org>
> > >> Cc: David Airlie <airlied@gmail.com>
> > >> Cc: Simona Vetter <simona@ffwll.ch>
> > >> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > >> Cc: Shuah Khan <shuah@kernel.org>
> > >> Cc: David Hildenbrand <david@redhat.com>
> > >> Cc: Barry Song <baohua@kernel.org>
> > >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > >> Cc: Ryan Roberts <ryan.roberts@arm.com>
> > >> Cc: Matthew Wilcox <willy@infradead.org>
> > >> Cc: Peter Xu <peterx@redhat.com>
> > >> Cc: Zi Yan <ziy@nvidia.com>
> > >> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> > >> Cc: Jane Chu <jane.chu@oracle.com>
> > >> Cc: Alistair Popple <apopple@nvidia.com>
> > >> Cc: Donet Tom <donettom@linux.ibm.com>
> > >>
> > >> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> > >> ---
> > >>  include/linux/migrate.h | 2 ++
> > >>  1 file changed, 2 insertions(+)
> > >>
> > >> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > >> index aaa2114498d6..1661e2d5479a 100644
> > >> --- a/include/linux/migrate.h
> > >> +++ b/include/linux/migrate.h
> > >> @@ -167,6 +167,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> > >>  #define MIGRATE_PFN_VALID	(1UL << 0)
> > >>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
> > >>  #define MIGRATE_PFN_WRITE	(1UL << 3)
> > >> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
> > > 
> > > Why is this necessary? Couldn't migrate_vma just use folio_order() to figure out
> > > if it's a compound page or not?
> > > 
> > 
> > I can definitely explore that angle. As we move towards mTHP, we'll need additional bits for the various order sizes as well.
> > 
> 
> I agree you probably could get away without having an explict mpfn flag
> for compound pages and rely on the folio order everywhere.
> 

Actually, I thought of one case where this is a bit useful—when a PMD is
not faulted during the call to migrate_vma_setup in a RAM → VRAM
migration. This communicates back to the caller that it can create a THP
device page. I suppose the caller could infer this based on the src
entries being unpopulated, but I thought it was worth bringing up this
case.

Not sure the MIGRATE_PFN_COMPOUND flag alone helps in the above scenario
with mTHP though. In that case, you'd probably need the order to be
encoded in the MFN as well.

Matt 

> Matt
> 
> > Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-18  3:57   ` Balbir Singh
  2025-07-18  4:57     ` Matthew Brost
  2025-07-19  0:53     ` Matthew Brost
@ 2025-07-21 11:42     ` Francois Dugast
  2025-07-21 23:34       ` Balbir Singh
  2 siblings, 1 reply; 99+ messages in thread
From: Francois Dugast @ 2025-07-21 11:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Brost, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> On 7/18/25 09:40, Matthew Brost wrote:
> > On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> ...
> >>
> >> The nouveau dmem code has been enhanced to use the new THP migration
> >> capability.
> >>
> >> Feedback from the RFC [2]:
> >>
> > 
> > Thanks for the patches, results look very promising. I wanted to give
> > some quick feedback:
> > 
> 
> Are you seeing improvements with the patchset?
> 
> > - You appear to have missed updating hmm_range_fault, specifically
> > hmm_vma_handle_pmd, to check for device-private entries and populate the
> > HMM PFNs accordingly. My colleague François has a fix for this if you're
> > interested.
> > 
> 
> Sure, please feel free to post them. 

Hi Balbir,

It seems we are missing this special handling in in hmm_vma_walk_pmd():

diff --git a/mm/hmm.c b/mm/hmm.c
index f2415b4b2cdd..449025f72b2f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
        }
 
        if (!pmd_present(pmd)) {
+               swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+               /*
+                * Don't fault in device private pages owned by the caller,
+                * just report the PFNs.
+                */
+               if (is_device_private_entry(entry) &&
+                   pfn_swap_entry_folio(entry)->pgmap->owner ==
+                   range->dev_private_owner) {
+                       unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
+                       unsigned long pfn = swp_offset_pfn(entry);
+                       unsigned long i;
+
+                       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+                               hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+                               hmm_pfns[i] |= pfn | cpu_flags;
+                       }
+
+                       return 0;
+               }
+
                if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
                        return -EFAULT;
                return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);

Francois

> 
> > - I believe copy_huge_pmd also needs to be updated to avoid installing a
> > migration entry if the swap entry is device-private. I don't have an
> > exact fix yet due to my limited experience with core MM. The test case
> > that triggers this is fairly simple: fault in a 2MB device page on the
> > GPU, then fork a process that reads the page — the kernel crashes in
> > this scenario.
> > 
> 
> I'd be happy to look at any traces you have or post any fixes you have
> 
> Thanks for the feedback
> Balbir Singh
> 


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-21 11:42     ` Francois Dugast
@ 2025-07-21 23:34       ` Balbir Singh
  2025-07-22  0:01         ` Matthew Brost
  2025-07-22 19:34         ` [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Francois Dugast
  0 siblings, 2 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-21 23:34 UTC (permalink / raw)
  To: Francois Dugast
  Cc: Matthew Brost, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/21/25 21:42, Francois Dugast wrote:
> On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
>> On 7/18/25 09:40, Matthew Brost wrote:
>>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
>> ...
>>>>
>>>> The nouveau dmem code has been enhanced to use the new THP migration
>>>> capability.
>>>>
>>>> Feedback from the RFC [2]:
>>>>
>>>
>>> Thanks for the patches, results look very promising. I wanted to give
>>> some quick feedback:
>>>
>>
>> Are you seeing improvements with the patchset?
>>
>>> - You appear to have missed updating hmm_range_fault, specifically
>>> hmm_vma_handle_pmd, to check for device-private entries and populate the
>>> HMM PFNs accordingly. My colleague François has a fix for this if you're
>>> interested.
>>>
>>
>> Sure, please feel free to post them. 
> 
> Hi Balbir,
> 
> It seems we are missing this special handling in in hmm_vma_walk_pmd():
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2415b4b2cdd..449025f72b2f 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>         }
>  
>         if (!pmd_present(pmd)) {
> +               swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +               /*
> +                * Don't fault in device private pages owned by the caller,
> +                * just report the PFNs.
> +                */
> +               if (is_device_private_entry(entry) &&
> +                   pfn_swap_entry_folio(entry)->pgmap->owner ==
> +                   range->dev_private_owner) {
> +                       unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
> +                       unsigned long pfn = swp_offset_pfn(entry);
> +                       unsigned long i;
> +
> +                       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +                               hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +                               hmm_pfns[i] |= pfn | cpu_flags;

Won't we use hmm_pfn_to_map_order(), do we still need to populate each entry in the hmm_pfns[i]?

> +                       }
> +
> +                       return 0;
> +               }
> +

Thanks for the patch! If you could send it with a full sign-off, I can add it to my series while
posting

Balbir Singh




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-18  4:57     ` Matthew Brost
@ 2025-07-21 23:48       ` Balbir Singh
  2025-07-22  0:07         ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-21 23:48 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/18/25 14:57, Matthew Brost wrote:
> On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
>> On 7/18/25 09:40, Matthew Brost wrote:
>>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
>> ...
>>>>
>>>> The nouveau dmem code has been enhanced to use the new THP migration
>>>> capability.
>>>>
>>>> Feedback from the RFC [2]:
>>>>
>>>
>>> Thanks for the patches, results look very promising. I wanted to give
>>> some quick feedback:
>>>
>>
>> Are you seeing improvements with the patchset?
>>
> 
> We're nowhere near stable yet, but basic testing shows that CPU time
> from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
> a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
> grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
> too.
> 
>>> - You appear to have missed updating hmm_range_fault, specifically
>>> hmm_vma_handle_pmd, to check for device-private entries and populate the
>>> HMM PFNs accordingly. My colleague François has a fix for this if you're
>>> interested.
>>>
>>
>> Sure, please feel free to post them. 
>>
>>> - I believe copy_huge_pmd also needs to be updated to avoid installing a
>>> migration entry if the swap entry is device-private. I don't have an
>>> exact fix yet due to my limited experience with core MM. The test case
>>> that triggers this is fairly simple: fault in a 2MB device page on the
>>> GPU, then fork a process that reads the page — the kernel crashes in
>>> this scenario.
>>>
>>
>> I'd be happy to look at any traces you have or post any fixes you have
>>
> 
> I've got it so the kernel doesn't explode but still get warnings like:
> 
> [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
> [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
> [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504
> 
> I'm basically just skip is_swap_pmd clause if the entry is device
> private, and let the rest of the function execute. This avoids
> installing a migration entry—which isn't required and cause the
> crash—and allows the rmap code to run, which flips the pages to not
> anonymous exclusive (effectively making them copy-on-write (?), though
> that doesn't fully apply to device pages). It's not 100% correct yet,
> but it's a step in the right direction.
> 


Thanks, could you post the stack trace as well. This is usually a symptom of 
not freeing up the page table cleanly.

Do you have my latest patches that have

	if (is_swap_pmd(pmdval)) {
		swp_entry_t entry = pmd_to_swp_entry(pmdval);

		if (is_device_private_entry(entry))
			goto nomap;
	}

in __pte_offset_map()?

Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-21 23:34       ` Balbir Singh
@ 2025-07-22  0:01         ` Matthew Brost
  2025-07-22 19:34         ` [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Francois Dugast
  1 sibling, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-22  0:01 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Francois Dugast, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Tue, Jul 22, 2025 at 09:34:13AM +1000, Balbir Singh wrote:
> On 7/21/25 21:42, Francois Dugast wrote:
> > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> >> On 7/18/25 09:40, Matthew Brost wrote:
> >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> >> ...
> >>>>
> >>>> The nouveau dmem code has been enhanced to use the new THP migration
> >>>> capability.
> >>>>
> >>>> Feedback from the RFC [2]:
> >>>>
> >>>
> >>> Thanks for the patches, results look very promising. I wanted to give
> >>> some quick feedback:
> >>>
> >>
> >> Are you seeing improvements with the patchset?
> >>
> >>> - You appear to have missed updating hmm_range_fault, specifically
> >>> hmm_vma_handle_pmd, to check for device-private entries and populate the
> >>> HMM PFNs accordingly. My colleague François has a fix for this if you're
> >>> interested.
> >>>
> >>
> >> Sure, please feel free to post them. 
> > 
> > Hi Balbir,
> > 
> > It seems we are missing this special handling in in hmm_vma_walk_pmd():
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index f2415b4b2cdd..449025f72b2f 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,27 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >         }
> >  
> >         if (!pmd_present(pmd)) {
> > +               swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +               /*
> > +                * Don't fault in device private pages owned by the caller,
> > +                * just report the PFNs.
> > +                */
> > +               if (is_device_private_entry(entry) &&
> > +                   pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +                   range->dev_private_owner) {
> > +                       unsigned long cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
> > +                       unsigned long pfn = swp_offset_pfn(entry);
> > +                       unsigned long i;
> > +
> > +                       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +                               hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +                               hmm_pfns[i] |= pfn | cpu_flags;
> 
> Won't we use hmm_pfn_to_map_order(), do we still need to populate each entry in the hmm_pfns[i]?
> 

I had the same question. hmm_vma_handle_pmd populates subsequent PFNs as
well, but this seems unnecessary. I removed both of these cases and my
code still worked. However, I haven't audited all kernel callers to
ensure this doesn't cause issues elsewhere.

Matt

> > +                       }
> > +
> > +                       return 0;
> > +               }
> > +
> 
> Thanks for the patch! If you could send it with a full sign-off, I can add it to my series while
> posting
> 
> Balbir Singh
> 
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-21 23:48       ` Balbir Singh
@ 2025-07-22  0:07         ` Matthew Brost
  2025-07-22  0:51           ` Balbir Singh
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-22  0:07 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Tue, Jul 22, 2025 at 09:48:18AM +1000, Balbir Singh wrote:
> On 7/18/25 14:57, Matthew Brost wrote:
> > On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
> >> On 7/18/25 09:40, Matthew Brost wrote:
> >>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
> >> ...
> >>>>
> >>>> The nouveau dmem code has been enhanced to use the new THP migration
> >>>> capability.
> >>>>
> >>>> Feedback from the RFC [2]:
> >>>>
> >>>
> >>> Thanks for the patches, results look very promising. I wanted to give
> >>> some quick feedback:
> >>>
> >>
> >> Are you seeing improvements with the patchset?
> >>
> > 
> > We're nowhere near stable yet, but basic testing shows that CPU time
> > from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
> > a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
> > grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
> > too.
> > 
> >>> - You appear to have missed updating hmm_range_fault, specifically
> >>> hmm_vma_handle_pmd, to check for device-private entries and populate the
> >>> HMM PFNs accordingly. My colleague François has a fix for this if you're
> >>> interested.
> >>>
> >>
> >> Sure, please feel free to post them. 
> >>
> >>> - I believe copy_huge_pmd also needs to be updated to avoid installing a
> >>> migration entry if the swap entry is device-private. I don't have an
> >>> exact fix yet due to my limited experience with core MM. The test case
> >>> that triggers this is fairly simple: fault in a 2MB device page on the
> >>> GPU, then fork a process that reads the page — the kernel crashes in
> >>> this scenario.
> >>>
> >>
> >> I'd be happy to look at any traces you have or post any fixes you have
> >>
> > 
> > I've got it so the kernel doesn't explode but still get warnings like:
> > 
> > [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
> > [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
> > [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504
> > 
> > I'm basically just skip is_swap_pmd clause if the entry is device
> > private, and let the rest of the function execute. This avoids
> > installing a migration entry—which isn't required and cause the
> > crash—and allows the rmap code to run, which flips the pages to not
> > anonymous exclusive (effectively making them copy-on-write (?), though
> > that doesn't fully apply to device pages). It's not 100% correct yet,
> > but it's a step in the right direction.
> > 
> 
> 
> Thanks, could you post the stack trace as well. This is usually a symptom of 
> not freeing up the page table cleanly.
> 

Did you see my reply here [1]? I've got this working cleanly.

I actually have all my tests passing with a few additional core MM
changes. I'll reply shortly to a few other patches with those details
and will also send over a complete set of the core MM changes we've made
to get things stable.

Matt

[1] https://lore.kernel.org/linux-mm/aHrsdvjjliBBdVQm@lstrano-desk.jf.intel.com/

> Do you have my latest patches that have
> 
> 	if (is_swap_pmd(pmdval)) {
> 		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> 
> 		if (is_device_private_entry(entry))
> 			goto nomap;
> 	}
> 
> in __pte_offset_map()?
> 
> Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 00/12] THP support for zone device page migration
  2025-07-22  0:07         ` Matthew Brost
@ 2025-07-22  0:51           ` Balbir Singh
  0 siblings, 0 replies; 99+ messages in thread
From: Balbir Singh @ 2025-07-22  0:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On 7/22/25 10:07, Matthew Brost wrote:
> On Tue, Jul 22, 2025 at 09:48:18AM +1000, Balbir Singh wrote:
>> On 7/18/25 14:57, Matthew Brost wrote:
>>> On Fri, Jul 18, 2025 at 01:57:13PM +1000, Balbir Singh wrote:
>>>> On 7/18/25 09:40, Matthew Brost wrote:
>>>>> On Fri, Jul 04, 2025 at 09:34:59AM +1000, Balbir Singh wrote:
>>>> ...
>>>>>>
>>>>>> The nouveau dmem code has been enhanced to use the new THP migration
>>>>>> capability.
>>>>>>
>>>>>> Feedback from the RFC [2]:
>>>>>>
>>>>>
>>>>> Thanks for the patches, results look very promising. I wanted to give
>>>>> some quick feedback:
>>>>>
>>>>
>>>> Are you seeing improvements with the patchset?
>>>>
>>>
>>> We're nowhere near stable yet, but basic testing shows that CPU time
>>> from the start of migrate_vma_* to the end drops from ~300µs to ~6µs on
>>> a 2MB GPU fault. A lot of this improvement is dma-mapping at 2M
>>> grainularity for the CPU<->GPU copy rather than mapping 512 4k pages
>>> too.
>>>
>>>>> - You appear to have missed updating hmm_range_fault, specifically
>>>>> hmm_vma_handle_pmd, to check for device-private entries and populate the
>>>>> HMM PFNs accordingly. My colleague François has a fix for this if you're
>>>>> interested.
>>>>>
>>>>
>>>> Sure, please feel free to post them. 
>>>>
>>>>> - I believe copy_huge_pmd also needs to be updated to avoid installing a
>>>>> migration entry if the swap entry is device-private. I don't have an
>>>>> exact fix yet due to my limited experience with core MM. The test case
>>>>> that triggers this is fairly simple: fault in a 2MB device page on the
>>>>> GPU, then fork a process that reads the page — the kernel crashes in
>>>>> this scenario.
>>>>>
>>>>
>>>> I'd be happy to look at any traces you have or post any fixes you have
>>>>
>>>
>>> I've got it so the kernel doesn't explode but still get warnings like:
>>>
>>> [ 3564.850036] mm/pgtable-generic.c:54: bad pmd ffff8881290408e0(efffff80042bfe00)
>>> [ 3565.298186] BUG: Bad rss-counter state mm:ffff88810a100c40 type:MM_ANONPAGES val:114688
>>> [ 3565.306108] BUG: non-zero pgtables_bytes on freeing mm: 917504
>>>
>>> I'm basically just skip is_swap_pmd clause if the entry is device
>>> private, and let the rest of the function execute. This avoids
>>> installing a migration entry—which isn't required and cause the
>>> crash—and allows the rmap code to run, which flips the pages to not
>>> anonymous exclusive (effectively making them copy-on-write (?), though
>>> that doesn't fully apply to device pages). It's not 100% correct yet,
>>> but it's a step in the right direction.
>>>
>>
>>
>> Thanks, could you post the stack trace as well. This is usually a symptom of 
>> not freeing up the page table cleanly.
>>
> 
> Did you see my reply here [1]? I've got this working cleanly.
> 
> I actually have all my tests passing with a few additional core MM
> changes. I'll reply shortly to a few other patches with those details
> and will also send over a complete set of the core MM changes we've made
> to get things stable.
> 
> Matt
> 
> [1] https://lore.kernel.org/linux-mm/aHrsdvjjliBBdVQm@lstrano-desk.jf.intel.com/

Sorry I missed it. Checking now!

I'd be happy to hear any numbers you have. In my throughput tests that rely on
page copying (lib/test_hmm.c), I am seeing a 500% improvement in performance and
a 80% improvement in latency

Thanks,
Balbir



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code
  2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
                     ` (3 preceding siblings ...)
  2025-07-07  6:07   ` Alistair Popple
@ 2025-07-22  4:42   ` Matthew Brost
  4 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-22  4:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 04, 2025 at 09:35:02AM +1000, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages
> aware of zone device pages. Although the code is
> designed to be generic when it comes to handling splitting
> of pages, the code is designed to work for THP page sizes
> corresponding to HPAGE_PMD_NR.
> 
> Modify page_vma_mapped_walk() to return true when a zone
> device huge entry is present, enabling try_to_migrate()
> and other code migration paths to appropriately process the
> entry
> 
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for
> zone device entries.
> 
> try_to_map_to_unused_zeropage() does not apply to zone device
> entries, zone device entries are ignored in the call.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  mm/huge_memory.c     | 153 +++++++++++++++++++++++++++++++------------
>  mm/migrate.c         |   2 +
>  mm/page_vma_mapped.c |  10 +++
>  mm/pgtable-generic.c |   6 ++
>  mm/rmap.c            |  19 +++++-
>  5 files changed, 146 insertions(+), 44 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ce130225a8e5..e6e390d0308f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

As mentioned in a reply to the cover letter [1], this code is crashing for
us (Intel) when we fork and then read device pages. I’ve suggested a fix
in an another reply [2] and will send Nvidia’s stakeholders a complete
patch with the all necessary fixes to stabilize our code — more on that
below.

[1] https://lore.kernel.org/linux-mm/aHmJ+L3fCc0tju7A@lstrano-desk.jf.intel.com/
[2] https://lore.kernel.org/linux-mm/aHrsdvjjliBBdVQm@lstrano-desk.jf.intel.com/#t

>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_device_private_entry(entry));
>  		if (!is_readable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		} else if (thp_migration_supported()) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> +
> +			VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +					!folio_is_device_private(folio));
> +
> +			if (folio_is_device_private(folio)) {
> +				folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
>  		} else
>  			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>  
> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
> +			  !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> -		} else {
> +		} else if (is_writable_device_private_entry(entry)) {
> +			newpmd = swp_entry_to_pmd(entry);
> +			entry = make_device_exclusive_entry(swp_offset(entry));
> +		} else
>  			newpmd = *pmd;
> -		}
>  
>  		if (uffd_wp)
>  			newpmd = pmd_swp_mkuffd_wp(newpmd);
> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> -	bool anon_exclusive = false, dirty = false;
> +	bool young, write, soft_dirty, uffd_wp = false;
> +	bool anon_exclusive = false, dirty = false, present = false;
>  	unsigned long addr;
>  	pte_t *pte;
>  	int i;
> +	swp_entry_t swp_entry;
>  
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
> +
> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +			&& !(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		return __split_huge_zero_page_pmd(vma, haddr, pmd);

This function is causing a crash with the following test case:

- Fault in a 2M GPU page (i.e., an order-9 device folio is in the PMD)
- User space calls munmap on a partial region

A quick explanation of the crash:

- zap_pmd_range() calls __split_huge_pmd()
- zap_nonpresent_ptes() finds multiple PTE swap entries that all point
  to the same large folio; it decrements the refcount multiple times,
  causing a kernel crash

I believe there are likely several other problematic cases in the kernel
as well, but I only deep-dived into the case above.

The solution I came up with is: if a device-private PMD is found, split
the folio. This seems to work.

Rather than include the fix I came up with here, I’ve just sent Nvidia’s
stakeholders a patch titled "mm: Changes for Nvidia's device THP series
to enable device THP in GPU SVM / Xe", which contains all the core MM
changes we made to stabilize our code. Feel free to make that patch
public for discussion or use it however you see fit.

Matt

>  	}
>  
> -	pmd_migration = is_pmd_migration_entry(*pmd);
> -	if (unlikely(pmd_migration)) {
> -		swp_entry_t entry;
>  
> +	present = pmd_present(*pmd);
> +	if (unlikely(!present)) {
> +		swp_entry = pmd_to_swp_entry(*pmd);
>  		old_pmd = *pmd;
> -		entry = pmd_to_swp_entry(old_pmd);
> -		page = pfn_swap_entry_to_page(entry);
> -		write = is_writable_migration_entry(entry);
> +
> +		folio = pfn_swap_entry_folio(swp_entry);
> +		VM_BUG_ON(!is_migration_entry(swp_entry) &&
> +				!is_device_private_entry(swp_entry));
> +		page = pfn_swap_entry_to_page(swp_entry);
> +		write = is_writable_migration_entry(swp_entry);
> +
>  		if (PageAnon(page))
> -			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = is_migration_entry_young(entry);
> -		dirty = is_migration_entry_dirty(entry);
> +			anon_exclusive =
> +				is_readable_exclusive_migration_entry(swp_entry);
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
> +		young = is_migration_entry_young(swp_entry);
> +		dirty = is_migration_entry_dirty(swp_entry);
>  	} else {
>  		/*
>  		 * Up to this point the pmd is present and huge and userland has
> @@ -2996,30 +3025,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * Note that NUMA hinting access restrictions are not transferred to
>  	 * avoid any possibility of altering permissions across VMAs.
>  	 */
> -	if (freeze || pmd_migration) {
> +	if (freeze || !present) {
>  		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>  			pte_t entry;
> -			swp_entry_t swp_entry;
> -
> -			if (write)
> -				swp_entry = make_writable_migration_entry(
> -							page_to_pfn(page + i));
> -			else if (anon_exclusive)
> -				swp_entry = make_readable_exclusive_migration_entry(
> -							page_to_pfn(page + i));
> -			else
> -				swp_entry = make_readable_migration_entry(
> -							page_to_pfn(page + i));
> -			if (young)
> -				swp_entry = make_migration_entry_young(swp_entry);
> -			if (dirty)
> -				swp_entry = make_migration_entry_dirty(swp_entry);
> -			entry = swp_entry_to_pte(swp_entry);
> -			if (soft_dirty)
> -				entry = pte_swp_mksoft_dirty(entry);
> -			if (uffd_wp)
> -				entry = pte_swp_mkuffd_wp(entry);
> -
> +			if (freeze || is_migration_entry(swp_entry)) {
> +				if (write)
> +					swp_entry = make_writable_migration_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_readable_exclusive_migration_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_migration_entry(
> +								page_to_pfn(page + i));
> +				if (young)
> +					swp_entry = make_migration_entry_young(swp_entry);
> +				if (dirty)
> +					swp_entry = make_migration_entry_dirty(swp_entry);
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			} else {
> +				VM_BUG_ON(!is_device_private_entry(swp_entry));
> +				if (write)
> +					swp_entry = make_writable_device_private_entry(
> +								page_to_pfn(page + i));
> +				else if (anon_exclusive)
> +					swp_entry = make_device_exclusive_entry(
> +								page_to_pfn(page + i));
> +				else
> +					swp_entry = make_readable_device_private_entry(
> +								page_to_pfn(page + i));
> +				entry = swp_entry_to_pte(swp_entry);
> +				if (soft_dirty)
> +					entry = pte_swp_mksoft_dirty(entry);
> +				if (uffd_wp)
> +					entry = pte_swp_mkuffd_wp(entry);
> +			}
>  			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>  			set_pte_at(mm, addr, pte + i, entry);
>  		}
> @@ -3046,7 +3090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	}
>  	pte_unmap(pte);
>  
> -	if (!pmd_migration)
> +	if (present)
>  		folio_remove_rmap_pmd(folio, page, vma);
>  	if (freeze)
>  		put_page(page);
> @@ -3058,8 +3102,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>  			   pmd_t *pmd, bool freeze)
>  {
> +
>  	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> -	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
> +	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
> +			(is_swap_pmd(*pmd) &&
> +			is_device_private_entry(pmd_to_swp_entry(*pmd))))
>  		__split_huge_pmd_locked(vma, pmd, address, freeze);
>  }
>  
> @@ -3238,6 +3285,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>  	lockdep_assert_held(&lruvec->lru_lock);
>  
> +	if (folio_is_device_private(folio))
> +		return;
> +
>  	if (list) {
>  		/* page reclaim is reclaiming a huge page */
>  		VM_WARN_ON(folio_test_lru(folio));
> @@ -3252,6 +3302,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>  			list_add_tail(&new_folio->lru, &folio->lru);
>  		folio_set_lru(new_folio);
>  	}
> +
>  }
>  
>  /* Racy check whether the huge page can be split */
> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  					((mapping || swap_cache) ?
>  						folio_nr_pages(release) : 0));
>  
> +			if (folio_is_device_private(release))
> +				percpu_ref_get_many(&release->pgmap->ref,
> +							(1 << new_order) - 1);
> +
>  			lru_add_split_folio(origin_folio, release, lruvec,
>  					list);
>  
> @@ -4596,7 +4651,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		return 0;
>  
>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> -	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	if (!folio_is_device_private(folio))
> +		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +	else
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
>  
>  	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>  	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
> @@ -4646,6 +4704,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>  	folio_get(folio);
>  	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
> +
> +	if (unlikely(folio_is_device_private(folio))) {
> +		if (pmd_write(pmde))
> +			entry = make_writable_device_private_entry(
> +							page_to_pfn(new));
> +		else
> +			entry = make_readable_device_private_entry(
> +							page_to_pfn(new));
> +		pmde = swp_entry_to_pmd(entry);
> +	}
> +
>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>  		pmde = pmd_mksoft_dirty(pmde);
>  	if (is_writable_migration_entry(entry))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 767f503f0875..0b6ecf559b22 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>  
>  	if (PageCompound(page))
>  		return false;
> +	if (folio_is_device_private(folio))
> +		return false;
>  	VM_BUG_ON_PAGE(!PageAnon(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e981a1a292d2..ff8254e52de5 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -277,6 +277,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			 * cannot return prematurely, while zap_huge_pmd() has
>  			 * cleared *pmd but not decremented compound_mapcount().
>  			 */
> +			swp_entry_t entry;
> +
> +			if (!thp_migration_supported())
> +				return not_found(pvmw);
> +			entry = pmd_to_swp_entry(pmde);
> +			if (is_device_private_entry(entry)) {
> +				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> +				return true;
> +			}
> +
>  			if ((pvmw->flags & PVMW_SYNC) &&
>  			    thp_vma_suitable_order(vma, pvmw->address,
>  						   PMD_ORDER) &&
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..604e8206a2ec 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  		*pmdvalp = pmdval;
>  	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>  		goto nomap;
> +	if (is_swap_pmd(pmdval)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmdval);
> +
> +		if (is_device_private_entry(entry))
> +			goto nomap;
> +	}
>  	if (unlikely(pmd_trans_huge(pmdval)))
>  		goto nomap;
>  	if (unlikely(pmd_bad(pmdval))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bd83724d14b6..da1e5b03e1fe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2336,8 +2336,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				break;
>  			}
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> -			subpage = folio_page(folio,
> -				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
> +			/*
> +			 * Zone device private folios do not work well with
> +			 * pmd_pfn() on some architectures due to pte
> +			 * inversion.
> +			 */
> +			if (folio_is_device_private(folio)) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
> +				unsigned long pfn = swp_offset_pfn(entry);
> +
> +				subpage = folio_page(folio, pfn
> +							- folio_pfn(folio));
> +			} else {
> +				subpage = folio_page(folio,
> +							pmd_pfn(*pvmw.pmd)
> +							- folio_pfn(folio));
> +			}
> +
>  			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>  					!folio_test_pmd_mappable(folio), folio);
>  
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages
  2025-07-18  8:22         ` Matthew Brost
@ 2025-07-22  4:54           ` Matthew Brost
  0 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-22  4:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, akpm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 18, 2025 at 01:22:24AM -0700, Matthew Brost wrote:
> On Fri, Jul 18, 2025 at 12:21:36AM -0700, Matthew Brost wrote:
> > On Fri, Jul 18, 2025 at 05:04:39PM +1000, Balbir Singh wrote:
> > > On 7/18/25 16:59, Matthew Brost wrote:
> > > > On Fri, Jul 04, 2025 at 09:35:03AM +1000, Balbir Singh wrote:
> > > >> +	if (thp_migration_supported() &&
> > > >> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
> > > >> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
> > > >> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
> > > >> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
> > > >> +						MIGRATE_PFN_COMPOUND;
> > > >> +		migrate->dst[migrate->npages] = 0;
> > > >> +		migrate->npages++;
> > > >> +		migrate->cpages++;
> > > > 
> > > > It's a bit unclear what cpages and npages actually represent when
> > > > collecting a THP. In my opinion, they should reflect the total number of
> > > > minimum sized pages collected—i.e., we should increment by the shifted
> > > > order (512) here. I'm fairly certain the logic in migrate_device.c would
> > > > break if a 4MB range was requested and a THP was found first, followed by a
> > > > non-THP.
> > > > 
> > > 
> > > cpages and npages represent entries in the array and when or'ed with MIGRATE_PFN_COMPOUND
> > > represent the right number of entries populated. If you have a test that shows
> > > the breakage, I'd be keen to see it. We do populate other entries in 4k size(s) when
> > > collecting to allow for a split of the folio.
> > > 
> > 
> > I don’t have a test case, but let me quickly point out a logic bug.
> > 
> > Look at migrate_device_unmap. The variable i is incremented by
> > folio_nr_pages, which seems correct. However, in the earlier code, we
> > populate migrate->src using migrate->npages as the index, then increment
> > it by 1. So, if two THPs are found back to back, they’ll occupy entries
> > 0 and 1, while migrate_device_unmap will access entries 0 and 512.
> > 

Ugh, ignore this logic bug explanation — I was wrong. I missed that
migrate_vma_collect_skip increments npages to create the desired holes
in the source array for folio splits or skip-over logic.

But my point still stands regarding what cpages should represent — the
total number of minimum-sized pages collected and unmapped, in an effort
to keep the meaning of npages and cpages consistent.

Matt

> > Given that we have no idea what mix of THP vs non-THP we’ll encounter,
> > the only sane approach is to populate the input array at minimum
> > page-entry alignment. Similarly, npages and cpages should reflect the
> > number of minimum-sized pages found, with the caller (and
> > migrate_device) understanding that src and dst will be sparsely
> > populated based on each entry’s folio order.
> > 
> 
> I looked into this further and found another case where the logic breaks.
> 
> In __migrate_device_pages, the call to migrate_vma_split_pages assumes
> that based on folio's order it can populate subsequent entries upon
> split. This requires the source array to reflect the folio order upon
> finding it.
> 
> Here’s a summary of how I believe the migrate_vma_setup interface should
> behave, assuming 4K pages and 2M THPs:
> 
> Example A: 4MB requested, 2 THPs found and unmapped
> src[0]:     folio, order 9, migrate flag set
> src[1–511]: not present
> src[512]:   folio, order 9, migrate flag set
> src[513–1023]: not present
> npages = 1024, cpages = 1024
> 
> Example B: 4MB requested, 2 THPs found, first THP unmap fails
> src[0]:     folio, order 9, migrate flag clear
> src[1–511]: not present
> src[512]:   folio, order 9, migrate flag set
> src[513–1023]: not present
> npages = 1024, cpages = 512
> 
> Example C: 4MB requested, 512 small pages + 1 THP found, some small pages fail to unmap
> src[0–7]:   folio, order 0, migrate flag clear  
> src[8–511]: folio, order 0, migrate flag set  
> src[512]:   folio, order 9, migrate flag set  
> src[513–1023]: not present  
> npages = 1024, cpages = 1016  
> 
> As I suggested in my previous reply to patch #2, this should be
> documented—preferably in kernel-doc—so the final behavior is clear to
> both migrate_device.c (and the structs in migrate.h) and the layers
> above. I can help take a pass at writing kernel-doc for both, as its
> behavior is fairly before you changes.
> 
> Matt
> 
> > Matt
> > 
> > > Thanks for the review,
> > > Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-21 23:34       ` Balbir Singh
  2025-07-22  0:01         ` Matthew Brost
@ 2025-07-22 19:34         ` Francois Dugast
  2025-07-22 20:07           ` Andrew Morton
                             ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Francois Dugast @ 2025-07-22 19:34 UTC (permalink / raw)
  To: balbirs
  Cc: airlied, akpm, apopple, baohua, baolin.wang, dakr, david,
	donettom, francois.dugast, jane.chu, jglisse, kherbst,
	linux-kernel, linux-mm, lyude, matthew.brost, peterx,
	ryan.roberts, shuah, simona, wangkefeng.wang, willy, ziy

When the PMD swap entry is device private and owned by the caller,
skip the range faulting and instead just set the correct HMM PFNs.
This is similar to the logic for PTEs in hmm_vma_handle_pte().

For now, each hmm_pfns[i] entry is populated as it is currently done
in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
optimization could be to make use of the order and skip populating
subsequent PFNs.

Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 mm/hmm.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index f2415b4b2cdd..63ec1b18a656 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	}
 
 	if (!pmd_present(pmd)) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		/*
+		 * Don't fault in device private pages owned by the caller,
+		 * just report the PFNs.
+		 */
+		if (is_device_private_entry(entry) &&
+		    pfn_swap_entry_folio(entry)->pgmap->owner ==
+		    range->dev_private_owner) {
+			unsigned long cpu_flags = HMM_PFN_VALID |
+				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+			unsigned long pfn = swp_offset_pfn(entry);
+			unsigned long i;
+
+			if (is_writable_device_private_entry(entry))
+				cpu_flags |= HMM_PFN_WRITE;
+
+			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+				hmm_pfns[i] |= pfn | cpu_flags;
+			}
+
+			return 0;
+		}
+
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-22 19:34         ` [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Francois Dugast
@ 2025-07-22 20:07           ` Andrew Morton
  2025-07-23 15:34             ` Francois Dugast
  2025-07-24  0:25           ` Balbir Singh
  2025-08-08  0:21           ` Matthew Brost
  2 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2025-07-22 20:07 UTC (permalink / raw)
  To: Francois Dugast
  Cc: balbirs, airlied, apopple, baohua, baolin.wang, dakr, david,
	donettom, jane.chu, jglisse, kherbst, linux-kernel, linux-mm,
	lyude, matthew.brost, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy

On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote:

> When the PMD swap entry is device private and owned by the caller,
> skip the range faulting and instead just set the correct HMM PFNs.
> This is similar to the logic for PTEs in hmm_vma_handle_pte().

Please always tell us why a patch does something, not only what it does.

> For now, each hmm_pfns[i] entry is populated as it is currently done
> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> optimization could be to make use of the order and skip populating
> subsequent PFNs.

I infer from this paragraph that this patch is a performance
optimization?  Have its effects been measured?

> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		/*
> +		 * Don't fault in device private pages owned by the caller,
> +		 * just report the PFNs.
> +		 */

Similarly, this tells us "what" it does, which is fairly obvious from
the code itself.  What is not obvious from the code is the "why".

> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +
> +			return 0;
> +		}
> +
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [v1 resend 08/12] mm/thp: add split during migration support
  2025-07-18 15:06                                 ` Zi Yan
@ 2025-07-23  0:00                                   ` Matthew Brost
  0 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-23  0:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, akpm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom

On Fri, Jul 18, 2025 at 11:06:09AM -0400, Zi Yan wrote:
> On 17 Jul 2025, at 23:33, Matthew Brost wrote:
> 
> > On Thu, Jul 17, 2025 at 09:25:02PM -0400, Zi Yan wrote:
> >> On 17 Jul 2025, at 20:41, Matthew Brost wrote:
> >>
> >>> On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
> >>>> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
> >>>>
> >>>>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> >>>>>> On 7/17/25 02:24, Matthew Brost wrote:
> >>>>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >>>>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>>>>>>>
> >>>>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> s/pages/folio
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks, will make the changes
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ack, will change the name
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>   *
> >>>>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>   */
> >>>>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>>>>>>>  		 * operations.
> >>>>>>>>>>>>>>>>  		 */
> >>>>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>>>>>>>> -			goto out;
> >>>>>>>>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>>>>>>>> +				goto out;
> >>>>>>>>>>>>>>>> +			}
> >>>>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>>>  		}
> >>>>>>>>>>>>>>>>  		end = -1;
> >>>>>>>>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>>>>>>>> +	if (!isolated)
> >>>>>>>>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>>>>>>>> -				uniform_split);
> >>>>>>>>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>>>>>>>  fail:
> >>>>>>>>>>>>>>>>  		if (mapping)
> >>>>>>>>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>>>> +		if (!isolated)
> >>>>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> There are two reasons for going down the current code path
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>>>>>>>> if calling the API with unmapped
> >>>>>>>>>>>
> >>>>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>>>>>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> There is a use for splitting unmapped folios (see below)
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>>>>>>>> device private folios properly. Details are below.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You do something below instead.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>>>>>>>> 	ret = -EBUSY;
> >>>>>>>>>>>>> 	goto out;
> >>>>>>>>>>>>> } else if (anon_vma) {
> >>>>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>>>>>>>> the check for device private folios?
> >>>>>>>>>>>
> >>>>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>>>>>>>> in if (!isolated) branch. In that case, just do
> >>>>>>>>>>>
> >>>>>>>>>>> if (folio_is_device_private(folio) {
> >>>>>>>>>>> ...
> >>>>>>>>>>> } else if (is_anon) {
> >>>>>>>>>>> ...
> >>>>>>>>>>> } else {
> >>>>>>>>>>> ...
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>>>>>>>> sees it as both device private and file-backed?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>>>>>>>> the name device private.
> >>>>>>>>>>>
> >>>>>>>>>>> OK.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>>>>>>>> sees a device private folio.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>>>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>>>>>>>> remap_folio(), because
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>>>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>>>>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>>>>>>>    so trying to do that again does not make sense.
> >>>>>>>>>>>
> >>>>>>>>>>> Why doing split in the middle of migration? Existing split code
> >>>>>>>>>>> assumes to-be-split folios are mapped.
> >>>>>>>>>>>
> >>>>>>>>>>> What prevents doing split before migration?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The code does do a split prior to migration if THP selection fails
> >>>>>>>>>>
> >>>>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>>>>>>>> and the fallback part which calls split_folio()
> >>>>>>>>>>
> >>>>>>>>>> But the case under consideration is special since the device needs to allocate
> >>>>>>>>>> corresponding pfn's as well. The changelog mentions it:
> >>>>>>>>>>
> >>>>>>>>>> "The common case that arises is that after setup, during migrate
> >>>>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>>>>>>>> pages."
> >>>>>>>>>>
> >>>>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>>>>>>>
> >>>>>>>>>> 1. migrate_vma_setup()
> >>>>>>>>>> 2. migrate_vma_pages()
> >>>>>>>>>> 3. migrate_vma_finalize()
> >>>>>>>>>>
> >>>>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>>>>>>>
> >>>>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>>>>>>>
> >>>>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Another use case I’ve seen is when a previously allocated high-order
> >>>>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
> >>>>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>>>>>>>
> >>>>>>>> That is different. If the high-order folio is free, it should be split
> >>>>>>>> using split_page() from mm/page_alloc.c.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Ah, ok. Let me see if that works - it would easier.
> >>>>>>>
> >>>>>
> >>>>> This suggestion quickly blows up as PageCompound is true and page_count
> >>>>> here is zero.
> >>>>
> >>>> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
> >>>>
> >>>>>
> >>>>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>>>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>>>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>>>>>>>
> >>>>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >>>>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
> >>>>>>>> differently.
> >>>>>>>>
> >>>>>>>
> >>>>>>> This is right, those fields should be clear.
> >>>>>>>
> >>>>>>> Thanks for the tip.
> >>>>>>>
> >>>>>> I was hoping to reuse __split_folio_to_order() at some point in the future
> >>>>>> to split the backing pages in the driver, but it is not an immediate priority
> >>>>>>
> >>>>>
> >>>>> I think we need something for the scenario I describe here. I was to
> >>>>> make __split_huge_page_to_list_to_order with a couple of hacks but it
> >>>>> almostly certainig not right as Zi pointed out.
> >>>>>
> >>>>> New to the MM stuff, but play around with this a bit and see if I can
> >>>>> come up with something that will work here.
> >>>>
> >>>> Can you try to write a new split_page function with __split_unmapped_folio()?
> >>>> Since based on your description, your folio is not mapped.
> >>>>
> >>>
> >>> Yes, page->mapping is NULL in this case - that was part of the hacks to
> >>> __split_huge_page_to_list_to_order (more specially __folio_split) I had
> >>> to make in order to get something working for this case.
> >>>
> >>> I can try out something based on __split_unmapped_folio and report back.
> >>
> >> mm-new tree has an updated __split_unmapped_folio() version, it moves
> >> all unmap irrelevant code out of __split_unmaped_folio(). You might find
> >> it easier to reuse.
> >>
> >> See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430
> >>

I pulled in the new version and it to works for this case.

Matt

> >
> > Will take a look. It is possible some of the issues we are hitting are
> > due to working on drm-tip + pulling in core MM patches in this series on
> > top of that branch then missing some other patches in mm-new. I'll see
> > if ww can figure out a work flow to have the latest and greatest from
> > both drm-tip and the MM branches.
> >
> > Will these changes be in 6.17?
> 
> Hopefully yes. mm patches usually go from mm-new to mm-unstable
> to mm-stable to mainline. If not, we will figure it out. :)
> 
> >
> >> I am about to update the code with v4 patches. I will cc you, so that
> >> you can get the updated __split_unmaped_folio().
> >>
> >> Feel free to ask questions on folio split code.
> >>
> >
> > Thanks.
> >
> > Matt
> >
> >> Best Regards,
> >> Yan, Zi
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-22 20:07           ` Andrew Morton
@ 2025-07-23 15:34             ` Francois Dugast
  2025-07-23 18:05               ` Matthew Brost
  0 siblings, 1 reply; 99+ messages in thread
From: Francois Dugast @ 2025-07-23 15:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: balbirs, airlied, apopple, baohua, baolin.wang, dakr, david,
	donettom, jane.chu, jglisse, kherbst, linux-kernel, linux-mm,
	lyude, matthew.brost, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy

On Tue, Jul 22, 2025 at 01:07:21PM -0700, Andrew Morton wrote:
> On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote:
> 
> > When the PMD swap entry is device private and owned by the caller,
> > skip the range faulting and instead just set the correct HMM PFNs.
> > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> 
> Please always tell us why a patch does something, not only what it does.

Sure, let me improve this in the next version.

> 
> > For now, each hmm_pfns[i] entry is populated as it is currently done
> > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > optimization could be to make use of the order and skip populating
> > subsequent PFNs.
> 
> I infer from this paragraph that this patch is a performance
> optimization?  Have its effects been measured?

Yes, this performance optimization would come from avoiding the loop
over the range but it has neither been properly tested nor measured
yet.

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		/*
> > +		 * Don't fault in device private pages owned by the caller,
> > +		 * just report the PFNs.
> > +		 */
> 
> Similarly, this tells us "what" it does, which is fairly obvious from
> the code itself.  What is not obvious from the code is the "why".

Indeed, will fix.

> 
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> > +			return 0;
> > +		}
> > +
> >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-23 15:34             ` Francois Dugast
@ 2025-07-23 18:05               ` Matthew Brost
  0 siblings, 0 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-23 18:05 UTC (permalink / raw)
  To: Francois Dugast
  Cc: Andrew Morton, balbirs, airlied, apopple, baohua, baolin.wang,
	dakr, david, donettom, jane.chu, jglisse, kherbst, linux-kernel,
	linux-mm, lyude, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy

On Wed, Jul 23, 2025 at 05:34:17PM +0200, Francois Dugast wrote:
> On Tue, Jul 22, 2025 at 01:07:21PM -0700, Andrew Morton wrote:
> > On Tue, 22 Jul 2025 21:34:45 +0200 Francois Dugast <francois.dugast@intel.com> wrote:
> > 
> > > When the PMD swap entry is device private and owned by the caller,
> > > skip the range faulting and instead just set the correct HMM PFNs.
> > > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > 
> > Please always tell us why a patch does something, not only what it does.
> 
> Sure, let me improve this in the next version.
> 
> > 
> > > For now, each hmm_pfns[i] entry is populated as it is currently done
> > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > > optimization could be to make use of the order and skip populating
> > > subsequent PFNs.
> > 
> > I infer from this paragraph that this patch is a performance
> > optimization?  Have its effects been measured?
> 
> Yes, this performance optimization would come from avoiding the loop
> over the range but it has neither been properly tested nor measured
> yet.
> 

This is also a functional change. Once THP device pages are enabled (for
performance), we will encounter device-private swap entries in PMDs. At
that point, the correct behavior is to populate HMM PFNs from the swap
entry when dev_private_owner matches; otherwise, trigger a fault if the
HMM range-walk input requests one, or skip it in the non-faulting case.

It’s harmless to merge this patch before THP device pages are enabled,
since with the current code base we never find device-private swap
entries in PMDs.

I'd include something like the above explanation in the patch commit
message, or in code comments if needed.

Matt

> > 
> > > --- a/mm/hmm.c
> > > +++ b/mm/hmm.c
> > > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> > >  	}
> > >  
> > >  	if (!pmd_present(pmd)) {
> > > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > > +
> > > +		/*
> > > +		 * Don't fault in device private pages owned by the caller,
> > > +		 * just report the PFNs.
> > > +		 */
> > 
> > Similarly, this tells us "what" it does, which is fairly obvious from
> > the code itself.  What is not obvious from the code is the "why".
> 
> Indeed, will fix.
> 
> > 
> > > +		if (is_device_private_entry(entry) &&
> > > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > > +		    range->dev_private_owner) {
> > > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > > +			unsigned long pfn = swp_offset_pfn(entry);
> > > +			unsigned long i;
> > > +
> > > +			if (is_writable_device_private_entry(entry))
> > > +				cpu_flags |= HMM_PFN_WRITE;
> > > +
> > > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > > +				hmm_pfns[i] |= pfn | cpu_flags;
> > > +			}
> > > +
> > > +			return 0;
> > > +		}
> > > +
> > >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> > >  			return -EFAULT;
> > >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> > 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-22 19:34         ` [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Francois Dugast
  2025-07-22 20:07           ` Andrew Morton
@ 2025-07-24  0:25           ` Balbir Singh
  2025-07-24  5:02             ` Matthew Brost
  2025-08-08  0:21           ` Matthew Brost
  2 siblings, 1 reply; 99+ messages in thread
From: Balbir Singh @ 2025-07-24  0:25 UTC (permalink / raw)
  To: Francois Dugast
  Cc: airlied, akpm, apopple, baohua, baolin.wang, dakr, david,
	donettom, jane.chu, jglisse, kherbst, linux-kernel, linux-mm,
	lyude, matthew.brost, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy

On 7/23/25 05:34, Francois Dugast wrote:
> When the PMD swap entry is device private and owned by the caller,
> skip the range faulting and instead just set the correct HMM PFNs.
> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> 
> For now, each hmm_pfns[i] entry is populated as it is currently done
> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> optimization could be to make use of the order and skip populating
> subsequent PFNs.

I think we should test and remove these now

> 
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  mm/hmm.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2415b4b2cdd..63ec1b18a656 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		/*
> +		 * Don't fault in device private pages owned by the caller,
> +		 * just report the PFNs.
> +		 */
> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +

As discussed, can we remove these.

> +			return 0;
> +		}

All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION

> +
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);



Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-24  0:25           ` Balbir Singh
@ 2025-07-24  5:02             ` Matthew Brost
  2025-07-24  5:46               ` Mika Penttilä
  2025-07-28 13:34               ` Jason Gunthorpe
  0 siblings, 2 replies; 99+ messages in thread
From: Matthew Brost @ 2025-07-24  5:02 UTC (permalink / raw)
  To: Balbir Singh, jgg, leonro
  Cc: Francois Dugast, airlied, akpm, apopple, baohua, baolin.wang,
	dakr, david, donettom, jane.chu, jglisse, kherbst, linux-kernel,
	linux-mm, lyude, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy

On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> On 7/23/25 05:34, Francois Dugast wrote:
> > When the PMD swap entry is device private and owned by the caller,
> > skip the range faulting and instead just set the correct HMM PFNs.
> > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > 
> > For now, each hmm_pfns[i] entry is populated as it is currently done
> > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > optimization could be to make use of the order and skip populating
> > subsequent PFNs.
> 
> I think we should test and remove these now
> 

+Jason, Leon – perhaps either of you can provide insight into why
hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
is found.

If we can be assured that changing this won’t break other parts of the
kernel, I agree it should be removed. A snippet of documentation should
also be added indicating that when higher-order PFNs are found,
subsequent PFNs within the range will remain unpopulated. I can verify
that GPU SVM works just fine without these PFNs being populated.

Matt

> > 
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  mm/hmm.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index f2415b4b2cdd..63ec1b18a656 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		/*
> > +		 * Don't fault in device private pages owned by the caller,
> > +		 * just report the PFNs.
> > +		 */
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> 
> As discussed, can we remove these.
> 
> > +			return 0;
> > +		}
> 
> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
> 
> > +
> >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> 
> 
> 
> Balbir Singh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-24  5:02             ` Matthew Brost
@ 2025-07-24  5:46               ` Mika Penttilä
  2025-07-24  5:57                 ` Matthew Brost
  2025-07-28 13:34               ` Jason Gunthorpe
  1 sibling, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-24  5:46 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Francois Dugast, airlied, akpm, apopple, baohua, baolin.wang,
	dakr, david, donettom, jane.chu, jglisse, kherbst, linux-kernel,
	linux-mm, lyude, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy, Balbir Singh, jgg, Leon Romanovsky


On 7/24/25 08:02, Matthew Brost wrote:
> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
>> On 7/23/25 05:34, Francois Dugast wrote:
>>> When the PMD swap entry is device private and owned by the caller,
>>> skip the range faulting and instead just set the correct HMM PFNs.
>>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
>>>
>>> For now, each hmm_pfns[i] entry is populated as it is currently done
>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
>>> optimization could be to make use of the order and skip populating
>>> subsequent PFNs.
>> I think we should test and remove these now
>>
> +Jason, Leon – perhaps either of you can provide insight into why
> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> is found.
>
> If we can be assured that changing this won’t break other parts of the
> kernel, I agree it should be removed. A snippet of documentation should
> also be added indicating that when higher-order PFNs are found,
> subsequent PFNs within the range will remain unpopulated. I can verify
> that GPU SVM works just fine without these PFNs being populated.

afaics the device can consume the range as smaller pages also, and some
hmm users depend on that.


> Matt


--Mika


>
>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
>>> ---
>>>  mm/hmm.c | 25 +++++++++++++++++++++++++
>>>  1 file changed, 25 insertions(+)
>>>
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index f2415b4b2cdd..63ec1b18a656 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>>  	}
>>>  
>>>  	if (!pmd_present(pmd)) {
>>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>> +
>>> +		/*
>>> +		 * Don't fault in device private pages owned by the caller,
>>> +		 * just report the PFNs.
>>> +		 */
>>> +		if (is_device_private_entry(entry) &&
>>> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
>>> +		    range->dev_private_owner) {
>>> +			unsigned long cpu_flags = HMM_PFN_VALID |
>>> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
>>> +			unsigned long pfn = swp_offset_pfn(entry);
>>> +			unsigned long i;
>>> +
>>> +			if (is_writable_device_private_entry(entry))
>>> +				cpu_flags |= HMM_PFN_WRITE;
>>> +
>>> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
>>> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
>>> +				hmm_pfns[i] |= pfn | cpu_flags;
>>> +			}
>>> +
>> As discussed, can we remove these.
>>
>>> +			return 0;
>>> +		}
>> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
>>
>>> +
>>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>>>  			return -EFAULT;
>>>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>>
>>
>> Balbir Singh



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-24  5:46               ` Mika Penttilä
@ 2025-07-24  5:57                 ` Matthew Brost
  2025-07-24  6:04                   ` Mika Penttilä
  0 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-07-24  5:57 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Francois Dugast, airlied, akpm, apopple, baohua, baolin.wang,
	dakr, david, donettom, jane.chu, jglisse, kherbst, linux-kernel,
	linux-mm, lyude, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy, Balbir Singh, jgg, Leon Romanovsky

On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote:
> 
> On 7/24/25 08:02, Matthew Brost wrote:
> > On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> >> On 7/23/25 05:34, Francois Dugast wrote:
> >>> When the PMD swap entry is device private and owned by the caller,
> >>> skip the range faulting and instead just set the correct HMM PFNs.
> >>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> >>>
> >>> For now, each hmm_pfns[i] entry is populated as it is currently done
> >>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> >>> optimization could be to make use of the order and skip populating
> >>> subsequent PFNs.
> >> I think we should test and remove these now
> >>
> > +Jason, Leon – perhaps either of you can provide insight into why
> > hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> > is found.
> >
> > If we can be assured that changing this won’t break other parts of the
> > kernel, I agree it should be removed. A snippet of documentation should
> > also be added indicating that when higher-order PFNs are found,
> > subsequent PFNs within the range will remain unpopulated. I can verify
> > that GPU SVM works just fine without these PFNs being populated.
> 
> afaics the device can consume the range as smaller pages also, and some
> hmm users depend on that.
> 

Sure, but I think that should be fixed in the device code. If a
large-order PFN is found, the subsequent PFNs can clearly be inferred.
It's a micro-optimization here, but devices or callers capable of
handling this properly shouldn't force a hacky, less optimal behavior on
core code. If anything relies on the current behavior, we should fix it
and ensure correctness.

Matt

> 
> > Matt
> 
> 
> --Mika
> 
> 
> >
> >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> >>> ---
> >>>  mm/hmm.c | 25 +++++++++++++++++++++++++
> >>>  1 file changed, 25 insertions(+)
> >>>
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index f2415b4b2cdd..63ec1b18a656 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >>>  	}
> >>>  
> >>>  	if (!pmd_present(pmd)) {
> >>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> >>> +
> >>> +		/*
> >>> +		 * Don't fault in device private pages owned by the caller,
> >>> +		 * just report the PFNs.
> >>> +		 */
> >>> +		if (is_device_private_entry(entry) &&
> >>> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> >>> +		    range->dev_private_owner) {
> >>> +			unsigned long cpu_flags = HMM_PFN_VALID |
> >>> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> >>> +			unsigned long pfn = swp_offset_pfn(entry);
> >>> +			unsigned long i;
> >>> +
> >>> +			if (is_writable_device_private_entry(entry))
> >>> +				cpu_flags |= HMM_PFN_WRITE;
> >>> +
> >>> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> >>> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> >>> +				hmm_pfns[i] |= pfn | cpu_flags;
> >>> +			}
> >>> +
> >> As discussed, can we remove these.
> >>
> >>> +			return 0;
> >>> +		}
> >> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
> >>
> >>> +
> >>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >>>  			return -EFAULT;
> >>>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> >>
> >>
> >> Balbir Singh
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-24  5:57                 ` Matthew Brost
@ 2025-07-24  6:04                   ` Mika Penttilä
  2025-07-24  6:47                     ` Leon Romanovsky
  0 siblings, 1 reply; 99+ messages in thread
From: Mika Penttilä @ 2025-07-24  6:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Francois Dugast, airlied, akpm, apopple, baohua, baolin.wang,
	dakr, david, donettom, jane.chu, jglisse, kherbst, linux-kernel,
	linux-mm, lyude, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy, Balbir Singh, jgg, Leon Romanovsky


On 7/24/25 08:57, Matthew Brost wrote:
> On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote:
>> On 7/24/25 08:02, Matthew Brost wrote:
>>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
>>>> On 7/23/25 05:34, Francois Dugast wrote:
>>>>> When the PMD swap entry is device private and owned by the caller,
>>>>> skip the range faulting and instead just set the correct HMM PFNs.
>>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
>>>>>
>>>>> For now, each hmm_pfns[i] entry is populated as it is currently done
>>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
>>>>> optimization could be to make use of the order and skip populating
>>>>> subsequent PFNs.
>>>> I think we should test and remove these now
>>>>
>>> +Jason, Leon – perhaps either of you can provide insight into why
>>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
>>> is found.
>>>
>>> If we can be assured that changing this won’t break other parts of the
>>> kernel, I agree it should be removed. A snippet of documentation should
>>> also be added indicating that when higher-order PFNs are found,
>>> subsequent PFNs within the range will remain unpopulated. I can verify
>>> that GPU SVM works just fine without these PFNs being populated.
>> afaics the device can consume the range as smaller pages also, and some
>> hmm users depend on that.
>>
> Sure, but I think that should be fixed in the device code. If a
> large-order PFN is found, the subsequent PFNs can clearly be inferred.
> It's a micro-optimization here, but devices or callers capable of
> handling this properly shouldn't force a hacky, less optimal behavior on
> core code. If anything relies on the current behavior, we should fix it
> and ensure correctness.

Yes sure device code can be changed but meant to say we can't just
delete those lines without breaking existing users.


>
> Matt


--Mika


>
>>> Matt
>>
>> --Mika
>>
>>
>>>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
>>>>> ---
>>>>>  mm/hmm.c | 25 +++++++++++++++++++++++++
>>>>>  1 file changed, 25 insertions(+)
>>>>>
>>>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>>>> index f2415b4b2cdd..63ec1b18a656 100644
>>>>> --- a/mm/hmm.c
>>>>> +++ b/mm/hmm.c
>>>>> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>>>>  	}
>>>>>  
>>>>>  	if (!pmd_present(pmd)) {
>>>>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
>>>>> +
>>>>> +		/*
>>>>> +		 * Don't fault in device private pages owned by the caller,
>>>>> +		 * just report the PFNs.
>>>>> +		 */
>>>>> +		if (is_device_private_entry(entry) &&
>>>>> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
>>>>> +		    range->dev_private_owner) {
>>>>> +			unsigned long cpu_flags = HMM_PFN_VALID |
>>>>> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
>>>>> +			unsigned long pfn = swp_offset_pfn(entry);
>>>>> +			unsigned long i;
>>>>> +
>>>>> +			if (is_writable_device_private_entry(entry))
>>>>> +				cpu_flags |= HMM_PFN_WRITE;
>>>>> +
>>>>> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
>>>>> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
>>>>> +				hmm_pfns[i] |= pfn | cpu_flags;
>>>>> +			}
>>>>> +
>>>> As discussed, can we remove these.
>>>>
>>>>> +			return 0;
>>>>> +		}
>>>> All of this be under CONFIG_ARCH_ENABLE_THP_MIGRATION
>>>>
>>>>> +
>>>>>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>>>>>  			return -EFAULT;
>>>>>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
>>>>
>>>> Balbir Singh



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-24  6:04                   ` Mika Penttilä
@ 2025-07-24  6:47                     ` Leon Romanovsky
  0 siblings, 0 replies; 99+ messages in thread
From: Leon Romanovsky @ 2025-07-24  6:47 UTC (permalink / raw)
  To: Matthew Brost, Mika Penttilä
  Cc: Francois Dugast, airlied, akpm, apopple, baohua, baolin.wang,
	dakr, david, donettom, jane.chu, jglisse, kherbst, linux-kernel,
	linux-mm, lyude, peterx, ryan.roberts, shuah, simona,
	wangkefeng.wang, willy, ziy, Balbir Singh, jgg

On Thu, Jul 24, 2025 at 09:04:36AM +0300, Mika Penttilä wrote:
> 
> On 7/24/25 08:57, Matthew Brost wrote:
> > On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote:
> >> On 7/24/25 08:02, Matthew Brost wrote:
> >>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> >>>> On 7/23/25 05:34, Francois Dugast wrote:
> >>>>> When the PMD swap entry is device private and owned by the caller,
> >>>>> skip the range faulting and instead just set the correct HMM PFNs.
> >>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> >>>>>
> >>>>> For now, each hmm_pfns[i] entry is populated as it is currently done
> >>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> >>>>> optimization could be to make use of the order and skip populating
> >>>>> subsequent PFNs.
> >>>> I think we should test and remove these now
> >>>>
> >>> +Jason, Leon – perhaps either of you can provide insight into why
> >>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> >>> is found.
> >>>
> >>> If we can be assured that changing this won’t break other parts of the
> >>> kernel, I agree it should be removed. A snippet of documentation should
> >>> also be added indicating that when higher-order PFNs are found,
> >>> subsequent PFNs within the range will remain unpopulated. I can verify
> >>> that GPU SVM works just fine without these PFNs being populated.
> >> afaics the device can consume the range as smaller pages also, and some
> >> hmm users depend on that.
> >>
> > Sure, but I think that should be fixed in the device code. If a
> > large-order PFN is found, the subsequent PFNs can clearly be inferred.
> > It's a micro-optimization here, but devices or callers capable of
> > handling this properly shouldn't force a hacky, less optimal behavior on
> > core code. If anything relies on the current behavior, we should fix it
> > and ensure correctness.
> 
> Yes sure device code can be changed but meant to say we can't just
> delete those lines without breaking existing users.

Mika is right. RDMA subsystem and HMM users there need to be updated.

We have special flag (IB_ACCESS_HUGETLB) that prepare whole RDMA stack
to handle large order PFNs. If this flag is not provided, we need to
fallback to basic device page size (4k) and for that we expect fully
populated PFN list.

Thanks


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-24  5:02             ` Matthew Brost
  2025-07-24  5:46               ` Mika Penttilä
@ 2025-07-28 13:34               ` Jason Gunthorpe
  1 sibling, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2025-07-28 13:34 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Balbir Singh, leonro, Francois Dugast, airlied, akpm, apopple,
	baohua, baolin.wang, dakr, david, donettom, jane.chu, jglisse,
	kherbst, linux-kernel, linux-mm, lyude, peterx, ryan.roberts,
	shuah, simona, wangkefeng.wang, willy, ziy

On Wed, Jul 23, 2025 at 10:02:58PM -0700, Matthew Brost wrote:
> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> > On 7/23/25 05:34, Francois Dugast wrote:
> > > When the PMD swap entry is device private and owned by the caller,
> > > skip the range faulting and instead just set the correct HMM PFNs.
> > > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > > 
> > > For now, each hmm_pfns[i] entry is populated as it is currently done
> > > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > > optimization could be to make use of the order and skip populating
> > > subsequent PFNs.
> > 
> > I think we should test and remove these now
> > 
> 
> +Jason, Leon – perhaps either of you can provide insight into why
> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> is found.

I'm not sure why this is burined in some weird unrelated thread,
please post patches normally and CC the right people.

At least the main patch looks reasonable to me, and probably should do
pgd as well while at it?

> If we can be assured that changing this won’t break other parts of the
> kernel, I agree it should be removed. A snippet of documentation should
> also be added indicating that when higher-order PFNs are found,
> subsequent PFNs within the range will remain unpopulated. I can verify
> that GPU SVM works just fine without these PFNs being populated.

We can only do this if someone audits all the current users to confirm
they are compatible. Do that and then it is OK to propose the change.

I think the current version evolved as an optimization so I would not
be surprised to see that some callers need fixing.

Jason


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-07-22 19:34         ` [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Francois Dugast
  2025-07-22 20:07           ` Andrew Morton
  2025-07-24  0:25           ` Balbir Singh
@ 2025-08-08  0:21           ` Matthew Brost
  2025-08-08  9:43             ` Francois Dugast
  2 siblings, 1 reply; 99+ messages in thread
From: Matthew Brost @ 2025-08-08  0:21 UTC (permalink / raw)
  To: Francois Dugast
  Cc: balbirs, airlied, akpm, apopple, baohua, baolin.wang, dakr, david,
	donettom, jane.chu, jglisse, kherbst, linux-kernel, linux-mm,
	lyude, peterx, ryan.roberts, shuah, simona, wangkefeng.wang,
	willy, ziy

On Tue, Jul 22, 2025 at 09:34:45PM +0200, Francois Dugast wrote:
> When the PMD swap entry is device private and owned by the caller,
> skip the range faulting and instead just set the correct HMM PFNs.
> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> 
> For now, each hmm_pfns[i] entry is populated as it is currently done
> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> optimization could be to make use of the order and skip populating
> subsequent PFNs.
> 
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  mm/hmm.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2415b4b2cdd..63ec1b18a656 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		/*
> +		 * Don't fault in device private pages owned by the caller,
> +		 * just report the PFNs.
> +		 */
> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +
> +			return 0;
> +		}
> +
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;

I think here if this is a is_device_private_entry(entry), we need to
call hmm_vma_fault.

Matt

>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller
  2025-08-08  0:21           ` Matthew Brost
@ 2025-08-08  9:43             ` Francois Dugast
  0 siblings, 0 replies; 99+ messages in thread
From: Francois Dugast @ 2025-08-08  9:43 UTC (permalink / raw)
  To: Matthew Brost
  Cc: balbirs, airlied, akpm, apopple, baohua, baolin.wang, dakr, david,
	donettom, jane.chu, jglisse, kherbst, linux-kernel, linux-mm,
	lyude, peterx, ryan.roberts, shuah, simona, wangkefeng.wang,
	willy, ziy

On Thu, Aug 07, 2025 at 05:21:45PM -0700, Matthew Brost wrote:
> On Tue, Jul 22, 2025 at 09:34:45PM +0200, Francois Dugast wrote:
> > When the PMD swap entry is device private and owned by the caller,
> > skip the range faulting and instead just set the correct HMM PFNs.
> > This is similar to the logic for PTEs in hmm_vma_handle_pte().
> > 
> > For now, each hmm_pfns[i] entry is populated as it is currently done
> > in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> > optimization could be to make use of the order and skip populating
> > subsequent PFNs.
> > 
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  mm/hmm.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index f2415b4b2cdd..63ec1b18a656 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,31 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		/*
> > +		 * Don't fault in device private pages owned by the caller,
> > +		 * just report the PFNs.
> > +		 */
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> > +			return 0;
> > +		}
> > +
> >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> 
> I think here if this is a is_device_private_entry(entry), we need to
> call hmm_vma_fault.

Yes that seems needed, thanks for the catch.

Francois

> 
> Matt
> 
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> > -- 
> > 2.43.0
> > 


^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2025-08-08  9:44 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 23:34 [v1 resend 00/12] THP support for zone device page migration Balbir Singh
2025-07-03 23:35 ` [v1 resend 01/12] mm/zone_device: support large zone device private folios Balbir Singh
2025-07-07  5:28   ` Alistair Popple
2025-07-08  6:47     ` Balbir Singh
2025-07-03 23:35 ` [v1 resend 02/12] mm/migrate_device: flags for selecting device private THP pages Balbir Singh
2025-07-07  5:31   ` Alistair Popple
2025-07-08  7:31     ` Balbir Singh
2025-07-19 20:06       ` Matthew Brost
2025-07-19 20:16         ` Matthew Brost
2025-07-18  3:15   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 03/12] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-04  4:46   ` Mika Penttilä
2025-07-06  1:21     ` Balbir Singh
2025-07-04 11:10   ` Mika Penttilä
2025-07-05  0:14     ` Balbir Singh
2025-07-07  6:09       ` Alistair Popple
2025-07-08  7:40         ` Balbir Singh
2025-07-07  3:49   ` Mika Penttilä
2025-07-08  4:20     ` Balbir Singh
2025-07-08  4:30       ` Mika Penttilä
2025-07-07  6:07   ` Alistair Popple
2025-07-08  4:59     ` Balbir Singh
2025-07-22  4:42   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 04/12] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-07-04 15:35   ` kernel test robot
2025-07-18  6:59   ` Matthew Brost
2025-07-18  7:04     ` Balbir Singh
2025-07-18  7:21       ` Matthew Brost
2025-07-18  8:22         ` Matthew Brost
2025-07-22  4:54           ` Matthew Brost
2025-07-19  2:10   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 05/12] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
2025-07-17 19:34   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 06/12] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-07-03 23:35 ` [v1 resend 07/12] mm/memremap: add folio_split support Balbir Singh
2025-07-04 11:14   ` Mika Penttilä
2025-07-06  1:24     ` Balbir Singh
2025-07-03 23:35 ` [v1 resend 08/12] mm/thp: add split during migration support Balbir Singh
2025-07-04  5:17   ` Mika Penttilä
2025-07-04  6:43     ` Mika Penttilä
2025-07-05  0:26       ` Balbir Singh
2025-07-05  3:17         ` Mika Penttilä
2025-07-07  2:35           ` Balbir Singh
2025-07-07  3:29             ` Mika Penttilä
2025-07-08  7:37               ` Balbir Singh
2025-07-04 11:24   ` Zi Yan
2025-07-05  0:58     ` Balbir Singh
2025-07-05  1:55       ` Zi Yan
2025-07-06  1:15         ` Balbir Singh
2025-07-06  1:34           ` Zi Yan
2025-07-06  1:47             ` Balbir Singh
2025-07-06  2:34               ` Zi Yan
2025-07-06  3:03                 ` Zi Yan
2025-07-07  2:29                   ` Balbir Singh
2025-07-07  2:45                     ` Zi Yan
2025-07-08  3:31                       ` Balbir Singh
2025-07-08  7:43                       ` Balbir Singh
2025-07-16  5:34               ` Matthew Brost
2025-07-16 11:19                 ` Zi Yan
2025-07-16 16:24                   ` Matthew Brost
2025-07-16 21:53                     ` Balbir Singh
2025-07-17 22:24                       ` Matthew Brost
2025-07-17 23:04                         ` Zi Yan
2025-07-18  0:41                           ` Matthew Brost
2025-07-18  1:25                             ` Zi Yan
2025-07-18  3:33                               ` Matthew Brost
2025-07-18 15:06                                 ` Zi Yan
2025-07-23  0:00                                   ` Matthew Brost
2025-07-03 23:35 ` [v1 resend 09/12] lib/test_hmm: add test case for split pages Balbir Singh
2025-07-03 23:35 ` [v1 resend 10/12] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-07-03 23:35 ` [v1 resend 11/12] gpu/drm/nouveau: add THP migration support Balbir Singh
2025-07-03 23:35 ` [v1 resend 12/12] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-07-04 16:16 ` [v1 resend 00/12] THP support for zone device page migration Zi Yan
2025-07-04 23:56   ` Balbir Singh
2025-07-08 14:53 ` David Hildenbrand
2025-07-08 22:43   ` Balbir Singh
2025-07-17 23:40 ` Matthew Brost
2025-07-18  3:57   ` Balbir Singh
2025-07-18  4:57     ` Matthew Brost
2025-07-21 23:48       ` Balbir Singh
2025-07-22  0:07         ` Matthew Brost
2025-07-22  0:51           ` Balbir Singh
2025-07-19  0:53     ` Matthew Brost
2025-07-21 11:42     ` Francois Dugast
2025-07-21 23:34       ` Balbir Singh
2025-07-22  0:01         ` Matthew Brost
2025-07-22 19:34         ` [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Francois Dugast
2025-07-22 20:07           ` Andrew Morton
2025-07-23 15:34             ` Francois Dugast
2025-07-23 18:05               ` Matthew Brost
2025-07-24  0:25           ` Balbir Singh
2025-07-24  5:02             ` Matthew Brost
2025-07-24  5:46               ` Mika Penttilä
2025-07-24  5:57                 ` Matthew Brost
2025-07-24  6:04                   ` Mika Penttilä
2025-07-24  6:47                     ` Leon Romanovsky
2025-07-28 13:34               ` Jason Gunthorpe
2025-08-08  0:21           ` Matthew Brost
2025-08-08  9:43             ` Francois Dugast

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).