[v2 00/11] THP support for zone device page migration

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [v2 00/11] THP support for zone device page migration
@ 2025-07-30  9:21 Balbir Singh
  2025-07-30  9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
                   ` (12 more replies)
  0 siblings, 13 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

This patch series adds support for THP migration of zone device pages.
To do so, the patches implement support for folio zone device pages
by adding support for setting up larger order pages. Larger order
pages provide a speedup in throughput and latency.

In my local testing (using lib/test_hmm) and a throughput test, the
series shows a 350% improvement in data transfer throughput and a
500% improvement in latency

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound pages.
migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along
with fault handling of large zone device private pages. page vma walk
and the rmap code is also zone device aware. Support has also been
added for folios that might need to be split in the middle
of migration (when the src and dst do not agree on
MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
migrate large pages, but the destination has not been able to allocate
large pages. The code supported and used folio_split() when migrating
THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
as an argument to migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.

The nouveau dmem code has been enhanced to use the new THP migration
capability.

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has
been kept generic to support various order sizes. With additional
refactoring of the code support of different order sizes should be
possible.

The future plan is to post enhancements to support mTHP with a rough
design as follows:

1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
   a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize

The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.

HMM support for large folios:

Francois Dugast posted patches support for HMM handling [4], the proposed
changes can build on top of this series to provide support for HMM fault
handling.

References:
[1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
[3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
[4] https://lore.kernel.org/lkml/20250722193445.1588348-1-francois.dugast@intel.com/

These patches are built on top of mm/mm-stable

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Changelog v2 [3] :
- Several review comments from David Hildenbrand were addressed, Mika,
  Zi, Matthew also provided helpful review comments
  - In paths where it makes sense a new helper
    is_pmd_device_private_entry() is used
  - anon_exclusive handling of zone device private pages in
    split_huge_pmd_locked() has been fixed
  - Patches that introduced helpers have been folded into where they
    are used
- Zone device handling in mm/huge_memory.c has benefited from the code
  and testing of Matthew Brost, he helped find bugs related to
  copy_huge_pmd() and partial unmapping of folios.
- Zone device THP PMD support via page_vma_mapped_walk() is restricted
  to try_to_migrate_one()
- There is a new dedicated helper to split large zone device folios

Changelog v1 [2]:
- Support for handling fault_folio and using trylock in the fault path
- A new test case has been added to measure the throughput improvement
- General refactoring of code to keep up with the changes in mm
- New split folio callback when the entire split is complete/done. The
  callback is used to know when the head order needs to be reset.

Testing:
- Testing was done with ZONE_DEVICE private pages on an x86 VM

Balbir Singh (11):
  mm/zone_device: support large zone device private folios
  mm/thp: zone_device awareness in THP handling code
  mm/migrate_device: THP migration of zone device pages
  mm/memory/fault: add support for zone device THP fault handling
  lib/test_hmm: test cases and support for zone device private THP
  mm/memremap: add folio_split support
  mm/thp: add split during migration support
  lib/test_hmm: add test case for split pages
  selftests/mm/hmm-tests: new tests for zone device THP migration
  gpu/drm/nouveau: add THP migration support
  selftests/mm/hmm-tests: new throughput tests including THP

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/huge_mm.h                |  19 +-
 include/linux/memremap.h               |  51 ++-
 include/linux/migrate.h                |   2 +
 include/linux/mm.h                     |   1 +
 include/linux/rmap.h                   |   2 +
 include/linux/swapops.h                |  17 +
 lib/test_hmm.c                         | 432 ++++++++++++++----
 lib/test_hmm_uapi.h                    |   3 +
 mm/huge_memory.c                       | 358 ++++++++++++---
 mm/memory.c                            |   6 +-
 mm/memremap.c                          |  48 +-
 mm/migrate_device.c                    | 517 ++++++++++++++++++---
 mm/page_vma_mapped.c                   |  13 +-
 mm/pgtable-generic.c                   |   6 +
 mm/rmap.c                              |  22 +-
 tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
 19 files changed, 2040 insertions(+), 319 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [v2 01/11] mm/zone_device: support large zone device private folios
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30  9:50   ` David Hildenbrand
  2025-07-30  9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h | 10 ++++++++-
 mm/memremap.c            | 48 +++++++++++++++++++++++++++++-----------
 2 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..a0723b35eeaa 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void zone_device_folio_init(struct folio *folio, unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
+
+static inline void zone_device_page_init(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	zone_device_folio_init(folio, 0);
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index b0ce0d8254bd..3ca136e7455e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
+	unsigned int nr = folio_nr_pages(folio);
+	int i;
 
 	if (WARN_ON_ONCE(!pgmap))
 		return;
 
 	mem_cgroup_uncharge(folio);
 
-	/*
-	 * Note: we don't expect anonymous compound pages yet. Once supported
-	 * and we could PTE-map them similar to THP, we'd have to clear
-	 * PG_anon_exclusive on all tail pages.
-	 */
 	if (folio_test_anon(folio)) {
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-		__ClearPageAnonExclusive(folio_page(folio, 0));
+		for (i = 0; i < nr; i++)
+			__ClearPageAnonExclusive(folio_page(folio, i));
+	} else {
+		VM_WARN_ON_ONCE(folio_test_large(folio));
 	}
 
 	/*
@@ -464,11 +463,20 @@ void free_zone_device_folio(struct folio *folio)
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+		if (folio_test_large(folio)) {
+			folio_unqueue_deferred_split(folio);
+
+			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
+		}
+		pgmap->ops->page_free(&folio->page);
+		percpu_ref_put(&folio->pgmap->ref);
+		folio->page.mapping = NULL;
+		break;
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
 			break;
-		pgmap->ops->page_free(folio_page(folio, 0));
-		put_dev_pagemap(pgmap);
+		pgmap->ops->page_free(&folio->page);
+		percpu_ref_put(&folio->pgmap->ref);
 		break;
 
 	case MEMORY_DEVICE_GENERIC:
@@ -491,14 +499,28 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page)
+void zone_device_folio_init(struct folio *folio, unsigned int order)
 {
+	struct page *page = folio_page(folio, 0);
+
+	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+
+	/*
+	 * Only PMD level migration is supported for THP migration
+	 */
+	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
-	set_page_count(page, 1);
+	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
+	folio_set_count(folio, 1);
 	lock_page(page);
+
+	if (order > 1) {
+		prep_compound_page(page, order);
+		folio_set_large_rmappable(folio);
+	}
 }
-EXPORT_SYMBOL_GPL(zone_device_page_init);
+EXPORT_SYMBOL_GPL(zone_device_folio_init);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
  2025-07-30  9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30 11:16   ` Mika Penttilä
  2025-07-30 20:05   ` kernel test robot
  2025-07-30  9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Mika Penttilä, Matthew Brost, Francois Dugast,
	Ralph Campbell

Make THP handling code in the mm subsystem for THP pages aware of zone
device pages. Although the code is designed to be generic when it comes
to handling splitting of pages, the code is designed to work for THP
page sizes corresponding to HPAGE_PMD_NR.

Modify page_vma_mapped_walk() to return true when a zone device huge
entry is present, enabling try_to_migrate() and other code migration
paths to appropriately process the entry. page_vma_mapped_walk() will
return true for zone device private large folios only when
PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
not zone device private pages from having to add awareness. The key
callback that needs this flag is try_to_migrate_one(). The other
callbacks page idle, damon use it for setting young/dirty bits, which is
not significant when it comes to pmd level bit harvesting.

pmd_pfn() does not work well with zone device entries, use
pfn_pmd_entry_to_swap() for checking and comparison as for zone device
entries.

Zone device private entries when split via munmap go through pmd split,
but need to go through a folio split, deferred split does not work if a
fault is encountered because fault handling involves migration entries
(via folio_migrate_mapping) and the folio sizes are expected to be the
same there. This introduces the need to split the folio while handling
the pmd split. Because the folio is still mapped, but calling
folio_split() will cause lock recursion, the __split_unmapped_folio()
code is used with a new helper to wrap the code
split_device_private_folio(), which skips the checks around
folio->mapping, swapcache and the need to go through unmap and remap
folio.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |   1 +
 include/linux/rmap.h    |   2 +
 include/linux/swapops.h |  17 +++
 mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
 mm/page_vma_mapped.c    |  13 +-
 mm/pgtable-generic.c    |   6 +
 mm/rmap.c               |  22 +++-
 7 files changed, 278 insertions(+), 51 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..2a6f5ff7bca3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -345,6 +345,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_device_private_folio(struct folio *folio);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 20803fcb49a7..625f36dcc121 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -905,6 +905,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 #define PVMW_SYNC		(1 << 0)
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION		(1 << 1)
+/* Look for device private THP entries */
+#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
 
 struct page_vma_mapped_walk {
 	unsigned long pfn;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 64ea151a7ae3..2641c01bd5d2 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 {
 	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
 }
+
 #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page)
@@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 }
 #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
+}
+
+#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return 0;
+}
+
+#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..e373c6578894 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -72,6 +72,10 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
+static int __split_unmapped_folio(struct folio *folio, int new_order,
+		struct page *split_at, struct xa_state *xas,
+		struct address_space *mapping, bool uniform_split);
+
 static bool split_underused_thp = true;
 
 static atomic_t huge_zero_refcount;
@@ -1711,8 +1715,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (!is_readable_migration_entry(entry)) {
+		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
+				!is_pmd_device_private_entry(pmd));
+
+		if (is_migration_entry(entry) &&
+			is_writable_migration_entry(entry)) {
 			entry = make_readable_migration_entry(
 							swp_offset(entry));
 			pmd = swp_entry_to_pmd(entry);
@@ -1722,6 +1729,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
+
+		if (is_device_private_entry(entry)) {
+			if (is_writable_device_private_entry(entry)) {
+				entry = make_readable_device_private_entry(
+					swp_offset(entry));
+				pmd = swp_entry_to_pmd(entry);
+
+				if (pmd_swp_soft_dirty(*src_pmd))
+					pmd = pmd_swp_mksoft_dirty(pmd);
+				if (pmd_swp_uffd_wp(*src_pmd))
+					pmd = pmd_swp_mkuffd_wp(pmd);
+				set_pmd_at(src_mm, addr, src_pmd, pmd);
+			}
+
+			src_folio = pfn_swap_entry_folio(entry);
+			VM_WARN_ON(!folio_test_large(src_folio));
+
+			folio_get(src_folio);
+			/*
+			 * folio_try_dup_anon_rmap_pmd does not fail for
+			 * device private entries.
+			 */
+			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
+					  &src_folio->page, dst_vma, src_vma));
+		}
+
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -2219,15 +2252,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			folio_remove_rmap_pmd(folio, page, vma);
 			WARN_ON_ONCE(folio_mapcount(folio) < 0);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-		} else if (thp_migration_supported()) {
+		} else if (is_pmd_migration_entry(orig_pmd) ||
+				is_pmd_device_private_entry(orig_pmd)) {
 			swp_entry_t entry;
 
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
 			folio = pfn_swap_entry_folio(entry);
 			flush_needed = 0;
-		} else
-			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+			if (!thp_migration_supported())
+				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+			if (is_pmd_device_private_entry(orig_pmd)) {
+				folio_remove_rmap_pmd(folio, &folio->page, vma);
+				WARN_ON_ONCE(folio_mapcount(folio) < 0);
+			}
+		}
 
 		if (folio_test_anon(folio)) {
 			zap_deposited_table(tlb->mm, pmd);
@@ -2247,6 +2287,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				folio_mark_accessed(folio);
 		}
 
+		/*
+		 * Do a folio put on zone device private pages after
+		 * changes to mm_counter, because the folio_put() will
+		 * clean folio->mapping and the folio_test_anon() check
+		 * will not be usable.
+		 */
+		if (folio_is_device_private(folio))
+			folio_put(folio);
+
 		spin_unlock(ptl);
 		if (flush_needed)
 			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
@@ -2375,7 +2424,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct folio *folio = pfn_swap_entry_folio(entry);
 		pmd_t newpmd;
 
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
+			   !folio_is_device_private(folio));
 		if (is_writable_migration_entry(entry)) {
 			/*
 			 * A protection check is difficult so
@@ -2388,6 +2438,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
+		} else if (is_writable_device_private_entry(entry)) {
+			entry = make_readable_device_private_entry(
+							swp_offset(entry));
+			newpmd = swp_entry_to_pmd(entry);
 		} else {
 			newpmd = *pmd;
 		}
@@ -2834,6 +2888,44 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
+/**
+ * split_huge_device_private_folio - split a huge device private folio into
+ * smaller pages (of order 0), currently used by migrate_device logic to
+ * split folios for pages that are partially mapped
+ *
+ * @folio: the folio to split
+ *
+ * The caller has to hold the folio_lock and a reference via folio_get
+ */
+int split_device_private_folio(struct folio *folio)
+{
+	struct folio *end_folio = folio_next(folio);
+	struct folio *new_folio;
+	int ret = 0;
+
+	/*
+	 * Split the folio now. In the case of device
+	 * private pages, this path is executed when
+	 * the pmd is split and since freeze is not true
+	 * it is likely the folio will be deferred_split.
+	 *
+	 * With device private pages, deferred splits of
+	 * folios should be handled here to prevent partial
+	 * unmaps from causing issues later on in migration
+	 * and fault handling flows.
+	 */
+	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
+	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
+	VM_WARN_ON(ret);
+	for (new_folio = folio_next(folio); new_folio != end_folio;
+					new_folio = folio_next(new_folio)) {
+		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
+								new_folio));
+	}
+	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
+	return ret;
+}
+
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
@@ -2842,16 +2934,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	bool anon_exclusive = false, dirty = false;
+	bool young, write, soft_dirty, uffd_wp = false;
+	bool anon_exclusive = false, dirty = false, present = false;
 	unsigned long addr;
 	pte_t *pte;
 	int i;
+	swp_entry_t swp_entry;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
+
+	VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+			&& !(is_pmd_device_private_entry(*pmd)));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -2899,18 +2994,60 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	pmd_migration = is_pmd_migration_entry(*pmd);
-	if (unlikely(pmd_migration)) {
-		swp_entry_t entry;
 
+	present = pmd_present(*pmd);
+	if (unlikely(!present)) {
+		swp_entry = pmd_to_swp_entry(*pmd);
 		old_pmd = *pmd;
-		entry = pmd_to_swp_entry(old_pmd);
-		page = pfn_swap_entry_to_page(entry);
-		write = is_writable_migration_entry(entry);
-		if (PageAnon(page))
-			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = is_migration_entry_young(entry);
-		dirty = is_migration_entry_dirty(entry);
+
+		folio = pfn_swap_entry_folio(swp_entry);
+		VM_WARN_ON(!is_migration_entry(swp_entry) &&
+				!is_device_private_entry(swp_entry));
+		page = pfn_swap_entry_to_page(swp_entry);
+
+		if (is_pmd_migration_entry(old_pmd)) {
+			write = is_writable_migration_entry(swp_entry);
+			if (PageAnon(page))
+				anon_exclusive =
+					is_readable_exclusive_migration_entry(
+								swp_entry);
+			young = is_migration_entry_young(swp_entry);
+			dirty = is_migration_entry_dirty(swp_entry);
+		} else if (is_pmd_device_private_entry(old_pmd)) {
+			write = is_writable_device_private_entry(swp_entry);
+			anon_exclusive = PageAnonExclusive(page);
+			if (freeze && anon_exclusive &&
+			    folio_try_share_anon_rmap_pmd(folio, page))
+				freeze = false;
+			if (!freeze) {
+				rmap_t rmap_flags = RMAP_NONE;
+				unsigned long addr = haddr;
+				struct folio *new_folio;
+				struct folio *end_folio = folio_next(folio);
+
+				if (anon_exclusive)
+					rmap_flags |= RMAP_EXCLUSIVE;
+
+				folio_lock(folio);
+				folio_get(folio);
+
+				split_device_private_folio(folio);
+
+				for (new_folio = folio_next(folio);
+					new_folio != end_folio;
+					new_folio = folio_next(new_folio)) {
+					addr += PAGE_SIZE;
+					folio_unlock(new_folio);
+					folio_add_anon_rmap_ptes(new_folio,
+						&new_folio->page, 1,
+						vma, addr, rmap_flags);
+				}
+				folio_unlock(folio);
+				folio_add_anon_rmap_ptes(folio, &folio->page,
+						1, vma, haddr, rmap_flags);
+			}
+		}
+
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2996,30 +3133,49 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Note that NUMA hinting access restrictions are not transferred to
 	 * avoid any possibility of altering permissions across VMAs.
 	 */
-	if (freeze || pmd_migration) {
+	if (freeze || !present) {
 		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
 			pte_t entry;
-			swp_entry_t swp_entry;
-
-			if (write)
-				swp_entry = make_writable_migration_entry(
-							page_to_pfn(page + i));
-			else if (anon_exclusive)
-				swp_entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page + i));
-			else
-				swp_entry = make_readable_migration_entry(
-							page_to_pfn(page + i));
-			if (young)
-				swp_entry = make_migration_entry_young(swp_entry);
-			if (dirty)
-				swp_entry = make_migration_entry_dirty(swp_entry);
-			entry = swp_entry_to_pte(swp_entry);
-			if (soft_dirty)
-				entry = pte_swp_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_swp_mkuffd_wp(entry);
-
+			if (freeze || is_migration_entry(swp_entry)) {
+				if (write)
+					swp_entry = make_writable_migration_entry(
+								page_to_pfn(page + i));
+				else if (anon_exclusive)
+					swp_entry = make_readable_exclusive_migration_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_migration_entry(
+								page_to_pfn(page + i));
+				if (young)
+					swp_entry = make_migration_entry_young(swp_entry);
+				if (dirty)
+					swp_entry = make_migration_entry_dirty(swp_entry);
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			} else {
+				/*
+				 * anon_exclusive was already propagated to the relevant
+				 * pages corresponding to the pte entries when freeze
+				 * is false.
+				 */
+				if (write)
+					swp_entry = make_writable_device_private_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_device_private_entry(
+								page_to_pfn(page + i));
+				/*
+				 * Young and dirty bits are not progated via swp_entry
+				 */
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			}
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
@@ -3046,7 +3202,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_migration)
+	if (present)
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
@@ -3058,8 +3214,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze)
 {
+
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
-	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
+	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
+			(is_pmd_device_private_entry(*pmd)))
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
@@ -3238,6 +3396,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
+	if (folio_is_device_private(folio))
+		return;
+
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		VM_WARN_ON(folio_test_lru(folio));
@@ -3252,6 +3413,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 			list_add_tail(&new_folio->lru, &folio->lru);
 		folio_set_lru(new_folio);
 	}
+
 }
 
 /* Racy check whether the huge page can be split */
@@ -3727,7 +3889,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	if (folio_ref_freeze(folio, 1 + extra_pins)) {
+	if (folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio))) {
 		struct address_space *swap_cache = NULL;
 		struct lruvec *lruvec;
 		int expected_refs;
@@ -4603,7 +4765,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		return 0;
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
-	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	if (unlikely(is_pmd_device_private_entry(*pvmw->pmd)))
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+	else
+		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
 
 	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
 	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
@@ -4653,6 +4818,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	folio_get(folio);
 	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
+
+	if (folio_is_device_private(folio)) {
+		if (pmd_write(pmde))
+			entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+		else
+			entry = make_readable_device_private_entry(
+							page_to_pfn(new));
+		pmde = swp_entry_to_pmd(entry);
+	}
+
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e981a1a292d2..246e6c211f34 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -250,12 +250,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 			pmde = *pvmw->pmd;
 			if (!pmd_present(pmde)) {
-				swp_entry_t entry;
+				swp_entry_t entry = pmd_to_swp_entry(pmde);
 
 				if (!thp_migration_supported() ||
 				    !(pvmw->flags & PVMW_MIGRATION))
 					return not_found(pvmw);
-				entry = pmd_to_swp_entry(pmde);
 				if (!is_migration_entry(entry) ||
 				    !check_pmd(swp_offset_pfn(entry), pvmw))
 					return not_found(pvmw);
@@ -277,6 +276,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cannot return prematurely, while zap_huge_pmd() has
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(pmde);
+
+			if (is_device_private_entry(entry) &&
+				(pvmw->flags & PVMW_THP_DEVICE_PRIVATE)) {
+				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+				return true;
+			}
+
 			if ((pvmw->flags & PVMW_SYNC) &&
 			    thp_vma_suitable_order(vma, pvmw->address,
 						   PMD_ORDER) &&
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..604e8206a2ec 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
 		goto nomap;
+	if (is_swap_pmd(pmdval)) {
+		swp_entry_t entry = pmd_to_swp_entry(pmdval);
+
+		if (is_device_private_entry(entry))
+			goto nomap;
+	}
 	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_bad(pmdval))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index f93ce27132ab..5c5c1c777ce3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2281,7 +2281,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
+				PVMW_THP_DEVICE_PRIVATE);
 	bool anon_exclusive, writable, ret = true;
 	pte_t pteval;
 	struct page *subpage;
@@ -2326,6 +2327,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
+			unsigned long pfn;
+
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				split_huge_pmd_locked(vma, pvmw.address,
 						      pvmw.pmd, true);
@@ -2334,8 +2337,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-			subpage = folio_page(folio,
-				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
+			/*
+			 * Zone device private folios do not work well with
+			 * pmd_pfn() on some architectures due to pte
+			 * inversion.
+			 */
+			if (is_pmd_device_private_entry(*pvmw.pmd)) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
+
+				pfn = swp_offset_pfn(entry);
+			} else {
+				pfn = pmd_pfn(*pvmw.pmd);
+			}
+
+			subpage = folio_page(folio, pfn - folio_pfn(folio));
+
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 03/11] mm/migrate_device: THP migration of zone device pages
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
  2025-07-30  9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
  2025-07-30  9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-31 16:19   ` kernel test robot
  2025-07-30  9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Mika Penttilä, Matthew Brost, Francois Dugast,
	Ralph Campbell

MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
device pages as compound pages during device pfn migration.

migrate_device code paths go through the collect, setup
and finalize phases of migration.

The entries in src and dst arrays passed to these functions still
remain at a PAGE_SIZE granularity. When a compound page is passed,
the first entry has the PFN along with MIGRATE_PFN_COMPOUND
and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
representation allows for the compound page to be split into smaller
page sizes.

migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
page aware. Two new helper functions migrate_vma_collect_huge_pmd()
and migrate_vma_insert_huge_pmd_page() have been added.

migrate_vma_collect_huge_pmd() can collect THP pages, but if for
some reason this fails, there is fallback support to split the folio
and migrate it.

migrate_vma_insert_huge_pmd_page() closely follows the logic of
migrate_vma_insert_page()

Support for splitting pages as needed for migration will follow in
later patches in this series.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/migrate.h |   2 +
 mm/migrate_device.c     | 456 ++++++++++++++++++++++++++++++++++------
 2 files changed, 395 insertions(+), 63 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index acadd41e0b5c..d9cef0819f91 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_COMPOUND	(1UL << 4)
 #define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -147,6 +148,7 @@ enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
 	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
+	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e05e14d6eacd..4c3334cc3228 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -14,6 +14,7 @@
 #include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swapops.h>
+#include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
 	if (!vma_is_anonymous(walk->vma))
 		return migrate_vma_collect_skip(start, end, walk);
 
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
+						MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages] = 0;
+		migrate->npages++;
+		migrate->cpages++;
+
+		/*
+		 * Collect the remaining entries as holes, in case we
+		 * need to split later
+		 */
+		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+	}
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
 		migrate->dst[migrate->npages] = 0;
@@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
 	return 0;
 }
 
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-				   unsigned long start,
-				   unsigned long end,
-				   struct mm_walk *walk)
+/**
+ * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
+ * folio for device private pages.
+ * @pmdp: pointer to pmd entry
+ * @start: start address of the range for migration
+ * @end: end address of the range for migration
+ * @walk: mm_walk callback structure
+ *
+ * Collect the huge pmd entry at @pmdp for migration and set the
+ * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
+ * migration will occur at HPAGE_PMD granularity
+ */
+static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
+					unsigned long end, struct mm_walk *walk,
+					struct folio *fault_folio)
 {
+	struct mm_struct *mm = walk->mm;
+	struct folio *folio;
 	struct migrate_vma *migrate = walk->private;
-	struct folio *fault_folio = migrate->fault_page ?
-		page_folio(migrate->fault_page) : NULL;
-	struct vm_area_struct *vma = walk->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
-	pte_t *ptep;
+	swp_entry_t entry;
+	int ret;
+	unsigned long write = 0;
 
-again:
-	if (pmd_none(*pmdp))
+	ptl = pmd_lock(mm, pmdp);
+	if (pmd_none(*pmdp)) {
+		spin_unlock(ptl);
 		return migrate_vma_collect_hole(start, end, -1, walk);
+	}
 
 	if (pmd_trans_huge(*pmdp)) {
-		struct folio *folio;
-
-		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
 			spin_unlock(ptl);
-			goto again;
+			return migrate_vma_collect_skip(start, end, walk);
 		}
 
 		folio = pmd_folio(*pmdp);
 		if (is_huge_zero_folio(folio)) {
 			spin_unlock(ptl);
-			split_huge_pmd(vma, pmdp, addr);
-		} else {
-			int ret;
+			return migrate_vma_collect_hole(start, end, -1, walk);
+		}
+		if (pmd_write(*pmdp))
+			write = MIGRATE_PFN_WRITE;
+	} else if (!pmd_present(*pmdp)) {
+		entry = pmd_to_swp_entry(*pmdp);
+		folio = pfn_swap_entry_folio(entry);
+
+		if (!is_device_private_entry(entry) ||
+			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
+			(folio->pgmap->owner != migrate->pgmap_owner)) {
+			spin_unlock(ptl);
+			return migrate_vma_collect_skip(start, end, walk);
+		}
 
-			folio_get(folio);
+		if (is_migration_entry(entry)) {
+			migration_entry_wait_on_locked(entry, ptl);
 			spin_unlock(ptl);
-			/* FIXME: we don't expect THP for fault_folio */
-			if (WARN_ON_ONCE(fault_folio == folio))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			if (unlikely(!folio_trylock(folio)))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			ret = split_folio(folio);
-			if (fault_folio != folio)
-				folio_unlock(folio);
-			folio_put(folio);
-			if (ret)
-				return migrate_vma_collect_skip(start, end,
-								walk);
+			return -EAGAIN;
 		}
+
+		if (is_writable_device_private_entry(entry))
+			write = MIGRATE_PFN_WRITE;
+	} else {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	folio_get(folio);
+	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
+		spin_unlock(ptl);
+		folio_put(folio);
+		return migrate_vma_collect_skip(start, end, walk);
+	}
+
+	if (thp_migration_supported() &&
+		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
+		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
+
+		struct page_vma_mapped_walk pvmw = {
+			.ptl = ptl,
+			.address = start,
+			.pmd = pmdp,
+			.vma = walk->vma,
+		};
+
+		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
+
+		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
+						| MIGRATE_PFN_MIGRATE
+						| MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages++] = 0;
+		migrate->cpages++;
+		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
+		if (ret) {
+			migrate->npages--;
+			migrate->cpages--;
+			migrate->src[migrate->npages] = 0;
+			migrate->dst[migrate->npages] = 0;
+			goto fallback;
+		}
+		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+		spin_unlock(ptl);
+		return 0;
+	}
+
+fallback:
+	spin_unlock(ptl);
+	if (!folio_test_large(folio))
+		goto done;
+	ret = split_folio(folio);
+	if (fault_folio != folio)
+		folio_unlock(folio);
+	folio_put(folio);
+	if (ret)
+		return migrate_vma_collect_skip(start, end, walk);
+	if (pmd_none(pmdp_get_lockless(pmdp)))
+		return migrate_vma_collect_hole(start, end, -1, walk);
+
+done:
+	return -ENOENT;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start, unmapped = 0;
+	spinlock_t *ptl;
+	struct folio *fault_folio = migrate->fault_page ?
+		page_folio(migrate->fault_page) : NULL;
+	pte_t *ptep;
+
+again:
+	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
+		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
+
+		if (ret == -EAGAIN)
+			goto again;
+		if (ret == 0)
+			return 0;
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -175,8 +287,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
 
-		/* FIXME support THP */
-		if (!page || !page->mapping || PageTransCompound(page)) {
+		if (!page || !page->mapping) {
 			mpfn = 0;
 			goto next;
 		}
@@ -347,14 +458,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 	 */
 	int extra = 1 + (page == fault_page);
 
-	/*
-	 * FIXME support THP (transparent huge page), it is bit more complex to
-	 * check them than regular pages, because they can be mapped with a pmd
-	 * or with a pte (split pte mapping).
-	 */
-	if (folio_test_large(folio))
-		return false;
-
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (folio_is_zone_device(folio))
 		extra++;
@@ -385,17 +488,24 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 	lru_add_drain();
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct folio *folio;
+		unsigned int nr = 1;
 
 		if (!page) {
 			if (src_pfns[i] & MIGRATE_PFN_MIGRATE)
 				unmapped++;
-			continue;
+			goto next;
 		}
 
 		folio =	page_folio(page);
+		nr = folio_nr_pages(folio);
+
+		if (nr > 1)
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+
+
 		/* ZONE_DEVICE folios are not on LRU */
 		if (!folio_is_zone_device(folio)) {
 			if (!folio_test_lru(folio) && allow_drain) {
@@ -407,7 +517,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			if (!folio_isolate_lru(folio)) {
 				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 				restore++;
-				continue;
+				goto next;
 			}
 
 			/* Drop the reference we took in collect */
@@ -426,10 +536,12 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			restore++;
-			continue;
+			goto next;
 		}
 
 		unmapped++;
+next:
+		i += nr;
 	}
 
 	for (i = 0; i < npages && restore; i++) {
@@ -575,6 +687,146 @@ int migrate_vma_setup(struct migrate_vma *args)
 }
 EXPORT_SYMBOL(migrate_vma_setup);
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/**
+ * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
+ * at @addr. folio is already allocated as a part of the migration process with
+ * large page.
+ *
+ * @folio needs to be initialized and setup after it's allocated. The code bits
+ * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
+ * not support THP zero pages.
+ *
+ * @migrate: migrate_vma arguments
+ * @addr: address where the folio will be inserted
+ * @folio: folio to be inserted at @addr
+ * @src: src pfn which is being migrated
+ * @pmdp: pointer to the pmd
+ */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	gfp_t gfp = vma_thp_gfp_mask(vma);
+	struct folio *folio = page_folio(page);
+	int ret;
+	spinlock_t *ptl;
+	pgtable_t pgtable;
+	pmd_t entry;
+	bool flush = false;
+	unsigned long i;
+
+	VM_WARN_ON_FOLIO(!folio, folio);
+	VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
+
+	if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
+		return -EINVAL;
+
+	ret = anon_vma_prepare(vma);
+	if (ret)
+		return ret;
+
+	folio_set_order(folio, HPAGE_PMD_ORDER);
+	folio_set_large_rmappable(folio);
+
+	if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+		ret = -ENOMEM;
+		goto abort;
+	}
+
+	__folio_mark_uptodate(folio);
+
+	pgtable = pte_alloc_one(vma->vm_mm);
+	if (unlikely(!pgtable))
+		goto abort;
+
+	if (folio_is_device_private(folio)) {
+		swp_entry_t swp_entry;
+
+		if (vma->vm_flags & VM_WRITE)
+			swp_entry = make_writable_device_private_entry(
+						page_to_pfn(page));
+		else
+			swp_entry = make_readable_device_private_entry(
+						page_to_pfn(page));
+		entry = swp_entry_to_pmd(swp_entry);
+	} else {
+		if (folio_is_zone_device(folio) &&
+		    !folio_is_device_coherent(folio)) {
+			goto abort;
+		}
+		entry = folio_mk_pmd(folio, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
+	}
+
+	ptl = pmd_lock(vma->vm_mm, pmdp);
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret)
+		goto abort;
+
+	/*
+	 * Check for userfaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma))
+		goto unlock_abort;
+
+	if (!pmd_none(*pmdp)) {
+		if (!is_huge_zero_pmd(*pmdp))
+			goto unlock_abort;
+		flush = true;
+	} else if (!pmd_none(*pmdp))
+		goto unlock_abort;
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
+	if (!folio_is_zone_device(folio))
+		folio_add_lru_vma(folio, vma);
+	folio_get(folio);
+
+	if (flush) {
+		pte_free(vma->vm_mm, pgtable);
+		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
+		pmdp_invalidate(vma, addr, pmdp);
+	} else {
+		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		mm_inc_nr_ptes(vma->vm_mm);
+	}
+	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
+	update_mmu_cache_pmd(vma, addr, pmdp);
+
+	spin_unlock(ptl);
+
+	count_vm_event(THP_FAULT_ALLOC);
+	count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
+	count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+
+	return 0;
+
+unlock_abort:
+	spin_unlock(ptl);
+abort:
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		src[i] &= ~MIGRATE_PFN_MIGRATE;
+	return 0;
+}
+#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
+static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
+					 unsigned long addr,
+					 struct page *page,
+					 unsigned long *src,
+					 pmd_t *pmdp)
+{
+	return 0;
+}
+#endif
+
 /*
  * This code closely matches the code in:
  *   __handle_mm_fault()
@@ -585,9 +837,10 @@ EXPORT_SYMBOL(migrate_vma_setup);
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
-				    struct page *page,
+				    unsigned long *dst,
 				    unsigned long *src)
 {
+	struct page *page = migrate_pfn_to_page(*dst);
 	struct folio *folio = page_folio(page);
 	struct vm_area_struct *vma = migrate->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -615,8 +868,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp))
-		goto abort;
+
+	if (thp_migration_supported() && (*dst & MIGRATE_PFN_COMPOUND)) {
+		int ret = migrate_vma_insert_huge_pmd_page(migrate, addr, page,
+								src, pmdp);
+		if (ret)
+			goto abort;
+		return;
+	}
+
+	if (!pmd_none(*pmdp)) {
+		if (pmd_trans_huge(*pmdp)) {
+			if (!is_huge_zero_pmd(*pmdp))
+				goto abort;
+			folio_get(pmd_folio(*pmdp));
+			split_huge_pmd(vma, pmdp, addr);
+		} else if (pmd_leaf(*pmdp))
+			goto abort;
+	}
+
 	if (pte_alloc(mm, pmdp))
 		goto abort;
 	if (unlikely(anon_vma_prepare(vma)))
@@ -707,23 +977,24 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 	unsigned long i;
 	bool notified = false;
 
-	for (i = 0; i < npages; i++) {
+	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
 		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct address_space *mapping;
 		struct folio *newfolio, *folio;
 		int r, extra_cnt = 0;
+		unsigned long nr = 1;
 
 		if (!newpage) {
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		if (!page) {
 			unsigned long addr;
 
 			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-				continue;
+				goto next;
 
 			/*
 			 * The only time there is no vma is when called from
@@ -741,15 +1012,47 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
-			migrate_vma_insert_page(migrate, addr, newpage,
+
+			if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
+				nr = HPAGE_PMD_NR;
+				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				goto next;
+			}
+
+			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
 						&src_pfns[i]);
-			continue;
+			goto next;
 		}
 
 		newfolio = page_folio(newpage);
 		folio = page_folio(page);
 		mapping = folio_mapping(folio);
 
+		/*
+		 * If THP migration is enabled, check if both src and dst
+		 * can migrate large pages
+		 */
+		if (thp_migration_supported()) {
+			if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+
+				if (!migrate) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
+				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
+				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+			}
+		}
+
+
 		if (folio_is_device_private(newfolio) ||
 		    folio_is_device_coherent(newfolio)) {
 			if (mapping) {
@@ -762,7 +1065,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				if (!folio_test_anon(folio) ||
 				    !folio_free_swap(folio)) {
 					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-					continue;
+					goto next;
 				}
 			}
 		} else if (folio_is_zone_device(newfolio)) {
@@ -770,7 +1073,7 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			 * Other types of ZONE_DEVICE page are not supported.
 			 */
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-			continue;
+			goto next;
 		}
 
 		BUG_ON(folio_test_writeback(folio));
@@ -782,6 +1085,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 		else
 			folio_migrate_flags(newfolio, folio);
+next:
+		i += nr;
 	}
 
 	if (notified)
@@ -943,10 +1248,23 @@ static unsigned long migrate_device_pfn_lock(unsigned long pfn)
 int migrate_device_range(unsigned long *src_pfns, unsigned long start,
 			unsigned long npages)
 {
-	unsigned long i, pfn;
+	unsigned long i, j, pfn;
+
+	for (pfn = start, i = 0; i < npages; pfn++, i++) {
+		struct page *page = pfn_to_page(pfn);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
-	for (pfn = start, i = 0; i < npages; pfn++, i++)
 		src_pfns[i] = migrate_device_pfn_lock(pfn);
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j - 1;
+			pfn += j - 1;
+		}
+	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
 
@@ -964,10 +1282,22 @@ EXPORT_SYMBOL(migrate_device_range);
  */
 int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)
 {
-	unsigned long i;
+	unsigned long i, j;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = pfn_to_page(src_pfns[i]);
+		struct folio *folio = page_folio(page);
+		unsigned int nr = 1;
 
-	for (i = 0; i < npages; i++)
 		src_pfns[i] = migrate_device_pfn_lock(src_pfns[i]);
+		nr = folio_nr_pages(folio);
+		if (nr > 1) {
+			src_pfns[i] |= MIGRATE_PFN_COMPOUND;
+			for (j = 1; j < nr; j++)
+				src_pfns[i+j] = 0;
+			i += j - 1;
+		}
+	}
 
 	migrate_device_unmap(src_pfns, npages, NULL);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 04/11] mm/memory/fault: add support for zone device THP fault handling
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (2 preceding siblings ...)
  2025-07-30  9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30  9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

When the CPU touches a zone device THP entry, the data needs to
be migrated back to the CPU, call migrate_to_ram() on these pages
via do_huge_pmd_device_private() fault handling helper.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |  7 +++++++
 mm/huge_memory.c        | 36 ++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  6 ++++--
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2a6f5ff7bca3..56fdcaf7604b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -475,6 +475,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf);
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
@@ -633,6 +635,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	return 0;
+}
+
 static inline bool is_huge_zero_folio(const struct folio *folio)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e373c6578894..713dd433d352 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1271,6 +1271,42 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 
 }
 
+vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	vm_fault_t ret = 0;
+	spinlock_t *ptl;
+	swp_entry_t swp_entry;
+	struct page *page;
+
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
+	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) {
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	swp_entry = pmd_to_swp_entry(vmf->orig_pmd);
+	page = pfn_swap_entry_to_page(swp_entry);
+	vmf->page = page;
+	vmf->pte = NULL;
+	if (trylock_page(vmf->page)) {
+		get_page(page);
+		spin_unlock(ptl);
+		ret = page_pgmap(page)->ops->migrate_to_ram(vmf);
+		unlock_page(vmf->page);
+		put_page(page);
+	} else {
+		spin_unlock(ptl);
+	}
+
+	return ret;
+}
+
 /*
  * always: directly stall for all thp allocations
  * defer: wake kswapd and fail if not immediately available
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1..6c87f043eea1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6152,8 +6152,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
 		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_device_private_entry(
+					pmd_to_swp_entry(vmf.orig_pmd)))
+				return do_huge_pmd_device_private(&vmf);
+
 			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 05/11] lib/test_hmm: test cases and support for zone device private THP
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (3 preceding siblings ...)
  2025-07-30  9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-31 11:17   ` kernel test robot
  2025-07-30  9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Enhance the hmm test driver (lib/test_hmm) with support for
THP pages.

A new pool of free_folios() has now been added to the dmirror
device, which can be allocated when a request for a THP zone
device private page is made.

Add compound page awareness to the allocation function during
normal migration and fault based migration. These routines also
copy folio_nr_pages() when moving data between system memory
and device memory.

args.src and args.dst used to hold migration entries are now
dynamically allocated (as they need to hold HPAGE_PMD_NR entries
or more).

Split and migrate support will be added in future patches in this
series.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h |  12 ++
 lib/test_hmm.c           | 366 +++++++++++++++++++++++++++++++--------
 2 files changed, 303 insertions(+), 75 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index a0723b35eeaa..0c5141a7d58c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -169,6 +169,18 @@ static inline bool folio_is_device_private(const struct folio *folio)
 	return is_device_private_page(&folio->page);
 }
 
+static inline void *folio_zone_device_data(const struct folio *folio)
+{
+	VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+	return folio->page.zone_device_data;
+}
+
+static inline void folio_set_zone_device_data(struct folio *folio, void *data)
+{
+	VM_WARN_ON_FOLIO(!folio_is_device_private(folio), folio);
+	folio->page.zone_device_data = data;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 761725bc713c..4850f9026694 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -119,6 +119,7 @@ struct dmirror_device {
 	unsigned long		calloc;
 	unsigned long		cfree;
 	struct page		*free_pages;
+	struct folio		*free_folios;
 	spinlock_t		lock;		/* protects the above */
 };
 
@@ -492,7 +493,7 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 }
 
 static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
-				   struct page **ppage)
+				  struct page **ppage, bool is_large)
 {
 	struct dmirror_chunk *devmem;
 	struct resource *res = NULL;
@@ -572,20 +573,45 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		pfn_first, pfn_last);
 
 	spin_lock(&mdevice->lock);
-	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+	for (pfn = pfn_first; pfn < pfn_last; ) {
 		struct page *page = pfn_to_page(pfn);
 
+		if (is_large && IS_ALIGNED(pfn, HPAGE_PMD_NR)
+			&& (pfn + HPAGE_PMD_NR <= pfn_last)) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+			pfn += HPAGE_PMD_NR;
+			continue;
+		}
+
 		page->zone_device_data = mdevice->free_pages;
 		mdevice->free_pages = page;
+		pfn++;
 	}
+
+	ret = 0;
 	if (ppage) {
-		*ppage = mdevice->free_pages;
-		mdevice->free_pages = (*ppage)->zone_device_data;
-		mdevice->calloc++;
+		if (is_large) {
+			if (!mdevice->free_folios) {
+				ret = -ENOMEM;
+				goto err_unlock;
+			}
+			*ppage = folio_page(mdevice->free_folios, 0);
+			mdevice->free_folios = (*ppage)->zone_device_data;
+			mdevice->calloc += HPAGE_PMD_NR;
+		} else if (mdevice->free_pages) {
+			*ppage = mdevice->free_pages;
+			mdevice->free_pages = (*ppage)->zone_device_data;
+			mdevice->calloc++;
+		} else {
+			ret = -ENOMEM;
+			goto err_unlock;
+		}
 	}
+err_unlock:
 	spin_unlock(&mdevice->lock);
 
-	return 0;
+	return ret;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -598,10 +624,13 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	return ret;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
+					      bool is_large)
 {
 	struct page *dpage = NULL;
 	struct page *rpage = NULL;
+	unsigned int order = is_large ? HPAGE_PMD_ORDER : 0;
+	struct dmirror_device *mdevice = dmirror->mdevice;
 
 	/*
 	 * For ZONE_DEVICE private type, this is a fake device so we allocate
@@ -610,49 +639,55 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	 * data and ignore rpage.
 	 */
 	if (dmirror_is_private_zone(mdevice)) {
-		rpage = alloc_page(GFP_HIGHUSER);
+		rpage = folio_page(folio_alloc(GFP_HIGHUSER, order), 0);
 		if (!rpage)
 			return NULL;
 	}
 	spin_lock(&mdevice->lock);
 
-	if (mdevice->free_pages) {
+	if (is_large && mdevice->free_folios) {
+		dpage = folio_page(mdevice->free_folios, 0);
+		mdevice->free_folios = dpage->zone_device_data;
+		mdevice->calloc += 1 << order;
+		spin_unlock(&mdevice->lock);
+	} else if (!is_large && mdevice->free_pages) {
 		dpage = mdevice->free_pages;
 		mdevice->free_pages = dpage->zone_device_data;
 		mdevice->calloc++;
 		spin_unlock(&mdevice->lock);
 	} else {
 		spin_unlock(&mdevice->lock);
-		if (dmirror_allocate_chunk(mdevice, &dpage))
+		if (dmirror_allocate_chunk(mdevice, &dpage, is_large))
 			goto error;
 	}
 
-	zone_device_page_init(dpage);
+	zone_device_folio_init(page_folio(dpage), order);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
 error:
 	if (rpage)
-		__free_page(rpage);
+		__free_pages(rpage, order);
 	return NULL;
 }
 
 static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 					   struct dmirror *dmirror)
 {
-	struct dmirror_device *mdevice = dmirror->mdevice;
 	const unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
 	unsigned long addr;
 
-	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
-						   src++, dst++) {
+	for (addr = args->start; addr < args->end; ) {
 		struct page *spage;
 		struct page *dpage;
 		struct page *rpage;
+		bool is_large = *src & MIGRATE_PFN_COMPOUND;
+		int write = (*src & MIGRATE_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+		unsigned long nr = 1;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		/*
 		 * Note that spage might be NULL which is OK since it is an
@@ -662,17 +697,45 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		if (WARN(spage && is_zone_device_page(spage),
 		     "page already in device spage pfn: 0x%lx\n",
 		     page_to_pfn(spage)))
+			goto next;
+
+		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (!dpage) {
+			struct folio *folio;
+			unsigned long i;
+			unsigned long spfn = *src >> MIGRATE_PFN_SHIFT;
+			struct page *src_page;
+
+			if (!is_large)
+				goto next;
+
+			if (!spage && is_large) {
+				nr = HPAGE_PMD_NR;
+			} else {
+				folio = page_folio(spage);
+				nr = folio_nr_pages(folio);
+			}
+
+			for (i = 0; i < nr && addr < args->end; i++) {
+				dpage = dmirror_devmem_alloc_page(dmirror, false);
+				rpage = BACKING_PAGE(dpage);
+				rpage->zone_device_data = dmirror;
+
+				*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+				src_page = pfn_to_page(spfn + i);
+
+				if (spage)
+					copy_highpage(rpage, src_page);
+				else
+					clear_highpage(rpage);
+				src++;
+				dst++;
+				addr += PAGE_SIZE;
+			}
 			continue;
-
-		dpage = dmirror_devmem_alloc_page(mdevice);
-		if (!dpage)
-			continue;
+		}
 
 		rpage = BACKING_PAGE(dpage);
-		if (spage)
-			copy_highpage(rpage, spage);
-		else
-			clear_highpage(rpage);
 
 		/*
 		 * Normally, a device would use the page->zone_device_data to
@@ -684,10 +747,42 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 			 page_to_pfn(spage), page_to_pfn(dpage));
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if ((*src & MIGRATE_PFN_WRITE) ||
-		    (!spage && args->vma->vm_flags & VM_WRITE))
-			*dst |= MIGRATE_PFN_WRITE;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) | write;
+
+		if (is_large) {
+			int i;
+			struct folio *folio = page_folio(dpage);
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+			if (folio_test_large(folio)) {
+				for (i = 0; i < folio_nr_pages(folio); i++) {
+					struct page *dst_page =
+						pfn_to_page(page_to_pfn(rpage) + i);
+					struct page *src_page =
+						pfn_to_page(page_to_pfn(spage) + i);
+
+					if (spage)
+						copy_highpage(dst_page, src_page);
+					else
+						clear_highpage(dst_page);
+					src++;
+					dst++;
+					addr += PAGE_SIZE;
+				}
+				continue;
+			}
+		}
+
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+next:
+		src++;
+		dst++;
+		addr += PAGE_SIZE;
 	}
 }
 
@@ -734,14 +829,17 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	const unsigned long *src = args->src;
 	const unsigned long *dst = args->dst;
 	unsigned long pfn;
+	const unsigned long start_pfn = start >> PAGE_SHIFT;
+	const unsigned long end_pfn = end >> PAGE_SHIFT;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
-								src++, dst++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn++, src++, dst++) {
 		struct page *dpage;
 		void *entry;
+		int nr, i;
+		struct page *rpage;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
 			continue;
@@ -750,13 +848,25 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 		if (!dpage)
 			continue;
 
-		entry = BACKING_PAGE(dpage);
-		if (*dst & MIGRATE_PFN_WRITE)
-			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
-		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
-		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
+		if (*dst & MIGRATE_PFN_COMPOUND)
+			nr = folio_nr_pages(page_folio(dpage));
+		else
+			nr = 1;
+
+		WARN_ON_ONCE(end_pfn < start_pfn + nr);
+
+		rpage = BACKING_PAGE(dpage);
+		VM_WARN_ON(folio_nr_pages(page_folio(rpage)) != nr);
+
+		for (i = 0; i < nr; i++) {
+			entry = folio_page(page_folio(rpage), i);
+			if (*dst & MIGRATE_PFN_WRITE)
+				entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+			entry = xa_store(&dmirror->pt, pfn + i, entry, GFP_ATOMIC);
+			if (xa_is_err(entry)) {
+				mutex_unlock(&dmirror->mutex);
+				return xa_err(entry);
+			}
 		}
 	}
 
@@ -829,31 +939,66 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 	unsigned long start = args->start;
 	unsigned long end = args->end;
 	unsigned long addr;
+	unsigned int order = 0;
+	int i;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
+	for (addr = start; addr < end; ) {
 		struct page *dpage, *spage;
 
 		spage = migrate_pfn_to_page(*src);
-		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE)) {
+			addr += PAGE_SIZE;
+			goto next;
+		}
 
 		if (WARN_ON(!is_device_private_page(spage) &&
-			    !is_device_coherent_page(spage)))
-			continue;
+			    !is_device_coherent_page(spage))) {
+			addr += PAGE_SIZE;
+			goto next;
+		}
+
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
-			 page_to_pfn(spage), page_to_pfn(dpage));
+		order = folio_order(page_folio(spage));
 
+		if (order)
+			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
+						order, args->vma, addr), 0);
+		else
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+
+		/* Try with smaller pages if large allocation fails */
+		if (!dpage && order) {
+			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+			if (!dpage)
+				return VM_FAULT_OOM;
+			order = 0;
+		}
+
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+				page_to_pfn(spage), page_to_pfn(dpage));
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
 		copy_highpage(dpage, spage);
 		*dst = migrate_pfn(page_to_pfn(dpage));
 		if (*src & MIGRATE_PFN_WRITE)
 			*dst |= MIGRATE_PFN_WRITE;
+		if (order)
+			*dst |= MIGRATE_PFN_COMPOUND;
+
+		for (i = 0; i < (1 << order); i++) {
+			struct page *src_page;
+			struct page *dst_page;
+
+			src_page = pfn_to_page(page_to_pfn(spage) + i);
+			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+
+			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			copy_highpage(dst_page, src_page);
+		}
+next:
+		addr += PAGE_SIZE << order;
+		src += 1 << order;
+		dst += 1 << order;
 	}
 	return 0;
 }
@@ -879,11 +1024,14 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
+
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
 
 	start = cmd->addr;
 	end = start + size;
@@ -902,7 +1050,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -912,7 +1060,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = dmirror_select_device(dmirror);
+		args.flags = dmirror_select_device(dmirror) | MIGRATE_VMA_SELECT_COMPOUND;
 
 		ret = migrate_vma_setup(&args);
 		if (ret)
@@ -928,6 +1076,8 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+	kvfree(src_pfns);
+	kvfree(dst_pfns);
 
 	return ret;
 }
@@ -939,12 +1089,12 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[32] = { 0 };
-	unsigned long dst_pfns[32] = { 0 };
 	struct dmirror_bounce bounce;
 	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
 
 	start = cmd->addr;
 	end = start + size;
@@ -955,6 +1105,18 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	if (!mmget_not_zero(mm))
 		return -EINVAL;
 
+	ret = -ENOMEM;
+	src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!src_pfns)
+		goto free_mem;
+
+	dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns),
+			  GFP_KERNEL | __GFP_NOFAIL);
+	if (!dst_pfns)
+		goto free_mem;
+
+	ret = 0;
 	mmap_read_lock(mm);
 	for (addr = start; addr < end; addr = next) {
 		vma = vma_lookup(mm, addr);
@@ -962,7 +1124,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -972,7 +1134,8 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+		args.flags = MIGRATE_VMA_SELECT_SYSTEM |
+				MIGRATE_VMA_SELECT_COMPOUND;
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out;
@@ -992,7 +1155,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	 */
 	ret = dmirror_bounce_init(&bounce, start, size);
 	if (ret)
-		return ret;
+		goto free_mem;
 	mutex_lock(&dmirror->mutex);
 	ret = dmirror_do_read(dmirror, start, end, &bounce);
 	mutex_unlock(&dmirror->mutex);
@@ -1003,11 +1166,14 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	}
 	cmd->cpages = bounce.cpages;
 	dmirror_bounce_fini(&bounce);
-	return ret;
+	goto free_mem;
 
 out:
 	mmap_read_unlock(mm);
 	mmput(mm);
+free_mem:
+	kfree(src_pfns);
+	kfree(dst_pfns);
 	return ret;
 }
 
@@ -1200,6 +1366,7 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 	unsigned long i;
 	unsigned long *src_pfns;
 	unsigned long *dst_pfns;
+	unsigned int order = 0;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
@@ -1215,13 +1382,25 @@ static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
 		if (WARN_ON(!is_device_private_page(spage) &&
 			    !is_device_coherent_page(spage)))
 			continue;
+
+		order = folio_order(page_folio(spage));
 		spage = BACKING_PAGE(spage);
-		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+		if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+			dpage = folio_page(folio_alloc(GFP_HIGHUSER_MOVABLE,
+					      order), 0);
+		} else {
+			dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+			order = 0;
+		}
+
+		/* TODO Support splitting here */
 		lock_page(dpage);
-		copy_highpage(dpage, spage);
 		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
 		if (src_pfns[i] & MIGRATE_PFN_WRITE)
 			dst_pfns[i] |= MIGRATE_PFN_WRITE;
+		if (order)
+			dst_pfns[i] |= MIGRATE_PFN_COMPOUND;
+		folio_copy(page_folio(dpage), page_folio(spage));
 	}
 	migrate_device_pages(src_pfns, dst_pfns, npages);
 	migrate_device_finalize(src_pfns, dst_pfns, npages);
@@ -1234,7 +1413,12 @@ static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
 {
 	struct dmirror_device *mdevice = devmem->mdevice;
 	struct page *page;
+	struct folio *folio;
 
+
+	for (folio = mdevice->free_folios; folio; folio = folio_zone_device_data(folio))
+		if (dmirror_page_to_chunk(folio_page(folio, 0)) == devmem)
+			mdevice->free_folios = folio_zone_device_data(folio);
 	for (page = mdevice->free_pages; page; page = page->zone_device_data)
 		if (dmirror_page_to_chunk(page) == devmem)
 			mdevice->free_pages = page->zone_device_data;
@@ -1265,6 +1449,7 @@ static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
 		mdevice->devmem_count = 0;
 		mdevice->devmem_capacity = 0;
 		mdevice->free_pages = NULL;
+		mdevice->free_folios = NULL;
 		kfree(mdevice->devmem_chunks);
 		mdevice->devmem_chunks = NULL;
 	}
@@ -1378,18 +1563,30 @@ static void dmirror_devmem_free(struct page *page)
 {
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
+	struct folio *folio = page_folio(rpage);
+	unsigned int order = folio_order(folio);
 
-	if (rpage != page)
-		__free_page(rpage);
+	if (rpage != page) {
+		if (order)
+			__free_pages(rpage, order);
+		else
+			__free_page(rpage);
+		rpage = NULL;
+	}
 
 	mdevice = dmirror_page_to_device(page);
 	spin_lock(&mdevice->lock);
 
 	/* Return page to our allocator if not freeing the chunk */
 	if (!dmirror_page_to_chunk(page)->remove) {
-		mdevice->cfree++;
-		page->zone_device_data = mdevice->free_pages;
-		mdevice->free_pages = page;
+		mdevice->cfree += 1 << order;
+		if (order) {
+			page->zone_device_data = mdevice->free_folios;
+			mdevice->free_folios = page_folio(page);
+		} else {
+			page->zone_device_data = mdevice->free_pages;
+			mdevice->free_pages = page;
+		}
 	}
 	spin_unlock(&mdevice->lock);
 }
@@ -1397,11 +1594,10 @@ static void dmirror_devmem_free(struct page *page)
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args = { 0 };
-	unsigned long src_pfns = 0;
-	unsigned long dst_pfns = 0;
 	struct page *rpage;
 	struct dmirror *dmirror;
-	vm_fault_t ret;
+	vm_fault_t ret = 0;
+	unsigned int order, nr;
 
 	/*
 	 * Normally, a device would use the page->zone_device_data to point to
@@ -1412,21 +1608,38 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
+	order = folio_order(page_folio(vmf->page));
+	nr = 1 << order;
+
+	/*
+	 * Consider a per-cpu cache of src and dst pfns, but with
+	 * large number of cpus that might not scale well.
+	 */
+	args.start = ALIGN_DOWN(vmf->address, (PAGE_SIZE << order));
 	args.vma = vmf->vma;
-	args.start = vmf->address;
-	args.end = args.start + PAGE_SIZE;
-	args.src = &src_pfns;
-	args.dst = &dst_pfns;
+	args.end = args.start + (PAGE_SIZE << order);
+
+	nr = (args.end - args.start) >> PAGE_SHIFT;
+	args.src = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(unsigned long), GFP_KERNEL);
 	args.pgmap_owner = dmirror->mdevice;
 	args.flags = dmirror_select_device(dmirror);
 	args.fault_page = vmf->page;
 
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
 	if (migrate_vma_setup(&args))
 		return VM_FAULT_SIGBUS;
 
 	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
 	if (ret)
-		return ret;
+		goto err;
 	migrate_vma_pages(&args);
 	/*
 	 * No device finalize step is needed since
@@ -1434,7 +1647,10 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * invalidated the device page table.
 	 */
 	migrate_vma_finalize(&args);
-	return 0;
+err:
+	kfree(args.src);
+	kfree(args.dst);
+	return ret;
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
@@ -1465,7 +1681,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 		return ret;
 
 	/* Build a list of free ZONE_DEVICE struct pages */
-	return dmirror_allocate_chunk(mdevice, NULL);
+	return dmirror_allocate_chunk(mdevice, NULL, false);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 06/11] mm/memremap: add folio_split support
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (4 preceding siblings ...)
  2025-07-30  9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30  9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

When a zone device page is split (via huge pmd folio split). The
driver callback for folio_split is invoked to let the device driver
know that the folio size has been split into a smaller order.

The HMM test driver has been updated to handle the split, since the
test driver uses backing pages, it requires a mechanism of reorganizing
the backing pages (backing pages are used to create a mirror device)
again into the right sized order pages. This is supported by exporting
prep_compound_page().

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/memremap.h | 29 +++++++++++++++++++++++++++++
 include/linux/mm.h       |  1 +
 lib/test_hmm.c           | 35 +++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         |  9 ++++++++-
 4 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0c5141a7d58c..20f4b5ebbc93 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,13 @@ struct dev_pagemap_ops {
 	 */
 	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
 			      unsigned long nr_pages, int mf_flags);
+
+	/*
+	 * Used for private (un-addressable) device memory only.
+	 * This callback is used when a folio is split into
+	 * a smaller folio
+	 */
+	void (*folio_split)(struct folio *head, struct folio *tail);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
@@ -229,6 +236,23 @@ static inline void zone_device_page_init(struct page *page)
 	zone_device_folio_init(folio, 0);
 }
 
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+						struct folio *new_folio)
+{
+	if (folio_is_device_private(original_folio)) {
+		if (!original_folio->pgmap->ops->folio_split) {
+			if (new_folio) {
+				new_folio->pgmap = original_folio->pgmap;
+				new_folio->page.mapping =
+					original_folio->page.mapping;
+			}
+		} else {
+			original_folio->pgmap->ops->folio_split(original_folio,
+								 new_folio);
+		}
+	}
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -263,6 +287,11 @@ static inline unsigned long memremap_compat_align(void)
 {
 	return PAGE_SIZE;
 }
+
+static inline void zone_device_private_split_cb(struct folio *original_folio,
+						struct folio *new_folio)
+{
+}
 #endif /* CONFIG_ZONE_DEVICE */
 
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8e3a4c5b78ff..d0ecf8386dd9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1183,6 +1183,7 @@ static inline struct folio *virt_to_folio(const void *x)
 void __folio_put(struct folio *folio);
 
 void split_page(struct page *page, unsigned int order);
+void prep_compound_page(struct page *page, unsigned int order);
 void folio_copy(struct folio *dst, struct folio *src);
 int folio_mc_copy(struct folio *dst, struct folio *src);
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 4850f9026694..a8d0d24b4b7a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1653,9 +1653,44 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
+{
+	struct page *rpage = BACKING_PAGE(folio_page(head, 0));
+	struct page *rpage_tail;
+	struct folio *rfolio;
+	unsigned long offset = 0;
+
+	if (!rpage) {
+		tail->page.zone_device_data = NULL;
+		return;
+	}
+
+	rfolio = page_folio(rpage);
+
+	if (tail == NULL) {
+		folio_reset_order(rfolio);
+		rfolio->mapping = NULL;
+		folio_set_count(rfolio, 1);
+		return;
+	}
+
+	offset = folio_pfn(tail) - folio_pfn(head);
+
+	rpage_tail = folio_page(rfolio, offset);
+	tail->page.zone_device_data = rpage_tail;
+	rpage_tail->zone_device_data = rpage->zone_device_data;
+	clear_compound_head(rpage_tail);
+	rpage_tail->mapping = NULL;
+
+	folio_page(tail, 0)->mapping = folio_page(head, 0)->mapping;
+	tail->pgmap = head->pgmap;
+	folio_set_count(page_folio(rpage_tail), 1);
+}
+
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
 	.page_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
+	.folio_split	= dmirror_devmem_folio_split,
 };
 
 static int dmirror_device_init(struct dmirror_device *mdevice, int id)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 713dd433d352..75b368e7e33f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2955,9 +2955,16 @@ int split_device_private_folio(struct folio *folio)
 	VM_WARN_ON(ret);
 	for (new_folio = folio_next(folio); new_folio != end_folio;
 					new_folio = folio_next(new_folio)) {
+		zone_device_private_split_cb(folio, new_folio);
 		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
 								new_folio));
 	}
+
+	/*
+	 * Mark the end of the folio split for device private THP
+	 * split
+	 */
+	zone_device_private_split_cb(folio, NULL);
 	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
 	return ret;
 }
@@ -3979,7 +3986,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 		ret = __split_unmapped_folio(folio, new_order, split_at, &xas,
 					     mapping, uniform_split);
-
 		/*
 		 * Unfreeze after-split folios and put them back to the right
 		 * list. @folio should be kept frozon until page cache
@@ -4030,6 +4036,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			__filemap_remove_folio(new_folio, NULL);
 			folio_put_refs(new_folio, nr_pages);
 		}
+
 		/*
 		 * Unfreeze @folio only after all page cache entries, which
 		 * used to point to it, have been updated with new folios.
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 07/11] mm/thp: add split during migration support
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (5 preceding siblings ...)
  2025-07-30  9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-31 10:04   ` kernel test robot
  2025-07-30  9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Support splitting pages during THP zone device migration as needed.
The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages.

Add a new routine migrate_vma_split_pages() to support the splitting
of already isolated pages. The pages being migrated are already unmapped
and marked for migration during setup (via unmap). folio_split() and
__split_unmapped_folio() take additional isolated arguments, to avoid
unmapping and remaping these pages and unlocking/putting the folio.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h | 11 +++++--
 mm/huge_memory.c        | 46 ++++++++++++++-------------
 mm/migrate_device.c     | 69 ++++++++++++++++++++++++++++++++++-------
 3 files changed, 91 insertions(+), 35 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 56fdcaf7604b..19e7e3b7c2b7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,9 +343,9 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order);
 int split_device_private_folio(struct folio *folio);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order, bool unmapped);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -354,6 +354,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns);
 int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 		struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
 /*
  * try_folio_split - try to split a @folio at @page using non uniform split.
  * @folio: folio to be split
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 75b368e7e33f..1fc1efa219c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3538,15 +3538,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		new_folio->mapping = folio->mapping;
 		new_folio->index = folio->index + i;
 
-		/*
-		 * page->private should not be set in tail pages. Fix up and warn once
-		 * if private is unexpectedly set.
-		 */
-		if (unlikely(new_folio->private)) {
-			VM_WARN_ON_ONCE_PAGE(true, new_head);
-			new_folio->private = NULL;
-		}
-
 		if (folio_test_swapcache(folio))
 			new_folio->swap.val = folio->swap.val + i;
 
@@ -3775,6 +3766,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
+ * @unmapped: The pages are already unmapped, they are migration entries.
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3790,7 +3782,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split)
+		struct list_head *list, bool uniform_split, bool unmapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3840,13 +3832,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = folio_get_anon_vma(folio);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (!unmapped) {
+			anon_vma = folio_get_anon_vma(folio);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		unsigned int min_order;
 		gfp_t gfp;
@@ -3913,7 +3907,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	unmap_folio(folio);
+	if (!unmapped)
+		unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -4000,10 +3995,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 			next = folio_next(new_folio);
 
+			zone_device_private_split_cb(folio, new_folio);
+
 			expected_refs = folio_expected_ref_count(new_folio) + 1;
 			folio_ref_unfreeze(new_folio, expected_refs);
 
-			lru_add_split_folio(folio, new_folio, lruvec, list);
+			if (!unmapped)
+				lru_add_split_folio(folio, new_folio, lruvec, list);
 
 			/*
 			 * Anonymous folio with swap cache.
@@ -4037,6 +4035,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			folio_put_refs(new_folio, nr_pages);
 		}
 
+		zone_device_private_split_cb(folio, NULL);
 		/*
 		 * Unfreeze @folio only after all page cache entries, which
 		 * used to point to it, have been updated with new folios.
@@ -4060,11 +4059,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	local_irq_enable();
 
+	if (unmapped)
+		return ret;
+
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
 	if (!ret && is_anon)
 		remap_flags = RMP_USE_SHARED_ZEROPAGE;
+
 	remap_page(folio, 1 << order, remap_flags);
 
 	/*
@@ -4149,12 +4152,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order, bool unmapped)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true);
+	return __folio_split(folio, new_order, &folio->page, page, list, true,
+				unmapped);
 }
 
 /*
@@ -4183,7 +4187,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
 	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false);
+			false, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 4c3334cc3228..49962ea19109 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -816,6 +816,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		src[i] &= ~MIGRATE_PFN_MIGRATE;
 	return 0;
 }
+
+static int migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{
+	unsigned long i;
+	unsigned long pfn;
+	unsigned long flags;
+	int ret = 0;
+
+	folio_get(folio);
+	split_huge_pmd_address(migrate->vma, addr, true);
+	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
+							0, true);
+	if (ret)
+		return ret;
+	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+	for (i = 1; i < HPAGE_PMD_NR; i++)
+		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+	return ret;
+}
 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 					 unsigned long addr,
@@ -825,6 +848,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 {
 	return 0;
 }
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{}
 #endif
 
 /*
@@ -974,8 +1002,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				struct migrate_vma *migrate)
 {
 	struct mmu_notifier_range range;
-	unsigned long i;
+	unsigned long i, j;
 	bool notified = false;
+	unsigned long addr;
 
 	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1017,12 +1046,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
 				nr = HPAGE_PMD_NR;
 				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-				goto next;
+			} else {
+				nr = 1;
 			}
 
-			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
-						&src_pfns[i]);
+			for (j = 0; j < nr && i + j < npages; j++) {
+				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+				migrate_vma_insert_page(migrate,
+					addr + j * PAGE_SIZE,
+					&dst_pfns[i+j], &src_pfns[i+j]);
+			}
 			goto next;
 		}
 
@@ -1044,7 +1077,14 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 							 MIGRATE_PFN_COMPOUND);
 					goto next;
 				}
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				nr = 1 << folio_order(folio);
+				addr = migrate->start + i * PAGE_SIZE;
+				if (migrate_vma_split_pages(migrate, i, addr,
+								folio)) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
 			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
 				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
 				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1079,12 +1119,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 		BUG_ON(folio_test_writeback(folio));
 
 		if (migrate && migrate->fault_page == page)
-			extra_cnt = 1;
-		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
-		if (r != MIGRATEPAGE_SUCCESS)
-			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-		else
-			folio_migrate_flags(newfolio, folio);
+			extra_cnt++;
+		for (j = 0; j < nr && i + j < npages; j++) {
+			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+			if (r != MIGRATEPAGE_SUCCESS)
+				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+			else
+				folio_migrate_flags(newfolio, folio);
+		}
 next:
 		i += nr;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 08/11] lib/test_hmm: add test case for split pages
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (6 preceding siblings ...)
  2025-07-30  9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30  9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add a new flag HMM_DMIRROR_FLAG_FAIL_ALLOC to emulate
failure of allocating a large page. This tests the code paths
involving split migration.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 lib/test_hmm.c      | 61 ++++++++++++++++++++++++++++++---------------
 lib/test_hmm_uapi.h |  3 +++
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index a8d0d24b4b7a..341ae2af44ec 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -92,6 +92,7 @@ struct dmirror {
 	struct xarray			pt;
 	struct mmu_interval_notifier	notifier;
 	struct mutex			mutex;
+	__u64			flags;
 };
 
 /*
@@ -699,7 +700,12 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		     page_to_pfn(spage)))
 			goto next;
 
-		dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			dpage = NULL;
+		} else
+			dpage = dmirror_devmem_alloc_page(dmirror, is_large);
+
 		if (!dpage) {
 			struct folio *folio;
 			unsigned long i;
@@ -959,44 +965,55 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 
 		spage = BACKING_PAGE(spage);
 		order = folio_order(page_folio(spage));
-
 		if (order)
+			*dst = MIGRATE_PFN_COMPOUND;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			*dst &= ~MIGRATE_PFN_COMPOUND;
+			dpage = NULL;
+		} else if (order) {
 			dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER_MOVABLE,
 						order, args->vma, addr), 0);
-		else
-			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-
-		/* Try with smaller pages if large allocation fails */
-		if (!dpage && order) {
+		} else {
 			dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-			if (!dpage)
-				return VM_FAULT_OOM;
-			order = 0;
 		}
 
+		if (!dpage && !order)
+			return VM_FAULT_OOM;
+
 		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
 				page_to_pfn(spage), page_to_pfn(dpage));
-		lock_page(dpage);
-		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage));
-		if (*src & MIGRATE_PFN_WRITE)
-			*dst |= MIGRATE_PFN_WRITE;
-		if (order)
-			*dst |= MIGRATE_PFN_COMPOUND;
+
+		if (dpage) {
+			lock_page(dpage);
+			*dst |= migrate_pfn(page_to_pfn(dpage));
+		}
 
 		for (i = 0; i < (1 << order); i++) {
 			struct page *src_page;
 			struct page *dst_page;
 
+			/* Try with smaller pages if large allocation fails */
+			if (!dpage && order) {
+				dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+				lock_page(dpage);
+				dst[i] = migrate_pfn(page_to_pfn(dpage));
+				dst_page = pfn_to_page(page_to_pfn(dpage));
+				dpage = NULL; /* For the next iteration */
+			} else {
+				dst_page = pfn_to_page(page_to_pfn(dpage) + i);
+			}
+
 			src_page = pfn_to_page(page_to_pfn(spage) + i);
-			dst_page = pfn_to_page(page_to_pfn(dpage) + i);
 
 			xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+			addr += PAGE_SIZE;
 			copy_highpage(dst_page, src_page);
 		}
 next:
-		addr += PAGE_SIZE << order;
 		src += 1 << order;
 		dst += 1 << order;
 	}
@@ -1514,6 +1531,10 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		dmirror_device_remove_chunks(dmirror->mdevice);
 		ret = 0;
 		break;
+	case HMM_DMIRROR_FLAGS:
+		dmirror->flags = cmd.npages;
+		ret = 0;
+		break;
 
 	default:
 		return -EINVAL;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 8c818a2cf4f6..f94c6d457338 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -37,6 +37,9 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_FLAGS		_IOWR('H', 0x08, struct hmm_dmirror_cmd)
+
+#define HMM_DMIRROR_FLAG_FAIL_ALLOC	(1ULL << 0)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (7 preceding siblings ...)
  2025-07-30  9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30  9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add new tests for migrating anon THP pages, including anon_huge,
anon_huge_zero and error cases involving forced splitting of pages
during migration.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 410 +++++++++++++++++++++++++
 1 file changed, 410 insertions(+)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index 141bf63cbe05..da3322a1282c 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -2056,4 +2056,414 @@ TEST_F(hmm, hmm_cow_in_device)
 
 	hmm_buffer_free(buffer);
 }
+
+/*
+ * Migrate private anonymous huge empty page.
+ */
+TEST_F(hmm, migrate_anon_huge_empty)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page.
+ */
+TEST_F(hmm, migrate_anon_huge_zero)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+	int val;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize a read-only zero huge page. */
+	val = *(int *)buffer->ptr;
+	ASSERT_EQ(val, 0);
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], 0);
+		/* If it asserts once, it probably will 500,000 times */
+		if (ptr[i] != 0)
+			break;
+	}
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and free.
+ */
+TEST_F(hmm, migrate_anon_huge_free)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Try freeing it. */
+	ret = madvise(map, size, MADV_FREE);
+	ASSERT_EQ(ret, 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and fault back to sysmem.
+ */
+TEST_F(hmm, migrate_anon_huge_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], i);
+		if (ptr[i] != i)
+			break;
+	}
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate THP to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/*
+	 * Force an allocation error when faulting back a THP resident in the
+	 * device.
+	 */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+
+	ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_zero_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 0);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory (zero THP page). */
+	ret = ptr[0];
+	ASSERT_EQ(ret, 0);
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Fault the device memory back and check it. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 10/11] gpu/drm/nouveau: add THP migration support
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (8 preceding siblings ...)
  2025-07-30  9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30  9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Change the code to add support for MIGRATE_VMA_SELECT_COMPOUND
and appropriately handling page sizes in the migrate/evict
code paths.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++++++++++++--------
 drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 178 insertions(+), 77 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index ca4932a150e3..d3672d01e8b5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -83,9 +83,15 @@ struct nouveau_dmem {
 	struct list_head chunks;
 	struct mutex mutex;
 	struct page *free_pages;
+	struct folio *free_folios;
 	spinlock_t lock;
 };
 
+struct nouveau_dmem_dma_info {
+	dma_addr_t dma_addr;
+	size_t size;
+};
+
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
 	return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
@@ -112,10 +118,16 @@ static void nouveau_dmem_page_free(struct page *page)
 {
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
+	struct folio *folio = page_folio(page);
 
 	spin_lock(&dmem->lock);
-	page->zone_device_data = dmem->free_pages;
-	dmem->free_pages = page;
+	if (folio_order(folio)) {
+		folio_set_zone_device_data(folio, dmem->free_folios);
+		dmem->free_folios = folio;
+	} else {
+		page->zone_device_data = dmem->free_pages;
+		dmem->free_pages = page;
+	}
 
 	WARN_ON(!chunk->callocated);
 	chunk->callocated--;
@@ -139,20 +151,28 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
 	}
 }
 
-static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
-				struct page *dpage, dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_folio(struct nouveau_drm *drm,
+				   struct folio *sfolio, struct folio *dfolio,
+				   struct nouveau_dmem_dma_info *dma_info)
 {
 	struct device *dev = drm->dev->dev;
+	struct page *dpage = folio_page(dfolio, 0);
+	struct page *spage = folio_page(sfolio, 0);
 
-	lock_page(dpage);
+	folio_lock(dfolio);
 
-	*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
-	if (dma_mapping_error(dev, *dma_addr))
+	dma_info->dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+					DMA_BIDIRECTIONAL);
+	dma_info->size = page_size(dpage);
+	if (dma_mapping_error(dev, dma_info->dma_addr))
 		return -EIO;
 
-	if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-					 NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage))) {
-		dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(sfolio),
+					 NOUVEAU_APER_HOST, dma_info->dma_addr,
+					 NOUVEAU_APER_VRAM,
+					 nouveau_dmem_page_addr(spage))) {
+		dma_unmap_page(dev, dma_info->dma_addr, page_size(dpage),
+					DMA_BIDIRECTIONAL);
 		return -EIO;
 	}
 
@@ -165,21 +185,38 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	struct nouveau_dmem *dmem = drm->dmem;
 	struct nouveau_fence *fence;
 	struct nouveau_svmm *svmm;
-	struct page *spage, *dpage;
-	unsigned long src = 0, dst = 0;
-	dma_addr_t dma_addr = 0;
+	struct page *dpage;
 	vm_fault_t ret = 0;
 	struct migrate_vma args = {
 		.vma		= vmf->vma,
-		.start		= vmf->address,
-		.end		= vmf->address + PAGE_SIZE,
-		.src		= &src,
-		.dst		= &dst,
 		.pgmap_owner	= drm->dev,
 		.fault_page	= vmf->page,
-		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+				  MIGRATE_VMA_SELECT_COMPOUND,
+		.src = NULL,
+		.dst = NULL,
 	};
-
+	unsigned int order, nr;
+	struct folio *sfolio, *dfolio;
+	struct nouveau_dmem_dma_info dma_info;
+
+	sfolio = page_folio(vmf->page);
+	order = folio_order(sfolio);
+	nr = 1 << order;
+
+	if (order)
+		args.flags |= MIGRATE_VMA_SELECT_COMPOUND;
+
+	args.start = ALIGN_DOWN(vmf->address, (1 << (PAGE_SHIFT + order)));
+	args.vma = vmf->vma;
+	args.end = args.start + (PAGE_SIZE << order);
+	args.src = kcalloc(nr, sizeof(*args.src), GFP_KERNEL);
+	args.dst = kcalloc(nr, sizeof(*args.dst), GFP_KERNEL);
+
+	if (!args.src || !args.dst) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
 	/*
 	 * FIXME what we really want is to find some heuristic to migrate more
 	 * than just one page on CPU fault. When such fault happens it is very
@@ -190,20 +227,26 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	if (!args.cpages)
 		return 0;
 
-	spage = migrate_pfn_to_page(src);
-	if (!spage || !(src & MIGRATE_PFN_MIGRATE))
-		goto done;
-
-	dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma, vmf->address);
-	if (!dpage)
+	if (order)
+		dpage = folio_page(vma_alloc_folio(GFP_HIGHUSER | __GFP_ZERO,
+					order, vmf->vma, vmf->address), 0);
+	else
+		dpage = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vmf->vma,
+					vmf->address);
+	if (!dpage) {
+		ret = VM_FAULT_OOM;
 		goto done;
+	}
 
-	dst = migrate_pfn(page_to_pfn(dpage));
+	args.dst[0] = migrate_pfn(page_to_pfn(dpage));
+	if (order)
+		args.dst[0] |= MIGRATE_PFN_COMPOUND;
+	dfolio = page_folio(dpage);
 
-	svmm = spage->zone_device_data;
+	svmm = folio_zone_device_data(sfolio);
 	mutex_lock(&svmm->mutex);
 	nouveau_svmm_invalidate(svmm, args.start, args.end);
-	ret = nouveau_dmem_copy_one(drm, spage, dpage, &dma_addr);
+	ret = nouveau_dmem_copy_folio(drm, sfolio, dfolio, &dma_info);
 	mutex_unlock(&svmm->mutex);
 	if (ret) {
 		ret = VM_FAULT_SIGBUS;
@@ -213,19 +256,33 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	nouveau_fence_new(&fence, dmem->migrate.chan);
 	migrate_vma_pages(&args);
 	nouveau_dmem_fence_done(&fence);
-	dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(drm->dev->dev, dma_info.dma_addr, PAGE_SIZE,
+				DMA_BIDIRECTIONAL);
 done:
 	migrate_vma_finalize(&args);
+err:
+	kfree(args.src);
+	kfree(args.dst);
 	return ret;
 }
 
+static void nouveau_dmem_folio_split(struct folio *head, struct folio *tail)
+{
+	if (tail == NULL)
+		return;
+	tail->pgmap = head->pgmap;
+	folio_set_zone_device_data(tail, folio_zone_device_data(head));
+}
+
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
 	.page_free		= nouveau_dmem_page_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
+	.folio_split		= nouveau_dmem_folio_split,
 };
 
 static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage,
+			 bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct resource *res;
@@ -274,16 +331,21 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 	pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
 	page = pfn_to_page(pfn_first);
 	spin_lock(&drm->dmem->lock);
-	for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
-		page->zone_device_data = drm->dmem->free_pages;
-		drm->dmem->free_pages = page;
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || !is_large) {
+		for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
+			page->zone_device_data = drm->dmem->free_pages;
+			drm->dmem->free_pages = page;
+		}
 	}
+
 	*ppage = page;
 	chunk->callocated++;
 	spin_unlock(&drm->dmem->lock);
 
-	NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
-		DMEM_CHUNK_SIZE >> 20);
+	NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+		DMEM_CHUNK_SIZE >> 20, is_large ? "THP " : "", pfn_first,
+		nouveau_dmem_page_addr(page));
 
 	return 0;
 
@@ -298,27 +360,37 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 }
 
 static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct page *page = NULL;
+	struct folio *folio = NULL;
 	int ret;
+	unsigned int order = 0;
 
 	spin_lock(&drm->dmem->lock);
-	if (drm->dmem->free_pages) {
+	if (is_large && drm->dmem->free_folios) {
+		folio = drm->dmem->free_folios;
+		drm->dmem->free_folios = folio_zone_device_data(folio);
+		chunk = nouveau_page_to_chunk(page);
+		chunk->callocated++;
+		spin_unlock(&drm->dmem->lock);
+		order = DMEM_CHUNK_NPAGES;
+	} else if (!is_large && drm->dmem->free_pages) {
 		page = drm->dmem->free_pages;
 		drm->dmem->free_pages = page->zone_device_data;
 		chunk = nouveau_page_to_chunk(page);
 		chunk->callocated++;
 		spin_unlock(&drm->dmem->lock);
+		folio = page_folio(page);
 	} else {
 		spin_unlock(&drm->dmem->lock);
-		ret = nouveau_dmem_chunk_alloc(drm, &page);
+		ret = nouveau_dmem_chunk_alloc(drm, &page, is_large);
 		if (ret)
 			return NULL;
 	}
 
-	zone_device_page_init(page);
+	zone_device_folio_init(folio, order);
 	return page;
 }
 
@@ -369,12 +441,12 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 {
 	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
 	unsigned long *src_pfns, *dst_pfns;
-	dma_addr_t *dma_addrs;
+	struct nouveau_dmem_dma_info *dma_info;
 	struct nouveau_fence *fence;
 
 	src_pfns = kvcalloc(npages, sizeof(*src_pfns), GFP_KERNEL | __GFP_NOFAIL);
 	dst_pfns = kvcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL | __GFP_NOFAIL);
-	dma_addrs = kvcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL | __GFP_NOFAIL);
+	dma_info = kvcalloc(npages, sizeof(*dma_info), GFP_KERNEL | __GFP_NOFAIL);
 
 	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
 			npages);
@@ -382,17 +454,28 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	for (i = 0; i < npages; i++) {
 		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
 			struct page *dpage;
+			struct folio *folio = page_folio(
+				migrate_pfn_to_page(src_pfns[i]));
+			unsigned int order = folio_order(folio);
+
+			if (src_pfns[i] & MIGRATE_PFN_COMPOUND) {
+				dpage = folio_page(
+						folio_alloc(
+						GFP_HIGHUSER_MOVABLE, order), 0);
+			} else {
+				/*
+				 * _GFP_NOFAIL because the GPU is going away and there
+				 * is nothing sensible we can do if we can't copy the
+				 * data back.
+				 */
+				dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+			}
 
-			/*
-			 * _GFP_NOFAIL because the GPU is going away and there
-			 * is nothing sensible we can do if we can't copy the
-			 * data back.
-			 */
-			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
 			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
-			nouveau_dmem_copy_one(chunk->drm,
-					migrate_pfn_to_page(src_pfns[i]), dpage,
-					&dma_addrs[i]);
+			nouveau_dmem_copy_folio(chunk->drm,
+				page_folio(migrate_pfn_to_page(src_pfns[i])),
+				page_folio(dpage),
+				&dma_info[i]);
 		}
 	}
 
@@ -403,8 +486,9 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	kvfree(src_pfns);
 	kvfree(dst_pfns);
 	for (i = 0; i < npages; i++)
-		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
-	kvfree(dma_addrs);
+		dma_unmap_page(chunk->drm->dev->dev, dma_info[i].dma_addr,
+				dma_info[i].size, DMA_BIDIRECTIONAL);
+	kvfree(dma_info);
 }
 
 void
@@ -607,31 +691,35 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, unsigned long src,
-		dma_addr_t *dma_addr, u64 *pfn)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfn)
 {
 	struct device *dev = drm->dev->dev;
 	struct page *dpage, *spage;
 	unsigned long paddr;
+	bool is_large = false;
 
 	spage = migrate_pfn_to_page(src);
 	if (!(src & MIGRATE_PFN_MIGRATE))
 		goto out;
 
-	dpage = nouveau_dmem_page_alloc_locked(drm);
+	is_large = src & MIGRATE_PFN_COMPOUND;
+	dpage = nouveau_dmem_page_alloc_locked(drm, is_large);
 	if (!dpage)
 		goto out;
 
 	paddr = nouveau_dmem_page_addr(dpage);
 	if (spage) {
-		*dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
+		dma_info->dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
 					 DMA_BIDIRECTIONAL);
-		if (dma_mapping_error(dev, *dma_addr))
+		dma_info->size = page_size(spage);
+		if (dma_mapping_error(dev, dma_info->dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, 1,
-			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
+		if (drm->dmem->migrate.copy_func(drm, folio_nr_pages(page_folio(spage)),
+			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST,
+			dma_info->dma_addr))
 			goto out_dma_unmap;
 	} else {
-		*dma_addr = DMA_MAPPING_ERROR;
+		dma_info->dma_addr = DMA_MAPPING_ERROR;
 		if (drm->dmem->migrate.clear_func(drm, page_size(dpage),
 			NOUVEAU_APER_VRAM, paddr))
 			goto out_free_page;
@@ -645,7 +733,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 	return migrate_pfn(page_to_pfn(dpage));
 
 out_dma_unmap:
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	dma_unmap_page(dev, dma_info->dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 out_free_page:
 	nouveau_dmem_page_free_locked(drm, dpage);
 out:
@@ -655,27 +743,33 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 
 static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, struct migrate_vma *args,
-		dma_addr_t *dma_addrs, u64 *pfns)
+		struct nouveau_dmem_dma_info *dma_info, u64 *pfns)
 {
 	struct nouveau_fence *fence;
 	unsigned long addr = args->start, nr_dma = 0, i;
+	unsigned long order = 0;
 
-	for (i = 0; addr < args->end; i++) {
+	for (i = 0; addr < args->end; ) {
+		struct folio *folio;
+
+		folio = page_folio(migrate_pfn_to_page(args->dst[i]));
+		order = folio_order(folio);
 		args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
-				args->src[i], dma_addrs + nr_dma, pfns + i);
-		if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
+				args->src[i], dma_info + nr_dma, pfns + i);
+		if (!dma_mapping_error(drm->dev->dev, dma_info[nr_dma].dma_addr))
 			nr_dma++;
-		addr += PAGE_SIZE;
+		i += 1 << order;
+		addr += (1 << order) * PAGE_SIZE;
 	}
 
 	nouveau_fence_new(&fence, drm->dmem->migrate.chan);
 	migrate_vma_pages(args);
 	nouveau_dmem_fence_done(&fence);
-	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i, order);
 
 	while (nr_dma--) {
-		dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
-				DMA_BIDIRECTIONAL);
+		dma_unmap_page(drm->dev->dev, dma_info[nr_dma].dma_addr,
+				dma_info[nr_dma].size, DMA_BIDIRECTIONAL);
 	}
 	migrate_vma_finalize(args);
 }
@@ -689,20 +783,24 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 {
 	unsigned long npages = (end - start) >> PAGE_SHIFT;
 	unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
-	dma_addr_t *dma_addrs;
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
 		.pgmap_owner	= drm->dev,
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM
+				  | MIGRATE_VMA_SELECT_COMPOUND,
 	};
 	unsigned long i;
 	u64 *pfns;
 	int ret = -ENOMEM;
+	struct nouveau_dmem_dma_info *dma_info;
 
 	if (drm->dmem == NULL)
 		return -ENODEV;
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		max = max(HPAGE_PMD_NR, max);
+
 	args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL);
 	if (!args.src)
 		goto out;
@@ -710,8 +808,8 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	if (!args.dst)
 		goto out_free_src;
 
-	dma_addrs = kmalloc_array(max, sizeof(*dma_addrs), GFP_KERNEL);
-	if (!dma_addrs)
+	dma_info = kmalloc_array(max, sizeof(*dma_info), GFP_KERNEL);
+	if (!dma_info)
 		goto out_free_dst;
 
 	pfns = nouveau_pfns_alloc(max);
@@ -729,7 +827,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 			goto out_free_pfns;
 
 		if (args.cpages)
-			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
+			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_info,
 						   pfns);
 		args.start = args.end;
 	}
@@ -738,7 +836,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 out_free_pfns:
 	nouveau_pfns_free(pfns);
 out_free_dma:
-	kfree(dma_addrs);
+	kfree(dma_info);
 out_free_dst:
 	kfree(args.dst);
 out_free_src:
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 6fa387da0637..b8a3378154d5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -921,12 +921,14 @@ nouveau_pfns_free(u64 *pfns)
 
 void
 nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		 unsigned long addr, u64 *pfns, unsigned long npages)
+		 unsigned long addr, u64 *pfns, unsigned long npages,
+		 unsigned int page_shift)
 {
 	struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
 
 	args->p.addr = addr;
-	args->p.size = npages << PAGE_SHIFT;
+	args->p.size = npages << page_shift;
+	args->p.page = page_shift;
 
 	mutex_lock(&svmm->mutex);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index e7d63d7f0c2d..3fd78662f17e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -33,7 +33,8 @@ void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit);
 u64 *nouveau_pfns_alloc(unsigned long npages);
 void nouveau_pfns_free(u64 *pfns);
 void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		      unsigned long addr, u64 *pfns, unsigned long npages);
+		      unsigned long addr, u64 *pfns, unsigned long npages,
+		      unsigned int page_shift);
 #else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
 static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
 static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (9 preceding siblings ...)
  2025-07-30  9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
@ 2025-07-30  9:21 ` Balbir Singh
  2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
  2025-08-05 21:34 ` Matthew Brost
  12 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-30  9:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Add new benchmark style support to test transfer bandwidth for
zone device memory operations.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 tools/testing/selftests/mm/hmm-tests.c | 197 ++++++++++++++++++++++++-
 1 file changed, 196 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c
index da3322a1282c..1325de70f44f 100644
--- a/tools/testing/selftests/mm/hmm-tests.c
+++ b/tools/testing/selftests/mm/hmm-tests.c
@@ -25,6 +25,7 @@
 #include <sys/stat.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
+#include <sys/time.h>
 
 
 /*
@@ -207,8 +208,10 @@ static void hmm_buffer_free(struct hmm_buffer *buffer)
 	if (buffer == NULL)
 		return;
 
-	if (buffer->ptr)
+	if (buffer->ptr) {
 		munmap(buffer->ptr, buffer->size);
+		buffer->ptr = NULL;
+	}
 	free(buffer->mirror);
 	free(buffer);
 }
@@ -2466,4 +2469,196 @@ TEST_F(hmm, migrate_anon_huge_zero_err)
 	buffer->ptr = old_ptr;
 	hmm_buffer_free(buffer);
 }
+
+struct benchmark_results {
+	double sys_to_dev_time;
+	double dev_to_sys_time;
+	double throughput_s2d;
+	double throughput_d2s;
+};
+
+static double get_time_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000.0) + (tv.tv_usec / 1000.0);
+}
+
+static inline struct hmm_buffer *hmm_buffer_alloc(unsigned long size)
+{
+	struct hmm_buffer *buffer;
+
+	buffer = malloc(sizeof(*buffer));
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	memset(buffer->mirror, 0xFF, size);
+	return buffer;
+}
+
+static void print_benchmark_results(const char *test_name, size_t buffer_size,
+				     struct benchmark_results *thp,
+				     struct benchmark_results *regular)
+{
+	double s2d_improvement = ((regular->sys_to_dev_time - thp->sys_to_dev_time) /
+				 regular->sys_to_dev_time) * 100.0;
+	double d2s_improvement = ((regular->dev_to_sys_time - thp->dev_to_sys_time) /
+				 regular->dev_to_sys_time) * 100.0;
+	double throughput_s2d_improvement = ((thp->throughput_s2d - regular->throughput_s2d) /
+					    regular->throughput_s2d) * 100.0;
+	double throughput_d2s_improvement = ((thp->throughput_d2s - regular->throughput_d2s) /
+					    regular->throughput_d2s) * 100.0;
+
+	printf("\n=== %s (%.1f MB) ===\n", test_name, buffer_size / (1024.0 * 1024.0));
+	printf("                     | With THP        | Without THP     | Improvement\n");
+	printf("---------------------------------------------------------------------\n");
+	printf("Sys->Dev Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->sys_to_dev_time, regular->sys_to_dev_time, s2d_improvement);
+	printf("Dev->Sys Migration   | %.3f ms        | %.3f ms        | %.1f%%\n",
+	       thp->dev_to_sys_time, regular->dev_to_sys_time, d2s_improvement);
+	printf("S->D Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_s2d, regular->throughput_s2d, throughput_s2d_improvement);
+	printf("D->S Throughput      | %.2f GB/s      | %.2f GB/s      | %.1f%%\n",
+	       thp->throughput_d2s, regular->throughput_d2s, throughput_d2s_improvement);
+}
+
+/*
+ * Run a single migration benchmark
+ * fd: file descriptor for hmm device
+ * use_thp: whether to use THP
+ * buffer_size: size of buffer to allocate
+ * iterations: number of iterations
+ * results: where to store results
+ */
+static inline int run_migration_benchmark(int fd, int use_thp, size_t buffer_size,
+					   int iterations, struct benchmark_results *results)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages = buffer_size / sysconf(_SC_PAGESIZE);
+	double start, end;
+	double s2d_total = 0, d2s_total = 0;
+	int ret, i;
+	int *ptr;
+
+	buffer = hmm_buffer_alloc(buffer_size);
+
+	/* Map memory */
+	buffer->ptr = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
+			  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (!buffer->ptr)
+		return -1;
+
+	/* Apply THP hint if requested */
+	if (use_thp)
+		ret = madvise(buffer->ptr, buffer_size, MADV_HUGEPAGE);
+	else
+		ret = madvise(buffer->ptr, buffer_size, MADV_NOHUGEPAGE);
+
+	if (ret)
+		return ret;
+
+	/* Initialize memory to make sure pages are allocated */
+	ptr = (int *)buffer->ptr;
+	for (i = 0; i < buffer_size / sizeof(int); i++)
+		ptr[i] = i & 0xFF;
+
+	/* Warmup iteration */
+	ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+	if (ret)
+		return ret;
+
+	/* Benchmark iterations */
+	for (i = 0; i < iterations; i++) {
+		/* System to device migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_sys_to_dev(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		s2d_total += (end - start);
+
+		/* Device to system migration */
+		start = get_time_ms();
+
+		ret = hmm_migrate_dev_to_sys(fd, buffer, npages);
+		if (ret)
+			return ret;
+
+		end = get_time_ms();
+		d2s_total += (end - start);
+	}
+
+	/* Calculate average times and throughput */
+	results->sys_to_dev_time = s2d_total / iterations;
+	results->dev_to_sys_time = d2s_total / iterations;
+	results->throughput_s2d = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->sys_to_dev_time / 1000.0);
+	results->throughput_d2s = (buffer_size / (1024.0 * 1024.0 * 1024.0)) /
+				 (results->dev_to_sys_time / 1000.0);
+
+	/* Cleanup */
+	hmm_buffer_free(buffer);
+	return 0;
+}
+
+/*
+ * Benchmark THP migration with different buffer sizes
+ */
+TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120)
+{
+	struct benchmark_results thp_results, regular_results;
+	size_t thp_size = 2 * 1024 * 1024; /* 2MB - typical THP size */
+	int iterations = 5;
+
+	printf("\nHMM THP Migration Benchmark\n");
+	printf("---------------------------\n");
+	printf("System page size: %ld bytes\n", sysconf(_SC_PAGESIZE));
+
+	/* Test different buffer sizes */
+	size_t test_sizes[] = {
+		thp_size / 4,      /* 512KB - smaller than THP */
+		thp_size / 2,      /* 1MB - half THP */
+		thp_size,          /* 2MB - single THP */
+		thp_size * 2,      /* 4MB - two THPs */
+		thp_size * 4,      /* 8MB - four THPs */
+		thp_size * 8,       /* 16MB - eight THPs */
+		thp_size * 128,       /* 256MB - one twenty eight THPs */
+	};
+
+	static const char *const test_names[] = {
+		"Small Buffer (512KB)",
+		"Half THP Size (1MB)",
+		"Single THP Size (2MB)",
+		"Two THP Size (4MB)",
+		"Four THP Size (8MB)",
+		"Eight THP Size (16MB)",
+		"One twenty eight THP Size (256MB)"
+	};
+
+	int num_tests = ARRAY_SIZE(test_sizes);
+
+	/* Run all tests */
+	for (int i = 0; i < num_tests; i++) {
+		/* Test with THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 1, test_sizes[i],
+					iterations, &thp_results), 0);
+
+		/* Test without THP */
+		ASSERT_EQ(run_migration_benchmark(self->fd, 0, test_sizes[i],
+					iterations, &regular_results), 0);
+
+		/* Print results */
+		print_benchmark_results(test_names[i], test_sizes[i],
+					&thp_results, &regular_results);
+	}
+}
 TEST_HARNESS_MAIN
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-07-30  9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
@ 2025-07-30  9:50   ` David Hildenbrand
  2025-08-04 23:43     ` Balbir Singh
  2025-08-05  4:22     ` Balbir Singh
  0 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-30  9:50 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 30.07.25 11:21, Balbir Singh wrote:
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
> 
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> 
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>   include/linux/memremap.h | 10 ++++++++-
>   mm/memremap.c            | 48 +++++++++++++++++++++++++++++-----------
>   2 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 4aa151914eab..a0723b35eeaa 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>   }
>   
>   #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>   void memunmap_pages(struct dev_pagemap *pgmap);
>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>   
>   unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_page_init(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	zone_device_folio_init(folio, 0);
> +}
> +
>   #else
>   static inline void *devm_memremap_pages(struct device *dev,
>   		struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b0ce0d8254bd..3ca136e7455e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>   void free_zone_device_folio(struct folio *folio)
>   {
>   	struct dev_pagemap *pgmap = folio->pgmap;
> +	unsigned int nr = folio_nr_pages(folio);
> +	int i;

"unsigned long" is to be future-proof.

(folio_nr_pages() returns long and probably soon unsigned long)

[ I'd probably all it "nr_pages" ]

>   
>   	if (WARN_ON_ONCE(!pgmap))
>   		return;
>   
>   	mem_cgroup_uncharge(folio);
>   
> -	/*
> -	 * Note: we don't expect anonymous compound pages yet. Once supported
> -	 * and we could PTE-map them similar to THP, we'd have to clear
> -	 * PG_anon_exclusive on all tail pages.
> -	 */
>   	if (folio_test_anon(folio)) {
> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -		__ClearPageAnonExclusive(folio_page(folio, 0));
> +		for (i = 0; i < nr; i++)
> +			__ClearPageAnonExclusive(folio_page(folio, i));
> +	} else {
> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>   	}
>   
>   	/*
> @@ -464,11 +463,20 @@ void free_zone_device_folio(struct folio *folio)
>   
>   	switch (pgmap->type) {
>   	case MEMORY_DEVICE_PRIVATE:
> +		if (folio_test_large(folio)) {

Could do "nr > 1" if we already have that value around.

> +			folio_unqueue_deferred_split(folio);

I think I asked that already but maybe missed the reply: Should these 
folios ever be added to the deferred split queue and is there any value 
in splitting them under memory pressure in the shrinker?

My gut feeling is "No", because the buddy cannot make use of these 
folios, but maybe there is an interesting case where we want that behavior?

> +
> +			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
> +		}
> +		pgmap->ops->page_free(&folio->page);
> +		percpu_ref_put(&folio->pgmap->ref);

Coold you simply do a

	percpu_ref_put_many(&folio->pgmap->ref, nr);

here, or would that be problematic?

> +		folio->page.mapping = NULL;
> +		break;
>   	case MEMORY_DEVICE_COHERENT:
>   		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>   			break;
> -		pgmap->ops->page_free(folio_page(folio, 0));
> -		put_dev_pagemap(pgmap);
> +		pgmap->ops->page_free(&folio->page);
> +		percpu_ref_put(&folio->pgmap->ref);
>   		break;
>   
>   	case MEMORY_DEVICE_GENERIC:
> @@ -491,14 +499,28 @@ void free_zone_device_folio(struct folio *folio)
>   	}
>   }
>   
> -void zone_device_page_init(struct page *page)
> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>   {
> +	struct page *page = folio_page(folio, 0);
> +
> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> +
> +	/*
> +	 * Only PMD level migration is supported for THP migration
> +	 */

Talking about something that does not exist yet (and is very specific) 
sounds a bit weird.

Should this go into a different patch, or could we rephrase the comment 
to be a bit more generic?

In this patch here, nothing would really object to "order" being 
intermediate.

(also, this is a device_private limitation? shouldn't that check go 
somehwere where we can perform this device-private limitation check?)

> +	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
> +
>   	/*
>   	 * Drivers shouldn't be allocating pages after calling
>   	 * memunmap_pages().
>   	 */
> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> -	set_page_count(page, 1);
> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> +	folio_set_count(folio, 1);
>   	lock_page(page);
> +
> +	if (order > 1) {
> +		prep_compound_page(page, order);
> +		folio_set_large_rmappable(folio);
> +	}
>   }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(zone_device_folio_init);


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30  9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
@ 2025-07-30 11:16   ` Mika Penttilä
  2025-07-30 11:27     ` Zi Yan
  2025-07-30 20:05   ` kernel test robot
  1 sibling, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 11:16 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	David Hildenbrand, Barry Song, Baolin Wang, Ryan Roberts,
	Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

Hi,

On 7/30/25 12:21, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages aware of zone
> device pages. Although the code is designed to be generic when it comes
> to handling splitting of pages, the code is designed to work for THP
> page sizes corresponding to HPAGE_PMD_NR.
>
> Modify page_vma_mapped_walk() to return true when a zone device huge
> entry is present, enabling try_to_migrate() and other code migration
> paths to appropriately process the entry. page_vma_mapped_walk() will
> return true for zone device private large folios only when
> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
> not zone device private pages from having to add awareness. The key
> callback that needs this flag is try_to_migrate_one(). The other
> callbacks page idle, damon use it for setting young/dirty bits, which is
> not significant when it comes to pmd level bit harvesting.
>
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
> entries.
>
> Zone device private entries when split via munmap go through pmd split,
> but need to go through a folio split, deferred split does not work if a
> fault is encountered because fault handling involves migration entries
> (via folio_migrate_mapping) and the folio sizes are expected to be the
> same there. This introduces the need to split the folio while handling
> the pmd split. Because the folio is still mapped, but calling
> folio_split() will cause lock recursion, the __split_unmapped_folio()
> code is used with a new helper to wrap the code
> split_device_private_folio(), which skips the checks around
> folio->mapping, swapcache and the need to go through unmap and remap
> folio.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h |   1 +
>  include/linux/rmap.h    |   2 +
>  include/linux/swapops.h |  17 +++
>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>  mm/page_vma_mapped.c    |  13 +-
>  mm/pgtable-generic.c    |   6 +
>  mm/rmap.c               |  22 +++-
>  7 files changed, 278 insertions(+), 51 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7748489fde1b..2a6f5ff7bca3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -345,6 +345,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> +int split_device_private_folio(struct folio *folio);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 20803fcb49a7..625f36dcc121 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -905,6 +905,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  #define PVMW_SYNC		(1 << 0)
>  /* Look for migration entries rather than present PTEs */
>  #define PVMW_MIGRATION		(1 << 1)
> +/* Look for device private THP entries */
> +#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
>  
>  struct page_vma_mapped_walk {
>  	unsigned long pfn;
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 64ea151a7ae3..2641c01bd5d2 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  {
>  	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>  }
> +
>  #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		struct page *page)
> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  }
>  #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  
> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
> +}
> +
> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
>  static inline int non_swap_entry(swp_entry_t entry)
>  {
>  	return swp_type(entry) >= MAX_SWAPFILES;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..e373c6578894 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -72,6 +72,10 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>  					  struct shrink_control *sc);
>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>  					 struct shrink_control *sc);
> +static int __split_unmapped_folio(struct folio *folio, int new_order,
> +		struct page *split_at, struct xa_state *xas,
> +		struct address_space *mapping, bool uniform_split);
> +
>  static bool split_underused_thp = true;
>  
>  static atomic_t huge_zero_refcount;
> @@ -1711,8 +1715,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> -		if (!is_readable_migration_entry(entry)) {
> +		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_pmd_device_private_entry(pmd));
> +
> +		if (is_migration_entry(entry) &&
> +			is_writable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
>  			pmd = swp_entry_to_pmd(entry);
> @@ -1722,6 +1729,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pmd = pmd_swp_mkuffd_wp(pmd);
>  			set_pmd_at(src_mm, addr, src_pmd, pmd);
>  		}
> +
> +		if (is_device_private_entry(entry)) {
> +			if (is_writable_device_private_entry(entry)) {
> +				entry = make_readable_device_private_entry(
> +					swp_offset(entry));
> +				pmd = swp_entry_to_pmd(entry);
> +
> +				if (pmd_swp_soft_dirty(*src_pmd))
> +					pmd = pmd_swp_mksoft_dirty(pmd);
> +				if (pmd_swp_uffd_wp(*src_pmd))
> +					pmd = pmd_swp_mkuffd_wp(pmd);
> +				set_pmd_at(src_mm, addr, src_pmd, pmd);
> +			}
> +
> +			src_folio = pfn_swap_entry_folio(entry);
> +			VM_WARN_ON(!folio_test_large(src_folio));
> +
> +			folio_get(src_folio);
> +			/*
> +			 * folio_try_dup_anon_rmap_pmd does not fail for
> +			 * device private entries.
> +			 */
> +			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
> +					  &src_folio->page, dst_vma, src_vma));
> +		}
> +
>  		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>  		mm_inc_nr_ptes(dst_mm);
>  		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -2219,15 +2252,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			folio_remove_rmap_pmd(folio, page, vma);
>  			WARN_ON_ONCE(folio_mapcount(folio) < 0);
>  			VM_BUG_ON_PAGE(!PageHead(page), page);
> -		} else if (thp_migration_supported()) {
> +		} else if (is_pmd_migration_entry(orig_pmd) ||
> +				is_pmd_device_private_entry(orig_pmd)) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> -		} else
> -			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (!thp_migration_supported())
> +				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (is_pmd_device_private_entry(orig_pmd)) {
> +				folio_remove_rmap_pmd(folio, &folio->page, vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
> +		}
>  
>  		if (folio_test_anon(folio)) {
>  			zap_deposited_table(tlb->mm, pmd);
> @@ -2247,6 +2287,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2424,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
> +			   !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,6 +2438,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> +		} else if (is_writable_device_private_entry(entry)) {
> +			entry = make_readable_device_private_entry(
> +							swp_offset(entry));
> +			newpmd = swp_entry_to_pmd(entry);
>  		} else {
>  			newpmd = *pmd;
>  		}
> @@ -2834,6 +2888,44 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pmd_populate(mm, pmd, pgtable);
>  }
>  
> +/**
> + * split_huge_device_private_folio - split a huge device private folio into
> + * smaller pages (of order 0), currently used by migrate_device logic to
> + * split folios for pages that are partially mapped
> + *
> + * @folio: the folio to split
> + *
> + * The caller has to hold the folio_lock and a reference via folio_get
> + */
> +int split_device_private_folio(struct folio *folio)
> +{
> +	struct folio *end_folio = folio_next(folio);
> +	struct folio *new_folio;
> +	int ret = 0;
> +
> +	/*
> +	 * Split the folio now. In the case of device
> +	 * private pages, this path is executed when
> +	 * the pmd is split and since freeze is not true
> +	 * it is likely the folio will be deferred_split.
> +	 *
> +	 * With device private pages, deferred splits of
> +	 * folios should be handled here to prevent partial
> +	 * unmaps from causing issues later on in migration
> +	 * and fault handling flows.
> +	 */
> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?

> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);

Confusing to  __split_unmapped_folio() if folio is mapped...

--Mika



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 11:16   ` Mika Penttilä
@ 2025-07-30 11:27     ` Zi Yan
  2025-07-30 11:30       ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 11:27 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 30 Jul 2025, at 7:16, Mika Penttilä wrote:

> Hi,
>
> On 7/30/25 12:21, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages aware of zone
>> device pages. Although the code is designed to be generic when it comes
>> to handling splitting of pages, the code is designed to work for THP
>> page sizes corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone device huge
>> entry is present, enabling try_to_migrate() and other code migration
>> paths to appropriately process the entry. page_vma_mapped_walk() will
>> return true for zone device private large folios only when
>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>> not zone device private pages from having to add awareness. The key
>> callback that needs this flag is try_to_migrate_one(). The other
>> callbacks page idle, damon use it for setting young/dirty bits, which is
>> not significant when it comes to pmd level bit harvesting.
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>> entries.
>>
>> Zone device private entries when split via munmap go through pmd split,
>> but need to go through a folio split, deferred split does not work if a
>> fault is encountered because fault handling involves migration entries
>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>> same there. This introduces the need to split the folio while handling
>> the pmd split. Because the folio is still mapped, but calling
>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>> code is used with a new helper to wrap the code
>> split_device_private_folio(), which skips the checks around
>> folio->mapping, swapcache and the need to go through unmap and remap
>> folio.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |   1 +
>>  include/linux/rmap.h    |   2 +
>>  include/linux/swapops.h |  17 +++
>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>  mm/page_vma_mapped.c    |  13 +-
>>  mm/pgtable-generic.c    |   6 +
>>  mm/rmap.c               |  22 +++-
>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>

<snip>

>> +/**
>> + * split_huge_device_private_folio - split a huge device private folio into
>> + * smaller pages (of order 0), currently used by migrate_device logic to
>> + * split folios for pages that are partially mapped
>> + *
>> + * @folio: the folio to split
>> + *
>> + * The caller has to hold the folio_lock and a reference via folio_get
>> + */
>> +int split_device_private_folio(struct folio *folio)
>> +{
>> +	struct folio *end_folio = folio_next(folio);
>> +	struct folio *new_folio;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * Split the folio now. In the case of device
>> +	 * private pages, this path is executed when
>> +	 * the pmd is split and since freeze is not true
>> +	 * it is likely the folio will be deferred_split.
>> +	 *
>> +	 * With device private pages, deferred splits of
>> +	 * folios should be handled here to prevent partial
>> +	 * unmaps from causing issues later on in migration
>> +	 * and fault handling flows.
>> +	 */
>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>
> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?

Based on my off-list conversation with Balbir, the folio is unmapped in
CPU side but mapped in the device. folio_ref_freeeze() is not aware of
device side mapping.

>
>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>
> Confusing to  __split_unmapped_folio() if folio is mapped...

From driver point of view, __split_unmapped_folio() probably should be renamed
to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
folio meta data for split.


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 00/11] THP support for zone device page migration
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (10 preceding siblings ...)
  2025-07-30  9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
@ 2025-07-30 11:30 ` David Hildenbrand
  2025-07-30 23:18   ` Alistair Popple
  2025-07-31  8:41   ` Balbir Singh
  2025-08-05 21:34 ` Matthew Brost
  12 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-30 11:30 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 30.07.25 11:21, Balbir Singh wrote:

BTW, I keep getting confused by the topic.

Isn't this essentially

"mm: support device-private THP"

and the support for migration is just a necessary requirement to 
*enable* device private?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 11:27     ` Zi Yan
@ 2025-07-30 11:30       ` Zi Yan
  2025-07-30 11:42         ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 11:30 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 30 Jul 2025, at 7:27, Zi Yan wrote:

> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>
>> Hi,
>>
>> On 7/30/25 12:21, Balbir Singh wrote:
>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>> device pages. Although the code is designed to be generic when it comes
>>> to handling splitting of pages, the code is designed to work for THP
>>> page sizes corresponding to HPAGE_PMD_NR.
>>>
>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>> entry is present, enabling try_to_migrate() and other code migration
>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>> return true for zone device private large folios only when
>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>> not zone device private pages from having to add awareness. The key
>>> callback that needs this flag is try_to_migrate_one(). The other
>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>> not significant when it comes to pmd level bit harvesting.
>>>
>>> pmd_pfn() does not work well with zone device entries, use
>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>> entries.
>>>
>>> Zone device private entries when split via munmap go through pmd split,
>>> but need to go through a folio split, deferred split does not work if a
>>> fault is encountered because fault handling involves migration entries
>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>> same there. This introduces the need to split the folio while handling
>>> the pmd split. Because the folio is still mapped, but calling
>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>> code is used with a new helper to wrap the code
>>> split_device_private_folio(), which skips the checks around
>>> folio->mapping, swapcache and the need to go through unmap and remap
>>> folio.
>>>
>>> Cc: Karol Herbst <kherbst@redhat.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>> Cc: Shuah Khan <shuah@kernel.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Cc: Jane Chu <jane.chu@oracle.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h |   1 +
>>>  include/linux/rmap.h    |   2 +
>>>  include/linux/swapops.h |  17 +++
>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>  mm/page_vma_mapped.c    |  13 +-
>>>  mm/pgtable-generic.c    |   6 +
>>>  mm/rmap.c               |  22 +++-
>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>
>
> <snip>
>
>>> +/**
>>> + * split_huge_device_private_folio - split a huge device private folio into
>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>> + * split folios for pages that are partially mapped
>>> + *
>>> + * @folio: the folio to split
>>> + *
>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>> + */
>>> +int split_device_private_folio(struct folio *folio)
>>> +{
>>> +	struct folio *end_folio = folio_next(folio);
>>> +	struct folio *new_folio;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * Split the folio now. In the case of device
>>> +	 * private pages, this path is executed when
>>> +	 * the pmd is split and since freeze is not true
>>> +	 * it is likely the folio will be deferred_split.
>>> +	 *
>>> +	 * With device private pages, deferred splits of
>>> +	 * folios should be handled here to prevent partial
>>> +	 * unmaps from causing issues later on in migration
>>> +	 * and fault handling flows.
>>> +	 */
>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>
> Based on my off-list conversation with Balbir, the folio is unmapped in
> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
> device side mapping.

Maybe we should make it aware of device private mapping? So that the
process mirrors CPU side folio split: 1) unmap device private mapping,
2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
5) remap device private mapping.

>
>>
>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>
>> Confusing to  __split_unmapped_folio() if folio is mapped...
>
> From driver point of view, __split_unmapped_folio() probably should be renamed
> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
> folio meta data for split.
>
>
> Best Regards,
> Yan, Zi


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 11:30       ` Zi Yan
@ 2025-07-30 11:42         ` Mika Penttilä
  2025-07-30 12:08           ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 11:42 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell


On 7/30/25 14:30, Zi Yan wrote:
> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>
>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>
>>> Hi,
>>>
>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>> device pages. Although the code is designed to be generic when it comes
>>>> to handling splitting of pages, the code is designed to work for THP
>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>
>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>> entry is present, enabling try_to_migrate() and other code migration
>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>> return true for zone device private large folios only when
>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>> not zone device private pages from having to add awareness. The key
>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>> not significant when it comes to pmd level bit harvesting.
>>>>
>>>> pmd_pfn() does not work well with zone device entries, use
>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>> entries.
>>>>
>>>> Zone device private entries when split via munmap go through pmd split,
>>>> but need to go through a folio split, deferred split does not work if a
>>>> fault is encountered because fault handling involves migration entries
>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>> same there. This introduces the need to split the folio while handling
>>>> the pmd split. Because the folio is still mapped, but calling
>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>> code is used with a new helper to wrap the code
>>>> split_device_private_folio(), which skips the checks around
>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>> folio.
>>>>
>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>  include/linux/huge_mm.h |   1 +
>>>>  include/linux/rmap.h    |   2 +
>>>>  include/linux/swapops.h |  17 +++
>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>  mm/pgtable-generic.c    |   6 +
>>>>  mm/rmap.c               |  22 +++-
>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>
>> <snip>
>>
>>>> +/**
>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>> + * split folios for pages that are partially mapped
>>>> + *
>>>> + * @folio: the folio to split
>>>> + *
>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>> + */
>>>> +int split_device_private_folio(struct folio *folio)
>>>> +{
>>>> +	struct folio *end_folio = folio_next(folio);
>>>> +	struct folio *new_folio;
>>>> +	int ret = 0;
>>>> +
>>>> +	/*
>>>> +	 * Split the folio now. In the case of device
>>>> +	 * private pages, this path is executed when
>>>> +	 * the pmd is split and since freeze is not true
>>>> +	 * it is likely the folio will be deferred_split.
>>>> +	 *
>>>> +	 * With device private pages, deferred splits of
>>>> +	 * folios should be handled here to prevent partial
>>>> +	 * unmaps from causing issues later on in migration
>>>> +	 * and fault handling flows.
>>>> +	 */
>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>> Based on my off-list conversation with Balbir, the folio is unmapped in
>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>> device side mapping.
> Maybe we should make it aware of device private mapping? So that the
> process mirrors CPU side folio split: 1) unmap device private mapping,
> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
> 5) remap device private mapping.

Ah ok this was about device private page obviously here, nevermind..

>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>> From driver point of view, __split_unmapped_folio() probably should be renamed
>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>> folio meta data for split.
>>
>>
>> Best Regards,
>> Yan, Zi
>
> Best Regards,
> Yan, Zi
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 11:42         ` Mika Penttilä
@ 2025-07-30 12:08           ` Mika Penttilä
  2025-07-30 12:25             ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 12:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell


On 7/30/25 14:42, Mika Penttilä wrote:
> On 7/30/25 14:30, Zi Yan wrote:
>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>
>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>
>>>> Hi,
>>>>
>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>> device pages. Although the code is designed to be generic when it comes
>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>
>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>> return true for zone device private large folios only when
>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>> not zone device private pages from having to add awareness. The key
>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>
>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>> entries.
>>>>>
>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>> but need to go through a folio split, deferred split does not work if a
>>>>> fault is encountered because fault handling involves migration entries
>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>> same there. This introduces the need to split the folio while handling
>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>> code is used with a new helper to wrap the code
>>>>> split_device_private_folio(), which skips the checks around
>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>> folio.
>>>>>
>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> ---
>>>>>  include/linux/huge_mm.h |   1 +
>>>>>  include/linux/rmap.h    |   2 +
>>>>>  include/linux/swapops.h |  17 +++
>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>  mm/rmap.c               |  22 +++-
>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>
>>> <snip>
>>>
>>>>> +/**
>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>> + * split folios for pages that are partially mapped
>>>>> + *
>>>>> + * @folio: the folio to split
>>>>> + *
>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>> + */
>>>>> +int split_device_private_folio(struct folio *folio)
>>>>> +{
>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>> +	struct folio *new_folio;
>>>>> +	int ret = 0;
>>>>> +
>>>>> +	/*
>>>>> +	 * Split the folio now. In the case of device
>>>>> +	 * private pages, this path is executed when
>>>>> +	 * the pmd is split and since freeze is not true
>>>>> +	 * it is likely the folio will be deferred_split.
>>>>> +	 *
>>>>> +	 * With device private pages, deferred splits of
>>>>> +	 * folios should be handled here to prevent partial
>>>>> +	 * unmaps from causing issues later on in migration
>>>>> +	 * and fault handling flows.
>>>>> +	 */
>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>> device side mapping.
>> Maybe we should make it aware of device private mapping? So that the
>> process mirrors CPU side folio split: 1) unmap device private mapping,
>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>> 5) remap device private mapping.
> Ah ok this was about device private page obviously here, nevermind..

Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?

>
>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>> folio meta data for split.
>>>
>>>
>>> Best Regards,
>>> Yan, Zi
>> Best Regards,
>> Yan, Zi
>>

--Mika



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 12:08           ` Mika Penttilä
@ 2025-07-30 12:25             ` Zi Yan
  2025-07-30 12:49               ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 12:25 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 30 Jul 2025, at 8:08, Mika Penttilä wrote:

> On 7/30/25 14:42, Mika Penttilä wrote:
>> On 7/30/25 14:30, Zi Yan wrote:
>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>
>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>
>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>> return true for zone device private large folios only when
>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>> not zone device private pages from having to add awareness. The key
>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>
>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>> entries.
>>>>>>
>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>> fault is encountered because fault handling involves migration entries
>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>> same there. This introduces the need to split the folio while handling
>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>> code is used with a new helper to wrap the code
>>>>>> split_device_private_folio(), which skips the checks around
>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>> folio.
>>>>>>
>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>
>>>> <snip>
>>>>
>>>>>> +/**
>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> + * split folios for pages that are partially mapped
>>>>>> + *
>>>>>> + * @folio: the folio to split
>>>>>> + *
>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> + */
>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>> +{
>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>> +	struct folio *new_folio;
>>>>>> +	int ret = 0;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Split the folio now. In the case of device
>>>>>> +	 * private pages, this path is executed when
>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>> +	 *
>>>>>> +	 * With device private pages, deferred splits of
>>>>>> +	 * folios should be handled here to prevent partial
>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>> +	 * and fault handling flows.
>>>>>> +	 */
>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>> device side mapping.
>>> Maybe we should make it aware of device private mapping? So that the
>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>> 5) remap device private mapping.
>> Ah ok this was about device private page obviously here, nevermind..
>
> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?

The folio only has migration entries pointing to it. From CPU perspective,
it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
folio by replacing existing page table entries with migration entries
and after that the folio is regarded as “unmapped”.

The migration entry is an invalid CPU page table entry, so it is not a CPU
mapping, IIUC.

>
>>
>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>> folio meta data for split.
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>> Best Regards,
>>> Yan, Zi
>>>
>
> --Mika


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 12:25             ` Zi Yan
@ 2025-07-30 12:49               ` Mika Penttilä
  2025-07-30 15:10                 ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 12:49 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 7/30/25 15:25, Zi Yan wrote:
> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>
>> On 7/30/25 14:42, Mika Penttilä wrote:
>>> On 7/30/25 14:30, Zi Yan wrote:
>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>
>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>
>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>> return true for zone device private large folios only when
>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>
>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>> entries.
>>>>>>>
>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>> code is used with a new helper to wrap the code
>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>> folio.
>>>>>>>
>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>
>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>> ---
>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>
>>>>> <snip>
>>>>>
>>>>>>> +/**
>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>> + * split folios for pages that are partially mapped
>>>>>>> + *
>>>>>>> + * @folio: the folio to split
>>>>>>> + *
>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>> + */
>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>> +{
>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>> +	struct folio *new_folio;
>>>>>>> +	int ret = 0;
>>>>>>> +
>>>>>>> +	/*
>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>> +	 * private pages, this path is executed when
>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>> +	 *
>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>> +	 * and fault handling flows.
>>>>>>> +	 */
>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>> device side mapping.
>>>> Maybe we should make it aware of device private mapping? So that the
>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>> 5) remap device private mapping.
>>> Ah ok this was about device private page obviously here, nevermind..
>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
> The folio only has migration entries pointing to it. From CPU perspective,
> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
> folio by replacing existing page table entries with migration entries
> and after that the folio is regarded as “unmapped”.
>
> The migration entry is an invalid CPU page table entry, so it is not a CPU

split_device_private_folio() is called for device private entry, not migrate entry afaics. 
And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.

> mapping, IIUC.
>
>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>> folio meta data for split.
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Yan, Zi
>>>> Best Regards,
>>>> Yan, Zi
>>>>
>> --Mika
>
> Best Regards,
> Yan, Zi
>
--Mika



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 12:49               ` Mika Penttilä
@ 2025-07-30 15:10                 ` Zi Yan
  2025-07-30 15:40                   ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 15:10 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 30 Jul 2025, at 8:49, Mika Penttilä wrote:

> On 7/30/25 15:25, Zi Yan wrote:
>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>
>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>
>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>
>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>> return true for zone device private large folios only when
>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>
>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>> entries.
>>>>>>>>
>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>> folio.
>>>>>>>>
>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>
>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>> ---
>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>>> +/**
>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>> + *
>>>>>>>> + * @folio: the folio to split
>>>>>>>> + *
>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>> + */
>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>> +{
>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>> +	struct folio *new_folio;
>>>>>>>> +	int ret = 0;
>>>>>>>> +
>>>>>>>> +	/*
>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>> +	 *
>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>> +	 * and fault handling flows.
>>>>>>>> +	 */
>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>> device side mapping.
>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>> 5) remap device private mapping.
>>>> Ah ok this was about device private page obviously here, nevermind..
>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>> The folio only has migration entries pointing to it. From CPU perspective,
>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>> folio by replacing existing page table entries with migration entries
>> and after that the folio is regarded as “unmapped”.
>>
>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>
> split_device_private_folio() is called for device private entry, not migrate entry afaics.

Yes, but from CPU perspective, both device private entry and migration entry
are invalid CPU page table entries, so the device private folio is “unmapped”
at CPU side.


> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.

I am not sure that is the right timing of splitting a folio. The device private
folio can be kept without splitting at split_huge_pmd() time.

But from CPU perspective, a device private folio has no CPU mapping, no other
CPU can access or manipulate the folio. It should be OK to split it.

>
>> mapping, IIUC.
>>
>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>> folio meta data for split.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>> Best Regards,
>>>>> Yan, Zi
>>>>>
>>> --Mika
>>
>> Best Regards,
>> Yan, Zi
>>
> --Mika


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 15:10                 ` Zi Yan
@ 2025-07-30 15:40                   ` Mika Penttilä
  2025-07-30 15:58                     ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 15:40 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell


On 7/30/25 18:10, Zi Yan wrote:
> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>
>> On 7/30/25 15:25, Zi Yan wrote:
>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>
>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>
>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>
>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>> entries.
>>>>>>>>>
>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>> folio.
>>>>>>>>>
>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>> ---
>>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>> +/**
>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>> + *
>>>>>>>>> + * @folio: the folio to split
>>>>>>>>> + *
>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>> + */
>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>> +{
>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>> +	int ret = 0;
>>>>>>>>> +
>>>>>>>>> +	/*
>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>> +	 *
>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>> +	 */
>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>> device side mapping.
>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>> 5) remap device private mapping.
>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>> The folio only has migration entries pointing to it. From CPU perspective,
>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>> folio by replacing existing page table entries with migration entries
>>> and after that the folio is regarded as “unmapped”.
>>>
>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
> Yes, but from CPU perspective, both device private entry and migration entry
> are invalid CPU page table entries, so the device private folio is “unmapped”
> at CPU side.

Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.

Also which might confuse is that v1 of the series had only 
  migrate_vma_split_pages()
which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
Now, 
  split_device_private_folio()
operates on mapcount != 0 folios.

>
>
>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
> I am not sure that is the right timing of splitting a folio. The device private
> folio can be kept without splitting at split_huge_pmd() time.

Yes this doesn't look quite right, and also
+	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

looks suspicious

Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
more the exact conditions, there might be a better fix.

>
> But from CPU perspective, a device private folio has no CPU mapping, no other
> CPU can access or manipulate the folio. It should be OK to split it.
>
>>> mapping, IIUC.
>>>
>>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>> folio meta data for split.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Yan, Zi
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>>>
>>>> --Mika
>>> Best Regards,
>>> Yan, Zi
>>>
>> --Mika
>
> Best Regards,
> Yan, Zi
>
--Mika



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 15:40                   ` Mika Penttilä
@ 2025-07-30 15:58                     ` Zi Yan
  2025-07-30 16:29                       ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-30 15:58 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 30 Jul 2025, at 11:40, Mika Penttilä wrote:

> On 7/30/25 18:10, Zi Yan wrote:
>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>
>>> On 7/30/25 15:25, Zi Yan wrote:
>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>
>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>
>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>
>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>> entries.
>>>>>>>>>>
>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>> folio.
>>>>>>>>>>
>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>> ---
>>>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>> + *
>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>> + *
>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>> + */
>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>> +{
>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>> +	int ret = 0;
>>>>>>>>>> +
>>>>>>>>>> +	/*
>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>> +	 *
>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>> +	 */
>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>> device side mapping.
>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>> 5) remap device private mapping.
>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>> folio by replacing existing page table entries with migration entries
>>>> and after that the folio is regarded as “unmapped”.
>>>>
>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>> Yes, but from CPU perspective, both device private entry and migration entry
>> are invalid CPU page table entries, so the device private folio is “unmapped”
>> at CPU side.
>
> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.

Right. That confused me when I was talking to Balbir and looking at v1.
When a device private folio is processed in __folio_split(), Balbir needed to
add code to skip CPU mapping handling code. Basically device private folios are
CPU unmapped and device mapped.

Here are my questions on device private folios:
1. How is mapcount used for device private folios? Why is it needed from CPU
   perspective? Can it be stored in a device private specific data structure?
2. When a device private folio is mapped on device, can someone other than
   the device driver manipulate it assuming core-mm just skips device private
   folios (barring the CPU access fault handling)?

Where I am going is that can device private folios be treated as unmapped folios
by CPU and only device driver manipulates their mappings?


>
> Also which might confuse is that v1 of the series had only
>   migrate_vma_split_pages()
> which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
> Now,
>   split_device_private_folio()
> operates on mapcount != 0 folios.
>
>>
>>
>>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
>> I am not sure that is the right timing of splitting a folio. The device private
>> folio can be kept without splitting at split_huge_pmd() time.
>
> Yes this doesn't look quite right, and also
> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

I wonder if we need to freeze a device private folio. Can anyone other than
device driver change its refcount? Since CPU just sees it as an unmapped folio.

>
> looks suspicious
>
> Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
> more the exact conditions, there might be a better fix.
>
>>
>> But from CPU perspective, a device private folio has no CPU mapping, no other
>> CPU can access or manipulate the folio. It should be OK to split it.
>>
>>>> mapping, IIUC.
>>>>
>>>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>>> folio meta data for split.



Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 15:58                     ` Zi Yan
@ 2025-07-30 16:29                       ` Mika Penttilä
  2025-07-31  7:15                         ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-07-30 16:29 UTC (permalink / raw)
  To: Zi Yan
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell


On 7/30/25 18:58, Zi Yan wrote:
> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>
>> On 7/30/25 18:10, Zi Yan wrote:
>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>
>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>
>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>
>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>> entries.
>>>>>>>>>>>
>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>> folio.
>>>>>>>>>>>
>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>
>>>>>>>>> <snip>
>>>>>>>>>
>>>>>>>>>>> +/**
>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>> + *
>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>> + *
>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>> + */
>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>> +{
>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +	/*
>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>> +	 *
>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>> +	 */
>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>> device side mapping.
>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>> 5) remap device private mapping.
>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>> folio by replacing existing page table entries with migration entries
>>>>> and after that the folio is regarded as “unmapped”.
>>>>>
>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>> Yes, but from CPU perspective, both device private entry and migration entry
>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>> at CPU side.
>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
> Right. That confused me when I was talking to Balbir and looking at v1.
> When a device private folio is processed in __folio_split(), Balbir needed to
> add code to skip CPU mapping handling code. Basically device private folios are
> CPU unmapped and device mapped.
>
> Here are my questions on device private folios:
> 1. How is mapcount used for device private folios? Why is it needed from CPU
>    perspective? Can it be stored in a device private specific data structure?

Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
common code more messy if not done that way but sure possible. 
And not consuming pfns (address space) at all would have benefits.

> 2. When a device private folio is mapped on device, can someone other than
>    the device driver manipulate it assuming core-mm just skips device private
>    folios (barring the CPU access fault handling)?
>
> Where I am going is that can device private folios be treated as unmapped folios
> by CPU and only device driver manipulates their mappings?
>
Yes not present by CPU but mm has bookkeeping on them. The private page has no content
someone could change while in device, it's just pfn.

>> Also which might confuse is that v1 of the series had only
>>   migrate_vma_split_pages()
>> which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
>> Now,
>>   split_device_private_folio()
>> operates on mapcount != 0 folios.
>>
>>>
>>>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
>>> I am not sure that is the right timing of splitting a folio. The device private
>>> folio can be kept without splitting at split_huge_pmd() time.
>> Yes this doesn't look quite right, and also
>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> I wonder if we need to freeze a device private folio. Can anyone other than
> device driver change its refcount? Since CPU just sees it as an unmapped folio.
>
>> looks suspicious
>>
>> Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
>> more the exact conditions, there might be a better fix.
>>
>>> But from CPU perspective, a device private folio has no CPU mapping, no other
>>> CPU can access or manipulate the folio. It should be OK to split it.
>>>
>>>>> mapping, IIUC.
>>>>>
>>>>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>>>> folio meta data for split.
>
>
> Best Regards,
> Yan, Zi
>

--Mika



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30  9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
  2025-07-30 11:16   ` Mika Penttilä
@ 2025-07-30 20:05   ` kernel test robot
  1 sibling, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-30 20:05 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Mika Penttilä, Matthew Brost, Francois Dugast,
	Ralph Campbell

Hi Balbir,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250730]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250730092139.3890844-3-balbirs%40nvidia.com
patch subject: [v2 02/11] mm/thp: zone_device awareness in THP handling code
config: i386-buildonly-randconfig-001-20250731 (https://download.01.org/0day-ci/archive/20250731/202507310343.ZipoyitU-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507310343.ZipoyitU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507310343.ZipoyitU-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_migrate_one':
>> mm/rmap.c:2330:39: warning: unused variable 'pfn' [-Wunused-variable]
    2330 |                         unsigned long pfn;
         |                                       ^~~


vim +/pfn +2330 mm/rmap.c

  2273	
  2274	/*
  2275	 * @arg: enum ttu_flags will be passed to this argument.
  2276	 *
  2277	 * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs
  2278	 * containing migration entries.
  2279	 */
  2280	static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
  2281			     unsigned long address, void *arg)
  2282	{
  2283		struct mm_struct *mm = vma->vm_mm;
  2284		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
  2285					PVMW_THP_DEVICE_PRIVATE);
  2286		bool anon_exclusive, writable, ret = true;
  2287		pte_t pteval;
  2288		struct page *subpage;
  2289		struct mmu_notifier_range range;
  2290		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  2291		unsigned long pfn;
  2292		unsigned long hsz = 0;
  2293	
  2294		/*
  2295		 * When racing against e.g. zap_pte_range() on another cpu,
  2296		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  2297		 * try_to_migrate() may return before page_mapped() has become false,
  2298		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  2299		 */
  2300		if (flags & TTU_SYNC)
  2301			pvmw.flags = PVMW_SYNC;
  2302	
  2303		/*
  2304		 * For THP, we have to assume the worse case ie pmd for invalidation.
  2305		 * For hugetlb, it could be much worse if we need to do pud
  2306		 * invalidation in the case of pmd sharing.
  2307		 *
  2308		 * Note that the page can not be free in this function as call of
  2309		 * try_to_unmap() must hold a reference on the page.
  2310		 */
  2311		range.end = vma_address_end(&pvmw);
  2312		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  2313					address, range.end);
  2314		if (folio_test_hugetlb(folio)) {
  2315			/*
  2316			 * If sharing is possible, start and end will be adjusted
  2317			 * accordingly.
  2318			 */
  2319			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  2320							     &range.end);
  2321	
  2322			/* We need the huge page size for set_huge_pte_at() */
  2323			hsz = huge_page_size(hstate_vma(vma));
  2324		}
  2325		mmu_notifier_invalidate_range_start(&range);
  2326	
  2327		while (page_vma_mapped_walk(&pvmw)) {
  2328			/* PMD-mapped THP migration entry */
  2329			if (!pvmw.pte) {
> 2330				unsigned long pfn;
  2331	
  2332				if (flags & TTU_SPLIT_HUGE_PMD) {
  2333					split_huge_pmd_locked(vma, pvmw.address,
  2334							      pvmw.pmd, true);
  2335					ret = false;
  2336					page_vma_mapped_walk_done(&pvmw);
  2337					break;
  2338				}
  2339	#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
  2340				/*
  2341				 * Zone device private folios do not work well with
  2342				 * pmd_pfn() on some architectures due to pte
  2343				 * inversion.
  2344				 */
  2345				if (is_pmd_device_private_entry(*pvmw.pmd)) {
  2346					swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
  2347	
  2348					pfn = swp_offset_pfn(entry);
  2349				} else {
  2350					pfn = pmd_pfn(*pvmw.pmd);
  2351				}
  2352	
  2353				subpage = folio_page(folio, pfn - folio_pfn(folio));
  2354	
  2355				VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
  2356						!folio_test_pmd_mappable(folio), folio);
  2357	
  2358				if (set_pmd_migration_entry(&pvmw, subpage)) {
  2359					ret = false;
  2360					page_vma_mapped_walk_done(&pvmw);
  2361					break;
  2362				}
  2363				continue;
  2364	#endif
  2365			}
  2366	
  2367			/* Unexpected PMD-mapped THP? */
  2368			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2369	
  2370			/*
  2371			 * Handle PFN swap PTEs, such as device-exclusive ones, that
  2372			 * actually map pages.
  2373			 */
  2374			pteval = ptep_get(pvmw.pte);
  2375			if (likely(pte_present(pteval))) {
  2376				pfn = pte_pfn(pteval);
  2377			} else {
  2378				pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
  2379				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  2380			}
  2381	
  2382			subpage = folio_page(folio, pfn - folio_pfn(folio));
  2383			address = pvmw.address;
  2384			anon_exclusive = folio_test_anon(folio) &&
  2385					 PageAnonExclusive(subpage);
  2386	
  2387			if (folio_test_hugetlb(folio)) {
  2388				bool anon = folio_test_anon(folio);
  2389	
  2390				/*
  2391				 * huge_pmd_unshare may unmap an entire PMD page.
  2392				 * There is no way of knowing exactly which PMDs may
  2393				 * be cached for this mm, so we must flush them all.
  2394				 * start/end were already adjusted above to cover this
  2395				 * range.
  2396				 */
  2397				flush_cache_range(vma, range.start, range.end);
  2398	
  2399				/*
  2400				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2401				 * held in write mode.  Caller needs to explicitly
  2402				 * do this outside rmap routines.
  2403				 *
  2404				 * We also must hold hugetlb vma_lock in write mode.
  2405				 * Lock order dictates acquiring vma_lock BEFORE
  2406				 * i_mmap_rwsem.  We can only try lock here and
  2407				 * fail if unsuccessful.
  2408				 */
  2409				if (!anon) {
  2410					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2411					if (!hugetlb_vma_trylock_write(vma)) {
  2412						page_vma_mapped_walk_done(&pvmw);
  2413						ret = false;
  2414						break;
  2415					}
  2416					if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
  2417						hugetlb_vma_unlock_write(vma);
  2418						flush_tlb_range(vma,
  2419							range.start, range.end);
  2420	
  2421						/*
  2422						 * The ref count of the PMD page was
  2423						 * dropped which is part of the way map
  2424						 * counting is done for shared PMDs.
  2425						 * Return 'true' here.  When there is
  2426						 * no other sharing, huge_pmd_unshare
  2427						 * returns false and we will unmap the
  2428						 * actual page and drop map count
  2429						 * to zero.
  2430						 */
  2431						page_vma_mapped_walk_done(&pvmw);
  2432						break;
  2433					}
  2434					hugetlb_vma_unlock_write(vma);
  2435				}
  2436				/* Nuke the hugetlb page table entry */
  2437				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2438				if (pte_dirty(pteval))
  2439					folio_mark_dirty(folio);
  2440				writable = pte_write(pteval);
  2441			} else if (likely(pte_present(pteval))) {
  2442				flush_cache_page(vma, address, pfn);
  2443				/* Nuke the page table entry. */
  2444				if (should_defer_flush(mm, flags)) {
  2445					/*
  2446					 * We clear the PTE but do not flush so potentially
  2447					 * a remote CPU could still be writing to the folio.
  2448					 * If the entry was previously clean then the
  2449					 * architecture must guarantee that a clear->dirty
  2450					 * transition on a cached TLB entry is written through
  2451					 * and traps if the PTE is unmapped.
  2452					 */
  2453					pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  2454	
  2455					set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
  2456				} else {
  2457					pteval = ptep_clear_flush(vma, address, pvmw.pte);
  2458				}
  2459				if (pte_dirty(pteval))
  2460					folio_mark_dirty(folio);
  2461				writable = pte_write(pteval);
  2462			} else {
  2463				pte_clear(mm, address, pvmw.pte);
  2464				writable = is_writable_device_private_entry(pte_to_swp_entry(pteval));
  2465			}
  2466	
  2467			VM_WARN_ON_FOLIO(writable && folio_test_anon(folio) &&
  2468					!anon_exclusive, folio);
  2469	
  2470			/* Update high watermark before we lower rss */
  2471			update_hiwater_rss(mm);
  2472	
  2473			if (PageHWPoison(subpage)) {
  2474				VM_WARN_ON_FOLIO(folio_is_device_private(folio), folio);
  2475	
  2476				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2477				if (folio_test_hugetlb(folio)) {
  2478					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2479					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2480							hsz);
  2481				} else {
  2482					dec_mm_counter(mm, mm_counter(folio));
  2483					set_pte_at(mm, address, pvmw.pte, pteval);
  2484				}
  2485			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2486				   !userfaultfd_armed(vma)) {
  2487				/*
  2488				 * The guest indicated that the page content is of no
  2489				 * interest anymore. Simply discard the pte, vmscan
  2490				 * will take care of the rest.
  2491				 * A future reference will then fault in a new zero
  2492				 * page. When userfaultfd is active, we must not drop
  2493				 * this page though, as its main user (postcopy
  2494				 * migration) will not expect userfaults on already
  2495				 * copied pages.
  2496				 */
  2497				dec_mm_counter(mm, mm_counter(folio));
  2498			} else {
  2499				swp_entry_t entry;
  2500				pte_t swp_pte;
  2501	
  2502				/*
  2503				 * arch_unmap_one() is expected to be a NOP on
  2504				 * architectures where we could have PFN swap PTEs,
  2505				 * so we'll not check/care.
  2506				 */
  2507				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2508					if (folio_test_hugetlb(folio))
  2509						set_huge_pte_at(mm, address, pvmw.pte,
  2510								pteval, hsz);
  2511					else
  2512						set_pte_at(mm, address, pvmw.pte, pteval);
  2513					ret = false;
  2514					page_vma_mapped_walk_done(&pvmw);
  2515					break;
  2516				}
  2517	
  2518				/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
  2519				if (folio_test_hugetlb(folio)) {
  2520					if (anon_exclusive &&
  2521					    hugetlb_try_share_anon_rmap(folio)) {
  2522						set_huge_pte_at(mm, address, pvmw.pte,
  2523								pteval, hsz);
  2524						ret = false;
  2525						page_vma_mapped_walk_done(&pvmw);
  2526						break;
  2527					}
  2528				} else if (anon_exclusive &&
  2529					   folio_try_share_anon_rmap_pte(folio, subpage)) {
  2530					set_pte_at(mm, address, pvmw.pte, pteval);
  2531					ret = false;
  2532					page_vma_mapped_walk_done(&pvmw);
  2533					break;
  2534				}
  2535	
  2536				/*
  2537				 * Store the pfn of the page in a special migration
  2538				 * pte. do_swap_page() will wait until the migration
  2539				 * pte is removed and then restart fault handling.
  2540				 */
  2541				if (writable)
  2542					entry = make_writable_migration_entry(
  2543								page_to_pfn(subpage));
  2544				else if (anon_exclusive)
  2545					entry = make_readable_exclusive_migration_entry(
  2546								page_to_pfn(subpage));
  2547				else
  2548					entry = make_readable_migration_entry(
  2549								page_to_pfn(subpage));
  2550				if (likely(pte_present(pteval))) {
  2551					if (pte_young(pteval))
  2552						entry = make_migration_entry_young(entry);
  2553					if (pte_dirty(pteval))
  2554						entry = make_migration_entry_dirty(entry);
  2555					swp_pte = swp_entry_to_pte(entry);
  2556					if (pte_soft_dirty(pteval))
  2557						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2558					if (pte_uffd_wp(pteval))
  2559						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2560				} else {
  2561					swp_pte = swp_entry_to_pte(entry);
  2562					if (pte_swp_soft_dirty(pteval))
  2563						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2564					if (pte_swp_uffd_wp(pteval))
  2565						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2566				}
  2567				if (folio_test_hugetlb(folio))
  2568					set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
  2569							hsz);
  2570				else
  2571					set_pte_at(mm, address, pvmw.pte, swp_pte);
  2572				trace_set_migration_pte(address, pte_val(swp_pte),
  2573							folio_order(folio));
  2574				/*
  2575				 * No need to invalidate here it will synchronize on
  2576				 * against the special swap migration pte.
  2577				 */
  2578			}
  2579	
  2580			if (unlikely(folio_test_hugetlb(folio)))
  2581				hugetlb_remove_rmap(folio);
  2582			else
  2583				folio_remove_rmap_pte(folio, subpage, vma);
  2584			if (vma->vm_flags & VM_LOCKED)
  2585				mlock_drain_local();
  2586			folio_put(folio);
  2587		}
  2588	
  2589		mmu_notifier_invalidate_range_end(&range);
  2590	
  2591		return ret;
  2592	}
  2593	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 00/11] THP support for zone device page migration
  2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
@ 2025-07-30 23:18   ` Alistair Popple
  2025-07-31  8:41   ` Balbir Singh
  1 sibling, 0 replies; 71+ messages in thread
From: Alistair Popple @ 2025-07-30 23:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang,
	Jane Chu, Donet Tom, Ralph Campbell, Mika Penttilä,
	Matthew Brost, Francois Dugast

On Wed, Jul 30, 2025 at 01:30:13PM +0200, David Hildenbrand wrote:
> On 30.07.25 11:21, Balbir Singh wrote:
> 
> BTW, I keep getting confused by the topic.
> 
> Isn't this essentially
> 
> "mm: support device-private THP"
> 
> and the support for migration is just a necessary requirement to *enable*
> device private?

Yes, that's a good point. Migration is one component but there is also fault
handling, etc. so I think calling this "support device-private THP" makes sense.

 - Alistair

> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-30 16:29                       ` Mika Penttilä
@ 2025-07-31  7:15                         ` David Hildenbrand
  2025-07-31  8:39                           ` Balbir Singh
  2025-07-31 11:26                           ` Zi Yan
  0 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31  7:15 UTC (permalink / raw)
  To: Mika Penttilä, Zi Yan
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 30.07.25 18:29, Mika Penttilä wrote:
> 
> On 7/30/25 18:58, Zi Yan wrote:
>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>
>>> On 7/30/25 18:10, Zi Yan wrote:
>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>
>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>
>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>
>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>> entries.
>>>>>>>>>>>>
>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>> folio.
>>>>>>>>>>>>
>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>> <snip>
>>>>>>>>>>
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>> + */
>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	/*
>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>> +	 *
>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>> +	 */
>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>> device side mapping.
>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>> 5) remap device private mapping.
>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>> folio by replacing existing page table entries with migration entries
>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>
>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>> at CPU side.
>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>> Right. That confused me when I was talking to Balbir and looking at v1.
>> When a device private folio is processed in __folio_split(), Balbir needed to
>> add code to skip CPU mapping handling code. Basically device private folios are
>> CPU unmapped and device mapped.
>>
>> Here are my questions on device private folios:
>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>     perspective? Can it be stored in a device private specific data structure?
> 
> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
> common code more messy if not done that way but sure possible.
> And not consuming pfns (address space) at all would have benefits.
> 
>> 2. When a device private folio is mapped on device, can someone other than
>>     the device driver manipulate it assuming core-mm just skips device private
>>     folios (barring the CPU access fault handling)?
>>
>> Where I am going is that can device private folios be treated as unmapped folios
>> by CPU and only device driver manipulates their mappings?
>>
> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
> someone could change while in device, it's just pfn.

Just to clarify: a device-private entry, like a device-exclusive entry, 
is a *page table mapping* tracked through the rmap -- even though they 
are not present page table entries.

It would be better if they would be present page table entries that are 
PROT_NONE, but it's tricky to mark them as being "special" 
device-private, device-exclusive etc. Maybe there are ways to do that in 
the future.

Maybe device-private could just be PROT_NONE, because we can identify 
the entry type based on the folio. device-exclusive is harder ...


So consider device-private entries just like PROT_NONE present page 
table entries. Refcount and mapcount is adjusted accordingly by rmap 
functions.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-31  7:15                         ` David Hildenbrand
@ 2025-07-31  8:39                           ` Balbir Singh
  2025-07-31 11:26                           ` Zi Yan
  1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-07-31  8:39 UTC (permalink / raw)
  To: David Hildenbrand, Mika Penttilä, Zi Yan
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 7/31/25 17:15, David Hildenbrand wrote:
> On 30.07.25 18:29, Mika Penttilä wrote:
>>
>> On 7/30/25 18:58, Zi Yan wrote:
>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>> <snip>
>>>>>>>>>>>
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>> +     *
>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>> +     */
>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>> device side mapping.
>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>
>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>> at CPU side.
>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>> add code to skip CPU mapping handling code. Basically device private folios are
>>> CPU unmapped and device mapped.
>>>
>>> Here are my questions on device private folios:
>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>     perspective? Can it be stored in a device private specific data structure?
>>
>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>> common code more messy if not done that way but sure possible.
>> And not consuming pfns (address space) at all would have benefits.
>>
>>> 2. When a device private folio is mapped on device, can someone other than
>>>     the device driver manipulate it assuming core-mm just skips device private
>>>     folios (barring the CPU access fault handling)?
>>>
>>> Where I am going is that can device private folios be treated as unmapped folios
>>> by CPU and only device driver manipulates their mappings?
>>>
>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>> someone could change while in device, it's just pfn.
> 
> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
> 
> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
> 
> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
> 
> 
> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
> 

Thanks for clarifying on my behalf, just catching up with the discussion

When I was referring to mapped with Zi, I was talking about how touching the entry will cause a fault and migration back and they can be considered as unmapped in that sense, because they are mapped to the device. Device private entries are mapped into the page tables and have a refcount associated with the page/folio that represents the device entries.

Balbir Singh


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 00/11] THP support for zone device page migration
  2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
  2025-07-30 23:18   ` Alistair Popple
@ 2025-07-31  8:41   ` Balbir Singh
  2025-07-31  8:56     ` David Hildenbrand
  1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-07-31  8:41 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 7/30/25 21:30, David Hildenbrand wrote:
> On 30.07.25 11:21, Balbir Singh wrote:
> 
> BTW, I keep getting confused by the topic.
> 
> Isn't this essentially
> 
> "mm: support device-private THP"
> 
> and the support for migration is just a necessary requirement to *enable* device private?
> 

I agree, I can change the title, but the focus of the use case is to
support THP migration for improved latency and throughput. All of that
involves support of device-private THP

Balbir Singh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 00/11] THP support for zone device page migration
  2025-07-31  8:41   ` Balbir Singh
@ 2025-07-31  8:56     ` David Hildenbrand
  0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31  8:56 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 31.07.25 10:41, Balbir Singh wrote:
> On 7/30/25 21:30, David Hildenbrand wrote:
>> On 30.07.25 11:21, Balbir Singh wrote:
>>
>> BTW, I keep getting confused by the topic.
>>
>> Isn't this essentially
>>
>> "mm: support device-private THP"
>>
>> and the support for migration is just a necessary requirement to *enable* device private?
>>
> 
> I agree, I can change the title, but the focus of the use case is to
> support THP migration for improved latency and throughput. All of that
> involves support of device-private THP

Well, the subject as is makes one believe that THP support for 
zone-device pages would already be there, and that you are adding 
migration support.

That was the confusing part to me, because in the very first patch you 
add ... THP support for (selected/private) zone device pages.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 07/11] mm/thp: add split during migration support
  2025-07-30  9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
@ 2025-07-31 10:04   ` kernel test robot
  0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-31 10:04 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: llvm, oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Hi Balbir,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20250731]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250730092139.3890844-8-balbirs%40nvidia.com
patch subject: [v2 07/11] mm/thp: add split during migration support
config: x86_64-randconfig-071-20250731 (https://download.01.org/0day-ci/archive/20250731/202507311724.mavZerV1-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507311724.mavZerV1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507311724.mavZerV1-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/migrate_device.c:1082:5: error: statement requires expression of scalar type ('void' invalid)
    1082 |                                 if (migrate_vma_split_pages(migrate, i, addr,
         |                                 ^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1083 |                                                                 folio)) {
         |                                                                 ~~~~~~
   1 error generated.


vim +1082 mm/migrate_device.c

   999	
  1000	static void __migrate_device_pages(unsigned long *src_pfns,
  1001					unsigned long *dst_pfns, unsigned long npages,
  1002					struct migrate_vma *migrate)
  1003	{
  1004		struct mmu_notifier_range range;
  1005		unsigned long i, j;
  1006		bool notified = false;
  1007		unsigned long addr;
  1008	
  1009		for (i = 0; i < npages; ) {
  1010			struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
  1011			struct page *page = migrate_pfn_to_page(src_pfns[i]);
  1012			struct address_space *mapping;
  1013			struct folio *newfolio, *folio;
  1014			int r, extra_cnt = 0;
  1015			unsigned long nr = 1;
  1016	
  1017			if (!newpage) {
  1018				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
  1019				goto next;
  1020			}
  1021	
  1022			if (!page) {
  1023				unsigned long addr;
  1024	
  1025				if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
  1026					goto next;
  1027	
  1028				/*
  1029				 * The only time there is no vma is when called from
  1030				 * migrate_device_coherent_folio(). However this isn't
  1031				 * called if the page could not be unmapped.
  1032				 */
  1033				VM_BUG_ON(!migrate);
  1034				addr = migrate->start + i*PAGE_SIZE;
  1035				if (!notified) {
  1036					notified = true;
  1037	
  1038					mmu_notifier_range_init_owner(&range,
  1039						MMU_NOTIFY_MIGRATE, 0,
  1040						migrate->vma->vm_mm, addr, migrate->end,
  1041						migrate->pgmap_owner);
  1042					mmu_notifier_invalidate_range_start(&range);
  1043				}
  1044	
  1045				if ((src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
  1046					(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
  1047					nr = HPAGE_PMD_NR;
  1048					src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
  1049				} else {
  1050					nr = 1;
  1051				}
  1052	
  1053				for (j = 0; j < nr && i + j < npages; j++) {
  1054					src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
  1055					migrate_vma_insert_page(migrate,
  1056						addr + j * PAGE_SIZE,
  1057						&dst_pfns[i+j], &src_pfns[i+j]);
  1058				}
  1059				goto next;
  1060			}
  1061	
  1062			newfolio = page_folio(newpage);
  1063			folio = page_folio(page);
  1064			mapping = folio_mapping(folio);
  1065	
  1066			/*
  1067			 * If THP migration is enabled, check if both src and dst
  1068			 * can migrate large pages
  1069			 */
  1070			if (thp_migration_supported()) {
  1071				if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
  1072					(src_pfns[i] & MIGRATE_PFN_COMPOUND) &&
  1073					!(dst_pfns[i] & MIGRATE_PFN_COMPOUND)) {
  1074	
  1075					if (!migrate) {
  1076						src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
  1077								 MIGRATE_PFN_COMPOUND);
  1078						goto next;
  1079					}
  1080					nr = 1 << folio_order(folio);
  1081					addr = migrate->start + i * PAGE_SIZE;
> 1082					if (migrate_vma_split_pages(migrate, i, addr,
  1083									folio)) {
  1084						src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
  1085								 MIGRATE_PFN_COMPOUND);
  1086						goto next;
  1087					}
  1088				} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
  1089					(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
  1090					!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
  1091					src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
  1092				}
  1093			}
  1094	
  1095	
  1096			if (folio_is_device_private(newfolio) ||
  1097			    folio_is_device_coherent(newfolio)) {
  1098				if (mapping) {
  1099					/*
  1100					 * For now only support anonymous memory migrating to
  1101					 * device private or coherent memory.
  1102					 *
  1103					 * Try to get rid of swap cache if possible.
  1104					 */
  1105					if (!folio_test_anon(folio) ||
  1106					    !folio_free_swap(folio)) {
  1107						src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
  1108						goto next;
  1109					}
  1110				}
  1111			} else if (folio_is_zone_device(newfolio)) {
  1112				/*
  1113				 * Other types of ZONE_DEVICE page are not supported.
  1114				 */
  1115				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
  1116				goto next;
  1117			}
  1118	
  1119			BUG_ON(folio_test_writeback(folio));
  1120	
  1121			if (migrate && migrate->fault_page == page)
  1122				extra_cnt++;
  1123			for (j = 0; j < nr && i + j < npages; j++) {
  1124				folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
  1125				newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
  1126	
  1127				r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
  1128				if (r != MIGRATEPAGE_SUCCESS)
  1129					src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
  1130				else
  1131					folio_migrate_flags(newfolio, folio);
  1132			}
  1133	next:
  1134			i += nr;
  1135		}
  1136	
  1137		if (notified)
  1138			mmu_notifier_invalidate_range_end(&range);
  1139	}
  1140	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 05/11] lib/test_hmm: test cases and support for zone device private THP
  2025-07-30  9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
@ 2025-07-31 11:17   ` kernel test robot
  0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-31 11:17 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

Hi Balbir,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250731]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250730092139.3890844-6-balbirs%40nvidia.com
patch subject: [v2 05/11] lib/test_hmm: test cases and support for zone device private THP
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20250731/202507311818.V6HUiudq-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 8f09b03aebb71c154f3bbe725c29e3f47d37c26e)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507311818.V6HUiudq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507311818.V6HUiudq-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> lib/test_hmm.c:1111:6: warning: variable 'dst_pfns' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    1111 |         if (!src_pfns)
         |             ^~~~~~~~~
   lib/test_hmm.c:1176:8: note: uninitialized use occurs here
    1176 |         kfree(dst_pfns);
         |               ^~~~~~~~
   lib/test_hmm.c:1111:2: note: remove the 'if' if its condition is always false
    1111 |         if (!src_pfns)
         |         ^~~~~~~~~~~~~~
    1112 |                 goto free_mem;
         |                 ~~~~~~~~~~~~~
   lib/test_hmm.c:1097:25: note: initialize the variable 'dst_pfns' to silence this warning
    1097 |         unsigned long *dst_pfns;
         |                                ^
         |                                 = NULL
   1 warning generated.


vim +1111 lib/test_hmm.c

  1084	
  1085	static int dmirror_migrate_to_device(struct dmirror *dmirror,
  1086					struct hmm_dmirror_cmd *cmd)
  1087	{
  1088		unsigned long start, end, addr;
  1089		unsigned long size = cmd->npages << PAGE_SHIFT;
  1090		struct mm_struct *mm = dmirror->notifier.mm;
  1091		struct vm_area_struct *vma;
  1092		struct dmirror_bounce bounce;
  1093		struct migrate_vma args = { 0 };
  1094		unsigned long next;
  1095		int ret;
  1096		unsigned long *src_pfns;
  1097		unsigned long *dst_pfns;
  1098	
  1099		start = cmd->addr;
  1100		end = start + size;
  1101		if (end < start)
  1102			return -EINVAL;
  1103	
  1104		/* Since the mm is for the mirrored process, get a reference first. */
  1105		if (!mmget_not_zero(mm))
  1106			return -EINVAL;
  1107	
  1108		ret = -ENOMEM;
  1109		src_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*src_pfns),
  1110				  GFP_KERNEL | __GFP_NOFAIL);
> 1111		if (!src_pfns)
  1112			goto free_mem;
  1113	
  1114		dst_pfns = kvcalloc(PTRS_PER_PTE, sizeof(*dst_pfns),
  1115				  GFP_KERNEL | __GFP_NOFAIL);
  1116		if (!dst_pfns)
  1117			goto free_mem;
  1118	
  1119		ret = 0;
  1120		mmap_read_lock(mm);
  1121		for (addr = start; addr < end; addr = next) {
  1122			vma = vma_lookup(mm, addr);
  1123			if (!vma || !(vma->vm_flags & VM_READ)) {
  1124				ret = -EINVAL;
  1125				goto out;
  1126			}
  1127			next = min(end, addr + (PTRS_PER_PTE << PAGE_SHIFT));
  1128			if (next > vma->vm_end)
  1129				next = vma->vm_end;
  1130	
  1131			args.vma = vma;
  1132			args.src = src_pfns;
  1133			args.dst = dst_pfns;
  1134			args.start = addr;
  1135			args.end = next;
  1136			args.pgmap_owner = dmirror->mdevice;
  1137			args.flags = MIGRATE_VMA_SELECT_SYSTEM |
  1138					MIGRATE_VMA_SELECT_COMPOUND;
  1139			ret = migrate_vma_setup(&args);
  1140			if (ret)
  1141				goto out;
  1142	
  1143			pr_debug("Migrating from sys mem to device mem\n");
  1144			dmirror_migrate_alloc_and_copy(&args, dmirror);
  1145			migrate_vma_pages(&args);
  1146			dmirror_migrate_finalize_and_map(&args, dmirror);
  1147			migrate_vma_finalize(&args);
  1148		}
  1149		mmap_read_unlock(mm);
  1150		mmput(mm);
  1151	
  1152		/*
  1153		 * Return the migrated data for verification.
  1154		 * Only for pages in device zone
  1155		 */
  1156		ret = dmirror_bounce_init(&bounce, start, size);
  1157		if (ret)
  1158			goto free_mem;
  1159		mutex_lock(&dmirror->mutex);
  1160		ret = dmirror_do_read(dmirror, start, end, &bounce);
  1161		mutex_unlock(&dmirror->mutex);
  1162		if (ret == 0) {
  1163			if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
  1164					 bounce.size))
  1165				ret = -EFAULT;
  1166		}
  1167		cmd->cpages = bounce.cpages;
  1168		dmirror_bounce_fini(&bounce);
  1169		goto free_mem;
  1170	
  1171	out:
  1172		mmap_read_unlock(mm);
  1173		mmput(mm);
  1174	free_mem:
  1175		kfree(src_pfns);
  1176		kfree(dst_pfns);
  1177		return ret;
  1178	}
  1179	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-31  7:15                         ` David Hildenbrand
  2025-07-31  8:39                           ` Balbir Singh
@ 2025-07-31 11:26                           ` Zi Yan
  2025-07-31 12:32                             ` David Hildenbrand
  2025-08-01  0:49                             ` Balbir Singh
  1 sibling, 2 replies; 71+ messages in thread
From: Zi Yan @ 2025-07-31 11:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
	Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 31 Jul 2025, at 3:15, David Hildenbrand wrote:

> On 30.07.25 18:29, Mika Penttilä wrote:
>>
>> On 7/30/25 18:58, Zi Yan wrote:
>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>> <snip>
>>>>>>>>>>>
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>> device side mapping.
>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>
>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>> at CPU side.
>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>> add code to skip CPU mapping handling code. Basically device private folios are
>>> CPU unmapped and device mapped.
>>>
>>> Here are my questions on device private folios:
>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>     perspective? Can it be stored in a device private specific data structure?
>>
>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>> common code more messy if not done that way but sure possible.
>> And not consuming pfns (address space) at all would have benefits.
>>
>>> 2. When a device private folio is mapped on device, can someone other than
>>>     the device driver manipulate it assuming core-mm just skips device private
>>>     folios (barring the CPU access fault handling)?
>>>
>>> Where I am going is that can device private folios be treated as unmapped folios
>>> by CPU and only device driver manipulates their mappings?
>>>
>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>> someone could change while in device, it's just pfn.
>
> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>
> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>
> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>
>
> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.

Thanks for the clarification.

So folio_mapcount() for device private folios should be treated the same
as normal folios, even if the corresponding PTEs are not accessible from CPUs.
Then I wonder if the device private large folio split should go through
__folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
remap. Otherwise, how can we prevent rmap changes during the split?


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-31 11:26                           ` Zi Yan
@ 2025-07-31 12:32                             ` David Hildenbrand
  2025-07-31 13:34                               ` Zi Yan
  2025-08-01  0:49                             ` Balbir Singh
  1 sibling, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31 12:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
	Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 31.07.25 13:26, Zi Yan wrote:
> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
> 
>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>
>>> On 7/30/25 18:58, Zi Yan wrote:
>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>> device side mapping.
>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>
>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>> at CPU side.
>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>> CPU unmapped and device mapped.
>>>>
>>>> Here are my questions on device private folios:
>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>      perspective? Can it be stored in a device private specific data structure?
>>>
>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>> common code more messy if not done that way but sure possible.
>>> And not consuming pfns (address space) at all would have benefits.
>>>
>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>      folios (barring the CPU access fault handling)?
>>>>
>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>> by CPU and only device driver manipulates their mappings?
>>>>
>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>> someone could change while in device, it's just pfn.
>>
>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>
>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>
>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>
>>
>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
> 
> Thanks for the clarification.
> 
> So folio_mapcount() for device private folios should be treated the same
> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
> Then I wonder if the device private large folio split should go through
> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
> remap. Otherwise, how can we prevent rmap changes during the split?

That is what I would expect: Replace device-private by migration 
entries, perform the migration/split/whatever, restore migration entries 
to device-private entries.

That will drive the mapcount to 0.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-31 12:32                             ` David Hildenbrand
@ 2025-07-31 13:34                               ` Zi Yan
  2025-07-31 19:09                                 ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-07-31 13:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
	Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 31 Jul 2025, at 8:32, David Hildenbrand wrote:

> On 31.07.25 13:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>      folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>>
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>
> That is what I would expect: Replace device-private by migration entries, perform the migration/split/whatever, restore migration entries to device-private entries.
>
> That will drive the mapcount to 0.

Great. That matches my expectations as well. One potential optimization could
be since device private entry is already CPU inaccessible TLB flush can be
avoided.


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 03/11] mm/migrate_device: THP migration of zone device pages
  2025-07-30  9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
@ 2025-07-31 16:19   ` kernel test robot
  0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-07-31 16:19 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: oe-kbuild-all, linux-kernel, Balbir Singh, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Mika Penttilä, Matthew Brost, Francois Dugast,
	Ralph Campbell

Hi Balbir,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250731]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250730092139.3890844-4-balbirs%40nvidia.com
patch subject: [v2 03/11] mm/migrate_device: THP migration of zone device pages
config: x86_64-randconfig-122-20250731 (https://download.01.org/0day-ci/archive/20250731/202507312342.dmLxVgli-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507312342.dmLxVgli-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507312342.dmLxVgli-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/migrate_device.c:769:13: sparse: sparse: incorrect type in assignment (different base types) @@     expected int [assigned] ret @@     got restricted vm_fault_t @@
   mm/migrate_device.c:769:13: sparse:     expected int [assigned] ret
   mm/migrate_device.c:769:13: sparse:     got restricted vm_fault_t
   mm/migrate_device.c:130:25: sparse: sparse: context imbalance in 'migrate_vma_collect_huge_pmd' - unexpected unlock
   mm/migrate_device.c:815:16: sparse: sparse: context imbalance in 'migrate_vma_insert_huge_pmd_page' - different lock contexts for basic block

vim +769 mm/migrate_device.c

   689	
   690	#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
   691	/**
   692	 * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vma->vm_mm
   693	 * at @addr. folio is already allocated as a part of the migration process with
   694	 * large page.
   695	 *
   696	 * @folio needs to be initialized and setup after it's allocated. The code bits
   697	 * here follow closely the code in __do_huge_pmd_anonymous_page(). This API does
   698	 * not support THP zero pages.
   699	 *
   700	 * @migrate: migrate_vma arguments
   701	 * @addr: address where the folio will be inserted
   702	 * @folio: folio to be inserted at @addr
   703	 * @src: src pfn which is being migrated
   704	 * @pmdp: pointer to the pmd
   705	 */
   706	static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
   707						 unsigned long addr,
   708						 struct page *page,
   709						 unsigned long *src,
   710						 pmd_t *pmdp)
   711	{
   712		struct vm_area_struct *vma = migrate->vma;
   713		gfp_t gfp = vma_thp_gfp_mask(vma);
   714		struct folio *folio = page_folio(page);
   715		int ret;
   716		spinlock_t *ptl;
   717		pgtable_t pgtable;
   718		pmd_t entry;
   719		bool flush = false;
   720		unsigned long i;
   721	
   722		VM_WARN_ON_FOLIO(!folio, folio);
   723		VM_WARN_ON_ONCE(!pmd_none(*pmdp) && !is_huge_zero_pmd(*pmdp));
   724	
   725		if (!thp_vma_suitable_order(vma, addr, HPAGE_PMD_ORDER))
   726			return -EINVAL;
   727	
   728		ret = anon_vma_prepare(vma);
   729		if (ret)
   730			return ret;
   731	
   732		folio_set_order(folio, HPAGE_PMD_ORDER);
   733		folio_set_large_rmappable(folio);
   734	
   735		if (mem_cgroup_charge(folio, migrate->vma->vm_mm, gfp)) {
   736			count_vm_event(THP_FAULT_FALLBACK);
   737			count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
   738			ret = -ENOMEM;
   739			goto abort;
   740		}
   741	
   742		__folio_mark_uptodate(folio);
   743	
   744		pgtable = pte_alloc_one(vma->vm_mm);
   745		if (unlikely(!pgtable))
   746			goto abort;
   747	
   748		if (folio_is_device_private(folio)) {
   749			swp_entry_t swp_entry;
   750	
   751			if (vma->vm_flags & VM_WRITE)
   752				swp_entry = make_writable_device_private_entry(
   753							page_to_pfn(page));
   754			else
   755				swp_entry = make_readable_device_private_entry(
   756							page_to_pfn(page));
   757			entry = swp_entry_to_pmd(swp_entry);
   758		} else {
   759			if (folio_is_zone_device(folio) &&
   760			    !folio_is_device_coherent(folio)) {
   761				goto abort;
   762			}
   763			entry = folio_mk_pmd(folio, vma->vm_page_prot);
   764			if (vma->vm_flags & VM_WRITE)
   765				entry = pmd_mkwrite(pmd_mkdirty(entry), vma);
   766		}
   767	
   768		ptl = pmd_lock(vma->vm_mm, pmdp);
 > 769		ret = check_stable_address_space(vma->vm_mm);
   770		if (ret)
   771			goto abort;
   772	
   773		/*
   774		 * Check for userfaultfd but do not deliver the fault. Instead,
   775		 * just back off.
   776		 */
   777		if (userfaultfd_missing(vma))
   778			goto unlock_abort;
   779	
   780		if (!pmd_none(*pmdp)) {
   781			if (!is_huge_zero_pmd(*pmdp))
   782				goto unlock_abort;
   783			flush = true;
   784		} else if (!pmd_none(*pmdp))
   785			goto unlock_abort;
   786	
   787		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
   788		folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
   789		if (!folio_is_zone_device(folio))
   790			folio_add_lru_vma(folio, vma);
   791		folio_get(folio);
   792	
   793		if (flush) {
   794			pte_free(vma->vm_mm, pgtable);
   795			flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
   796			pmdp_invalidate(vma, addr, pmdp);
   797		} else {
   798			pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
   799			mm_inc_nr_ptes(vma->vm_mm);
   800		}
   801		set_pmd_at(vma->vm_mm, addr, pmdp, entry);
   802		update_mmu_cache_pmd(vma, addr, pmdp);
   803	
   804		spin_unlock(ptl);
   805	
   806		count_vm_event(THP_FAULT_ALLOC);
   807		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
   808		count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
   809	
   810		return 0;
   811	
   812	unlock_abort:
   813		spin_unlock(ptl);
   814	abort:
   815		for (i = 0; i < HPAGE_PMD_NR; i++)
   816			src[i] &= ~MIGRATE_PFN_MIGRATE;
   817		return 0;
   818	}
   819	#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
   820	static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
   821						 unsigned long addr,
   822						 struct page *page,
   823						 unsigned long *src,
   824						 pmd_t *pmdp)
   825	{
   826		return 0;
   827	}
   828	#endif
   829	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-31 13:34                               ` Zi Yan
@ 2025-07-31 19:09                                 ` David Hildenbrand
  0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-07-31 19:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: Mika Penttilä, Balbir Singh, linux-mm, linux-kernel,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
	Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 31.07.25 15:34, Zi Yan wrote:
> On 31 Jul 2025, at 8:32, David Hildenbrand wrote:
> 
>> On 31.07.25 13:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>>
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
>> That is what I would expect: Replace device-private by migration entries, perform the migration/split/whatever, restore migration entries to device-private entries.
>>
>> That will drive the mapcount to 0.
> 
> Great. That matches my expectations as well. One potential optimization could
> be since device private entry is already CPU inaccessible TLB flush can be
> avoided.

Right, I would assume that is already done, or could easily be added.

Not using proper migration entries sounds like a hack that we shouldn't 
start with. We should start with as little special cases as possible in 
core-mm.

For example, as you probably implied, there is nothing stopping 
concurrent fork() or zap to mess with the refcount+mapcount.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-07-31 11:26                           ` Zi Yan
  2025-07-31 12:32                             ` David Hildenbrand
@ 2025-08-01  0:49                             ` Balbir Singh
  2025-08-01  1:09                               ` Zi Yan
  2025-08-01  1:16                               ` Mika Penttilä
  1 sibling, 2 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-01  0:49 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand
  Cc: Mika Penttilä, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 7/31/25 21:26, Zi Yan wrote:
> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
> 
>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>
>>> On 7/30/25 18:58, Zi Yan wrote:
>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>> device side mapping.
>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>
>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>> at CPU side.
>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>> CPU unmapped and device mapped.
>>>>
>>>> Here are my questions on device private folios:
>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>     perspective? Can it be stored in a device private specific data structure?
>>>
>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>> common code more messy if not done that way but sure possible.
>>> And not consuming pfns (address space) at all would have benefits.
>>>
>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>     folios (barring the CPU access fault handling)?
>>>>
>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>> by CPU and only device driver manipulates their mappings?
>>>>
>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>> someone could change while in device, it's just pfn.
>>
>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>
>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>
>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>
>>
>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
> 
> Thanks for the clarification.
> 
> So folio_mapcount() for device private folios should be treated the same
> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
> Then I wonder if the device private large folio split should go through
> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
> remap. Otherwise, how can we prevent rmap changes during the split?
> 

That is true in general, the special cases I mentioned are:

1. split during migration (where we the sizes on source/destination do not
   match) and so we need to split in the middle of migration. The entries
   there are already unmapped and hence the special handling
2. Partial unmap case, where we need to split in the context of the unmap
   due to the isses mentioned in the patch. I expanded the folio split code
   for device private can be expanded into its own helper, which does not
   need to do the xas/mapped/lru folio handling. During partial unmap the
   original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)

For (2), I spent some time examining the implications of not unmapping the
folios prior to split and in the partial unmap path, once we split the PMD
the folios diverge. I did not run into any particular race either with the
tests.

Balbir Singh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  0:49                             ` Balbir Singh
@ 2025-08-01  1:09                               ` Zi Yan
  2025-08-01  7:01                                 ` David Hildenbrand
  2025-08-01  1:16                               ` Mika Penttilä
  1 sibling, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-08-01  1:09 UTC (permalink / raw)
  To: Balbir Singh
  Cc: David Hildenbrand, Mika Penttilä, linux-mm, linux-kernel,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
	Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 31 Jul 2025, at 20:49, Balbir Singh wrote:

> On 7/31/25 21:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>     folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>>
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
>
> That is true in general, the special cases I mentioned are:
>
> 1. split during migration (where we the sizes on source/destination do not
>    match) and so we need to split in the middle of migration. The entries
>    there are already unmapped and hence the special handling

In this case, all device private entries pointing to this device private
folio should be turned into migration entries and folio_mapcount() should
be 0. The split_device_private_folio() is handling this situation, although
the function name is not very descriptive. You might want to add a comment
to this function about its use and add a check to make sure folio_mapcount()
is 0.

> 2. Partial unmap case, where we need to split in the context of the unmap
>    due to the isses mentioned in the patch. I expanded the folio split code
>    for device private can be expanded into its own helper, which does not
>    need to do the xas/mapped/lru folio handling. During partial unmap the
>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>
> For (2), I spent some time examining the implications of not unmapping the
> folios prior to split and in the partial unmap path, once we split the PMD
> the folios diverge. I did not run into any particular race either with the
> tests.

For partial unmap case, you should be able to handle it in the same way
as normal PTE-mapped large folio. Since like David said, each device private
entry can be seen as a PROTNONE entry. At PMD split, PMD page table page
should be filled with device private PTEs. Each of them points to the
corresponding subpage. When device unmaps some of the PTEs, rmap code
should take care of the folio_mapcount().


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  0:49                             ` Balbir Singh
  2025-08-01  1:09                               ` Zi Yan
@ 2025-08-01  1:16                               ` Mika Penttilä
  2025-08-01  4:44                                 ` Balbir Singh
  1 sibling, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-01  1:16 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan, David Hildenbrand
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

Hi,

On 8/1/25 03:49, Balbir Singh wrote:

> On 7/31/25 21:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>     folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
> That is true in general, the special cases I mentioned are:
>
> 1. split during migration (where we the sizes on source/destination do not
>    match) and so we need to split in the middle of migration. The entries
>    there are already unmapped and hence the special handling
> 2. Partial unmap case, where we need to split in the context of the unmap
>    due to the isses mentioned in the patch. I expanded the folio split code
>    for device private can be expanded into its own helper, which does not
>    need to do the xas/mapped/lru folio handling. During partial unmap the
>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>
> For (2), I spent some time examining the implications of not unmapping the
> folios prior to split and in the partial unmap path, once we split the PMD
> the folios diverge. I did not run into any particular race either with the
> tests.

1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()

2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
It is vulnerable to races by rmap. And for instance this does not look right without checking:

   folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
possible to split the folio at fault time then?
Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
instead?


> Balbir Singh
>
--Mika


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  1:16                               ` Mika Penttilä
@ 2025-08-01  4:44                                 ` Balbir Singh
  2025-08-01  5:57                                   ` Balbir Singh
                                                     ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-01  4:44 UTC (permalink / raw)
  To: Mika Penttilä, Zi Yan, David Hildenbrand
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/1/25 11:16, Mika Penttilä wrote:
> Hi,
> 
> On 8/1/25 03:49, Balbir Singh wrote:
> 
>> On 7/31/25 21:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>>     folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>
>> That is true in general, the special cases I mentioned are:
>>
>> 1. split during migration (where we the sizes on source/destination do not
>>    match) and so we need to split in the middle of migration. The entries
>>    there are already unmapped and hence the special handling
>> 2. Partial unmap case, where we need to split in the context of the unmap
>>    due to the isses mentioned in the patch. I expanded the folio split code
>>    for device private can be expanded into its own helper, which does not
>>    need to do the xas/mapped/lru folio handling. During partial unmap the
>>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>
>> For (2), I spent some time examining the implications of not unmapping the
>> folios prior to split and in the partial unmap path, once we split the PMD
>> the folios diverge. I did not run into any particular race either with the
>> tests.
> 
> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
> 
> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
> It is vulnerable to races by rmap. And for instance this does not look right without checking:
> 
>    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> 

I can add checks to make sure that the call does succeed. 

> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
> possible to split the folio at fault time then?

So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.


> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
> instead?
> 
> 

Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
migration requirements for fault handling.

Balbir Singh


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  4:44                                 ` Balbir Singh
@ 2025-08-01  5:57                                   ` Balbir Singh
  2025-08-01  6:01                                   ` Mika Penttilä
  2025-08-01  7:04                                   ` David Hildenbrand
  2 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-01  5:57 UTC (permalink / raw)
  To: Mika Penttilä, Zi Yan, David Hildenbrand
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/1/25 14:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>>>     folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>>    match) and so we need to split in the middle of migration. The entries
>>>    there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>    due to the isses mentioned in the patch. I expanded the folio split code
>>>    for device private can be expanded into its own helper, which does not
>>>    need to do the xas/mapped/lru folio handling. During partial unmap the
>>>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>>
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>>    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
> 
> I can add checks to make sure that the call does succeed. 
> 
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
> 
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
> 

Let me get back to you on this with data, I was playing around with CONFIG_MM_IDS and might
have different data from it.

> 
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
> 
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.
> 
> Balbir Singh
> 

Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  4:44                                 ` Balbir Singh
  2025-08-01  5:57                                   ` Balbir Singh
@ 2025-08-01  6:01                                   ` Mika Penttilä
  2025-08-01  7:04                                   ` David Hildenbrand
  2 siblings, 0 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-01  6:01 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan, David Hildenbrand
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell


On 8/1/25 07:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>>>     folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>>    match) and so we need to split in the middle of migration. The entries
>>>    there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>    due to the isses mentioned in the patch. I expanded the folio split code
>>>    for device private can be expanded into its own helper, which does not
>>>    need to do the xas/mapped/lru folio handling. During partial unmap the
>>>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>>    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
> I can add checks to make sure that the call does succeed. 
>
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.

Is this after the deferred split -> map_unused_to_zeropage flow which would leave the page unmapped? Maybe disable that for device pages?


>
>
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.
>
> Balbir Singh

--Mika


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  1:09                               ` Zi Yan
@ 2025-08-01  7:01                                 ` David Hildenbrand
  0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2025-08-01  7:01 UTC (permalink / raw)
  To: Zi Yan, Balbir Singh
  Cc: Mika Penttilä, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 01.08.25 03:09, Zi Yan wrote:
> On 31 Jul 2025, at 20:49, Balbir Singh wrote:
> 
>> On 7/31/25 21:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>>
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>>      folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>>
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>
>>
>> That is true in general, the special cases I mentioned are:
>>
>> 1. split during migration (where we the sizes on source/destination do not
>>     match) and so we need to split in the middle of migration. The entries
>>     there are already unmapped and hence the special handling
> 
> In this case, all device private entries pointing to this device private
> folio should be turned into migration entries and folio_mapcount() should
> be 0. The split_device_private_folio() is handling this situation, although
> the function name is not very descriptive. You might want to add a comment
> to this function about its use and add a check to make sure folio_mapcount()
> is 0.
> 
>> 2. Partial unmap case, where we need to split in the context of the unmap
>>     due to the isses mentioned in the patch. I expanded the folio split code
>>     for device private can be expanded into its own helper, which does not
>>     need to do the xas/mapped/lru folio handling. During partial unmap the
>>     original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>
>> For (2), I spent some time examining the implications of not unmapping the
>> folios prior to split and in the partial unmap path, once we split the PMD
>> the folios diverge. I did not run into any particular race either with the
>> tests.
> 
> For partial unmap case, you should be able to handle it in the same way
> as normal PTE-mapped large folio. Since like David said, each device private
> entry can be seen as a PROTNONE entry. At PMD split, PMD page table page
> should be filled with device private PTEs. Each of them points to the
> corresponding subpage. When device unmaps some of the PTEs, rmap code
> should take care of the folio_mapcount().

Right. In general, no splitting of any THP with a mapcount > 0 
(folio_mapped()). It's a clear indication that you are doing something 
wrong.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  4:44                                 ` Balbir Singh
  2025-08-01  5:57                                   ` Balbir Singh
  2025-08-01  6:01                                   ` Mika Penttilä
@ 2025-08-01  7:04                                   ` David Hildenbrand
  2025-08-01  8:01                                     ` Balbir Singh
  2 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-01  7:04 UTC (permalink / raw)
  To: Balbir Singh, Mika Penttilä, Zi Yan
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 01.08.25 06:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>>>      folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>>     match) and so we need to split in the middle of migration. The entries
>>>     there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>     due to the isses mentioned in the patch. I expanded the folio split code
>>>     for device private can be expanded into its own helper, which does not
>>>     need to do the xas/mapped/lru folio handling. During partial unmap the
>>>     original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>>
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>>     folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
> 
> I can add checks to make sure that the call does succeed.
> 
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
> 
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.

I think you mean "Calling folio_split() on a *fully* unmapped folio 
fails ..."

A partially mapped folio still has folio_mapcount() > 0 -> 
folio_mapped() == true.

> 
> 
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
> 
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path.

Yes, that's very complicated.

> Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.

Can you elaborate on that?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  7:04                                   ` David Hildenbrand
@ 2025-08-01  8:01                                     ` Balbir Singh
  2025-08-01  8:46                                       ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-01  8:01 UTC (permalink / raw)
  To: David Hildenbrand, Mika Penttilä, Zi Yan
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/1/25 17:04, David Hildenbrand wrote:
> On 01.08.25 06:44, Balbir Singh wrote:
>> On 8/1/25 11:16, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>
>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>
>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>
>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>> at CPU side.
>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>
>>>>>>>> Here are my questions on device private folios:
>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>
>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>      folios (barring the CPU access fault handling)?
>>>>>>>>
>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>
>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>> someone could change while in device, it's just pfn.
>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>
>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>
>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>
>>>>>>
>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>> Thanks for the clarification.
>>>>>
>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>> Then I wonder if the device private large folio split should go through
>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>
>>>> That is true in general, the special cases I mentioned are:
>>>>
>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>     match) and so we need to split in the middle of migration. The entries
>>>>     there are already unmapped and hence the special handling
>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>     due to the isses mentioned in the patch. I expanded the folio split code
>>>>     for device private can be expanded into its own helper, which does not
>>>>     need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>     original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>
>>>> For (2), I spent some time examining the implications of not unmapping the
>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>> the folios diverge. I did not run into any particular race either with the
>>>> tests.
>>>
>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>
>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>
>>>     folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>
>>
>> I can add checks to make sure that the call does succeed.
>>
>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>> possible to split the folio at fault time then?
>>
>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
> 
> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
> 
> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
> 

Looking into this again at my end

>>
>>
>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>> instead?
>>>
>>>
>>
>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>> split_huge_pmd_locked() path.
> 
> Yes, that's very complicated.
> 

Yes and I want to avoid going down that path.

>> Deferred splits do not work for device private pages, due to the
>> migration requirements for fault handling.
> 
> Can you elaborate on that?
> 

If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
assumes that the folio sizes are the same (via check for reference and mapcount)

Balbir



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  8:01                                     ` Balbir Singh
@ 2025-08-01  8:46                                       ` David Hildenbrand
  2025-08-01 11:10                                         ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-01  8:46 UTC (permalink / raw)
  To: Balbir Singh, Mika Penttilä, Zi Yan
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 01.08.25 10:01, Balbir Singh wrote:
> On 8/1/25 17:04, David Hildenbrand wrote:
>> On 01.08.25 06:44, Balbir Singh wrote:
>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>
>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>> at CPU side.
>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>
>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>
>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>
>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>
>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>
>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>
>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>
>>>>>>>
>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>> Then I wonder if the device private large folio split should go through
>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>
>>>>> That is true in general, the special cases I mentioned are:
>>>>>
>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>      there are already unmapped and hence the special handling
>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>      for device private can be expanded into its own helper, which does not
>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>
>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>> the folios diverge. I did not run into any particular race either with the
>>>>> tests.
>>>>
>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>
>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>
>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>
>>>
>>> I can add checks to make sure that the call does succeed.
>>>
>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>> possible to split the folio at fault time then?
>>>
>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>
>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>
>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>
> 
> Looking into this again at my end
> 
>>>
>>>
>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>> instead?
>>>>
>>>>
>>>
>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>> split_huge_pmd_locked() path.
>>
>> Yes, that's very complicated.
>>
> 
> Yes and I want to avoid going down that path.
> 
>>> Deferred splits do not work for device private pages, due to the
>>> migration requirements for fault handling.
>>
>> Can you elaborate on that?
>>
> 
> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
> assumes that the folio sizes are the same (via check for reference and mapcount)

If you hit a partially-mapped folio, instead of migrating, you would 
actually want to split and then migrate I assume.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01  8:46                                       ` David Hildenbrand
@ 2025-08-01 11:10                                         ` Zi Yan
  2025-08-01 12:20                                           ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2025-08-01 11:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Balbir Singh, Mika Penttilä, linux-mm, linux-kernel,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
	Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 1 Aug 2025, at 4:46, David Hildenbrand wrote:

> On 01.08.25 10:01, Balbir Singh wrote:
>> On 8/1/25 17:04, David Hildenbrand wrote:
>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>
>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>> at CPU side.
>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>
>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>
>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>
>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>
>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>
>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>
>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>
>>>>>>>>
>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>> Thanks for the clarification.
>>>>>>>
>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>
>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>
>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>      there are already unmapped and hence the special handling
>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>
>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>> tests.
>>>>>
>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>
>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>
>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>
>>>>
>>>> I can add checks to make sure that the call does succeed.
>>>>
>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>> possible to split the folio at fault time then?
>>>>
>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>
>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>
>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>
>>
>> Looking into this again at my end
>>
>>>>
>>>>
>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>> instead?
>>>>>
>>>>>
>>>>
>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>> split_huge_pmd_locked() path.
>>>
>>> Yes, that's very complicated.
>>>
>>
>> Yes and I want to avoid going down that path.
>>
>>>> Deferred splits do not work for device private pages, due to the
>>>> migration requirements for fault handling.
>>>
>>> Can you elaborate on that?
>>>
>>
>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>> assumes that the folio sizes are the same (via check for reference and mapcount)
>
> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.

Yes, that is exactly what migrate_pages() does. And if split fails, the migration
fails too. Device private folio probably should do the same thing, assuming
splitting device private folio would always succeed.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01 11:10                                         ` Zi Yan
@ 2025-08-01 12:20                                           ` Mika Penttilä
  2025-08-01 12:28                                             ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-01 12:20 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell


On 8/1/25 14:10, Zi Yan wrote:
> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>
>> On 01.08.25 10:01, Balbir Singh wrote:
>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>
>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>
>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>>
>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>
>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>
>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>
>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>> Thanks for the clarification.
>>>>>>>>
>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>
>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>
>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>>      there are already unmapped and hence the special handling
>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>
>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>> tests.
>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>
>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>
>>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>
>>>>> I can add checks to make sure that the call does succeed.
>>>>>
>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>> possible to split the folio at fault time then?
>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>
>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>
>>> Looking into this again at my end
>>>
>>>>>
>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>> instead?
>>>>>>
>>>>>>
>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>> split_huge_pmd_locked() path.
>>>> Yes, that's very complicated.
>>>>
>>> Yes and I want to avoid going down that path.
>>>
>>>>> Deferred splits do not work for device private pages, due to the
>>>>> migration requirements for fault handling.
>>>> Can you elaborate on that?
>>>>
>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
> fails too. Device private folio probably should do the same thing, assuming
> splitting device private folio would always succeed.

hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
device private pages, that can't work..

>
> Best Regards,
> Yan, Zi
>
--Mika


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01 12:20                                           ` Mika Penttilä
@ 2025-08-01 12:28                                             ` Zi Yan
  2025-08-02  1:17                                               ` Balbir Singh
  2025-08-02 10:37                                               ` Balbir Singh
  0 siblings, 2 replies; 71+ messages in thread
From: Zi Yan @ 2025-08-01 12:28 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: David Hildenbrand, Balbir Singh, linux-mm, linux-kernel,
	Karol Herbst, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Jérôme Glisse, Shuah Khan, Barry Song,
	Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Matthew Brost,
	Francois Dugast, Ralph Campbell

On 1 Aug 2025, at 8:20, Mika Penttilä wrote:

> On 8/1/25 14:10, Zi Yan wrote:
>> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>>
>>> On 01.08.25 10:01, Balbir Singh wrote:
>>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>>
>>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>>
>>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>>>
>>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>>
>>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>>
>>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>>
>>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>>> Thanks for the clarification.
>>>>>>>>>
>>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>>
>>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>>
>>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>>>      there are already unmapped and hence the special handling
>>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>>
>>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>>> tests.
>>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>>
>>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>>
>>>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>
>>>>>> I can add checks to make sure that the call does succeed.
>>>>>>
>>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>>> possible to split the folio at fault time then?
>>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>>
>>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>>
>>>> Looking into this again at my end
>>>>
>>>>>>
>>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>>> instead?
>>>>>>>
>>>>>>>
>>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>>> split_huge_pmd_locked() path.
>>>>> Yes, that's very complicated.
>>>>>
>>>> Yes and I want to avoid going down that path.
>>>>
>>>>>> Deferred splits do not work for device private pages, due to the
>>>>>> migration requirements for fault handling.
>>>>> Can you elaborate on that?
>>>>>
>>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
>> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
>> fails too. Device private folio probably should do the same thing, assuming
>> splitting device private folio would always succeed.
>
> hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
> device private pages, that can't work..

It is fine to exclude device private folio to use RMP_USE_SHARED_ZEROPAGE like:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2b4ea5a2ce7d..b97dfd3521a9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3858,7 +3858,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
        if (nr_shmem_dropped)
                shmem_uncharge(mapping->host, nr_shmem_dropped);

-       if (!ret && is_anon)
+       if (!ret && is_anon && !folio_is_device_private(folio))
                remap_flags = RMP_USE_SHARED_ZEROPAGE;
        remap_page(folio, 1 << order, remap_flags);

Or it can be done in remove_migration_pte().

Best Regards,
Yan, Zi

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01 12:28                                             ` Zi Yan
@ 2025-08-02  1:17                                               ` Balbir Singh
  2025-08-02 10:37                                               ` Balbir Singh
  1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-02  1:17 UTC (permalink / raw)
  To: Zi Yan, Mika Penttilä
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/1/25 22:28, Zi Yan wrote:
> On 1 Aug 2025, at 8:20, Mika Penttilä wrote:
> 
>> On 8/1/25 14:10, Zi Yan wrote:
>>> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>>>
>>>> On 01.08.25 10:01, Balbir Singh wrote:
>>>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>>>
>>>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>>>
>>>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>>>
>>>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>>>
>>>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>>>
>>>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>>>> Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>>>
>>>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>>>
>>>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>>>>      there are already unmapped and hence the special handling
>>>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>>>
>>>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>>>> tests.
>>>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>>>
>>>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>>>
>>>>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>
>>>>>>> I can add checks to make sure that the call does succeed.
>>>>>>>
>>>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>>>> possible to split the folio at fault time then?
>>>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>>>
>>>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>>>
>>>>> Looking into this again at my end
>>>>>
>>>>>>>
>>>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>>>> instead?
>>>>>>>>
>>>>>>>>
>>>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>>>> split_huge_pmd_locked() path.
>>>>>> Yes, that's very complicated.
>>>>>>
>>>>> Yes and I want to avoid going down that path.
>>>>>
>>>>>>> Deferred splits do not work for device private pages, due to the
>>>>>>> migration requirements for fault handling.
>>>>>> Can you elaborate on that?
>>>>>>
>>>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>>>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
>>> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
>>> fails too. Device private folio probably should do the same thing, assuming
>>> splitting device private folio would always succeed.
>>
>> hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
>> device private pages, that can't work..
> 
> It is fine to exclude device private folio to use RMP_USE_SHARED_ZEROPAGE like:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2b4ea5a2ce7d..b97dfd3521a9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3858,7 +3858,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>         if (nr_shmem_dropped)
>                 shmem_uncharge(mapping->host, nr_shmem_dropped);
> 
> -       if (!ret && is_anon)
> +       if (!ret && is_anon && !folio_is_device_private(folio))
>                 remap_flags = RMP_USE_SHARED_ZEROPAGE;
>         remap_page(folio, 1 << order, remap_flags);
> 
> Or it can be done in remove_migration_pte().


I have the same set of changes plus more to see if the logic can be simplified and well known
paths be taken. 

Balbir Singh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-01 12:28                                             ` Zi Yan
  2025-08-02  1:17                                               ` Balbir Singh
@ 2025-08-02 10:37                                               ` Balbir Singh
  2025-08-02 12:13                                                 ` Mika Penttilä
  1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-02 10:37 UTC (permalink / raw)
  To: Zi Yan, Mika Penttilä
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

FYI:

I have the following patch on top of my series that seems to make it work
without requiring the helper to split device private folios


Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |  1 -
 lib/test_hmm.c          | 11 +++++-
 mm/huge_memory.c        | 76 ++++-------------------------------------
 mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
 4 files changed, 67 insertions(+), 72 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 19e7e3b7c2b7..52d8b435950b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_device_private_folio(struct folio *folio);
 int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order, bool unmapped);
 int min_order_for_split(struct folio *folio);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 341ae2af44ec..444477785882 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * the mirror but here we use it to hold the page for the simulated
 	 * device memory and that page holds the pointer to the mirror.
 	 */
-	rpage = vmf->page->zone_device_data;
+	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
 	order = folio_order(page_folio(vmf->page));
 	nr = 1 << order;
 
+	/*
+	 * When folios are partially mapped, we can't rely on the folio
+	 * order of vmf->page as the folio might not be fully split yet
+	 */
+	if (vmf->pte) {
+		order = 0;
+		nr = 1;
+	}
+
 	/*
 	 * Consider a per-cpu cache of src and dst pfns, but with
 	 * large number of cpus that might not scale well.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1fc1efa219c8..863393dec1f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
-static int __split_unmapped_folio(struct folio *folio, int new_order,
-		struct page *split_at, struct xa_state *xas,
-		struct address_space *mapping, bool uniform_split);
-
 static bool split_underused_thp = true;
 
 static atomic_t huge_zero_refcount;
@@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
-/**
- * split_huge_device_private_folio - split a huge device private folio into
- * smaller pages (of order 0), currently used by migrate_device logic to
- * split folios for pages that are partially mapped
- *
- * @folio: the folio to split
- *
- * The caller has to hold the folio_lock and a reference via folio_get
- */
-int split_device_private_folio(struct folio *folio)
-{
-	struct folio *end_folio = folio_next(folio);
-	struct folio *new_folio;
-	int ret = 0;
-
-	/*
-	 * Split the folio now. In the case of device
-	 * private pages, this path is executed when
-	 * the pmd is split and since freeze is not true
-	 * it is likely the folio will be deferred_split.
-	 *
-	 * With device private pages, deferred splits of
-	 * folios should be handled here to prevent partial
-	 * unmaps from causing issues later on in migration
-	 * and fault handling flows.
-	 */
-	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
-	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
-	VM_WARN_ON(ret);
-	for (new_folio = folio_next(folio); new_folio != end_folio;
-					new_folio = folio_next(new_folio)) {
-		zone_device_private_split_cb(folio, new_folio);
-		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
-								new_folio));
-	}
-
-	/*
-	 * Mark the end of the folio split for device private THP
-	 * split
-	 */
-	zone_device_private_split_cb(folio, NULL);
-	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
-	return ret;
-}
-
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
@@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				freeze = false;
 			if (!freeze) {
 				rmap_t rmap_flags = RMAP_NONE;
-				unsigned long addr = haddr;
-				struct folio *new_folio;
-				struct folio *end_folio = folio_next(folio);
 
 				if (anon_exclusive)
 					rmap_flags |= RMAP_EXCLUSIVE;
 
-				folio_lock(folio);
-				folio_get(folio);
-
-				split_device_private_folio(folio);
-
-				for (new_folio = folio_next(folio);
-					new_folio != end_folio;
-					new_folio = folio_next(new_folio)) {
-					addr += PAGE_SIZE;
-					folio_unlock(new_folio);
-					folio_add_anon_rmap_ptes(new_folio,
-						&new_folio->page, 1,
-						vma, addr, rmap_flags);
-				}
-				folio_unlock(folio);
-				folio_add_anon_rmap_ptes(folio, &folio->page,
-						1, vma, haddr, rmap_flags);
+				folio_ref_add(folio, HPAGE_PMD_NR - 1);
+				if (anon_exclusive)
+					rmap_flags |= RMAP_EXCLUSIVE;
+				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+						 vma, haddr, rmap_flags);
 			}
 		}
 
@@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
-	if (!ret && is_anon)
+	if (!ret && is_anon && !folio_is_device_private(folio))
 		remap_flags = RMP_USE_SHARED_ZEROPAGE;
 
 	remap_page(folio, 1 << order, remap_flags);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 49962ea19109..4264c0290d08 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * page table entry. Other special swap entries are not
 			 * migratable, and we ignore regular swapped page.
 			 */
+			struct folio *folio;
+
 			entry = pte_to_swp_entry(pte);
 			if (!is_device_private_entry(entry))
 				goto next;
@@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			    pgmap->owner != migrate->pgmap_owner)
 				goto next;
 
+			folio = page_folio(page);
+			if (folio_test_large(folio)) {
+				struct folio *new_folio;
+				struct folio *new_fault_folio;
+
+				/*
+				 * The reason for finding pmd present with a
+				 * device private pte and a large folio for the
+				 * pte is partial unmaps. Split the folio now
+				 * for the migration to be handled correctly
+				 */
+				pte_unmap_unlock(ptep, ptl);
+
+				folio_get(folio);
+				if (folio != fault_folio)
+					folio_lock(folio);
+				if (split_folio(folio)) {
+					if (folio != fault_folio)
+						folio_unlock(folio);
+					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+					goto next;
+				}
+
+				/*
+				 * After the split, get back the extra reference
+				 * on the fault_page, this reference is checked during
+				 * folio_migrate_mapping()
+				 */
+				if (migrate->fault_page) {
+					new_fault_folio = page_folio(migrate->fault_page);
+					folio_get(new_fault_folio);
+				}
+
+				new_folio = page_folio(page);
+				pfn = page_to_pfn(page);
+
+				/*
+				 * Ensure the lock is held on the correct
+				 * folio after the split
+				 */
+				if (folio != new_folio) {
+					folio_unlock(folio);
+					folio_lock(new_folio);
+				}
+				folio_put(folio);
+				addr = start;
+				goto again;
+			}
+
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
 			if (is_writable_device_private_entry(entry))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-02 10:37                                               ` Balbir Singh
@ 2025-08-02 12:13                                                 ` Mika Penttilä
  2025-08-04 22:46                                                   ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-02 12:13 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

Hi,

On 8/2/25 13:37, Balbir Singh wrote:
> FYI:
>
> I have the following patch on top of my series that seems to make it work
> without requiring the helper to split device private folios
>
I think this looks much better!

> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h |  1 -
>  lib/test_hmm.c          | 11 +++++-
>  mm/huge_memory.c        | 76 ++++-------------------------------------
>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>  4 files changed, 67 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 19e7e3b7c2b7..52d8b435950b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>  
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_device_private_folio(struct folio *folio);
>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order, bool unmapped);
>  int min_order_for_split(struct folio *folio);
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 341ae2af44ec..444477785882 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	 * the mirror but here we use it to hold the page for the simulated
>  	 * device memory and that page holds the pointer to the mirror.
>  	 */
> -	rpage = vmf->page->zone_device_data;
> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>  	dmirror = rpage->zone_device_data;
>  
>  	/* FIXME demonstrate how we can adjust migrate range */
>  	order = folio_order(page_folio(vmf->page));
>  	nr = 1 << order;
>  
> +	/*
> +	 * When folios are partially mapped, we can't rely on the folio
> +	 * order of vmf->page as the folio might not be fully split yet
> +	 */
> +	if (vmf->pte) {
> +		order = 0;
> +		nr = 1;
> +	}
> +
>  	/*
>  	 * Consider a per-cpu cache of src and dst pfns, but with
>  	 * large number of cpus that might not scale well.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1fc1efa219c8..863393dec1f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>  					  struct shrink_control *sc);
>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>  					 struct shrink_control *sc);
> -static int __split_unmapped_folio(struct folio *folio, int new_order,
> -		struct page *split_at, struct xa_state *xas,
> -		struct address_space *mapping, bool uniform_split);
> -
>  static bool split_underused_thp = true;
>  
>  static atomic_t huge_zero_refcount;
> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pmd_populate(mm, pmd, pgtable);
>  }
>  
> -/**
> - * split_huge_device_private_folio - split a huge device private folio into
> - * smaller pages (of order 0), currently used by migrate_device logic to
> - * split folios for pages that are partially mapped
> - *
> - * @folio: the folio to split
> - *
> - * The caller has to hold the folio_lock and a reference via folio_get
> - */
> -int split_device_private_folio(struct folio *folio)
> -{
> -	struct folio *end_folio = folio_next(folio);
> -	struct folio *new_folio;
> -	int ret = 0;
> -
> -	/*
> -	 * Split the folio now. In the case of device
> -	 * private pages, this path is executed when
> -	 * the pmd is split and since freeze is not true
> -	 * it is likely the folio will be deferred_split.
> -	 *
> -	 * With device private pages, deferred splits of
> -	 * folios should be handled here to prevent partial
> -	 * unmaps from causing issues later on in migration
> -	 * and fault handling flows.
> -	 */
> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
> -	VM_WARN_ON(ret);
> -	for (new_folio = folio_next(folio); new_folio != end_folio;
> -					new_folio = folio_next(new_folio)) {
> -		zone_device_private_split_cb(folio, new_folio);
> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
> -								new_folio));
> -	}
> -
> -	/*
> -	 * Mark the end of the folio split for device private THP
> -	 * split
> -	 */
> -	zone_device_private_split_cb(folio, NULL);
> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
> -	return ret;
> -}
> -
>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long haddr, bool freeze)
>  {
> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  				freeze = false;
>  			if (!freeze) {
>  				rmap_t rmap_flags = RMAP_NONE;
> -				unsigned long addr = haddr;
> -				struct folio *new_folio;
> -				struct folio *end_folio = folio_next(folio);
>  
>  				if (anon_exclusive)
>  					rmap_flags |= RMAP_EXCLUSIVE;
>  
> -				folio_lock(folio);
> -				folio_get(folio);
> -
> -				split_device_private_folio(folio);
> -
> -				for (new_folio = folio_next(folio);
> -					new_folio != end_folio;
> -					new_folio = folio_next(new_folio)) {
> -					addr += PAGE_SIZE;
> -					folio_unlock(new_folio);
> -					folio_add_anon_rmap_ptes(new_folio,
> -						&new_folio->page, 1,
> -						vma, addr, rmap_flags);
> -				}
> -				folio_unlock(folio);
> -				folio_add_anon_rmap_ptes(folio, &folio->page,
> -						1, vma, haddr, rmap_flags);
> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
> +				if (anon_exclusive)
> +					rmap_flags |= RMAP_EXCLUSIVE;
> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
> +						 vma, haddr, rmap_flags);
>  			}
>  		}
>  
> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  	if (nr_shmem_dropped)
>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>  
> -	if (!ret && is_anon)
> +	if (!ret && is_anon && !folio_is_device_private(folio))
>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>  
>  	remap_page(folio, 1 << order, remap_flags);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 49962ea19109..4264c0290d08 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			 * page table entry. Other special swap entries are not
>  			 * migratable, and we ignore regular swapped page.
>  			 */
> +			struct folio *folio;
> +
>  			entry = pte_to_swp_entry(pte);
>  			if (!is_device_private_entry(entry))
>  				goto next;
> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			    pgmap->owner != migrate->pgmap_owner)
>  				goto next;
>  
> +			folio = page_folio(page);
> +			if (folio_test_large(folio)) {
> +				struct folio *new_folio;
> +				struct folio *new_fault_folio;
> +
> +				/*
> +				 * The reason for finding pmd present with a
> +				 * device private pte and a large folio for the
> +				 * pte is partial unmaps. Split the folio now
> +				 * for the migration to be handled correctly
> +				 */
> +				pte_unmap_unlock(ptep, ptl);
> +
> +				folio_get(folio);
> +				if (folio != fault_folio)
> +					folio_lock(folio);
> +				if (split_folio(folio)) {
> +					if (folio != fault_folio)
> +						folio_unlock(folio);
> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +					goto next;
> +				}
> +

The nouveau migrate_to_ram handler needs adjustment also if split happens.

> +				/*
> +				 * After the split, get back the extra reference
> +				 * on the fault_page, this reference is checked during
> +				 * folio_migrate_mapping()
> +				 */
> +				if (migrate->fault_page) {
> +					new_fault_folio = page_folio(migrate->fault_page);
> +					folio_get(new_fault_folio);
> +				}
> +
> +				new_folio = page_folio(page);
> +				pfn = page_to_pfn(page);
> +
> +				/*
> +				 * Ensure the lock is held on the correct
> +				 * folio after the split
> +				 */
> +				if (folio != new_folio) {
> +					folio_unlock(folio);
> +					folio_lock(new_folio);
> +				}

Maybe careful not to unlock fault_page ?

> +				folio_put(folio);
> +				addr = start;
> +				goto again;
> +			}
> +
>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>  					MIGRATE_PFN_MIGRATE;
>  			if (is_writable_device_private_entry(entry))

--Mika


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-02 12:13                                                 ` Mika Penttilä
@ 2025-08-04 22:46                                                   ` Balbir Singh
  2025-08-04 23:26                                                     ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-04 22:46 UTC (permalink / raw)
  To: Mika Penttilä, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/2/25 22:13, Mika Penttilä wrote:
> Hi,
> 
> On 8/2/25 13:37, Balbir Singh wrote:
>> FYI:
>>
>> I have the following patch on top of my series that seems to make it work
>> without requiring the helper to split device private folios
>>
> I think this looks much better!
> 

Thanks!

>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |  1 -
>>  lib/test_hmm.c          | 11 +++++-
>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 19e7e3b7c2b7..52d8b435950b 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>  		vm_flags_t vm_flags);
>>  
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_device_private_folio(struct folio *folio);
>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>  		unsigned int new_order, bool unmapped);
>>  int min_order_for_split(struct folio *folio);
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 341ae2af44ec..444477785882 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>  	 * the mirror but here we use it to hold the page for the simulated
>>  	 * device memory and that page holds the pointer to the mirror.
>>  	 */
>> -	rpage = vmf->page->zone_device_data;
>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>  	dmirror = rpage->zone_device_data;
>>  
>>  	/* FIXME demonstrate how we can adjust migrate range */
>>  	order = folio_order(page_folio(vmf->page));
>>  	nr = 1 << order;
>>  
>> +	/*
>> +	 * When folios are partially mapped, we can't rely on the folio
>> +	 * order of vmf->page as the folio might not be fully split yet
>> +	 */
>> +	if (vmf->pte) {
>> +		order = 0;
>> +		nr = 1;
>> +	}
>> +
>>  	/*
>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>  	 * large number of cpus that might not scale well.
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1fc1efa219c8..863393dec1f1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>  					  struct shrink_control *sc);
>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>  					 struct shrink_control *sc);
>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>> -		struct page *split_at, struct xa_state *xas,
>> -		struct address_space *mapping, bool uniform_split);
>> -
>>  static bool split_underused_thp = true;
>>  
>>  static atomic_t huge_zero_refcount;
>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>  	pmd_populate(mm, pmd, pgtable);
>>  }
>>  
>> -/**
>> - * split_huge_device_private_folio - split a huge device private folio into
>> - * smaller pages (of order 0), currently used by migrate_device logic to
>> - * split folios for pages that are partially mapped
>> - *
>> - * @folio: the folio to split
>> - *
>> - * The caller has to hold the folio_lock and a reference via folio_get
>> - */
>> -int split_device_private_folio(struct folio *folio)
>> -{
>> -	struct folio *end_folio = folio_next(folio);
>> -	struct folio *new_folio;
>> -	int ret = 0;
>> -
>> -	/*
>> -	 * Split the folio now. In the case of device
>> -	 * private pages, this path is executed when
>> -	 * the pmd is split and since freeze is not true
>> -	 * it is likely the folio will be deferred_split.
>> -	 *
>> -	 * With device private pages, deferred splits of
>> -	 * folios should be handled here to prevent partial
>> -	 * unmaps from causing issues later on in migration
>> -	 * and fault handling flows.
>> -	 */
>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>> -	VM_WARN_ON(ret);
>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>> -					new_folio = folio_next(new_folio)) {
>> -		zone_device_private_split_cb(folio, new_folio);
>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>> -								new_folio));
>> -	}
>> -
>> -	/*
>> -	 * Mark the end of the folio split for device private THP
>> -	 * split
>> -	 */
>> -	zone_device_private_split_cb(folio, NULL);
>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>> -	return ret;
>> -}
>> -
>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		unsigned long haddr, bool freeze)
>>  {
>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  				freeze = false;
>>  			if (!freeze) {
>>  				rmap_t rmap_flags = RMAP_NONE;
>> -				unsigned long addr = haddr;
>> -				struct folio *new_folio;
>> -				struct folio *end_folio = folio_next(folio);
>>  
>>  				if (anon_exclusive)
>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>  
>> -				folio_lock(folio);
>> -				folio_get(folio);
>> -
>> -				split_device_private_folio(folio);
>> -
>> -				for (new_folio = folio_next(folio);
>> -					new_folio != end_folio;
>> -					new_folio = folio_next(new_folio)) {
>> -					addr += PAGE_SIZE;
>> -					folio_unlock(new_folio);
>> -					folio_add_anon_rmap_ptes(new_folio,
>> -						&new_folio->page, 1,
>> -						vma, addr, rmap_flags);
>> -				}
>> -				folio_unlock(folio);
>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>> -						1, vma, haddr, rmap_flags);
>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>> +				if (anon_exclusive)
>> +					rmap_flags |= RMAP_EXCLUSIVE;
>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>> +						 vma, haddr, rmap_flags);
>>  			}
>>  		}
>>  
>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  	if (nr_shmem_dropped)
>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>  
>> -	if (!ret && is_anon)
>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>  
>>  	remap_page(folio, 1 << order, remap_flags);
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 49962ea19109..4264c0290d08 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			 * page table entry. Other special swap entries are not
>>  			 * migratable, and we ignore regular swapped page.
>>  			 */
>> +			struct folio *folio;
>> +
>>  			entry = pte_to_swp_entry(pte);
>>  			if (!is_device_private_entry(entry))
>>  				goto next;
>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			    pgmap->owner != migrate->pgmap_owner)
>>  				goto next;
>>  
>> +			folio = page_folio(page);
>> +			if (folio_test_large(folio)) {
>> +				struct folio *new_folio;
>> +				struct folio *new_fault_folio;
>> +
>> +				/*
>> +				 * The reason for finding pmd present with a
>> +				 * device private pte and a large folio for the
>> +				 * pte is partial unmaps. Split the folio now
>> +				 * for the migration to be handled correctly
>> +				 */
>> +				pte_unmap_unlock(ptep, ptl);
>> +
>> +				folio_get(folio);
>> +				if (folio != fault_folio)
>> +					folio_lock(folio);
>> +				if (split_folio(folio)) {
>> +					if (folio != fault_folio)
>> +						folio_unlock(folio);
>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> +					goto next;
>> +				}
>> +
> 
> The nouveau migrate_to_ram handler needs adjustment also if split happens.
> 

test_hmm needs adjustment because of the way the backup folios are setup.

>> +				/*
>> +				 * After the split, get back the extra reference
>> +				 * on the fault_page, this reference is checked during
>> +				 * folio_migrate_mapping()
>> +				 */
>> +				if (migrate->fault_page) {
>> +					new_fault_folio = page_folio(migrate->fault_page);
>> +					folio_get(new_fault_folio);
>> +				}
>> +
>> +				new_folio = page_folio(page);
>> +				pfn = page_to_pfn(page);
>> +
>> +				/*
>> +				 * Ensure the lock is held on the correct
>> +				 * folio after the split
>> +				 */
>> +				if (folio != new_folio) {
>> +					folio_unlock(folio);
>> +					folio_lock(new_folio);
>> +				}
> 
> Maybe careful not to unlock fault_page ?
> 

split_page will unlock everything but the original folio, the code takes the lock
on the folio corresponding to the new folio

>> +				folio_put(folio);
>> +				addr = start;
>> +				goto again;
>> +			}
>> +
>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>  					MIGRATE_PFN_MIGRATE;
>>  			if (is_writable_device_private_entry(entry))
> 

Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-04 22:46                                                   ` Balbir Singh
@ 2025-08-04 23:26                                                     ` Mika Penttilä
  2025-08-05  4:10                                                       ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-04 23:26 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

Hi,

On 8/5/25 01:46, Balbir Singh wrote:
> On 8/2/25 22:13, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/2/25 13:37, Balbir Singh wrote:
>>> FYI:
>>>
>>> I have the following patch on top of my series that seems to make it work
>>> without requiring the helper to split device private folios
>>>
>> I think this looks much better!
>>
> Thanks!
>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h |  1 -
>>>  lib/test_hmm.c          | 11 +++++-
>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>  		vm_flags_t vm_flags);
>>>  
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> -int split_device_private_folio(struct folio *folio);
>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>  		unsigned int new_order, bool unmapped);
>>>  int min_order_for_split(struct folio *folio);
>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>> index 341ae2af44ec..444477785882 100644
>>> --- a/lib/test_hmm.c
>>> +++ b/lib/test_hmm.c
>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>  	 * device memory and that page holds the pointer to the mirror.
>>>  	 */
>>> -	rpage = vmf->page->zone_device_data;
>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>  	dmirror = rpage->zone_device_data;
>>>  
>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>  	order = folio_order(page_folio(vmf->page));
>>>  	nr = 1 << order;
>>>  
>>> +	/*
>>> +	 * When folios are partially mapped, we can't rely on the folio
>>> +	 * order of vmf->page as the folio might not be fully split yet
>>> +	 */
>>> +	if (vmf->pte) {
>>> +		order = 0;
>>> +		nr = 1;
>>> +	}
>>> +
>>>  	/*
>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>  	 * large number of cpus that might not scale well.
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 1fc1efa219c8..863393dec1f1 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>  					  struct shrink_control *sc);
>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>  					 struct shrink_control *sc);
>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>> -		struct page *split_at, struct xa_state *xas,
>>> -		struct address_space *mapping, bool uniform_split);
>>> -
>>>  static bool split_underused_thp = true;
>>>  
>>>  static atomic_t huge_zero_refcount;
>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>  	pmd_populate(mm, pmd, pgtable);
>>>  }
>>>  
>>> -/**
>>> - * split_huge_device_private_folio - split a huge device private folio into
>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>> - * split folios for pages that are partially mapped
>>> - *
>>> - * @folio: the folio to split
>>> - *
>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>> - */
>>> -int split_device_private_folio(struct folio *folio)
>>> -{
>>> -	struct folio *end_folio = folio_next(folio);
>>> -	struct folio *new_folio;
>>> -	int ret = 0;
>>> -
>>> -	/*
>>> -	 * Split the folio now. In the case of device
>>> -	 * private pages, this path is executed when
>>> -	 * the pmd is split and since freeze is not true
>>> -	 * it is likely the folio will be deferred_split.
>>> -	 *
>>> -	 * With device private pages, deferred splits of
>>> -	 * folios should be handled here to prevent partial
>>> -	 * unmaps from causing issues later on in migration
>>> -	 * and fault handling flows.
>>> -	 */
>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>> -	VM_WARN_ON(ret);
>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>> -					new_folio = folio_next(new_folio)) {
>>> -		zone_device_private_split_cb(folio, new_folio);
>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>> -								new_folio));
>>> -	}
>>> -
>>> -	/*
>>> -	 * Mark the end of the folio split for device private THP
>>> -	 * split
>>> -	 */
>>> -	zone_device_private_split_cb(folio, NULL);
>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>> -	return ret;
>>> -}
>>> -
>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  		unsigned long haddr, bool freeze)
>>>  {
>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  				freeze = false;
>>>  			if (!freeze) {
>>>  				rmap_t rmap_flags = RMAP_NONE;
>>> -				unsigned long addr = haddr;
>>> -				struct folio *new_folio;
>>> -				struct folio *end_folio = folio_next(folio);
>>>  
>>>  				if (anon_exclusive)
>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>  
>>> -				folio_lock(folio);
>>> -				folio_get(folio);
>>> -
>>> -				split_device_private_folio(folio);
>>> -
>>> -				for (new_folio = folio_next(folio);
>>> -					new_folio != end_folio;
>>> -					new_folio = folio_next(new_folio)) {
>>> -					addr += PAGE_SIZE;
>>> -					folio_unlock(new_folio);
>>> -					folio_add_anon_rmap_ptes(new_folio,
>>> -						&new_folio->page, 1,
>>> -						vma, addr, rmap_flags);
>>> -				}
>>> -				folio_unlock(folio);
>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>> -						1, vma, haddr, rmap_flags);
>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>> +				if (anon_exclusive)
>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>> +						 vma, haddr, rmap_flags);
>>>  			}
>>>  		}
>>>  
>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  	if (nr_shmem_dropped)
>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>  
>>> -	if (!ret && is_anon)
>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>  
>>>  	remap_page(folio, 1 << order, remap_flags);
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 49962ea19109..4264c0290d08 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			 * page table entry. Other special swap entries are not
>>>  			 * migratable, and we ignore regular swapped page.
>>>  			 */
>>> +			struct folio *folio;
>>> +
>>>  			entry = pte_to_swp_entry(pte);
>>>  			if (!is_device_private_entry(entry))
>>>  				goto next;
>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>  				goto next;
>>>  
>>> +			folio = page_folio(page);
>>> +			if (folio_test_large(folio)) {
>>> +				struct folio *new_folio;
>>> +				struct folio *new_fault_folio;
>>> +
>>> +				/*
>>> +				 * The reason for finding pmd present with a
>>> +				 * device private pte and a large folio for the
>>> +				 * pte is partial unmaps. Split the folio now
>>> +				 * for the migration to be handled correctly
>>> +				 */
>>> +				pte_unmap_unlock(ptep, ptl);
>>> +
>>> +				folio_get(folio);
>>> +				if (folio != fault_folio)
>>> +					folio_lock(folio);
>>> +				if (split_folio(folio)) {
>>> +					if (folio != fault_folio)
>>> +						folio_unlock(folio);
>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>> +					goto next;
>>> +				}
>>> +
>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>
> test_hmm needs adjustment because of the way the backup folios are setup.

nouveau should check the folio order after the possible split happens.

>
>>> +				/*
>>> +				 * After the split, get back the extra reference
>>> +				 * on the fault_page, this reference is checked during
>>> +				 * folio_migrate_mapping()
>>> +				 */
>>> +				if (migrate->fault_page) {
>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>> +					folio_get(new_fault_folio);
>>> +				}
>>> +
>>> +				new_folio = page_folio(page);
>>> +				pfn = page_to_pfn(page);
>>> +
>>> +				/*
>>> +				 * Ensure the lock is held on the correct
>>> +				 * folio after the split
>>> +				 */
>>> +				if (folio != new_folio) {
>>> +					folio_unlock(folio);
>>> +					folio_lock(new_folio);
>>> +				}
>> Maybe careful not to unlock fault_page ?
>>
> split_page will unlock everything but the original folio, the code takes the lock
> on the folio corresponding to the new folio

I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.

>
>>> +				folio_put(folio);
>>> +				addr = start;
>>> +				goto again;
>>> +			}
>>> +
>>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>>  					MIGRATE_PFN_MIGRATE;
>>>  			if (is_writable_device_private_entry(entry))
> Balbir
>
--Mika


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-07-30  9:50   ` David Hildenbrand
@ 2025-08-04 23:43     ` Balbir Singh
  2025-08-05  4:22     ` Balbir Singh
  1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-04 23:43 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 7/30/25 19:50, David Hildenbrand wrote:
> On 30.07.25 11:21, Balbir Singh wrote:
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>   include/linux/memremap.h | 10 ++++++++-
>>   mm/memremap.c            | 48 +++++++++++++++++++++++++++++-----------
>>   2 files changed, 44 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 4aa151914eab..a0723b35eeaa 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>>   }
>>     #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>   void memunmap_pages(struct dev_pagemap *pgmap);
>>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>     unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_page_init(struct page *page)
>> +{
>> +    struct folio *folio = page_folio(page);
>> +
>> +    zone_device_folio_init(folio, 0);
>> +}
>> +
>>   #else
>>   static inline void *devm_memremap_pages(struct device *dev,
>>           struct dev_pagemap *pgmap)
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index b0ce0d8254bd..3ca136e7455e 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>   void free_zone_device_folio(struct folio *folio)
>>   {
>>       struct dev_pagemap *pgmap = folio->pgmap;
>> +    unsigned int nr = folio_nr_pages(folio);
>> +    int i;
> 
> "unsigned long" is to be future-proof.

Will change this for v3

> 
> (folio_nr_pages() returns long and probably soon unsigned long)
> 
> [ I'd probably all it "nr_pages" ]

Ack

> 
>>         if (WARN_ON_ONCE(!pgmap))
>>           return;
>>         mem_cgroup_uncharge(folio);
>>   -    /*
>> -     * Note: we don't expect anonymous compound pages yet. Once supported
>> -     * and we could PTE-map them similar to THP, we'd have to clear
>> -     * PG_anon_exclusive on all tail pages.
>> -     */
>>       if (folio_test_anon(folio)) {
>> -        VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -        __ClearPageAnonExclusive(folio_page(folio, 0));
>> +        for (i = 0; i < nr; i++)
>> +            __ClearPageAnonExclusive(folio_page(folio, i));
>> +    } else {
>> +        VM_WARN_ON_ONCE(folio_test_large(folio));
>>       }
>>         /*
>> @@ -464,11 +463,20 @@ void free_zone_device_folio(struct folio *folio)
>>         switch (pgmap->type) {
>>       case MEMORY_DEVICE_PRIVATE:
>> +        if (folio_test_large(folio)) {
> 
> Could do "nr > 1" if we already have that value around.
> 

Ack

>> +            folio_unqueue_deferred_split(folio);
> 
> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
> 
> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
> 
>> +
>> +            percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
>> +        }
>> +        pgmap->ops->page_free(&folio->page);
>> +        percpu_ref_put(&folio->pgmap->ref);
> 
> Coold you simply do a
> 
>     percpu_ref_put_many(&folio->pgmap->ref, nr);
> 
> here, or would that be problematic?
> 

I can definitely try that

>> +        folio->page.mapping = NULL;
>> +        break;
>>       case MEMORY_DEVICE_COHERENT:
>>           if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>               break;
>> -        pgmap->ops->page_free(folio_page(folio, 0));
>> -        put_dev_pagemap(pgmap);
>> +        pgmap->ops->page_free(&folio->page);
>> +        percpu_ref_put(&folio->pgmap->ref);
>>           break;
>>         case MEMORY_DEVICE_GENERIC:
>> @@ -491,14 +499,28 @@ void free_zone_device_folio(struct folio *folio)
>>       }
>>   }
>>   -void zone_device_page_init(struct page *page)
>> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>>   {
>> +    struct page *page = folio_page(folio, 0);
>> +
>> +    VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>> +
>> +    /*
>> +     * Only PMD level migration is supported for THP migration
>> +     */
> 
> Talking about something that does not exist yet (and is very specific) sounds a bit weird.
> 
> Should this go into a different patch, or could we rephrase the comment to be a bit more generic?
> 
> In this patch here, nothing would really object to "order" being intermediate.
> 
> (also, this is a device_private limitation? shouldn't that check go somehwere where we can perform this device-private limitation check?)
> 

I can remove the limitation and keep it generic

>> +    WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
>> +
>>       /*
>>        * Drivers shouldn't be allocating pages after calling
>>        * memunmap_pages().
>>        */
>> -    WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>> -    set_page_count(page, 1);
>> +    WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> +    folio_set_count(folio, 1);
>>       lock_page(page);
>> +
>> +    if (order > 1) {
>> +        prep_compound_page(page, order);
>> +        folio_set_large_rmappable(folio);
>> +    }
>>   }
>> -EXPORT_SYMBOL_GPL(zone_device_page_init);
>> +EXPORT_SYMBOL_GPL(zone_device_folio_init);
> 
> 

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-04 23:26                                                     ` Mika Penttilä
@ 2025-08-05  4:10                                                       ` Balbir Singh
  2025-08-05  4:24                                                         ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05  4:10 UTC (permalink / raw)
  To: Mika Penttilä, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/5/25 09:26, Mika Penttilä wrote:
> Hi,
> 
> On 8/5/25 01:46, Balbir Singh wrote:
>> On 8/2/25 22:13, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>> FYI:
>>>>
>>>> I have the following patch on top of my series that seems to make it work
>>>> without requiring the helper to split device private folios
>>>>
>>> I think this looks much better!
>>>
>> Thanks!
>>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>  include/linux/huge_mm.h |  1 -
>>>>  lib/test_hmm.c          | 11 +++++-
>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>  		vm_flags_t vm_flags);
>>>>  
>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>> -int split_device_private_folio(struct folio *folio);
>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>  		unsigned int new_order, bool unmapped);
>>>>  int min_order_for_split(struct folio *folio);
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 341ae2af44ec..444477785882 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>  	 */
>>>> -	rpage = vmf->page->zone_device_data;
>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>  	dmirror = rpage->zone_device_data;
>>>>  
>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>  	order = folio_order(page_folio(vmf->page));
>>>>  	nr = 1 << order;
>>>>  
>>>> +	/*
>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>> +	 */
>>>> +	if (vmf->pte) {
>>>> +		order = 0;
>>>> +		nr = 1;
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>  	 * large number of cpus that might not scale well.
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>  					  struct shrink_control *sc);
>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>  					 struct shrink_control *sc);
>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>> -		struct page *split_at, struct xa_state *xas,
>>>> -		struct address_space *mapping, bool uniform_split);
>>>> -
>>>>  static bool split_underused_thp = true;
>>>>  
>>>>  static atomic_t huge_zero_refcount;
>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>  }
>>>>  
>>>> -/**
>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>> - * split folios for pages that are partially mapped
>>>> - *
>>>> - * @folio: the folio to split
>>>> - *
>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>> - */
>>>> -int split_device_private_folio(struct folio *folio)
>>>> -{
>>>> -	struct folio *end_folio = folio_next(folio);
>>>> -	struct folio *new_folio;
>>>> -	int ret = 0;
>>>> -
>>>> -	/*
>>>> -	 * Split the folio now. In the case of device
>>>> -	 * private pages, this path is executed when
>>>> -	 * the pmd is split and since freeze is not true
>>>> -	 * it is likely the folio will be deferred_split.
>>>> -	 *
>>>> -	 * With device private pages, deferred splits of
>>>> -	 * folios should be handled here to prevent partial
>>>> -	 * unmaps from causing issues later on in migration
>>>> -	 * and fault handling flows.
>>>> -	 */
>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>> -	VM_WARN_ON(ret);
>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>> -					new_folio = folio_next(new_folio)) {
>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>> -								new_folio));
>>>> -	}
>>>> -
>>>> -	/*
>>>> -	 * Mark the end of the folio split for device private THP
>>>> -	 * split
>>>> -	 */
>>>> -	zone_device_private_split_cb(folio, NULL);
>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>> -	return ret;
>>>> -}
>>>> -
>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  		unsigned long haddr, bool freeze)
>>>>  {
>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  				freeze = false;
>>>>  			if (!freeze) {
>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>> -				unsigned long addr = haddr;
>>>> -				struct folio *new_folio;
>>>> -				struct folio *end_folio = folio_next(folio);
>>>>  
>>>>  				if (anon_exclusive)
>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>  
>>>> -				folio_lock(folio);
>>>> -				folio_get(folio);
>>>> -
>>>> -				split_device_private_folio(folio);
>>>> -
>>>> -				for (new_folio = folio_next(folio);
>>>> -					new_folio != end_folio;
>>>> -					new_folio = folio_next(new_folio)) {
>>>> -					addr += PAGE_SIZE;
>>>> -					folio_unlock(new_folio);
>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>> -						&new_folio->page, 1,
>>>> -						vma, addr, rmap_flags);
>>>> -				}
>>>> -				folio_unlock(folio);
>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>> -						1, vma, haddr, rmap_flags);
>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>> +				if (anon_exclusive)
>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>> +						 vma, haddr, rmap_flags);
>>>>  			}
>>>>  		}
>>>>  
>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  	if (nr_shmem_dropped)
>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>  
>>>> -	if (!ret && is_anon)
>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>  
>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 49962ea19109..4264c0290d08 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>  			 * page table entry. Other special swap entries are not
>>>>  			 * migratable, and we ignore regular swapped page.
>>>>  			 */
>>>> +			struct folio *folio;
>>>> +
>>>>  			entry = pte_to_swp_entry(pte);
>>>>  			if (!is_device_private_entry(entry))
>>>>  				goto next;
>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>  				goto next;
>>>>  
>>>> +			folio = page_folio(page);
>>>> +			if (folio_test_large(folio)) {
>>>> +				struct folio *new_folio;
>>>> +				struct folio *new_fault_folio;
>>>> +
>>>> +				/*
>>>> +				 * The reason for finding pmd present with a
>>>> +				 * device private pte and a large folio for the
>>>> +				 * pte is partial unmaps. Split the folio now
>>>> +				 * for the migration to be handled correctly
>>>> +				 */
>>>> +				pte_unmap_unlock(ptep, ptl);
>>>> +
>>>> +				folio_get(folio);
>>>> +				if (folio != fault_folio)
>>>> +					folio_lock(folio);
>>>> +				if (split_folio(folio)) {
>>>> +					if (folio != fault_folio)
>>>> +						folio_unlock(folio);
>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>> +					goto next;
>>>> +				}
>>>> +
>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>
>> test_hmm needs adjustment because of the way the backup folios are setup.
> 
> nouveau should check the folio order after the possible split happens.
> 

You mean the folio_split callback?

>>
>>>> +				/*
>>>> +				 * After the split, get back the extra reference
>>>> +				 * on the fault_page, this reference is checked during
>>>> +				 * folio_migrate_mapping()
>>>> +				 */
>>>> +				if (migrate->fault_page) {
>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>> +					folio_get(new_fault_folio);
>>>> +				}
>>>> +
>>>> +				new_folio = page_folio(page);
>>>> +				pfn = page_to_pfn(page);
>>>> +
>>>> +				/*
>>>> +				 * Ensure the lock is held on the correct
>>>> +				 * folio after the split
>>>> +				 */
>>>> +				if (folio != new_folio) {
>>>> +					folio_unlock(folio);
>>>> +					folio_lock(new_folio);
>>>> +				}
>>> Maybe careful not to unlock fault_page ?
>>>
>> split_page will unlock everything but the original folio, the code takes the lock
>> on the folio corresponding to the new folio
> 
> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
> 

Not sure I follow what you're trying to elaborate on here

Balbir


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-07-30  9:50   ` David Hildenbrand
  2025-08-04 23:43     ` Balbir Singh
@ 2025-08-05  4:22     ` Balbir Singh
  2025-08-05 10:57       ` David Hildenbrand
  1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05  4:22 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 7/30/25 19:50, David Hildenbrand wrote:

> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
> 
> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
> 

I realized I did not answer this

deferred_split() is the default action when partial unmaps take place. Anything that does
folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
unmapped.

We can optimize for this later if needed

Balbir Singh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-05  4:10                                                       ` Balbir Singh
@ 2025-08-05  4:24                                                         ` Mika Penttilä
  2025-08-05  5:19                                                           ` Mika Penttilä
  2025-08-05 10:27                                                           ` Balbir Singh
  0 siblings, 2 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05  4:24 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

Hi,

On 8/5/25 07:10, Balbir Singh wrote:
> On 8/5/25 09:26, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/5/25 01:46, Balbir Singh wrote:
>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>> FYI:
>>>>>
>>>>> I have the following patch on top of my series that seems to make it work
>>>>> without requiring the helper to split device private folios
>>>>>
>>>> I think this looks much better!
>>>>
>>> Thanks!
>>>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> ---
>>>>>  include/linux/huge_mm.h |  1 -
>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>  		vm_flags_t vm_flags);
>>>>>  
>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>  		unsigned int new_order, bool unmapped);
>>>>>  int min_order_for_split(struct folio *folio);
>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>> index 341ae2af44ec..444477785882 100644
>>>>> --- a/lib/test_hmm.c
>>>>> +++ b/lib/test_hmm.c
>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>  	 */
>>>>> -	rpage = vmf->page->zone_device_data;
>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>  	dmirror = rpage->zone_device_data;
>>>>>  
>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>  	nr = 1 << order;
>>>>>  
>>>>> +	/*
>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>> +	 */
>>>>> +	if (vmf->pte) {
>>>>> +		order = 0;
>>>>> +		nr = 1;
>>>>> +	}
>>>>> +
>>>>>  	/*
>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>  	 * large number of cpus that might not scale well.
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>  					  struct shrink_control *sc);
>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>  					 struct shrink_control *sc);
>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>> -
>>>>>  static bool split_underused_thp = true;
>>>>>  
>>>>>  static atomic_t huge_zero_refcount;
>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>  }
>>>>>  
>>>>> -/**
>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>> - * split folios for pages that are partially mapped
>>>>> - *
>>>>> - * @folio: the folio to split
>>>>> - *
>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>> - */
>>>>> -int split_device_private_folio(struct folio *folio)
>>>>> -{
>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>> -	struct folio *new_folio;
>>>>> -	int ret = 0;
>>>>> -
>>>>> -	/*
>>>>> -	 * Split the folio now. In the case of device
>>>>> -	 * private pages, this path is executed when
>>>>> -	 * the pmd is split and since freeze is not true
>>>>> -	 * it is likely the folio will be deferred_split.
>>>>> -	 *
>>>>> -	 * With device private pages, deferred splits of
>>>>> -	 * folios should be handled here to prevent partial
>>>>> -	 * unmaps from causing issues later on in migration
>>>>> -	 * and fault handling flows.
>>>>> -	 */
>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>> -	VM_WARN_ON(ret);
>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>> -					new_folio = folio_next(new_folio)) {
>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>> -								new_folio));
>>>>> -	}
>>>>> -
>>>>> -	/*
>>>>> -	 * Mark the end of the folio split for device private THP
>>>>> -	 * split
>>>>> -	 */
>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> -	return ret;
>>>>> -}
>>>>> -
>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>  		unsigned long haddr, bool freeze)
>>>>>  {
>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>  				freeze = false;
>>>>>  			if (!freeze) {
>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>> -				unsigned long addr = haddr;
>>>>> -				struct folio *new_folio;
>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>  
>>>>>  				if (anon_exclusive)
>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>  
>>>>> -				folio_lock(folio);
>>>>> -				folio_get(folio);
>>>>> -
>>>>> -				split_device_private_folio(folio);
>>>>> -
>>>>> -				for (new_folio = folio_next(folio);
>>>>> -					new_folio != end_folio;
>>>>> -					new_folio = folio_next(new_folio)) {
>>>>> -					addr += PAGE_SIZE;
>>>>> -					folio_unlock(new_folio);
>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>> -						&new_folio->page, 1,
>>>>> -						vma, addr, rmap_flags);
>>>>> -				}
>>>>> -				folio_unlock(folio);
>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>> -						1, vma, haddr, rmap_flags);
>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>> +				if (anon_exclusive)
>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>> +						 vma, haddr, rmap_flags);
>>>>>  			}
>>>>>  		}
>>>>>  
>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  	if (nr_shmem_dropped)
>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>  
>>>>> -	if (!ret && is_anon)
>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>  
>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 49962ea19109..4264c0290d08 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>  			 * page table entry. Other special swap entries are not
>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>  			 */
>>>>> +			struct folio *folio;
>>>>> +
>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>  			if (!is_device_private_entry(entry))
>>>>>  				goto next;
>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>  				goto next;
>>>>>  
>>>>> +			folio = page_folio(page);
>>>>> +			if (folio_test_large(folio)) {
>>>>> +				struct folio *new_folio;
>>>>> +				struct folio *new_fault_folio;
>>>>> +
>>>>> +				/*
>>>>> +				 * The reason for finding pmd present with a
>>>>> +				 * device private pte and a large folio for the
>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>> +				 * for the migration to be handled correctly
>>>>> +				 */
>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>> +
>>>>> +				folio_get(folio);
>>>>> +				if (folio != fault_folio)
>>>>> +					folio_lock(folio);
>>>>> +				if (split_folio(folio)) {
>>>>> +					if (folio != fault_folio)
>>>>> +						folio_unlock(folio);
>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>> +					goto next;
>>>>> +				}
>>>>> +
>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>
>>> test_hmm needs adjustment because of the way the backup folios are setup.
>> nouveau should check the folio order after the possible split happens.
>>
> You mean the folio_split callback?

no, nouveau_dmem_migrate_to_ram():
..
        sfolio = page_folio(vmf->page);
        order = folio_order(sfolio);
...
        migrate_vma_setup()
..
if sfolio is split order still reflects the pre-split order

>
>>>>> +				/*
>>>>> +				 * After the split, get back the extra reference
>>>>> +				 * on the fault_page, this reference is checked during
>>>>> +				 * folio_migrate_mapping()
>>>>> +				 */
>>>>> +				if (migrate->fault_page) {
>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>> +					folio_get(new_fault_folio);
>>>>> +				}
>>>>> +
>>>>> +				new_folio = page_folio(page);
>>>>> +				pfn = page_to_pfn(page);
>>>>> +
>>>>> +				/*
>>>>> +				 * Ensure the lock is held on the correct
>>>>> +				 * folio after the split
>>>>> +				 */
>>>>> +				if (folio != new_folio) {
>>>>> +					folio_unlock(folio);
>>>>> +					folio_lock(new_folio);
>>>>> +				}
>>>> Maybe careful not to unlock fault_page ?
>>>>
>>> split_page will unlock everything but the original folio, the code takes the lock
>>> on the folio corresponding to the new folio
>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>
> Not sure I follow what you're trying to elaborate on here

do_swap_page:
..
        if (trylock_page(vmf->page)) {
                ret = pgmap->ops->migrate_to_ram(vmf);
                               <- vmf->page should be locked here even after split
                unlock_page(vmf->page);

> Balbir
>
--Mika


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-05  4:24                                                         ` Mika Penttilä
@ 2025-08-05  5:19                                                           ` Mika Penttilä
  2025-08-05 10:27                                                           ` Balbir Singh
  1 sibling, 0 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05  5:19 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell


On 8/5/25 07:24, Mika Penttilä wrote:

> Hi,
>
> On 8/5/25 07:10, Balbir Singh wrote:
>> On 8/5/25 09:26, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>> FYI:
>>>>>>
>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>> without requiring the helper to split device private folios
>>>>>>
>>>>> I think this looks much better!
>>>>>
>>>> Thanks!
>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>  		vm_flags_t vm_flags);
>>>>>>  
>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>  	 */
>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>  
>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>  	nr = 1 << order;
>>>>>>  
>>>>>> +	/*
>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>> +	 */
>>>>>> +	if (vmf->pte) {
>>>>>> +		order = 0;
>>>>>> +		nr = 1;
>>>>>> +	}
>>>>>> +
>>>>>>  	/*
>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>  	 * large number of cpus that might not scale well.
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>  					  struct shrink_control *sc);
>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>  					 struct shrink_control *sc);
>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>> -
>>>>>>  static bool split_underused_thp = true;
>>>>>>  
>>>>>>  static atomic_t huge_zero_refcount;
>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>  }
>>>>>>  
>>>>>> -/**
>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> - * split folios for pages that are partially mapped
>>>>>> - *
>>>>>> - * @folio: the folio to split
>>>>>> - *
>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> - */
>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>> -{
>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>> -	struct folio *new_folio;
>>>>>> -	int ret = 0;
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Split the folio now. In the case of device
>>>>>> -	 * private pages, this path is executed when
>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>> -	 *
>>>>>> -	 * With device private pages, deferred splits of
>>>>>> -	 * folios should be handled here to prevent partial
>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>> -	 * and fault handling flows.
>>>>>> -	 */
>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> -	VM_WARN_ON(ret);
>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>> -								new_folio));
>>>>>> -	}
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>> -	 * split
>>>>>> -	 */
>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	return ret;
>>>>>> -}
>>>>>> -
>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>  {
>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  				freeze = false;
>>>>>>  			if (!freeze) {
>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>> -				unsigned long addr = haddr;
>>>>>> -				struct folio *new_folio;
>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>  
>>>>>>  				if (anon_exclusive)
>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>  
>>>>>> -				folio_lock(folio);
>>>>>> -				folio_get(folio);
>>>>>> -
>>>>>> -				split_device_private_folio(folio);
>>>>>> -
>>>>>> -				for (new_folio = folio_next(folio);
>>>>>> -					new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -					addr += PAGE_SIZE;
>>>>>> -					folio_unlock(new_folio);
>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>> -						&new_folio->page, 1,
>>>>>> -						vma, addr, rmap_flags);
>>>>>> -				}
>>>>>> -				folio_unlock(folio);
>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>> +				if (anon_exclusive)
>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>  			}
>>>>>>  		}
>>>>>>  
>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  	if (nr_shmem_dropped)
>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>  
>>>>>> -	if (!ret && is_anon)
>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>  
>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>  			 */
>>>>>> +			struct folio *folio;
>>>>>> +
>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>  				goto next;
>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>  				goto next;
>>>>>>  
>>>>>> +			folio = page_folio(page);
>>>>>> +			if (folio_test_large(folio)) {
>>>>>> +				struct folio *new_folio;
>>>>>> +				struct folio *new_fault_folio;
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * The reason for finding pmd present with a
>>>>>> +				 * device private pte and a large folio for the
>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>> +				 * for the migration to be handled correctly
>>>>>> +				 */
>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>> +
>>>>>> +				folio_get(folio);
>>>>>> +				if (folio != fault_folio)
>>>>>> +					folio_lock(folio);
>>>>>> +				if (split_folio(folio)) {
>>>>>> +					if (folio != fault_folio)
>>>>>> +						folio_unlock(folio);
>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> +					goto next;
>>>>>> +				}
>>>>>> +
>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>
>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>> nouveau should check the folio order after the possible split happens.
>>>
>> You mean the folio_split callback?
> no, nouveau_dmem_migrate_to_ram():
> ..
>         sfolio = page_folio(vmf->page);
>         order = folio_order(sfolio);
> ...
>         migrate_vma_setup()
> ..
> if sfolio is split order still reflects the pre-split order
>
>>>>>> +				/*
>>>>>> +				 * After the split, get back the extra reference
>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>> +				 * folio_migrate_mapping()
>>>>>> +				 */
>>>>>> +				if (migrate->fault_page) {
>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>> +					folio_get(new_fault_folio);
>>>>>> +				}
>>>>>> +
>>>>>> +				new_folio = page_folio(page);
>>>>>> +				pfn = page_to_pfn(page);
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * Ensure the lock is held on the correct
>>>>>> +				 * folio after the split
>>>>>> +				 */
>>>>>> +				if (folio != new_folio) {
>>>>>> +					folio_unlock(folio);
>>>>>> +					folio_lock(new_folio);
>>>>>> +				}
>>>>> Maybe careful not to unlock fault_page ?
>>>>>
>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>> on the folio corresponding to the new folio
>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>
>> Not sure I follow what you're trying to elaborate on here
>
Actually fault_folio should be fine but should we have: 


if (fault_folio)
        if(folio != new_folio)) {
                folio_unlock(folio);
                folio_lock(new_folio);
        }
else 
        folio_unlock(folio);




>> Balbir
>>
> --Mika
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-05  4:24                                                         ` Mika Penttilä
  2025-08-05  5:19                                                           ` Mika Penttilä
@ 2025-08-05 10:27                                                           ` Balbir Singh
  2025-08-05 10:35                                                             ` Mika Penttilä
  1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 10:27 UTC (permalink / raw)
  To: Mika Penttilä, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/5/25 14:24, Mika Penttilä wrote:
> Hi,
> 
> On 8/5/25 07:10, Balbir Singh wrote:
>> On 8/5/25 09:26, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>> FYI:
>>>>>>
>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>> without requiring the helper to split device private folios
>>>>>>
>>>>> I think this looks much better!
>>>>>
>>>> Thanks!
>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>  		vm_flags_t vm_flags);
>>>>>>  
>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>  	 */
>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>  
>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>  	nr = 1 << order;
>>>>>>  
>>>>>> +	/*
>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>> +	 */
>>>>>> +	if (vmf->pte) {
>>>>>> +		order = 0;
>>>>>> +		nr = 1;
>>>>>> +	}
>>>>>> +
>>>>>>  	/*
>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>  	 * large number of cpus that might not scale well.
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>  					  struct shrink_control *sc);
>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>  					 struct shrink_control *sc);
>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>> -
>>>>>>  static bool split_underused_thp = true;
>>>>>>  
>>>>>>  static atomic_t huge_zero_refcount;
>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>  }
>>>>>>  
>>>>>> -/**
>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> - * split folios for pages that are partially mapped
>>>>>> - *
>>>>>> - * @folio: the folio to split
>>>>>> - *
>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> - */
>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>> -{
>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>> -	struct folio *new_folio;
>>>>>> -	int ret = 0;
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Split the folio now. In the case of device
>>>>>> -	 * private pages, this path is executed when
>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>> -	 *
>>>>>> -	 * With device private pages, deferred splits of
>>>>>> -	 * folios should be handled here to prevent partial
>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>> -	 * and fault handling flows.
>>>>>> -	 */
>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> -	VM_WARN_ON(ret);
>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>> -								new_folio));
>>>>>> -	}
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>> -	 * split
>>>>>> -	 */
>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	return ret;
>>>>>> -}
>>>>>> -
>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>  {
>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  				freeze = false;
>>>>>>  			if (!freeze) {
>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>> -				unsigned long addr = haddr;
>>>>>> -				struct folio *new_folio;
>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>  
>>>>>>  				if (anon_exclusive)
>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>  
>>>>>> -				folio_lock(folio);
>>>>>> -				folio_get(folio);
>>>>>> -
>>>>>> -				split_device_private_folio(folio);
>>>>>> -
>>>>>> -				for (new_folio = folio_next(folio);
>>>>>> -					new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -					addr += PAGE_SIZE;
>>>>>> -					folio_unlock(new_folio);
>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>> -						&new_folio->page, 1,
>>>>>> -						vma, addr, rmap_flags);
>>>>>> -				}
>>>>>> -				folio_unlock(folio);
>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>> +				if (anon_exclusive)
>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>  			}
>>>>>>  		}
>>>>>>  
>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  	if (nr_shmem_dropped)
>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>  
>>>>>> -	if (!ret && is_anon)
>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>  
>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>  			 */
>>>>>> +			struct folio *folio;
>>>>>> +
>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>  				goto next;
>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>  				goto next;
>>>>>>  
>>>>>> +			folio = page_folio(page);
>>>>>> +			if (folio_test_large(folio)) {
>>>>>> +				struct folio *new_folio;
>>>>>> +				struct folio *new_fault_folio;
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * The reason for finding pmd present with a
>>>>>> +				 * device private pte and a large folio for the
>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>> +				 * for the migration to be handled correctly
>>>>>> +				 */
>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>> +
>>>>>> +				folio_get(folio);
>>>>>> +				if (folio != fault_folio)
>>>>>> +					folio_lock(folio);
>>>>>> +				if (split_folio(folio)) {
>>>>>> +					if (folio != fault_folio)
>>>>>> +						folio_unlock(folio);
>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> +					goto next;
>>>>>> +				}
>>>>>> +
>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>
>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>> nouveau should check the folio order after the possible split happens.
>>>
>> You mean the folio_split callback?
> 
> no, nouveau_dmem_migrate_to_ram():
> ..
>         sfolio = page_folio(vmf->page);
>         order = folio_order(sfolio);
> ...
>         migrate_vma_setup()
> ..
> if sfolio is split order still reflects the pre-split order
> 

Will fix, good catch!

>>
>>>>>> +				/*
>>>>>> +				 * After the split, get back the extra reference
>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>> +				 * folio_migrate_mapping()
>>>>>> +				 */
>>>>>> +				if (migrate->fault_page) {
>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>> +					folio_get(new_fault_folio);
>>>>>> +				}
>>>>>> +
>>>>>> +				new_folio = page_folio(page);
>>>>>> +				pfn = page_to_pfn(page);
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * Ensure the lock is held on the correct
>>>>>> +				 * folio after the split
>>>>>> +				 */
>>>>>> +				if (folio != new_folio) {
>>>>>> +					folio_unlock(folio);
>>>>>> +					folio_lock(new_folio);
>>>>>> +				}
>>>>> Maybe careful not to unlock fault_page ?
>>>>>
>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>> on the folio corresponding to the new folio
>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>
>> Not sure I follow what you're trying to elaborate on here
> 
> do_swap_page:
> ..
>         if (trylock_page(vmf->page)) {
>                 ret = pgmap->ops->migrate_to_ram(vmf);
>                                <- vmf->page should be locked here even after split
>                 unlock_page(vmf->page);
> 

Yep, the split will unlock all tail folios, leaving the just head folio locked
and this the change, the lock we need to hold is the folio lock associated with
fault_page, pte entry and not unlock when the cause is a fault. The code seems
to do the right thing there, let me double check

Balbir
and the code does the right thing there.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-05 10:27                                                           ` Balbir Singh
@ 2025-08-05 10:35                                                             ` Mika Penttilä
  2025-08-05 10:36                                                               ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05 10:35 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell


On 8/5/25 13:27, Balbir Singh wrote:

> On 8/5/25 14:24, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/5/25 07:10, Balbir Singh wrote:
>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>> FYI:
>>>>>>>
>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>> without requiring the helper to split device private folios
>>>>>>>
>>>>>> I think this looks much better!
>>>>>>
>>>>> Thanks!
>>>>>
>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>> ---
>>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>  		vm_flags_t vm_flags);
>>>>>>>  
>>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>> --- a/lib/test_hmm.c
>>>>>>> +++ b/lib/test_hmm.c
>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>>  	 */
>>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>>  
>>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>>  	nr = 1 << order;
>>>>>>>  
>>>>>>> +	/*
>>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>>> +	 */
>>>>>>> +	if (vmf->pte) {
>>>>>>> +		order = 0;
>>>>>>> +		nr = 1;
>>>>>>> +	}
>>>>>>> +
>>>>>>>  	/*
>>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>  	 * large number of cpus that might not scale well.
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>  					  struct shrink_control *sc);
>>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>  					 struct shrink_control *sc);
>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>>> -
>>>>>>>  static bool split_underused_thp = true;
>>>>>>>  
>>>>>>>  static atomic_t huge_zero_refcount;
>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>>  }
>>>>>>>  
>>>>>>> -/**
>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>> - * split folios for pages that are partially mapped
>>>>>>> - *
>>>>>>> - * @folio: the folio to split
>>>>>>> - *
>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>> - */
>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>> -{
>>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>>> -	struct folio *new_folio;
>>>>>>> -	int ret = 0;
>>>>>>> -
>>>>>>> -	/*
>>>>>>> -	 * Split the folio now. In the case of device
>>>>>>> -	 * private pages, this path is executed when
>>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>>> -	 *
>>>>>>> -	 * With device private pages, deferred splits of
>>>>>>> -	 * folios should be handled here to prevent partial
>>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>>> -	 * and fault handling flows.
>>>>>>> -	 */
>>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>> -	VM_WARN_ON(ret);
>>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>> -								new_folio));
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	/*
>>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>>> -	 * split
>>>>>>> -	 */
>>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> -	return ret;
>>>>>>> -}
>>>>>>> -
>>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>>  {
>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>  				freeze = false;
>>>>>>>  			if (!freeze) {
>>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>>> -				unsigned long addr = haddr;
>>>>>>> -				struct folio *new_folio;
>>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>>  
>>>>>>>  				if (anon_exclusive)
>>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>  
>>>>>>> -				folio_lock(folio);
>>>>>>> -				folio_get(folio);
>>>>>>> -
>>>>>>> -				split_device_private_folio(folio);
>>>>>>> -
>>>>>>> -				for (new_folio = folio_next(folio);
>>>>>>> -					new_folio != end_folio;
>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>> -					addr += PAGE_SIZE;
>>>>>>> -					folio_unlock(new_folio);
>>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>>> -						&new_folio->page, 1,
>>>>>>> -						vma, addr, rmap_flags);
>>>>>>> -				}
>>>>>>> -				folio_unlock(folio);
>>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>> +				if (anon_exclusive)
>>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>>  			}
>>>>>>>  		}
>>>>>>>  
>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  	if (nr_shmem_dropped)
>>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>  
>>>>>>> -	if (!ret && is_anon)
>>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>  
>>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>> --- a/mm/migrate_device.c
>>>>>>> +++ b/mm/migrate_device.c
>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>>  			 */
>>>>>>> +			struct folio *folio;
>>>>>>> +
>>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>>  				goto next;
>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>>  				goto next;
>>>>>>>  
>>>>>>> +			folio = page_folio(page);
>>>>>>> +			if (folio_test_large(folio)) {
>>>>>>> +				struct folio *new_folio;
>>>>>>> +				struct folio *new_fault_folio;
>>>>>>> +
>>>>>>> +				/*
>>>>>>> +				 * The reason for finding pmd present with a
>>>>>>> +				 * device private pte and a large folio for the
>>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>>> +				 * for the migration to be handled correctly
>>>>>>> +				 */
>>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>>> +
>>>>>>> +				folio_get(folio);
>>>>>>> +				if (folio != fault_folio)
>>>>>>> +					folio_lock(folio);
>>>>>>> +				if (split_folio(folio)) {
>>>>>>> +					if (folio != fault_folio)
>>>>>>> +						folio_unlock(folio);
>>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>> +					goto next;
>>>>>>> +				}
>>>>>>> +
>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>
>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>> nouveau should check the folio order after the possible split happens.
>>>>
>>> You mean the folio_split callback?
>> no, nouveau_dmem_migrate_to_ram():
>> ..
>>         sfolio = page_folio(vmf->page);
>>         order = folio_order(sfolio);
>> ...
>>         migrate_vma_setup()
>> ..
>> if sfolio is split order still reflects the pre-split order
>>
> Will fix, good catch!
>
>>>>>>> +				/*
>>>>>>> +				 * After the split, get back the extra reference
>>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>>> +				 * folio_migrate_mapping()
>>>>>>> +				 */
>>>>>>> +				if (migrate->fault_page) {
>>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>>> +					folio_get(new_fault_folio);
>>>>>>> +				}
>>>>>>> +
>>>>>>> +				new_folio = page_folio(page);
>>>>>>> +				pfn = page_to_pfn(page);
>>>>>>> +
>>>>>>> +				/*
>>>>>>> +				 * Ensure the lock is held on the correct
>>>>>>> +				 * folio after the split
>>>>>>> +				 */
>>>>>>> +				if (folio != new_folio) {
>>>>>>> +					folio_unlock(folio);
>>>>>>> +					folio_lock(new_folio);
>>>>>>> +				}
>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>
>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>> on the folio corresponding to the new folio
>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>
>>> Not sure I follow what you're trying to elaborate on here
>> do_swap_page:
>> ..
>>         if (trylock_page(vmf->page)) {
>>                 ret = pgmap->ops->migrate_to_ram(vmf);
>>                                <- vmf->page should be locked here even after split
>>                 unlock_page(vmf->page);
>>
> Yep, the split will unlock all tail folios, leaving the just head folio locked
> and this the change, the lock we need to hold is the folio lock associated with
> fault_page, pte entry and not unlock when the cause is a fault. The code seems
> to do the right thing there, let me double check

Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked

>
> Balbir
> and the code does the right thing there.
>
--Mika


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-05 10:35                                                             ` Mika Penttilä
@ 2025-08-05 10:36                                                               ` Balbir Singh
  2025-08-05 10:46                                                                 ` Mika Penttilä
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 10:36 UTC (permalink / raw)
  To: Mika Penttilä, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell

On 8/5/25 20:35, Mika Penttilä wrote:
> 
> On 8/5/25 13:27, Balbir Singh wrote:
> 
>> On 8/5/25 14:24, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 07:10, Balbir Singh wrote:
>>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>>> FYI:
>>>>>>>>
>>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>>> without requiring the helper to split device private folios
>>>>>>>>
>>>>>>> I think this looks much better!
>>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>> ---
>>>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>>  		vm_flags_t vm_flags);
>>>>>>>>  
>>>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>>> --- a/lib/test_hmm.c
>>>>>>>> +++ b/lib/test_hmm.c
>>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>>>  	 */
>>>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>>>  
>>>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>>>  	nr = 1 << order;
>>>>>>>>  
>>>>>>>> +	/*
>>>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>>>> +	 */
>>>>>>>> +	if (vmf->pte) {
>>>>>>>> +		order = 0;
>>>>>>>> +		nr = 1;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>  	/*
>>>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>>  	 * large number of cpus that might not scale well.
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>>  					  struct shrink_control *sc);
>>>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>>  					 struct shrink_control *sc);
>>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>>>> -
>>>>>>>>  static bool split_underused_thp = true;
>>>>>>>>  
>>>>>>>>  static atomic_t huge_zero_refcount;
>>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> -/**
>>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>> - * split folios for pages that are partially mapped
>>>>>>>> - *
>>>>>>>> - * @folio: the folio to split
>>>>>>>> - *
>>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>> - */
>>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>>> -{
>>>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>>>> -	struct folio *new_folio;
>>>>>>>> -	int ret = 0;
>>>>>>>> -
>>>>>>>> -	/*
>>>>>>>> -	 * Split the folio now. In the case of device
>>>>>>>> -	 * private pages, this path is executed when
>>>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>>>> -	 *
>>>>>>>> -	 * With device private pages, deferred splits of
>>>>>>>> -	 * folios should be handled here to prevent partial
>>>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>>>> -	 * and fault handling flows.
>>>>>>>> -	 */
>>>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>> -	VM_WARN_ON(ret);
>>>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>>> -								new_folio));
>>>>>>>> -	}
>>>>>>>> -
>>>>>>>> -	/*
>>>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>>>> -	 * split
>>>>>>>> -	 */
>>>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> -	return ret;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>>>  {
>>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>  				freeze = false;
>>>>>>>>  			if (!freeze) {
>>>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>>>> -				unsigned long addr = haddr;
>>>>>>>> -				struct folio *new_folio;
>>>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>>>  
>>>>>>>>  				if (anon_exclusive)
>>>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>  
>>>>>>>> -				folio_lock(folio);
>>>>>>>> -				folio_get(folio);
>>>>>>>> -
>>>>>>>> -				split_device_private_folio(folio);
>>>>>>>> -
>>>>>>>> -				for (new_folio = folio_next(folio);
>>>>>>>> -					new_folio != end_folio;
>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>> -					addr += PAGE_SIZE;
>>>>>>>> -					folio_unlock(new_folio);
>>>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>>>> -						&new_folio->page, 1,
>>>>>>>> -						vma, addr, rmap_flags);
>>>>>>>> -				}
>>>>>>>> -				folio_unlock(folio);
>>>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>>> +				if (anon_exclusive)
>>>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>>>  			}
>>>>>>>>  		}
>>>>>>>>  
>>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  	if (nr_shmem_dropped)
>>>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>>  
>>>>>>>> -	if (!ret && is_anon)
>>>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>>  
>>>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>>> --- a/mm/migrate_device.c
>>>>>>>> +++ b/mm/migrate_device.c
>>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>>>  			 */
>>>>>>>> +			struct folio *folio;
>>>>>>>> +
>>>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>>>  				goto next;
>>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>>>  				goto next;
>>>>>>>>  
>>>>>>>> +			folio = page_folio(page);
>>>>>>>> +			if (folio_test_large(folio)) {
>>>>>>>> +				struct folio *new_folio;
>>>>>>>> +				struct folio *new_fault_folio;
>>>>>>>> +
>>>>>>>> +				/*
>>>>>>>> +				 * The reason for finding pmd present with a
>>>>>>>> +				 * device private pte and a large folio for the
>>>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>>>> +				 * for the migration to be handled correctly
>>>>>>>> +				 */
>>>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>>>> +
>>>>>>>> +				folio_get(folio);
>>>>>>>> +				if (folio != fault_folio)
>>>>>>>> +					folio_lock(folio);
>>>>>>>> +				if (split_folio(folio)) {
>>>>>>>> +					if (folio != fault_folio)
>>>>>>>> +						folio_unlock(folio);
>>>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>>> +					goto next;
>>>>>>>> +				}
>>>>>>>> +
>>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>>
>>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>>> nouveau should check the folio order after the possible split happens.
>>>>>
>>>> You mean the folio_split callback?
>>> no, nouveau_dmem_migrate_to_ram():
>>> ..
>>>         sfolio = page_folio(vmf->page);
>>>         order = folio_order(sfolio);
>>> ...
>>>         migrate_vma_setup()
>>> ..
>>> if sfolio is split order still reflects the pre-split order
>>>
>> Will fix, good catch!
>>
>>>>>>>> +				/*
>>>>>>>> +				 * After the split, get back the extra reference
>>>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>>>> +				 * folio_migrate_mapping()
>>>>>>>> +				 */
>>>>>>>> +				if (migrate->fault_page) {
>>>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>>>> +					folio_get(new_fault_folio);
>>>>>>>> +				}
>>>>>>>> +
>>>>>>>> +				new_folio = page_folio(page);
>>>>>>>> +				pfn = page_to_pfn(page);
>>>>>>>> +
>>>>>>>> +				/*
>>>>>>>> +				 * Ensure the lock is held on the correct
>>>>>>>> +				 * folio after the split
>>>>>>>> +				 */
>>>>>>>> +				if (folio != new_folio) {
>>>>>>>> +					folio_unlock(folio);
>>>>>>>> +					folio_lock(new_folio);
>>>>>>>> +				}
>>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>>
>>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>>> on the folio corresponding to the new folio
>>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>>
>>>> Not sure I follow what you're trying to elaborate on here
>>> do_swap_page:
>>> ..
>>>         if (trylock_page(vmf->page)) {
>>>                 ret = pgmap->ops->migrate_to_ram(vmf);
>>>                                <- vmf->page should be locked here even after split
>>>                 unlock_page(vmf->page);
>>>
>> Yep, the split will unlock all tail folios, leaving the just head folio locked
>> and this the change, the lock we need to hold is the folio lock associated with
>> fault_page, pte entry and not unlock when the cause is a fault. The code seems
>> to do the right thing there, let me double check
> 
> Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked
> 

migrate_vma_finalize() handles this

Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code
  2025-08-05 10:36                                                               ` Balbir Singh
@ 2025-08-05 10:46                                                                 ` Mika Penttilä
  0 siblings, 0 replies; 71+ messages in thread
From: Mika Penttilä @ 2025-08-05 10:46 UTC (permalink / raw)
  To: Balbir Singh, Zi Yan
  Cc: David Hildenbrand, linux-mm, linux-kernel, Karol Herbst,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Kefeng Wang, Jane Chu,
	Alistair Popple, Donet Tom, Matthew Brost, Francois Dugast,
	Ralph Campbell


On 8/5/25 13:36, Balbir Singh wrote:
> On 8/5/25 20:35, Mika Penttilä wrote:
>> On 8/5/25 13:27, Balbir Singh wrote:
>>
>>> On 8/5/25 14:24, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/5/25 07:10, Balbir Singh wrote:
>>>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>>>> FYI:
>>>>>>>>>
>>>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>>>> without requiring the helper to split device private folios
>>>>>>>>>
>>>>>>>> I think this looks much better!
>>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>> ---
>>>>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>>>  		vm_flags_t vm_flags);
>>>>>>>>>  
>>>>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>>>> --- a/lib/test_hmm.c
>>>>>>>>> +++ b/lib/test_hmm.c
>>>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>>>>  	 */
>>>>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>>>>  
>>>>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>>>>  	nr = 1 << order;
>>>>>>>>>  
>>>>>>>>> +	/*
>>>>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>>>>> +	 */
>>>>>>>>> +	if (vmf->pte) {
>>>>>>>>> +		order = 0;
>>>>>>>>> +		nr = 1;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>>  	/*
>>>>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>>>  	 * large number of cpus that might not scale well.
>>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>>>  					  struct shrink_control *sc);
>>>>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>>>  					 struct shrink_control *sc);
>>>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>>>>> -
>>>>>>>>>  static bool split_underused_thp = true;
>>>>>>>>>  
>>>>>>>>>  static atomic_t huge_zero_refcount;
>>>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> -/**
>>>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>> - * split folios for pages that are partially mapped
>>>>>>>>> - *
>>>>>>>>> - * @folio: the folio to split
>>>>>>>>> - *
>>>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>> - */
>>>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>>>> -{
>>>>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>>>>> -	struct folio *new_folio;
>>>>>>>>> -	int ret = 0;
>>>>>>>>> -
>>>>>>>>> -	/*
>>>>>>>>> -	 * Split the folio now. In the case of device
>>>>>>>>> -	 * private pages, this path is executed when
>>>>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>>>>> -	 *
>>>>>>>>> -	 * With device private pages, deferred splits of
>>>>>>>>> -	 * folios should be handled here to prevent partial
>>>>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>>>>> -	 * and fault handling flows.
>>>>>>>>> -	 */
>>>>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>> -	VM_WARN_ON(ret);
>>>>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>>>> -								new_folio));
>>>>>>>>> -	}
>>>>>>>>> -
>>>>>>>>> -	/*
>>>>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>>>>> -	 * split
>>>>>>>>> -	 */
>>>>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> -	return ret;
>>>>>>>>> -}
>>>>>>>>> -
>>>>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>>>>  {
>>>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>>  				freeze = false;
>>>>>>>>>  			if (!freeze) {
>>>>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>>>>> -				unsigned long addr = haddr;
>>>>>>>>> -				struct folio *new_folio;
>>>>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>>>>  
>>>>>>>>>  				if (anon_exclusive)
>>>>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>>  
>>>>>>>>> -				folio_lock(folio);
>>>>>>>>> -				folio_get(folio);
>>>>>>>>> -
>>>>>>>>> -				split_device_private_folio(folio);
>>>>>>>>> -
>>>>>>>>> -				for (new_folio = folio_next(folio);
>>>>>>>>> -					new_folio != end_folio;
>>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>>> -					addr += PAGE_SIZE;
>>>>>>>>> -					folio_unlock(new_folio);
>>>>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>>>>> -						&new_folio->page, 1,
>>>>>>>>> -						vma, addr, rmap_flags);
>>>>>>>>> -				}
>>>>>>>>> -				folio_unlock(folio);
>>>>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>>>> +				if (anon_exclusive)
>>>>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>>>>  			}
>>>>>>>>>  		}
>>>>>>>>>  
>>>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  	if (nr_shmem_dropped)
>>>>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>>>  
>>>>>>>>> -	if (!ret && is_anon)
>>>>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>>>  
>>>>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>>>> --- a/mm/migrate_device.c
>>>>>>>>> +++ b/mm/migrate_device.c
>>>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>>>>  			 */
>>>>>>>>> +			struct folio *folio;
>>>>>>>>> +
>>>>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>>>>  				goto next;
>>>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>>>>  				goto next;
>>>>>>>>>  
>>>>>>>>> +			folio = page_folio(page);
>>>>>>>>> +			if (folio_test_large(folio)) {
>>>>>>>>> +				struct folio *new_folio;
>>>>>>>>> +				struct folio *new_fault_folio;
>>>>>>>>> +
>>>>>>>>> +				/*
>>>>>>>>> +				 * The reason for finding pmd present with a
>>>>>>>>> +				 * device private pte and a large folio for the
>>>>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>>>>> +				 * for the migration to be handled correctly
>>>>>>>>> +				 */
>>>>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>>>>> +
>>>>>>>>> +				folio_get(folio);
>>>>>>>>> +				if (folio != fault_folio)
>>>>>>>>> +					folio_lock(folio);
>>>>>>>>> +				if (split_folio(folio)) {
>>>>>>>>> +					if (folio != fault_folio)
>>>>>>>>> +						folio_unlock(folio);
>>>>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>>>> +					goto next;
>>>>>>>>> +				}
>>>>>>>>> +
>>>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>>>
>>>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>>>> nouveau should check the folio order after the possible split happens.
>>>>>>
>>>>> You mean the folio_split callback?
>>>> no, nouveau_dmem_migrate_to_ram():
>>>> ..
>>>>         sfolio = page_folio(vmf->page);
>>>>         order = folio_order(sfolio);
>>>> ...
>>>>         migrate_vma_setup()
>>>> ..
>>>> if sfolio is split order still reflects the pre-split order
>>>>
>>> Will fix, good catch!
>>>
>>>>>>>>> +				/*
>>>>>>>>> +				 * After the split, get back the extra reference
>>>>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>>>>> +				 * folio_migrate_mapping()
>>>>>>>>> +				 */
>>>>>>>>> +				if (migrate->fault_page) {
>>>>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>>>>> +					folio_get(new_fault_folio);
>>>>>>>>> +				}
>>>>>>>>> +
>>>>>>>>> +				new_folio = page_folio(page);
>>>>>>>>> +				pfn = page_to_pfn(page);
>>>>>>>>> +
>>>>>>>>> +				/*
>>>>>>>>> +				 * Ensure the lock is held on the correct
>>>>>>>>> +				 * folio after the split
>>>>>>>>> +				 */
>>>>>>>>> +				if (folio != new_folio) {
>>>>>>>>> +					folio_unlock(folio);
>>>>>>>>> +					folio_lock(new_folio);
>>>>>>>>> +				}
>>>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>>>
>>>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>>>> on the folio corresponding to the new folio
>>>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>>>
>>>>> Not sure I follow what you're trying to elaborate on here
>>>> do_swap_page:
>>>> ..
>>>>         if (trylock_page(vmf->page)) {
>>>>                 ret = pgmap->ops->migrate_to_ram(vmf);
>>>>                                <- vmf->page should be locked here even after split
>>>>                 unlock_page(vmf->page);
>>>>
>>> Yep, the split will unlock all tail folios, leaving the just head folio locked
>>> and this the change, the lock we need to hold is the folio lock associated with
>>> fault_page, pte entry and not unlock when the cause is a fault. The code seems
>>> to do the right thing there, let me double check
>> Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked
>>
> migrate_vma_finalize() handles this

But we are in migrate_vma_collect_pmd() after the split, and try to collect the pte, locking the page again.
So needs to be unlocked after the split.

>
> Balbir
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-08-05  4:22     ` Balbir Singh
@ 2025-08-05 10:57       ` David Hildenbrand
  2025-08-05 11:01         ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-05 10:57 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 05.08.25 06:22, Balbir Singh wrote:
> On 7/30/25 19:50, David Hildenbrand wrote:
> 
>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>
>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>
> 
> I realized I did not answer this
> 
> deferred_split() is the default action when partial unmaps take place. Anything that does
> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
> unmapped.

Right, but it's easy to exclude zone-device folios here. So the real 
question is: do you want to deal with deferred splits or not?

If not, then just disable it right from the start.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-08-05 10:57       ` David Hildenbrand
@ 2025-08-05 11:01         ` Balbir Singh
  2025-08-05 12:58           ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2025-08-05 11:01 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 8/5/25 20:57, David Hildenbrand wrote:
> On 05.08.25 06:22, Balbir Singh wrote:
>> On 7/30/25 19:50, David Hildenbrand wrote:
>>
>>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>>
>>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>>
>>
>> I realized I did not answer this
>>
>> deferred_split() is the default action when partial unmaps take place. Anything that does
>> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
>> unmapped.
> 
> Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
> 
> If not, then just disable it right from the start.
> 

I agree, I was trying to avoid special casing device private folios unless needed to the extent possible

Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-08-05 11:01         ` Balbir Singh
@ 2025-08-05 12:58           ` David Hildenbrand
  2025-08-05 21:15             ` Matthew Brost
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2025-08-05 12:58 UTC (permalink / raw)
  To: Balbir Singh, linux-mm
  Cc: linux-kernel, Karol Herbst, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Jérôme Glisse, Shuah Khan,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Matthew Brost,
	Francois Dugast

On 05.08.25 13:01, Balbir Singh wrote:
> On 8/5/25 20:57, David Hildenbrand wrote:
>> On 05.08.25 06:22, Balbir Singh wrote:
>>> On 7/30/25 19:50, David Hildenbrand wrote:
>>>
>>>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>>>
>>>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>>>
>>>
>>> I realized I did not answer this
>>>
>>> deferred_split() is the default action when partial unmaps take place. Anything that does
>>> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
>>> unmapped.
>>
>> Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
>>
>> If not, then just disable it right from the start.
>>
> 
> I agree, I was trying to avoid special casing device private folios unless needed to the extent possible

By introducing a completely separate split logic :P

Jokes aside, we have plenty of zone_device special-casing already, no 
harm in adding one more folio_is_zone_device() there.

Deferred splitting is all weird already that you can call yourself 
fortunate if you don't have to mess with that for zone-device folios.

Again, unless there is a benefit in having it.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-08-05 12:58           ` David Hildenbrand
@ 2025-08-05 21:15             ` Matthew Brost
  2025-08-06 12:19               ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Matthew Brost @ 2025-08-05 21:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Balbir Singh, linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Ralph Campbell,
	Mika Penttilä, Francois Dugast

On Tue, Aug 05, 2025 at 02:58:42PM +0200, David Hildenbrand wrote:
> On 05.08.25 13:01, Balbir Singh wrote:
> > On 8/5/25 20:57, David Hildenbrand wrote:
> > > On 05.08.25 06:22, Balbir Singh wrote:
> > > > On 7/30/25 19:50, David Hildenbrand wrote:
> > > > 
> > > > > I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
> > > > > 
> > > > > My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
> > > > > 
> > > > 
> > > > I realized I did not answer this
> > > > 
> > > > deferred_split() is the default action when partial unmaps take place. Anything that does
> > > > folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
> > > > unmapped.
> > > 
> > > Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
> > > 
> > > If not, then just disable it right from the start.
> > > 
> > 
> > I agree, I was trying to avoid special casing device private folios unless needed to the extent possible
> 
> By introducing a completely separate split logic :P
> 
> Jokes aside, we have plenty of zone_device special-casing already, no harm
> in adding one more folio_is_zone_device() there.
> 
> Deferred splitting is all weird already that you can call yourself fortunate
> if you don't have to mess with that for zone-device folios.
> 
> Again, unless there is a benefit in having it.

+1 on no deferred split for device folios.

Matt

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 00/11] THP support for zone device page migration
  2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
                   ` (11 preceding siblings ...)
  2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
@ 2025-08-05 21:34 ` Matthew Brost
  12 siblings, 0 replies; 71+ messages in thread
From: Matthew Brost @ 2025-08-05 21:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, David Hildenbrand,
	Barry Song, Baolin Wang, Ryan Roberts, Matthew Wilcox, Peter Xu,
	Zi Yan, Kefeng Wang, Jane Chu, Alistair Popple, Donet Tom,
	Ralph Campbell, Mika Penttilä, Francois Dugast

On Wed, Jul 30, 2025 at 07:21:28PM +1000, Balbir Singh wrote:
> This patch series adds support for THP migration of zone device pages.
> To do so, the patches implement support for folio zone device pages
> by adding support for setting up larger order pages. Larger order
> pages provide a speedup in throughput and latency.
> 
> In my local testing (using lib/test_hmm) and a throughput test, the
> series shows a 350% improvement in data transfer throughput and a
> 500% improvement in latency
> 
> These patches build on the earlier posts by Ralph Campbell [1]
> 
> Two new flags are added in vma_migration to select and mark compound pages.
> migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize()
> support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND
> is passed in as arguments.
> 
> The series also adds zone device awareness to (m)THP pages along
> with fault handling of large zone device private pages. page vma walk
> and the rmap code is also zone device aware. Support has also been
> added for folios that might need to be split in the middle
> of migration (when the src and dst do not agree on
> MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can
> migrate large pages, but the destination has not been able to allocate
> large pages. The code supported and used folio_split() when migrating
> THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed
> as an argument to migrate_vma_setup().
> 
> The test infrastructure lib/test_hmm.c has been enhanced to support THP
> migration. A new ioctl to emulate failure of large page allocations has
> been added to test the folio split code path. hmm-tests.c has new test
> cases for huge page migration and to test the folio split path. A new
> throughput test has been added as well.
> 
> The nouveau dmem code has been enhanced to use the new THP migration
> capability.
> 
> mTHP support:
> 
> The patches hard code, HPAGE_PMD_NR in a few places, but the code has
> been kept generic to support various order sizes. With additional
> refactoring of the code support of different order sizes should be
> possible.
> 
> The future plan is to post enhancements to support mTHP with a rough
> design as follows:
> 
> 1. Add the notion of allowable thp orders to the HMM based test driver
> 2. For non PMD based THP paths in migrate_device.c, check to see if
>    a suitable order is found and supported by the driver
> 3. Iterate across orders to check the highest supported order for migration
> 4. Migrate and finalize
> 
> The mTHP patches can be built on top of this series, the key design
> elements that need to be worked out are infrastructure and driver support
> for multiple ordered pages and their migration.
> 
> HMM support for large folios:
> 
> Francois Dugast posted patches support for HMM handling [4], the proposed
> changes can build on top of this series to provide support for HMM fault
> handling.
> 
> References:
> [1] https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/
> [2] https://lore.kernel.org/linux-mm/20250306044239.3874247-3-balbirs@nvidia.com/T/
> [3] https://lore.kernel.org/lkml/20250703233511.2028395-1-balbirs@nvidia.com/
> [4] https://lore.kernel.org/lkml/20250722193445.1588348-1-francois.dugast@intel.com/
> 
> These patches are built on top of mm/mm-stable
> 
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> 
> Changelog v2 [3] :
> - Several review comments from David Hildenbrand were addressed, Mika,
>   Zi, Matthew also provided helpful review comments
>   - In paths where it makes sense a new helper
>     is_pmd_device_private_entry() is used
>   - anon_exclusive handling of zone device private pages in
>     split_huge_pmd_locked() has been fixed
>   - Patches that introduced helpers have been folded into where they
>     are used
> - Zone device handling in mm/huge_memory.c has benefited from the code
>   and testing of Matthew Brost, he helped find bugs related to
>   copy_huge_pmd() and partial unmapping of folios.

I see a ton of discussion on this series, particularly patch 2. It looks
like you have landed on a different solution for partial unmaps. I
wanted to pull this series into in for testing but if this is actively
being refactored, likely best to hold off until next post or test off a
WIP branch if you have one.

Matt

> - Zone device THP PMD support via page_vma_mapped_walk() is restricted
>   to try_to_migrate_one()
> - There is a new dedicated helper to split large zone device folios
> 
> Changelog v1 [2]:
> - Support for handling fault_folio and using trylock in the fault path
> - A new test case has been added to measure the throughput improvement
> - General refactoring of code to keep up with the changes in mm
> - New split folio callback when the entire split is complete/done. The
>   callback is used to know when the head order needs to be reset.
> 
> Testing:
> - Testing was done with ZONE_DEVICE private pages on an x86 VM
> 
> Balbir Singh (11):
>   mm/zone_device: support large zone device private folios
>   mm/thp: zone_device awareness in THP handling code
>   mm/migrate_device: THP migration of zone device pages
>   mm/memory/fault: add support for zone device THP fault handling
>   lib/test_hmm: test cases and support for zone device private THP
>   mm/memremap: add folio_split support
>   mm/thp: add split during migration support
>   lib/test_hmm: add test case for split pages
>   selftests/mm/hmm-tests: new tests for zone device THP migration
>   gpu/drm/nouveau: add THP migration support
>   selftests/mm/hmm-tests: new throughput tests including THP
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 246 +++++++---
>  drivers/gpu/drm/nouveau/nouveau_svm.c  |   6 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
>  include/linux/huge_mm.h                |  19 +-
>  include/linux/memremap.h               |  51 ++-
>  include/linux/migrate.h                |   2 +
>  include/linux/mm.h                     |   1 +
>  include/linux/rmap.h                   |   2 +
>  include/linux/swapops.h                |  17 +
>  lib/test_hmm.c                         | 432 ++++++++++++++----
>  lib/test_hmm_uapi.h                    |   3 +
>  mm/huge_memory.c                       | 358 ++++++++++++---
>  mm/memory.c                            |   6 +-
>  mm/memremap.c                          |  48 +-
>  mm/migrate_device.c                    | 517 ++++++++++++++++++---
>  mm/page_vma_mapped.c                   |  13 +-
>  mm/pgtable-generic.c                   |   6 +
>  mm/rmap.c                              |  22 +-
>  tools/testing/selftests/mm/hmm-tests.c | 607 ++++++++++++++++++++++++-
>  19 files changed, 2040 insertions(+), 319 deletions(-)
> 
> -- 
> 2.50.1
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [v2 01/11] mm/zone_device: support large zone device private folios
  2025-08-05 21:15             ` Matthew Brost
@ 2025-08-06 12:19               ` Balbir Singh
  0 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2025-08-06 12:19 UTC (permalink / raw)
  To: Matthew Brost, David Hildenbrand
  Cc: linux-mm, linux-kernel, Karol Herbst, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter,
	Jérôme Glisse, Shuah Khan, Barry Song, Baolin Wang,
	Ryan Roberts, Matthew Wilcox, Peter Xu, Zi Yan, Kefeng Wang,
	Jane Chu, Alistair Popple, Donet Tom, Ralph Campbell,
	Mika Penttilä, Francois Dugast

On 8/6/25 07:15, Matthew Brost wrote:
> On Tue, Aug 05, 2025 at 02:58:42PM +0200, David Hildenbrand wrote:
>> On 05.08.25 13:01, Balbir Singh wrote:
>>> On 8/5/25 20:57, David Hildenbrand wrote:
>>>> On 05.08.25 06:22, Balbir Singh wrote:
>>>>> On 7/30/25 19:50, David Hildenbrand wrote:
>>>>>
>>>>>> I think I asked that already but maybe missed the reply: Should these folios ever be added to the deferred split queue and is there any value in splitting them under memory pressure in the shrinker?
>>>>>>
>>>>>> My gut feeling is "No", because the buddy cannot make use of these folios, but maybe there is an interesting case where we want that behavior?
>>>>>>
>>>>>
>>>>> I realized I did not answer this
>>>>>
>>>>> deferred_split() is the default action when partial unmaps take place. Anything that does
>>>>> folio_rmap_remove_ptes can cause the folio to be deferred split if it gets partially
>>>>> unmapped.
>>>>
>>>> Right, but it's easy to exclude zone-device folios here. So the real question is: do you want to deal with deferred splits or not?
>>>>
>>>> If not, then just disable it right from the start.
>>>>
>>>
>>> I agree, I was trying to avoid special casing device private folios unless needed to the extent possible
>>
>> By introducing a completely separate split logic :P
>>
>> Jokes aside, we have plenty of zone_device special-casing already, no harm
>> in adding one more folio_is_zone_device() there.
>>
>> Deferred splitting is all weird already that you can call yourself fortunate
>> if you don't have to mess with that for zone-device folios.
>>
>> Again, unless there is a benefit in having it.
> 
> +1 on no deferred split for device folios.
> 
>

I'll add it to v3 to check that we do not do deferred splits on zone device folios

Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2025-08-06 12:19 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30  9:21 [v2 00/11] THP support for zone device page migration Balbir Singh
2025-07-30  9:21 ` [v2 01/11] mm/zone_device: support large zone device private folios Balbir Singh
2025-07-30  9:50   ` David Hildenbrand
2025-08-04 23:43     ` Balbir Singh
2025-08-05  4:22     ` Balbir Singh
2025-08-05 10:57       ` David Hildenbrand
2025-08-05 11:01         ` Balbir Singh
2025-08-05 12:58           ` David Hildenbrand
2025-08-05 21:15             ` Matthew Brost
2025-08-06 12:19               ` Balbir Singh
2025-07-30  9:21 ` [v2 02/11] mm/thp: zone_device awareness in THP handling code Balbir Singh
2025-07-30 11:16   ` Mika Penttilä
2025-07-30 11:27     ` Zi Yan
2025-07-30 11:30       ` Zi Yan
2025-07-30 11:42         ` Mika Penttilä
2025-07-30 12:08           ` Mika Penttilä
2025-07-30 12:25             ` Zi Yan
2025-07-30 12:49               ` Mika Penttilä
2025-07-30 15:10                 ` Zi Yan
2025-07-30 15:40                   ` Mika Penttilä
2025-07-30 15:58                     ` Zi Yan
2025-07-30 16:29                       ` Mika Penttilä
2025-07-31  7:15                         ` David Hildenbrand
2025-07-31  8:39                           ` Balbir Singh
2025-07-31 11:26                           ` Zi Yan
2025-07-31 12:32                             ` David Hildenbrand
2025-07-31 13:34                               ` Zi Yan
2025-07-31 19:09                                 ` David Hildenbrand
2025-08-01  0:49                             ` Balbir Singh
2025-08-01  1:09                               ` Zi Yan
2025-08-01  7:01                                 ` David Hildenbrand
2025-08-01  1:16                               ` Mika Penttilä
2025-08-01  4:44                                 ` Balbir Singh
2025-08-01  5:57                                   ` Balbir Singh
2025-08-01  6:01                                   ` Mika Penttilä
2025-08-01  7:04                                   ` David Hildenbrand
2025-08-01  8:01                                     ` Balbir Singh
2025-08-01  8:46                                       ` David Hildenbrand
2025-08-01 11:10                                         ` Zi Yan
2025-08-01 12:20                                           ` Mika Penttilä
2025-08-01 12:28                                             ` Zi Yan
2025-08-02  1:17                                               ` Balbir Singh
2025-08-02 10:37                                               ` Balbir Singh
2025-08-02 12:13                                                 ` Mika Penttilä
2025-08-04 22:46                                                   ` Balbir Singh
2025-08-04 23:26                                                     ` Mika Penttilä
2025-08-05  4:10                                                       ` Balbir Singh
2025-08-05  4:24                                                         ` Mika Penttilä
2025-08-05  5:19                                                           ` Mika Penttilä
2025-08-05 10:27                                                           ` Balbir Singh
2025-08-05 10:35                                                             ` Mika Penttilä
2025-08-05 10:36                                                               ` Balbir Singh
2025-08-05 10:46                                                                 ` Mika Penttilä
2025-07-30 20:05   ` kernel test robot
2025-07-30  9:21 ` [v2 03/11] mm/migrate_device: THP migration of zone device pages Balbir Singh
2025-07-31 16:19   ` kernel test robot
2025-07-30  9:21 ` [v2 04/11] mm/memory/fault: add support for zone device THP fault handling Balbir Singh
2025-07-30  9:21 ` [v2 05/11] lib/test_hmm: test cases and support for zone device private THP Balbir Singh
2025-07-31 11:17   ` kernel test robot
2025-07-30  9:21 ` [v2 06/11] mm/memremap: add folio_split support Balbir Singh
2025-07-30  9:21 ` [v2 07/11] mm/thp: add split during migration support Balbir Singh
2025-07-31 10:04   ` kernel test robot
2025-07-30  9:21 ` [v2 08/11] lib/test_hmm: add test case for split pages Balbir Singh
2025-07-30  9:21 ` [v2 09/11] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-07-30  9:21 ` [v2 10/11] gpu/drm/nouveau: add THP migration support Balbir Singh
2025-07-30  9:21 ` [v2 11/11] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-07-30 11:30 ` [v2 00/11] THP support for zone device page migration David Hildenbrand
2025-07-30 23:18   ` Alistair Popple
2025-07-31  8:41   ` Balbir Singh
2025-07-31  8:56     ` David Hildenbrand
2025-08-05 21:34 ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).